Hot starts, batch inference, and what's next for Runpod Serverless. Webinar June 25.

Beyond the Notebook: The Engineering Realities of Production AI Agents

Shift from stateless inference to stateful architectures to resolve infrastructure bottlenecks like memory management, concurrency limits, and runaway jobs in production AI agents.

Beyond the Notebook: The Engineering Realities of Production AI Agents

You spent weeks building an LLM agent. It reasons across steps, calls tools, retrieves from memory, and produces results you're proud of. You run it in your notebook and it works exactly the way you designed it.

But when you move that logic to production, it behaves in ways you never imagined. 

You encounter latency spikes, mysterious OOM errors, and runaway jobs that pin your GPUs indefinitely. These aren't bugs in your agent logic; they are symptoms of an infrastructure mismatch. Agentic workloads behave differently than standard inference APIs. To productionize them, you have to move from simple API deployment to stateful architectural design.

This post is about the three structural problems agents introduce at the GPU layer, the configuration mistakes that make each one worse, and the architecture patterns that actually work in production.

Why Agents Don't Behave Like Inference APIs

Most GPU infrastructure is designed around a simple model: a request comes in, a model runs a forward pass, a response goes out. The worker is done in under a second. You can scale horizontally by adding more workers, each stateless and interchangeable.

LLM agents don't work that way.

A single agent task (not a user session, a single task) might involve a query decomposition step, three parallel web lookups, an embedding search against your knowledge base, a synthesis pass over the retrieved content, and a critique loop before the final response goes back. That whole sequence runs as one continuous process. The GPU worker holding it can't be recycled between steps. The model, the context window, and the intermediate state all stay in VRAM until the task is done.

That means a five-step research agent isn't five inference calls. It's one session that runs for 30–90 seconds, holds 4–16GB of VRAM, and branches differently every run depending on what the tools return.

Put ten concurrent users through that pipeline and you have a resource contention problem that cannot be addressed at the model layer.

The Three Infrastructure Problems Agents Create

1. Memory: You're Sizing for the Model, Not the Session

What goes wrong:

In standard inference, you size for model weights. In agentic workflows, you are sizing for the Model + KV Cache.

Why it happens:

Agents are dynamic memory consumers. During a reasoning chain, the KV cache expands based on the context window and the number of tool outputs. In long-running processes, the KV cache often exceeds the size of the model weights themselves. Using a default GpuGroup.ANY setting is a common mistake here; it treats your agent like a stateless API, potentially landing it on hardware that lacks the VRAM to support the growing context window of a complex session.

What you can do:

  • Pin your hardware: Use specific GPU types (e.g., NVIDIA_A100_80GB_PCIe) to ensure predictable VRAM availability.
  • Size for peak context: Account for the maximum possible context window the agent might traverse, not just the model checkpoint size.

2. Concurrency: Your Dev Configuration Doesn't Work in Production

What goes wrong:

Developers often configure workers=(0, n) for development, assuming the cloud will handle auto-scaling. In production, this causes a catastrophic "cold start" penalty.

Why it happens:

If an endpoint scales to zero, the first user request triggers a cold boot of the GPU worker. This can take 20–90 seconds. If a user is waiting synchronously for an agent to "think," they will have refreshed the page or left before your worker is even ready.

The Strategy:

Configuration Cost Cold Start Best For
workers=(0, n) Lowest 20–90s Batch jobs, dev, non-urgent tasks
workers=(1, n) Medium <1s Production batch, variable traffic
workers=(3, n) Highest Always ready Real-time APIs, high-latency sensitivity

3. Runaway Jobs: Why Agents Fail in Non-Obvious Ways

What goes wrong:

An inference request is bounded; an agent session is open-ended. Agents can get stuck in infinite reasoning loops, repetitive generation patterns, or hang while waiting for a tool response from an external API.

Why it happens:

Without an explicit ceiling, zombie jobs pin your GPU indefinitely, starving your healthy agents of compute resources.

What you can do:

  • Implement Circuit Breakers: Set an execution_timeout_ms. A 60-second limit is standard for real-time endpoints.
  • Adopt Distributed Tracing: A timeout kills the runaway job, but it doesn't tell you why it hung. Integrate tools like OpenTelemetry or LangSmith. If a job times out, the trace should show you exactly which tool call or reasoning loop caused the hang. Never deploy to production without visibility into the step-by-step reasoning chain.

Architecture: Stateless vs. Stateful Workers

These three problems, memory pressure, cold start latency, runaway jobs, compound differently depending on how you structure your agent workers. 

The Stateful Orchestrator:

Production systems typically use a hybrid pattern. A stateful orchestrator manages the "brain" of the agent, decomposing tasks, managing the context window, and maintaining the reasoning loop. Because this holds the model state, it requires pinned hardware and high availability.

The Stateless Tooling Layer:

Parallelizable sub-tasks (like web scraping, embedding retrieval, or data formatting) should be sent to stateless worker endpoints.

The Engineering Trade-off:

Stateful workers require session affinity in your load balancing. If a worker dies, the session context is wiped. You must implement a "rehydration" strategy: if a node fails, your orchestrator must be able to pull intermediate state from an external cache (like Redis) to resume.

Important Configurations for Runpod Serverless

These three infrastructure problems each have a corresponding configuration mistake on Runpod Serverless. Addressed together, they're straightforward.

For memory:

The default is GpuGroup.ANY, which is the fastest provisioning option and the right call in development. In production, it means the GPU type varies between runs. If your agent needs 24GB minimum and gets assigned hardware that doesn't have it, the job starts, makes progress, runs out of memory, and fails silently. Pin a specific GPU type instead. Use GpuType.NVIDIA_A100_80GB_PCIe for 70B parameter models or workloads with large context windows. Use GpuType.NVIDIA_GEFORCE_RTX_4090 for 7–13B parameter models where cost is the constraint.

For concurrency: The workers=(0, n) setting scales to zero and is appropriate for offline batch pipelines, document processing, or anything where queue latency is acceptable. For production APIs handling user-facing requests, a non-zero minimum is required. See the table above for the right setting for your use case.

For runaway jobs: Every endpoint needs an explicit timeout. 60 seconds (60000ms) is a reasonable starting point for real-time API endpoints. For batch jobs that legitimately run long, 600 seconds (600000ms) is more appropriate. The specific number matters less than having one, as this setting turns an indefinite hang into a clean failure your application can handle.

Configuring Agents for Production

Here is the blueprint for a hybrid agent pipeline.

The Orchestrator (Stateful)

Use this for the reasoning layer. It needs to be warm and ready.

api = Endpoint(
    name="agent-orchestrator",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    datacenter=DataCenter.US_GA_2,
    workers=(3, 20),
    idle_timeout=1800,           # Keep workers warm between session turns
    execution_timeout_ms=120000, # 2 min ceiling
    volume=NetworkVolume(name="model-weights", size=50, datacenter=DataCenter.US_GA_2)
)

What each setting is doing:

  • workers=(3, 20): three always-warm workers means the first user request doesn't cold-start. The ceiling of 20 handles concurrent sessions without unbounded cost.
  • idle_timeout=1800: a user pausing between turns in a multi-step task shouldn't trigger a worker spin-down. 30 minutes keeps workers ready through natural interaction gaps.
  • execution_timeout_ms=120000:  two minutes is enough for complex reasoning tasks. It's short enough to catch infinite loops before they cost you.
  • NetworkVolume: model weights load once when the worker starts, not on every request. For a 70B model, this is the difference between a 30-second cold start and a 3-minute one.

A note about the size value. size=50 is illustrative. The actual value depends on model weight size. For a 70B model in FP16, model weights alone are ~140GB, giving a realistic value of  size=200. You should size for your specific model.

The Tooling Layer (Stateless)

Use this for non-GPU tasks, such as web scraping, data formatting, and external API calls. Move scraping and API calls to CPU endpoints to avoid VRAM contention.

@Endpoint(
    name="web-retrieval",
    cpu="cpu5c-4-8",    # CPU endpoint — no GPU needed for web scraping
    workers=(0, 10),    # Scale to zero; web retrieval tolerates queue latency
    idle_timeout=300    # 5 min — scale down fast between fan-out bursts
)
def scrape(data): ...

The CPU endpoint for web retrieval is worth calling out specifically. Most developers default to GPU for everything because that's what the development environment uses. For a tool that's doing HTTP requests and HTML parsing, you're paying GPU rates for a workload that runs fine on CPU. Separating it out reduces cost per fan-out call and frees GPU capacity for the steps that actually need it.

Putting It Together: A Research Agent in Production

Here's what the full architecture looks like for a research and synthesis agent that takes a user query, searches the web, retrieves from an internal knowledge base, and synthesizes a response with a critique pass.

User query
Orchestrator endpoint
(workers=(3,20), A100, network volume)
    ├──► Query decomposition [sub-step, same worker]
    │         └──► 3 sub-queries
    ├──► Web retrieval × 3 [parallel fan-out]
    │    (CPU endpoint, workers=(0,10))
    │         └──► Raw source text per query
    ├──► Embedding search [queue-based]
    │    (RTX 4090, workers=(1,10))
    │         └──► Relevant chunks from internal KB
    └──► Synthesis + critique [same orchestrator worker]
              └──► Final response

The query decomposition and synthesis steps run on the orchestrator worker because they're tightly coupled to the session state; decomposition sets up the fan-out and synthesis reads everything the fan-out returns. Keeping them on the same stateful worker avoids serializing and passing that state over the network.

The web retrieval and embedding search steps are inherently parallelizable and stateless, so they go to separate endpoints. They don't touch the orchestrator's VRAM, they can scale independently, and their failures are contained, meaning that a stuck web scraper times out at 30 seconds and the pipeline handles it gracefully, rather than holding the orchestrator worker hostage.

Before anything like this goes to production, run through this list:

  • GPU pinned: specific type, not GpuGroup.ANY, on every production endpoint
  • Worker minimums set: min ≥ 1 on anything latency-sensitive; min ≥ 3 for production APIs
  • Timeouts configured: execution_timeout_ms set on every endpoint, tighter than you think you need
  • Network volume attached: on any endpoint loading a model larger than a few GB
  • No hardcoded config: everything sensitive goes through env, not the function body
  • Health check route: on all load-balanced endpoints
  • Local test first: flash dev before production deploy
  • Trace reasoning loops: when a job inevitably fails, you can see exactly where it went wrong.

Where to Go From Here

If you're starting from scratch, the serverless endpoint builder is the fastest way to test the configuration options: deploy one endpoint, adjust the settings, run it locally with flash dev until the behavior matches what you need.

If you're designing a more complex pipeline, perhaps with multiple agents, long-running autonomous workflows, or pipelines with more than a few tool hops, Deploying AI Agents covers the deployment patterns for those architectures in more depth.

And if you want to step back and think through how orchestration, memory, and tool routing fit together at the design level before you write any configuration, Agentic AI Workflows Explained is the right starting point.

Related articles

View All
Deploy When Available is now GA

Deploy When Available is now GA

Queue for any GPU spec, even one that's fully rented out, and we'll deploy it the moment capacity opens up. No more refreshing the console or running a sniping tool.

All
The Chips Got Faster. The Stack Didn't.

The Chips Got Faster. The Stack Didn't.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.

All

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.