
One Million Developers on Runpod, and the Cloud We’re Building Next
We raised a $100 million Series A. Here's what it means for you.
Blog
Shift from stateless inference to stateful architectures to resolve infrastructure bottlenecks like memory management, concurrency limits, and runaway jobs in production AI agents.

You spent weeks building an LLM agent. It reasons across steps, calls tools, retrieves from memory, and produces results you're proud of. You run it in your notebook and it works exactly the way you designed it.
But when you move that logic to production, it behaves in ways you never imagined.
You encounter latency spikes, mysterious OOM errors, and runaway jobs that pin your GPUs indefinitely. These aren't bugs in your agent logic; they are symptoms of an infrastructure mismatch. Agentic workloads behave differently than standard inference APIs. To productionize them, you have to move from simple API deployment to stateful architectural design.
This post is about the three structural problems agents introduce at the GPU layer, the configuration mistakes that make each one worse, and the architecture patterns that actually work in production.
Most GPU infrastructure is designed around a simple model: a request comes in, a model runs a forward pass, a response goes out. The worker is done in under a second. You can scale horizontally by adding more workers, each stateless and interchangeable.
LLM agents don't work that way.
A single agent task (not a user session, a single task) might involve a query decomposition step, three parallel web lookups, an embedding search against your knowledge base, a synthesis pass over the retrieved content, and a critique loop before the final response goes back. That whole sequence runs as one continuous process. The GPU worker holding it can't be recycled between steps. The model, the context window, and the intermediate state all stay in VRAM until the task is done.
That means a five-step research agent isn't five inference calls. It's one session that runs for 30–90 seconds, holds 4–16GB of VRAM, and branches differently every run depending on what the tools return.
Put ten concurrent users through that pipeline and you have a resource contention problem that cannot be addressed at the model layer.
In standard inference, you size for model weights. In agentic workflows, you are sizing for the Model + KV Cache.
Agents are dynamic memory consumers. During a reasoning chain, the KV cache expands based on the context window and the number of tool outputs. In long-running processes, the KV cache often exceeds the size of the model weights themselves. Using a default GpuGroup.ANY setting is a common mistake here; it treats your agent like a stateless API, potentially landing it on hardware that lacks the VRAM to support the growing context window of a complex session.
NVIDIA_A100_80GB_PCIe) to ensure predictable VRAM availability.Developers often configure workers=(0, n) for development, assuming the cloud will handle auto-scaling. In production, this causes a catastrophic "cold start" penalty.
If an endpoint scales to zero, the first user request triggers a cold boot of the GPU worker. This can take 20–90 seconds. If a user is waiting synchronously for an agent to "think," they will have refreshed the page or left before your worker is even ready.
An inference request is bounded; an agent session is open-ended. Agents can get stuck in infinite reasoning loops, repetitive generation patterns, or hang while waiting for a tool response from an external API.
Without an explicit ceiling, zombie jobs pin your GPU indefinitely, starving your healthy agents of compute resources.
execution_timeout_ms. A 60-second limit is standard for real-time endpoints.These three problems, memory pressure, cold start latency, runaway jobs, compound differently depending on how you structure your agent workers.
Production systems typically use a hybrid pattern. A stateful orchestrator manages the "brain" of the agent, decomposing tasks, managing the context window, and maintaining the reasoning loop. Because this holds the model state, it requires pinned hardware and high availability.
Parallelizable sub-tasks (like web scraping, embedding retrieval, or data formatting) should be sent to stateless worker endpoints.
Stateful workers require session affinity in your load balancing. If a worker dies, the session context is wiped. You must implement a "rehydration" strategy: if a node fails, your orchestrator must be able to pull intermediate state from an external cache (like Redis) to resume.
These three infrastructure problems each have a corresponding configuration mistake on Runpod Serverless. Addressed together, they're straightforward.
For memory:
The default is GpuGroup.ANY, which is the fastest provisioning option and the right call in development. In production, it means the GPU type varies between runs. If your agent needs 24GB minimum and gets assigned hardware that doesn't have it, the job starts, makes progress, runs out of memory, and fails silently. Pin a specific GPU type instead. Use GpuType.NVIDIA_A100_80GB_PCIe for 70B parameter models or workloads with large context windows. Use GpuType.NVIDIA_GEFORCE_RTX_4090 for 7–13B parameter models where cost is the constraint.
For concurrency: The workers=(0, n) setting scales to zero and is appropriate for offline batch pipelines, document processing, or anything where queue latency is acceptable. For production APIs handling user-facing requests, a non-zero minimum is required. See the table above for the right setting for your use case.
For runaway jobs: Every endpoint needs an explicit timeout. 60 seconds (60000ms) is a reasonable starting point for real-time API endpoints. For batch jobs that legitimately run long, 600 seconds (600000ms) is more appropriate. The specific number matters less than having one, as this setting turns an indefinite hang into a clean failure your application can handle.
Here is the blueprint for a hybrid agent pipeline.
Use this for the reasoning layer. It needs to be warm and ready.
api = Endpoint(
name="agent-orchestrator",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
datacenter=DataCenter.US_GA_2,
workers=(3, 20),
idle_timeout=1800, # Keep workers warm between session turns
execution_timeout_ms=120000, # 2 min ceiling
volume=NetworkVolume(name="model-weights", size=50, datacenter=DataCenter.US_GA_2)
)
What each setting is doing:
A note about the size value. size=50 is illustrative. The actual value depends on model weight size. For a 70B model in FP16, model weights alone are ~140GB, giving a realistic value of size=200. You should size for your specific model.
Use this for non-GPU tasks, such as web scraping, data formatting, and external API calls. Move scraping and API calls to CPU endpoints to avoid VRAM contention.
@Endpoint(
name="web-retrieval",
cpu="cpu5c-4-8", # CPU endpoint — no GPU needed for web scraping
workers=(0, 10), # Scale to zero; web retrieval tolerates queue latency
idle_timeout=300 # 5 min — scale down fast between fan-out bursts
)
def scrape(data): ...The CPU endpoint for web retrieval is worth calling out specifically. Most developers default to GPU for everything because that's what the development environment uses. For a tool that's doing HTTP requests and HTML parsing, you're paying GPU rates for a workload that runs fine on CPU. Separating it out reduces cost per fan-out call and frees GPU capacity for the steps that actually need it.
Here's what the full architecture looks like for a research and synthesis agent that takes a user query, searches the web, retrieves from an internal knowledge base, and synthesizes a response with a critique pass.
User query
│
▼
Orchestrator endpoint
(workers=(3,20), A100, network volume)
│
├──► Query decomposition [sub-step, same worker]
│ └──► 3 sub-queries
│
├──► Web retrieval × 3 [parallel fan-out]
│ (CPU endpoint, workers=(0,10))
│ └──► Raw source text per query
│
├──► Embedding search [queue-based]
│ (RTX 4090, workers=(1,10))
│ └──► Relevant chunks from internal KB
│
└──► Synthesis + critique [same orchestrator worker]
└──► Final responseThe query decomposition and synthesis steps run on the orchestrator worker because they're tightly coupled to the session state; decomposition sets up the fan-out and synthesis reads everything the fan-out returns. Keeping them on the same stateful worker avoids serializing and passing that state over the network.
The web retrieval and embedding search steps are inherently parallelizable and stateless, so they go to separate endpoints. They don't touch the orchestrator's VRAM, they can scale independently, and their failures are contained, meaning that a stuck web scraper times out at 30 seconds and the pipeline handles it gracefully, rather than holding the orchestrator worker hostage.
Before anything like this goes to production, run through this list:
GpuGroup.ANY, on every production endpointIf you're starting from scratch, the serverless endpoint builder is the fastest way to test the configuration options: deploy one endpoint, adjust the settings, run it locally with flash dev until the behavior matches what you need.
If you're designing a more complex pipeline, perhaps with multiple agents, long-running autonomous workflows, or pipelines with more than a few tool hops, Deploying AI Agents covers the deployment patterns for those architectures in more depth.
And if you want to step back and think through how orchestration, memory, and tool routing fit together at the design level before you write any configuration, Agentic AI Workflows Explained is the right starting point.
Blog Posts

We raised a $100 million Series A. Here's what it means for you.
.jpeg)
Queue for any GPU spec, even one that's fully rented out, and we'll deploy it the moment capacity opens up. No more refreshing the console or running a sniping tool.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.