Hot starts, batch inference, and what's next for Runpod Serverless. Webinar June 25.

Deploy CrewAI Multi-Agent Systems on Runpod: CPU Orchestration + GPU Inference on One Platform

Imagine a five-step CrewAI Flow running on a single container that bundles CPU orchestration and GPU inference together. The first step kicks off, the LLM endpoint accepts the call, and the worker takes 4 seconds to respond. Step two runs against the same worker, then step three. By step three you have spent twenty-plus seconds, your timeout budget is gone, and the system returns errors that point everywhere and nowhere.

Why Your CrewAI Deployment Fails Under Load

The monolithic container failure mode works like this: when orchestration logic and inference serving run in the same container, they compete for the same resources and share the same scaling ceiling. Adding more CPU does not help the GPU bottleneck. Throwing more GPU at the problem can waste money when CPU orchestration threads are the bottleneck: GPU cores sit underutilized while Python threads queue for CPU time (Why LLM Inference Has Low GPU Utilization). Because both layers log to the same stdout, distinguishing an orchestration bottleneck from an inference bottleneck in the logs becomes difficult.

Two concrete failure modes are worth naming.

Cold-start latency compounds across Flow steps when workers scale to zero. A CrewAI Flow can execute sequentially; in the simple chaining pattern, each @listen step depends on the upstream @start output. If your GPU workers scale to zero between requests, each step in that Flow can trigger its own cold start. A Flow with five steps and an 8-second cold start per step delivers its first result in 40-plus seconds before a single token of useful output arrives. With workers kept warm by traffic, cold starts do not compound, but queue latency still adds up if your worker count is lower than your concurrent request rate. At scale, this compounding latency becomes a serious reliability risk.

Per-layer observability is absent in a monolithic setup. When you see a 30-second p95 latency in your logs, you cannot tell if the CPU coordination layer spent 28 seconds routing between agents or the vLLM worker spent 28 seconds on token generation. Without separate endpoints, there is no metric surface to attach alerts to. You are debugging by guessing.

The solution: separate the CPU orchestration plane from the GPU inference plane and deploy them as independent Runpod Serverless endpoints that each scale to their own demand.

The Two-Layer Architecture for CrewAI on Runpod

Runpod runs both layers of this architecture (CPU orchestration and GPU inference) as serverless endpoints under a single console, a single billing account, and a unified monitoring dashboard (Deploying AI Agents at Scale). You wire nothing to a separate GPU cloud and hope the latency is acceptable. Everything runs under one roof.

The CPU plane runs your CrewAI Flow logic. A handler.py using the standard Runpod handler pattern wraps your Flow and exposes it as a Runpod Serverless CPU endpoint. This layer handles the event["input"] parsing that triggers each Flow execution and serves as the coordination point for job dispatch and Flow state management. It scales on request volume, not GPU availability, which matters when orchestration is cheap and inference is expensive.

The GPU plane runs vLLM behind an OpenAI-compatible API, deployed from the vLLM worker template in Runpod Hub. Your agents call this endpoint at https://api.runpod.ai/v2/{ENDPOINT_ID}/openai/v1 exactly as they would call the OpenAI API, with the same base_url pattern, model parameter structure, and response schema. This layer scales on inference demand, completely independent of the orchestration layer.

Sizing max_workers on the GPU endpoint. This is where readers most often over-provision. A standard sequential CrewAI Flow (one @start, then a chain of @listen steps) has exactly one LLM call in flight at a time per Flow execution. For a single sequential Flow per job, max_workers=1 on the GPU endpoint is sufficient. You only need more when:

  • Your Flow has parallel branches (using or_() / and_() fan-in, or independent @start methods that fire concurrently), in which case set max_workers to the maximum number of simultaneous LLM calls a single Flow execution produces.
  • You are dispatching multiple concurrent Flow jobs to the CPU endpoint, in which case set max_workers to your expected concurrent job count.

Set it lower than your real concurrency and you have moved the queue from hidden to visible: the queue still exists, but now it shows up in your delay-time metrics (Mastering Serverless Scaling on Runpod).

The two-endpoint setup is a starting point. You can extend it by pointing routing or classification agents at a lower-cost endpoint (such as an RTX 4090-backed one) and reasoning agents at a higher-memory endpoint like an A100 or H100. Each gets its own max_workers budget and GPU tier. The Production Hardening section covers the cost math on this multi-endpoint pattern.

Step 1: Deploy the vLLM Inference Endpoint (GPU Layer)

In the Runpod console, navigate to Serverless > New Endpoint. Under templates, select the vLLM worker from Runpod Hub. This pre-built worker handles the vLLM process lifecycle, health checks, and the OpenAI-compatible routing layer, so you are not configuring any of that yourself.

Set HUGGING_FACE_HUB_MODEL_ID to your target model (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct). If the model requires authentication, set HUGGING_FACE_HUB_TOKEN as a secret in the endpoint configuration.

GPU tier selection for CrewAI agents

GPU Tier Best For
RTX 4090 (24GB VRAM) Models ≤13B; routing/classification agents; low-concurrency Flows
A100 80GB Models 13B–70B (70B with INT4/AWQ quantization); production multi-agent Flows; 4–8 concurrent jobs
H100 80GB Large context (>32K tokens); high-concurrency systems; reasoning-heavy pipelines

Set max_workers using the sizing rules from the architecture section (max_workers=1 for a sequential Flow). If you expect burst traffic well above your steady state, add headroom.

FlashBoot should be enabled. It is on by default for new endpoints and pre-warms your container so cold starts do not cascade across Flow steps. Do not disable it to “save money”: the latency cost on a cold multi-step Flow is generally higher than the pre-warming overhead, based on Runpod’s published cold-start benchmarks.

Key environment variables for the vLLM template

  • MAX_MODEL_LEN: maximum sequence length; match this to your agent’s context window requirements
  • DTYPE: compute precision; bfloat16 is preferred for both A100 (Ampere) and H100 (Hopper) deployments thanks to native hardware support and better numerical stability; float16 is acceptable if your model or toolchain does not support bfloat16
  • GPU_MEMORY_UTILIZATION: fraction of VRAM allocated to the model; default is 0.9; lower this if you are running out of headroom for concurrent requests
  • QUANTIZATION: accepts awq, gptq, or bitsandbytes; loading a pre-quantized INT4 checkpoint cuts weight memory roughly 4x vs FP16, which matters on RTX 4090 for 13B+ models; vLLM does not quantize on the fly, so QUANTIZATION=awq requires HUGGING_FACE_HUB_MODEL_ID to point at an already-AWQ-quantized checkpoint
  • OPENAI_SERVED_MODEL_NAME_OVERRIDE: if your CrewAI agent config specifies a model name like "gpt-4" or "llama-3-8b", set this to that exact string so vLLM responds to it without a model-not-found rejection. Confirm the override resolves on your vLLM version before you depend on it in production.

Once the endpoint is live, the CrewAI LLM config is a few lines:

from crewai import LLM
import os

llm = LLM(
    model="openai/meta-llama/Meta-Llama-3.1-8B-Instruct",
    base_url="https://api.runpod.ai/v2/{YOUR_ENDPOINT_ID}/openai/v1",
    api_key=os.environ["RUNPOD_API_KEY"],
)

The openai/ prefix is required: it tells CrewAI to route through its OpenAI-compatible client to your base_url rather than treating meta-llama as a provider. The rest of the string is the model name your vLLM endpoint serves (the HUGGING_FACE_HUB_MODEL_ID, unless you set OPENAI_SERVED_MODEL_NAME_OVERRIDE). Pass this llm to your agents (Agent(llm=llm, ...)); your tools and task definitions are untouched. Swap the config, keep everything else.

Step 2: Build the CrewAI Flow Handler (CPU Layer)

CrewAI Flows are the production-grade orchestration construct: event-driven, composable, and built for multi-step pipelines where outputs chain forward. A minimal Flow skeleton instantiates each crew inside the step that runs it:

from crewai.flow.flow import Flow, listen, start
from pydantic import BaseModel
from my_crews import ResearchCrew, ReportCrew  # standard CrewAI crew classes

class ResearchState(BaseModel):
    topic: str = ""
    research_output: str = ""
    final_report: str = ""

class ResearchFlow(Flow[ResearchState]):
    @start()
    def run_research(self):
        result = ResearchCrew().crew().kickoff(inputs={"topic": self.state.topic})
        self.state.research_output = result.raw
        return result.raw

    @listen(run_research)
    def compile_report(self, research_output):
        result = ReportCrew().crew().kickoff(inputs={"research": research_output})
        self.state.final_report = result.raw
        return result.raw

ResearchCrew and ReportCrew are ordinary CrewAI crew classes (the kind crewai create scaffolds), each exposing a .crew() method that returns a configured Crew. Each @listen step triggers when its upstream @start (or another @listen) completes. The orchestration logic is pure Python, cheap to run on CPU. The expensive work happens at the vLLM endpoint when kickoff() fires an agent’s LLM call.

The handler.py: Runpod Serverless entry point

Initialize shared resources outside the handler function. This is one of the most impactful optimizations in the entire setup.

Every time your handler runs, any code inside it executes from scratch. If you initialize a model, open a database connection, or load a large file inside the handler, you pay that cost on every single request. For a 7B parameter model, that can mean 20–40 seconds of cold setup on what should be a sub-second inference call, depending on storage configuration.

Move those initializations outside the handler, at module level. Runpod’s worker process loads your module once, then calls the handler repeatedly. Anything at module level gets initialized during that first load and stays resident in memory for the lifetime of the worker.

The same principle applies to database connections, file handles, and SDK clients. Opening a new Postgres connection on every invocation will exhaust your connection pool under any real load. Initialize the connection pool at module level, pass it into your handler logic, and you are done.

One genuine friction point: if your initialization code raises an exception, the worker may get stuck in an initializing state in some configurations without clear error reporting. Add explicit logging around your startup code so you can distinguish a bad initialization from a bad request.

A note for AWS Lambda veterans. Runpod workers persist between invocations. Unlike Lambda, the Python process stays alive and global variables retain their values, so the assumption that each request gets a fresh execution environment does not hold here. The module-level pattern above is exactly what this persistence rewards.

Module-scope initialization has one important exception: the Flow itself. A Flow carries per-request data in its state object, so a single module-scope Flow reused across jobs would leak one job’s output into the next. The documented CrewAI pattern instantiates the Flow per run and sets its state fields before kickoff(), which keeps each job’s state isolated. Reserve module scope for the genuinely expensive, stateless resources (SDK clients, database connection pools, the LLM object pointing at your vLLM endpoint), and create the Flow inside the handler:

import runpod
from my_flow import ResearchFlow

def handler(event):
    job_input = event.get("input", {})

    # Validate inputs before any agent work starts.
    topic = job_input.get("topic")
    if not topic:
        return {"error": "Missing required input: 'topic'"}

    # A fresh Flow per job keeps one invocation's state out of the next.
    flow = ResearchFlow()
    flow.state.topic = topic
    flow.kickoff()

    return {"report": flow.state.final_report}

runpod.serverless.start({"handler": handler})

Instantiating the Flow per job costs only a few milliseconds, because the heavy resources already live at module scope and the model itself runs on the GPU endpoint, not in this worker. Module-scope initialization of those shared clients is what reduces per-request overhead from seconds to milliseconds, the difference between a usable API and a slow one.

Dockerfile for the CPU orchestration container

FROM runpod/base:1.0.6

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "handler.py"]

Build with --platform linux/amd64, which is a hard requirement for Runpod Serverless:

docker build --platform linux/amd64 -t your-dockerhub-username/crewai-orchestrator:latest .
docker push your-dockerhub-username/crewai-orchestrator:latest

Local testing before cloud deployment

The Runpod Python SDK simulates a Serverless invocation locally:

# test_local.py
import runpod
from handler import handler

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
}, test_input={
    "input": {
        "topic": "quantum computing applications in logistics"
    }
})

Run this before pushing to Docker Hub. It catches event["input"] key mismatches, missing environment variables, and serialization errors that are invisible until they surface as silent job failures in the Runpod console. Catching them locally saves a full deploy-push-test cycle per bug.

Step 3: Deploy the Orchestration Layer on Runpod Serverless CPU

Build and push your Docker image to Docker Hub or ECR. Alternatively, use Runpod’s Import Git Repository option to point directly at a GitHub repo. Note that worker rebuilds are not automatic on push to main. Per the official GitHub integration docs, you create a GitHub release to trigger an endpoint rebuild. That release-based workflow still removes manual docker build and docker push from the loop; it just gates the trigger on an intentional release rather than every commit.

In the Runpod console, navigate to Serverless > New Endpoint > Import from Docker Registry. Paste your image URL. Then choose your endpoint type:

  • Load Balancer: direct HTTP routing to available workers; callers POST a job and receive a result without queueing; right choice for direct API integrations and user-facing pipelines that need low-latency, request-response access
  • Queue-based: async job dispatch; right choice for batch workloads, background processing, or large-scale Flows where you submit many jobs and retrieve results asynchronously (Runpod Load Balancing Overview)

Flex workers vs. active workers: the cost decision

Flex Workers scale to zero when no jobs are queued, so you pay nothing between jobs (Runpod Serverless pricing). This is the right choice for bursty pipelines, such as research runs, content generation, or periodic analysis, where traffic is unpredictable and 24/7 availability is not required.

Active Workers run around the clock with zero cold starts. Runpod prices Active Workers at roughly 40 percent lower per hour than Flex Workers. For user-facing agents with consistent traffic and latency SLAs, like a live assistant, a real-time analysis tool, or a customer-facing chatbot, Active Workers pay for themselves. The exact breakeven depends on your utilization rate and the current Active Worker discount; the math is (active_hourly_rate × 24) versus (flex_hourly_rate × active_hours_per_day). Plug in the current pricing and your expected traffic to confirm.

If your Flow routinely runs for extended periods (multiple agents, multi-hop research, long tool call chains), consider a persistent Runpod Pod instead of Serverless, which is designed for stateless on-demand workloads. Pods bill per second, and /workspace gives you a persistent volume. One caveat: CrewAI Flows do not automatically persist state to disk. To resume from the last successful @listen after a failure, serialize state explicitly at each step (for example, json.dump(self.state.model_dump(), open('/workspace/state.json', 'w'))) and implement resume logic that loads the last saved state at startup and skips already-completed steps. With that scaffolding in place, a retry on a failed downstream step can resume from the last checkpoint instead of re-running the entire Flow from scratch.

With both endpoints deployed and verified, the architecture is live. The next section covers the operational layer: monitoring, cost controls, and failure-mode handling for a production system.

Production Hardening: Observability, Cost Control, and Failure Modes

A two-endpoint deployment that runs in a demo is not yet one you can operate. Three concerns separate the two: knowing which layer is slow when latency climbs, keeping the GPU bill proportional to the work, and making sure a transient failure does not take down a whole Flow. The rest of this section covers each in turn.

Monitoring your two-layer CrewAI system

The Runpod dashboard surfaces per-endpoint metrics: execution time and delay time (how long jobs waited before a worker picked them up), with P70/P90/P98 percentile breakdowns. Watch delay time specifically. A rising delay time on the GPU endpoint is a strong indicator that max_workers may be too low for your concurrent request rate. A rising delay time on the CPU endpoint suggests orchestration workers are queuing; check your CPU endpoint’s worker scaling settings if this occurs under high-concurrency Flow dispatch.

For active debugging, Runpod supports SSH access into running workers. You can connect to a live worker mid-execution without redeployment, which is useful for inspecting agent state, tracing a stuck @listen step, or verifying that the vLLM endpoint is reachable from inside the worker’s network context.

GPU cost optimization for multi-agent pipelines

The multi-endpoint pattern is a cost lever as much as an architecture decision. A routing agent that classifies incoming jobs (low token count, simple prompts, fast turnaround) typically does not need high-end datacenter GPU memory; a lower-cost tier is sufficient. Deploy it on an RTX 4090 with max_workers set to match its concurrent demand. Point your reasoning agents at an A100 endpoint sized to their own load. Two endpoints sized to their respective workloads will generally cost less and perform better than one A100 endpoint trying to serve both profiles.

For models, QUANTIZATION is your biggest single cost reduction. An AWQ-quantized Llama 3.1 70B Instruct fits on one 80GB GPU. Loading the pre-quantized INT4 checkpoint cuts weight memory roughly 4x against a half-precision (FP16 or bfloat16) baseline and roughly halves the per-token compute, with minimal quality loss on most instruction tasks. As Step 1 noted, the checkpoint has to be quantized already; search Hugging Face for [model-name]-AWQ to find pre-quantized variants for most popular models.

Handler-level reliability patterns

Three patterns prevent silent failures at the orchestration layer.

Validate every key in event["input"] at the top of handler() and return structured errors before any agent work starts. If a required input is missing, fail fast and explicitly rather than letting the Flow propagate a None reference to a deep step and surface a cryptic exception with no context.

Implement retry logic at the handler level for transient vLLM endpoint failures. GPU workers can cold-start during low-traffic periods even with FlashBoot. Wrapping the whole job in a retry with exponential backoff handles these transient failures transparently without surfacing errors to end users. This pattern targets cold-start failures, where re-running the job is safe; recovering partial progress after a mid-Flow crash is what the /workspace serialization covered earlier is for. Replace the Step 2 handler with this retry-wrapped version:

import time
import runpod
from my_flow import ResearchFlow

def handler(event):
    job_input = event.get("input", {})
    topic = job_input.get("topic")
    if not topic:
        return {"error": "Missing required input: 'topic'"}

    delay = 2
    for attempt in range(3):
        try:
            # A fresh Flow per attempt means a retry never inherits
            # half-written state from a failed attempt.
            flow = ResearchFlow()
            flow.state.topic = topic
            flow.kickoff()
            return {"report": flow.state.final_report}
        except Exception as e:
            if attempt == 2:
                return {"error": str(e)}
            time.sleep(delay)
            delay = min(delay * 2, 30)

runpod.serverless.start({"handler": handler})

Tune the starting delay and cap to your observed cold-start latency.

Write intermediate state to self.state at each @listen step. Keeping each step’s output in self.state is what makes the /workspace checkpointing above possible: a downstream failure can be resumed from the last serialized step instead of re-running prior successful work. Step failures that do not cascade through the whole pipeline are what separate a system that degrades gracefully from one that fails completely.

Your Next Concrete Action

Deploy the GPU endpoint first. Follow the steps in “Step 1: Deploy the vLLM Inference Endpoint” (Hub template, HUGGING_FACE_HUB_MODEL_ID, GPU tier, max_workers). Before running the test, retrieve your API key from the Runpod console under Settings > API Keys and export it:

export RUNPOD_API_KEY=your_key_here

Then confirm the inference layer is healthy with a direct curl:

curl -X POST https://api.runpod.ai/v2/{YOUR_ENDPOINT_ID}/openai/v1/chat/completions \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello, are you working?"}]
  }'

A valid choices[0].message.content response confirms the inference layer is alive and reachable end-to-end. If this call fails, no amount of CrewAI integration code will fix the underlying vLLM layer problem. Resolve the vLLM endpoint issue before moving on to the CPU orchestration layer.

Once the GPU endpoint passes, build the handler.py container, test it locally with the Runpod SDK, push the image, and deploy the CPU endpoint. You will have a multi-agent system with two independently scalable layers on Runpod, each billed per second of active execution.

Keep the sizing decisions on hand as you build:

  • Sequential Flow: max_workers=1. Parallel branches or concurrent jobs: size to peak simultaneous LLM calls.
  • 70B model: AWQ INT4 on a single 80GB GPU.
  • DTYPE: bfloat16 on A100 and H100.
  • Consistent all-day traffic: weigh Active Workers against Flex on the breakeven math above.

Start building on Runpod

Related articles

View All
No items found.

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.