Read the new State of AI report
You've unlocked a referral bonus! Sign up today and you'll get a random credit bonus between $5 and $500
You've unlocked a referral bonus!
Claim Your Bonus
Claim Bonus

Deploy vLLM with Docker on Runpod: Container Config, Model Loading, and Production Tuning

Most vLLM tutorials end where the real work begins. They show you how to pip install vllm and run a quick inference test, then leave you to figure out Docker packaging, GPU selection, network volume caching, and the server flags that actually determine whether your endpoint handles 5 concurrent requests or 500.

This guide covers the complete deploy vLLM Docker Runpod workflow, from pod creation through production tuning, using Llama 3.1 8B Instruct as the reference model and an L40S as the default GPU. The result is a running, OpenAI-compatible inference server you can point real traffic at.

1. Prerequisites and GPU Selection for vLLM on Runpod

Before you start, you need:

  • A Runpod account with billing configured
  • A HuggingFace token with access to meta-llama/Meta-Llama-3.1-8B-Instruct (request access at hf.co/meta-llama)
  • Docker installed locally (only if you’re building a custom image; optional for standard deployments)
  • Basic familiarity with Docker CLI

VRAM requirements are the first constraint to size correctly. Weight-only memory for common model sizes:

Model Size Precision Min VRAM
7B fp16 ~14 GB
7B AWQ 4-bit ~7 GB
13B fp16 ~26 GB
13B AWQ 4-bit ~13 GB
34B fp16 ~68 GB
70B fp16 ~140 GB
70B AWQ 4-bit ~70 GB

These are floor numbers, add KV cache overhead on top, which is why --gpu-memory-utilization matters (covered in Section 4).

GPU-to-model mapping for Runpod SKUs:

  • RTX 4090 (24 GB): 7B quantized (AWQ/GPTQ 4-bit) or smaller. The VRAM ceiling rules out fp16 7B without aggressive KV cache constraints.
  • L40S (48 GB): 7B fp16 comfortably, or 13B quantized. The right pick for most 7–13B use cases and the tutorial’s reference GPU.
  • A100 80 GB or H100 PCIe (80 GB): 13B fp16 or 34B quantized. Good fit for higher-throughput 13B deployments where fp16 matters.
  • H100 SXM (80 GB, multi-GPU with NVLink): 70B fp16 with --tensor-parallel-size 2. Also the right choice for 34B fp16 on a single card with room for KV cache.

Runpod provisions all of these GPU SKUs on-demand with per-second billing, which is useful when iterating on deployment configuration without committing to a long-running pod.

2. Launching a Runpod GPU Pod

Navigate to Pods > GPU Cloud in the Runpod console. Select GPU Cloud (not Community Cloud) for production workloads; the Secure Cloud tier gives you isolated hardware and more predictable network performance.

Select your GPU type. For this tutorial, choose L40S. Under Container Image, enter:

vllm/vllm-openai:latest

The official vLLM Docker image bundles CUDA, the inference engine, and the OpenAI-compatible HTTP server. You don’t need to build anything custom for standard model deployments.

Here’s how the request flow looks once the pod is running:

Attach a Network Volume before launching. Create a volume of at least 50 GB (Llama 3.1 8B weights are ~16 GB; budget extra for the tokenizer, model index, and any future models you cache). Mount it at:

/root/.cache/huggingface

This is the default HuggingFace Hub cache directory. When vLLM downloads model weights on first boot, they land here. On every subsequent pod restart with the same volume attached, weights load from disk with no re-download, cutting startup time from ~3 minutes to ~20 seconds.

Under Expose Ports, add port 8000. Runpod’s reverse proxy will generate a public HTTPS endpoint at https://{POD_ID}-8000.proxy.Runpod.net.

Add your HuggingFace token as an environment variable:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Set the pod’s start command to:

```bash
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 256 \
  --served-model-name llama-3.1-8b \
  --api-key YOUR_SECRET_KEY
```

To automate pod provisioning via the Runpod API, install the SDK and use create_pod() directly:

pip install Runpod
```python
import Runpod

Runpod.api_key = "your_Runpod_api_key"

pod = Runpod.create_pod(
    name="vllm-llama-31-8b",
    image_name="vllm/vllm-openai:latest",
    gpu_type_id="NVIDIA L40S",
    cloud_type="SECURE",
    docker_args="vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --gpu-memory-utilization 0.90 --max-model-len 8192 --max-num-seqs 256 --api-key YOUR_SECRET_KEY",
    volume_in_gb=50,
    container_disk_in_gb=20,
    ports="8000/http",
    volume_mount_path="/root/.cache/huggingface",
    env={"HF_TOKEN": "hf_xxxxxxxxxxxxxxxxxxxx"}
)
print(pod["id"])
```

From here you can build a CI/CD pipeline around pod provisioning: swap the model name, adjust flags, and re-deploy without touching the console.

3. Writing the Dockerfile for a Custom vLLM Image

vllm/vllm-openai:latest covers most use cases. You need a custom Dockerfile when:

  • You’re loading weights from a private registry or internal model store (not HuggingFace Hub)
  • You have extra Python dependencies such as custom tokenizers, adapter loading libraries, or internal packages
  • You’re pinning to a specific CUDA version or vLLM release for reproducibility

When you do need a custom image:

```dockerfile
# Pin to a specific vLLM release for reproducibility
FROM vllm/vllm-openai:v0.6.3

# Must match the Network Volume mount path set in the Runpod pod configuration
ENV HF_HOME=/root/.cache/huggingface

# Required for tensor parallelism across multiple GPUs
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn

# Install any additional dependencies
RUN pip install --no-cache-dir \
    some-custom-tokenizer==1.2.0

EXPOSE 8000

# Default serve command; override at pod launch for flexibility
CMD ["vllm", "serve", "meta-llama/Meta-Llama-3.1-8B-Instruct", \
     "--dtype", "bfloat16", \
     "--gpu-memory-utilization", "0.90", \
     "--max-model-len", "8192", \
     "--max-num-seqs", "256"]
```

VLLM_WORKER_MULTIPROC_METHOD=spawn is the one environment variable that bites people silently. Without it, tensor parallelism across multiple GPUs fails at worker initialization on some CUDA configurations; the error isn’t always obvious. (Skip this if you’re on a single GPU; tensor parallelism is covered in Section 4.)

Build and push to Docker Hub or GitHub Container Registry:

```bash
docker build -t your-org/vllm-custom:v1 .
docker push your-org/vllm-custom:v1
```

Then use your-org/vllm-custom:v1 as the container image in the Runpod pod template instead of vllm/vllm-openai:latest.

4. Configuring vLLM Server Flags for Production

The flags you pass to vllm serve are where most of the performance lives. Here’s the full reference command for the L40S + Llama 3.1 8B Instruct deployment:

```bash
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 256 \
  --served-model-name llama-3.1-8b
```

What each flag does:

--dtype bfloat16: Use bfloat16 precision for H100s and A100s; they have native bfloat16 support and peak throughput on that dtype. On RTX GPUs (4090, 3090), switch to float16 instead. Both Ampere and Ada Lovelace support bfloat16 tensor cores, but bfloat16 throughput on those architectures is slower than float16 in practice.

--gpu-memory-utilization 0.90: vLLM pre-allocates this fraction of available VRAM for the KV cache. The remaining 10% is headroom for CUDA overhead, model weight activations, and temporary buffers. Below 0.85, you’re wasting KV cache capacity and hurting throughput. Above 0.95, longer prompts will OOM. 0.90 is the right default.

--max-model-len 8192: Sets the maximum sequence length (prompt + completion). This directly controls how much KV cache memory each in-flight request can consume. If you’re running a summarization workload with 16K+ token inputs, bump this to 16384 and reduce --gpu-memory-utilization slightly. Llama 3.1 8B supports 128K context natively, but fitting the full context window on a single L40S requires aggressive VRAM management.

--max-num-seqs 256: Controls how many requests vLLM processes simultaneously during continuous batching. The tradeoff is direct: more concurrency raises throughput but adds latency per request. 128-256 works well for interactive chat; batch inference pipelines that don’t care about per-request latency can push to 512+.

--served-model-name llama-3.1-8b: Sets the string clients use in the model field of API requests. Keeping this stable while upgrading the underlying checkpoint means client code never needs updating.

Multi-GPU Deployment with Tensor Parallelism

For a 2×A100 80 GB pod running Llama 3.1 70B:

```bash
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 128 \
  --served-model-name llama-3.1-70b
```

--tensor-parallel-size must equal the number of GPUs on the pods. It shards model weights across GPUs; each GPU holds a shard of each weight matrix, and NVLink (on H100 SXM pods) handles the all-reduce communication. Requires VLLM_WORKER_MULTIPROC_METHOD=spawn in the environment.

Quantization Flags: AWQ and GPTQ

For AWQ-quantized models (e.g., TheBloke’s AWQ variants on HuggingFace):

```bash
vllm serve TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --dtype float16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096
```

For GPTQ models:

```bash
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
  --quantization gptq \
  --dtype float16
```

--quantization awq requires the model checkpoint to already be AWQ-quantized. You can’t pass this flag with a base fp16 checkpoint and expect vLLM to quantize on load; vLLM does not perform on-the-fly quantization.

5. Testing the vLLM API Endpoint

Once the pod is running, find the proxy URL in the pod dashboard under Connect. It follows this pattern:

https://{POD_ID}-8000.proxy.Runpod.net

Before testing, confirm vLLM has finished loading. Open the pod’s Logs tab in the Runpod console and wait for INFO: Application startup complete. Alternatively, poll the readiness endpoint: curl https://{POD_ID}-8000.proxy.Runpod.net/health returns HTTP 200 when the server is ready.

vLLM does not enforce API key authentication by default. Add --api-key YOUR_SECRET_KEY to the vllm serve command before exposing any endpoint to the internet; without it, the proxy URL is publicly accessible and any caller can run inference billed to your Runpod account. Pass the same value as the Bearer token in the Authorization header.

Test with curl:

```bash
curl https://{POD_ID}-8000.proxy.Runpod.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer {YOUR_Runpod_API_KEY}" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "user", "content": "Explain KV cache in one sentence."}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

Expected response shape:

```json
{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "model": "llama-3.1-8b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "KV cache stores key-value attention pairs from prior tokens..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 24,
    "total_tokens": 42
  }
}
```

The OpenAI Python SDK works unchanged; just override base_url and api_key:

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://{POD_ID}-8000.proxy.Runpod.net/v1",
    api_key="{YOUR_Runpod_API_KEY}"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Explain PagedAttention."}]
)
print(response.choices[0].message.content)
```

Zero other changes needed. Any existing code that targets the OpenAI API works against your self-hosted vLLM inference server endpoint with two parameter changes.

Three Common Failure Modes

1. CUDA OOM on startup: The model won’t load if VRAM is insufficient. Reduce --gpu-memory-utilization to 0.85, switch to a quantized model variant, or upgrade the GPU SKU.

2. Port not reachable (502 or connection refused): Declare port 8000 in Runpod pod settings; the proxy cannot route traffic to ports not listed there. Confirm it appears under Exposed Ports in the pod configuration, not just in the Dockerfile.

3. Model download fails (401 Unauthorized): HuggingFace gated model auth failed. Verify HF_TOKEN is set in the pod’s environment variables and that the account has accepted the Llama 3.1 license at hf.co/meta-llama.

6. Tuning vLLM for Production Traffic

vLLM’s continuous batching works like this: while one request is mid-generation, new incoming requests join the same forward pass rather than waiting in a queue. That’s the mechanism behind its throughput advantage over naive single-request serving. The flag that controls the aggressiveness of this batching is --max-num-batched-tokens.

```bash
# Higher throughput, slightly more latency per request
--max-num-batched-tokens 8192

# Lower latency, less peak throughput
--max-num-batched-tokens 2048
```

Setting --max-num-batched-tokens to match --max-model-len is a reasonable starting point. Interactive chat favors lower values; bulk inference pipelines favor higher.

Prometheus Monitoring for vLLM

Prometheus monitoring: vLLM exposes a /metrics endpoint on port 8000. A minimal scrape config:

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['{POD_ID}-8000.proxy.Runpod.net']
    metrics_path: '/metrics'
    scheme: https
    bearer_token: '{YOUR_Runpod_API_KEY}'
```

The three metrics that matter most for a self-hosted vLLM inference server:

  • vllm:num_requests_running — current queue depth; sustained spikes mean the server is saturated
  • vllm:gpu_cache_usage_perc — KV cache utilization; above 90% indicates OOM risk on long prompts
  • vllm:request_success_total — throughput counter; rate() on this gives you requests/sec

Persistent Pod vs. Runpod Serverless: Billing Fit

Runpod’s per-second billing means an idle pod still accumulates cost. If your traffic is bursty (predictable peaks with long quiet periods between them), Runpod Serverless for vLLM is the better architecture. Serverless bills per request and scales to zero, but requires a handler function wrapper rather than a direct drop-in of the same image and flags. For steady baseline traffic where a pod stays at consistent utilization, a persistent pod is more economical.

Baseline Benchmark Before Tuning

Run this before you change any flags so you have numbers to compare:

```python
import asyncio
import time
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://{POD_ID}-8000.proxy.Runpod.net/v1",
    api_key="{YOUR_Runpod_API_KEY}"
)

async def single_request():
    start = time.perf_counter()
    response = await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
        max_tokens=50
    )
    latency = time.perf_counter() - start
    tokens = response.usage.completion_tokens
    return latency, tokens

async def benchmark(concurrency=50):
    tasks = [single_request() for _ in range(concurrency)]
    results = await asyncio.gather(*tasks)
    latencies = [r[0] for r in results]
    total_tokens = sum(r[1] for r in results)
    print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
    print(f"Total tokens/sec: {total_tokens / max(latencies):.1f}")

asyncio.run(benchmark(concurrency=50))
```

Run this before and after changing --max-num-batched-tokens or --max-num-seqs. The numbers tell you whether the flag change actually helped.

7. Deploy vLLM on Runpod: Quick-Start Checklist

Here’s the exact sequence to execute right now:

  1. Create an L40S pod on Runpod with vllm/vllm-openai:latest, attach a 50 GB Network Volume at /root/.cache/huggingface, expose port 8000
  2. Set HF_TOKEN in pod environment variables
  3. Set the start command to vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --gpu-memory-utilization 0.90 --max-model-len 8192 --max-num-seqs 256 --api-key YOUR_SECRET_KEY
  4. Wait ~3 minutes for the model to download on first boot (or ~20 seconds on restart if the Network Volume cache is warm)
  5. Hit the proxy URL with the curl command from Section 5

That’s the complete deploy vLLM Docker Runpod workflow.

If your traffic is bursty, Runpod Serverless for vLLM is the better fit: it bills per request and scales to zero, though it requires a handler function wrapper rather than a direct pod drop-in.

If you need multi-GPU scale, the same Dockerfile and flag patterns apply. Add --tensor-parallel-size 2 (or 4) and select a multi-GPU pod SKU (2x or 4x H100 SXM). Client code doesn’t change.

Run the checklist above and the endpoint will be serving requests in under 10 minutes. Deploy on Runpod →

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.