Deploy vLLM with Docker on Runpod: Container Config, Model Loading, and Production Tuning

Most vLLM tutorials end where the real work begins. They show you how to pip install vllm and run a quick inference test, then leave you to figure out Docker packaging, GPU selection, network volume caching, and the server flags that actually determine whether your endpoint handles 5 concurrent requests or 500.

This guide covers the complete deploy vLLM Docker Runpod workflow, from pod creation through production tuning, using Llama 3.1 8B Instruct as the reference model and an L40S as the default GPU. The result is a running, OpenAI-compatible inference server you can point real traffic at.

1. Prerequisites and GPU Selection for vLLM on Runpod

Before you start, you need:

A Runpod account with billing configured
A HuggingFace token with access to meta-llama/Meta-Llama-3.1-8B-Instruct (request access at hf.co/meta-llama)
Docker installed locally (only if you’re building a custom image; optional for standard deployments)
Basic familiarity with Docker CLI

VRAM requirements are the first constraint to size correctly. Weight-only memory for common model sizes:

Model Size	Precision	Min VRAM
7B	fp16	~14 GB
7B	AWQ 4-bit	~7 GB
13B	fp16	~26 GB
13B	AWQ 4-bit	~13 GB
34B	fp16	~68 GB
70B	fp16	~140 GB
70B	AWQ 4-bit	~70 GB

These are floor numbers, add KV cache overhead on top, which is why --gpu-memory-utilization matters (covered in Section 4).

GPU-to-model mapping for Runpod SKUs:

RTX 4090 (24 GB): 7B quantized (AWQ/GPTQ 4-bit) or smaller. The VRAM ceiling rules out fp16 7B without aggressive KV cache constraints.
L40S (48 GB): 7B fp16 comfortably, or 13B quantized. The right pick for most 7–13B use cases and the tutorial’s reference GPU.
A100 80 GB or H100 PCIe (80 GB): 13B fp16 or 34B quantized. Good fit for higher-throughput 13B deployments where fp16 matters.
H100 SXM (80 GB, multi-GPU with NVLink): 70B fp16 with --tensor-parallel-size 2. Also the right choice for 34B fp16 on a single card with room for KV cache.

Runpod provisions all of these GPU SKUs on-demand with per-second billing, which is useful when iterating on deployment configuration without committing to a long-running pod.

2. Launching a Runpod GPU Pod

Navigate to Pods > GPU Cloud in the Runpod console. Select GPU Cloud (not Community Cloud) for production workloads; the Secure Cloud tier gives you isolated hardware and more predictable network performance.

Select your GPU type. For this tutorial, choose L40S. Under Container Image, enter:

The official vLLM Docker image bundles CUDA, the inference engine, and the OpenAI-compatible HTTP server. You don’t need to build anything custom for standard model deployments.

Here’s how the request flow looks once the pod is running:

Attach a Network Volume before launching. Create a volume of at least 50 GB (Llama 3.1 8B weights are ~16 GB; budget extra for the tokenizer, model index, and any future models you cache). Mount it at:

This is the default HuggingFace Hub cache directory. When vLLM downloads model weights on first boot, they land here. On every subsequent pod restart with the same volume attached, weights load from disk with no re-download, cutting startup time from ~3 minutes to ~20 seconds.

Under Expose Ports, add port 8000. Runpod’s reverse proxy will generate a public HTTPS endpoint at https://{POD_ID}-8000.proxy.Runpod.net.

Add your HuggingFace token as an environment variable:

Set the pod’s start command to:

To automate pod provisioning via the Runpod API, install the SDK and use create_pod() directly:

From here you can build a CI/CD pipeline around pod provisioning: swap the model name, adjust flags, and re-deploy without touching the console.

3. Writing the Dockerfile for a Custom vLLM Image

vllm/vllm-openai:latest covers most use cases. You need a custom Dockerfile when:

You’re loading weights from a private registry or internal model store (not HuggingFace Hub)
You have extra Python dependencies such as custom tokenizers, adapter loading libraries, or internal packages
You’re pinning to a specific CUDA version or vLLM release for reproducibility

When you do need a custom image:

VLLM_WORKER_MULTIPROC_METHOD=spawn is the one environment variable that bites people silently. Without it, tensor parallelism across multiple GPUs fails at worker initialization on some CUDA configurations; the error isn’t always obvious. (Skip this if you’re on a single GPU; tensor parallelism is covered in Section 4.)

Build and push to Docker Hub or GitHub Container Registry:

Then use your-org/vllm-custom:v1 as the container image in the Runpod pod template instead of vllm/vllm-openai:latest.

4. Configuring vLLM Server Flags for Production

The flags you pass to vllm serve are where most of the performance lives. Here’s the full reference command for the L40S + Llama 3.1 8B Instruct deployment:

What each flag does:

--dtype bfloat16: Use bfloat16 precision for H100s and A100s; they have native bfloat16 support and peak throughput on that dtype. On RTX GPUs (4090, 3090), switch to float16 instead. Both Ampere and Ada Lovelace support bfloat16 tensor cores, but bfloat16 throughput on those architectures is slower than float16 in practice.

--gpu-memory-utilization 0.90: vLLM pre-allocates this fraction of available VRAM for the KV cache. The remaining 10% is headroom for CUDA overhead, model weight activations, and temporary buffers. Below 0.85, you’re wasting KV cache capacity and hurting throughput. Above 0.95, longer prompts will OOM. 0.90 is the right default.

--max-model-len 8192: Sets the maximum sequence length (prompt + completion). This directly controls how much KV cache memory each in-flight request can consume. If you’re running a summarization workload with 16K+ token inputs, bump this to 16384 and reduce --gpu-memory-utilization slightly. Llama 3.1 8B supports 128K context natively, but fitting the full context window on a single L40S requires aggressive VRAM management.

--max-num-seqs 256: Controls how many requests vLLM processes simultaneously during continuous batching. The tradeoff is direct: more concurrency raises throughput but adds latency per request. 128-256 works well for interactive chat; batch inference pipelines that don’t care about per-request latency can push to 512+.

--served-model-name llama-3.1-8b: Sets the string clients use in the model field of API requests. Keeping this stable while upgrading the underlying checkpoint means client code never needs updating.

Multi-GPU Deployment with Tensor Parallelism

For a 2×A100 80 GB pod running Llama 3.1 70B:

--tensor-parallel-size must equal the number of GPUs on the pods. It shards model weights across GPUs; each GPU holds a shard of each weight matrix, and NVLink (on H100 SXM pods) handles the all-reduce communication. Requires VLLM_WORKER_MULTIPROC_METHOD=spawn in the environment.

Quantization Flags: AWQ and GPTQ

For AWQ-quantized models (e.g., TheBloke’s AWQ variants on HuggingFace):

For GPTQ models:

--quantization awq requires the model checkpoint to already be AWQ-quantized. You can’t pass this flag with a base fp16 checkpoint and expect vLLM to quantize on load; vLLM does not perform on-the-fly quantization.

5. Testing the vLLM API Endpoint

Once the pod is running, find the proxy URL in the pod dashboard under Connect. It follows this pattern:

Before testing, confirm vLLM has finished loading. Open the pod’s Logs tab in the Runpod console and wait for INFO: Application startup complete. Alternatively, poll the readiness endpoint: curl https://{POD_ID}-8000.proxy.Runpod.net/health returns HTTP 200 when the server is ready.

vLLM does not enforce API key authentication by default. Add --api-key YOUR_SECRET_KEY to the vllm serve command before exposing any endpoint to the internet; without it, the proxy URL is publicly accessible and any caller can run inference billed to your Runpod account. Pass the same value as the Bearer token in the Authorization header.

Test with curl:

Expected response shape:

The OpenAI Python SDK works unchanged; just override base_url and api_key:

Zero other changes needed. Any existing code that targets the OpenAI API works against your self-hosted vLLM inference server endpoint with two parameter changes.

Three Common Failure Modes

1. CUDA OOM on startup: The model won’t load if VRAM is insufficient. Reduce --gpu-memory-utilization to 0.85, switch to a quantized model variant, or upgrade the GPU SKU.

2. Port not reachable (502 or connection refused): Declare port 8000 in Runpod pod settings; the proxy cannot route traffic to ports not listed there. Confirm it appears under Exposed Ports in the pod configuration, not just in the Dockerfile.

3. Model download fails (401 Unauthorized): HuggingFace gated model auth failed. Verify HF_TOKEN is set in the pod’s environment variables and that the account has accepted the Llama 3.1 license at hf.co/meta-llama.

6. Tuning vLLM for Production Traffic

vLLM’s continuous batching works like this: while one request is mid-generation, new incoming requests join the same forward pass rather than waiting in a queue. That’s the mechanism behind its throughput advantage over naive single-request serving. The flag that controls the aggressiveness of this batching is --max-num-batched-tokens.

Setting --max-num-batched-tokens to match --max-model-len is a reasonable starting point. Interactive chat favors lower values; bulk inference pipelines favor higher.

Prometheus Monitoring for vLLM

Prometheus monitoring: vLLM exposes a /metrics endpoint on port 8000. A minimal scrape config:

The three metrics that matter most for a self-hosted vLLM inference server:

vllm:num_requests_running — current queue depth; sustained spikes mean the server is saturated
vllm:gpu_cache_usage_perc — KV cache utilization; above 90% indicates OOM risk on long prompts
vllm:request_success_total — throughput counter; rate() on this gives you requests/sec

Persistent Pod vs. Runpod Serverless: Billing Fit

Runpod’s per-second billing means an idle pod still accumulates cost. If your traffic is bursty (predictable peaks with long quiet periods between them), Runpod Serverless for vLLM is the better architecture. Serverless bills per request and scales to zero, but requires a handler function wrapper rather than a direct drop-in of the same image and flags. For steady baseline traffic where a pod stays at consistent utilization, a persistent pod is more economical.

Baseline Benchmark Before Tuning

Run this before you change any flags so you have numbers to compare:

Run this before and after changing --max-num-batched-tokens or --max-num-seqs. The numbers tell you whether the flag change actually helped.

7. Deploy vLLM on Runpod: Quick-Start Checklist

Here’s the exact sequence to execute right now:

Create an L40S pod on Runpod with vllm/vllm-openai:latest, attach a 50 GB Network Volume at /root/.cache/huggingface, expose port 8000
Set HF_TOKEN in pod environment variables
Set the start command to vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --gpu-memory-utilization 0.90 --max-model-len 8192 --max-num-seqs 256 --api-key YOUR_SECRET_KEY
Wait ~3 minutes for the model to download on first boot (or ~20 seconds on restart if the Network Volume cache is warm)
Hit the proxy URL with the curl command from Section 5

That’s the complete deploy vLLM Docker Runpod workflow.

If your traffic is bursty, Runpod Serverless for vLLM is the better fit: it bills per request and scales to zero, though it requires a handler function wrapper rather than a direct pod drop-in.

If you need multi-GPU scale, the same Dockerfile and flag patterns apply. Add --tensor-parallel-size 2 (or 4) and select a multi-GPU pod SKU (2x or 4x H100 SXM). Client code doesn’t change.

Run the checklist above and the endpoint will be serving requests in under 10 minutes. Deploy on Runpod →

‍

Related guides

Articles

View All

Serverless GPU: what it is, when to use it, and how to choose a provider

Serverless GPU lets you run AI inference workloads on demand, scaling to zero when idle and spinning up in milliseconds. Learn what it is, when it's the.

RTX 4090 NVLink Support, Specs, VRAM, and AI Performance

RTX 4090 NVLink support, VRAM, specs, pricing, and AI performance, including multi-GPU alternatives for workloads on Runpod.

Nvidia H200 GPU: Specs, VRAM, Price, and AI Performance

The complete guide to the Nvidia H200 GPU: full specs, 141 GB HBM3e VRAM, SXM vs NVL variants, pricing, AI benchmark performance, and how it compares to.

Nvidia H100 GPU: Specs, VRAM, Price, and AI Performance

The complete guide to the Nvidia H100 GPU: full specs, 80 GB VRAM, SXM vs PCIe variants, pricing, AI benchmark performance, and how it compares to the.

Nvidia B200 GPU: Specs, VRAM, Price, and AI Performance

The complete guide to the Nvidia B200 GPU: full specs, 180 GB HBM3e VRAM, pricing, AI benchmark performance, and how it compares to the H100 and H200 for.

AI Model Serving Architecture for Scalable APIs

Learn how to design high-performance model serving systems with the right inference engines, APIs, hardware, scaling, and monitoring for enterprise AI.

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.

Get started