Your Llama-3-70B deployment burns $2.99/hr on a single H100. At $0.002 per 1K output tokens, that's roughly 1.5 million tokens just to break even on one hour of compute, before networking, storage, or the engineer who set it up. Running in FP16 makes the math worse: the 70B model requires 140GB of VRAM, which a single H100 can't fit, forcing a 2×H100 configuration at ~$5.98/hr. At ~890 tok/s throughput, you're paying $1.84 per million tokens before you've optimized a single parameter.
Four compounding decisions close that gap: pick the right GPU for your model size, apply the correct quantization format, tune your serving engine for throughput and latency, and architect your infrastructure to pay only for what you actually use. Execute all four, and cost per million tokens on that same 70B model drops to under $0.55, without a perceptible change in output quality.
This playbook gives you the benchmarks and configuration files to make those decisions. On a bare-metal provider like Nebius, optimization starts with two-plus weeks of cluster provisioning (Kubernetes, Helm, driver configuration, health checks) before you run a single benchmark. On Runpod, you deploy with a template and start benchmarking in minutes. Keep this playbook open on one screen while you configure your Runpod templates on the other.
Chapter 1: The hardware and model matrix
The single most expensive mistake in LLM deployment is reflexive H100 selection. Engineers reach for it because it's the fastest card available, which is true, but fastest does not mean most cost-efficient for every model size and traffic pattern.
Here's the core constraint: a model in FP16 precision requires roughly 2 bytes per parameter. Llama-3-8B needs 16GB of VRAM. Llama-3-70B needs 140GB. Mixtral 8x7B (46.7B total parameters) needs approximately 93GB. Your GPU choice follows directly from those numbers, adjusted by whatever quantization you apply (Chapter 2 covers that).
The table below shows benchmarked performance across three scenarios on Runpod hardware, using vLLM with continuous batching and a 256-token input / 512-token output workload.
At $0.059/M tokens, Scenario A is the workhorse for most production chatbot workloads. Its 24GB of GDDR6X runs Llama-3-8B AWQ at $0.74/hr, the lowest cost-per-token of any configuration. At 120ms TTFT, interactive latency is fully acceptable; the only constraint is the 24GB VRAM ceiling: beyond it, you're either quantizing more aggressively or moving to a larger GPU.
Scenario B brings the 70B parameter class within reach at a cost that's still workable for production. Two A6000 Ada cards give you 96GB of total VRAM, which comfortably fits Llama-3-70B in AWQ 4-bit format (roughly 35GB). The 380ms TTFT rules this out for latency-sensitive applications, but for batch processing, document analysis, or agentic tasks where quality matters more than speed, it produces 70B-quality output for roughly a third of H100 pricing.
Scenario C is the configuration for low-latency, high-concurrency workloads where you can't afford to batch users. The H100's FP8 Tensor Core support means Mixtral 8x7B fits within 80GB while running at near-FP16 accuracy. The 95ms TTFT makes it viable for real-time voice applications and sub-200ms API responses.
The takeaway: you don't need an H100 to run production LLMs. You need an H100 when TTFT is your primary constraint and budget allows. For most startups at the post-seed stage, Scenario A or B delivers better unit economics while TTFT remains acceptable to users.
On Nebius or Lambda Labs, selecting Scenario B means you get the hardware and then spend days configuring multi-GPU interconnects, setting up the tensor parallelism orchestration layer, and validating health checks. On Runpod, you set --tensor-parallel-size 2 in your vLLM template and multi-GPU orchestration is automatic.
The hardware decision is locked. The next lever is quantization, which determines how much of that hardware you actually use.
Chapter 2: Quantization strategy: AWQ, GPTQ, or FP8?
Quantization is the single biggest optimization you can apply before touching your serving engine. It cuts VRAM requirements by 50–75% and lifts throughput by removing memory bandwidth bottlenecks. On modern hardware, the accuracy cost is negligible for most production tasks.
The three formats worth knowing are AWQ (Activation-aware Weight Quantization), GPTQ, and FP8. They're not interchangeable.
FP8 is the cleanest path on H100s. NVIDIA's Hopper architecture includes native FP8 Tensor Core support, meaning the quantized computation runs on dedicated silicon rather than emulation. The perplexity delta against the FP16 baseline is under 1% on standard benchmarks (WikiText-2 perplexity for Llama-3-8B: FP16 baseline ~6.1, FP8 ~6.15). You get a 50% VRAM reduction with essentially no accuracy cost. If you're running H100s, FP8 is the only quantization format worth using.
AWQ dominates on Ada Lovelace hardware (RTX 4090, A6000 Ada). The AWQ paper demonstrates that instead of uniformly quantizing all weights, the algorithm identifies the 1% of weights that have the largest activation magnitude and protects them at higher precision. The result is a perplexity delta of roughly 3% against FP16 at 4-bit precision; at 4-bit, AWQ consistently outperforms GPTQ at the same bit-width. Critically, AWQ's GEMM kernels are tuned for Ada architecture memory hierarchies, which means you see the full throughput benefit rather than leaving compute on the table.
GPTQ is the fallback for Ampere and older GPUs (A100, V100, A6000 non-Ada). It delivers similar VRAM savings to AWQ but with a larger perplexity delta (approximately 6% for 4-bit on Llama-class models) and slower kernel performance. If you're still running A100s, start planning a migration to A6000 Ada on Runpod for the AWQ throughput improvement.
To quantize a custom model using AutoAWQ on a Runpod Pod, run the following on any A6000 Ada or RTX 4090 instance:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
quant_path = "llama-3-8b-instruct-awq-4bit"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Use GEMM for Ada, GEMV for decode-heavy workloads
}
model = AutoAWQForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
use_cache=False
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Default calibration dataset: PILEVAL (The Pile validation set).
# For domain-specific models (code, legal, medical), pass calib_data=["your domain text samples"]
# to reduce accuracy loss on specialized vocabulary.
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)The q_group_size: 128 setting quantizes weights in groups of 128, which balances accuracy against memory. Smaller groups (64) improve accuracy at the cost of a larger calibration overhead. The version: GEMM flag selects the matrix-multiplication kernel path optimized for batch inference; if you're running a decode-heavy chatbot with small batches, switch to GEMV for better single-token latency.
Once quantized, you need to get the model onto a Runpod network volume before it can be referenced in your vLLM launch command. Two patterns:
# Option A: Download a pre-quantized model directly to the volume (recommended)
# Run this inside a Runpod pod with the network volume mounted at /runpod-volume
pip install huggingface_hub
huggingface-cli download TheBloke/Llama-3-70B-Instruct-AWQ \
--local-dir /runpod-volume/llama-3-70b-instruct-awq
# Option B: Save your custom quantization output directly to the volume
# Set quant_path above to point at the mounted volume path:
# quant_path = "/runpod-volume/llama-3-8b-instruct-awq-4bit"
# The model.save_quantized() call will write directly there.Create the network volume in the Runpod console before launching your pod, and attach it at /runpod-volume. The quantization process on an 8B model takes approximately 20 minutes on a single A6000 Ada.
Chapter 3: Serving engine configuration: tuning vLLM and SGLang
You've picked the right hardware and quantized your model. Now you need to serve it, and your engine configuration determines whether that GPU runs at 30% or 90%.
Version note: All configurations in this chapter are tested against vLLM ≥ 0.6.x. Pin your container to vllm/vllm-openai:v0.6.x in Runpod templates. The python -m vllm.entrypoints.openai.api_server invocation is equivalent to vllm serve in v0.5+ CLI.
Choosing between vLLM and SGLang
vLLM is the default choice for general-purpose chat and high-throughput batch inference. Its PagedAttention memory manager eliminates KV cache fragmentation, which is what makes continuous batching possible at scale. At batch size 32, vLLM consistently saturates GPU compute on both Ada and Hopper architectures.
SGLang targets a different problem: complex, multi-step agentic workflows that require structured JSON output. Where vLLM treats each request independently, SGLang's RadixAttention mechanism caches and reuses KV states across requests that share common prefixes. In agentic pipelines where every tool-call response starts with the same system prompt and conversation history, that prefix sharing reduces TTFT by 30–60% compared to naive per-request serving. If you're building a document processing pipeline or a ReAct agent loop, SGLang's structured output mode (direct JSON grammar enforcement at the token level) also eliminates the post-processing overhead of parsing malformed model outputs.
Use vLLM for chat, completions, and RAG. Use SGLang when your workflow runs multiple sequential LLM calls, requires guaranteed JSON schemas, or has long shared prefixes.
For the vLLM path, three parameters separate a saturated GPU from one wasting half its compute.
vLLM configuration deep dive
The three parameters that determine whether vLLM saturates your GPU or wastes it are --max-num-seqs, --gpu-memory-utilization, and --tensor-parallel-size.
--max-num-seqssets the maximum number of sequences vLLM processes simultaneously. This is your effective batch size ceiling. The default of 256 is conservative for most models. On a 4090 running Llama-3-8B AWQ, start at 128 and profile using vLLM's Prometheus endpoint (covered in Chapter 4): watchvllm:avg_generation_throughput_toks_per_sand compare it against the throughput ceiling from Chapter 1's benchmarks. You're saturated when throughput plateaus despite increasing--max-num-seqs. If throughput is still climbing, increase in increments of 64 until TTFT starts degrading beyond your SLA threshold.--gpu-memory-utilizationis where most engineers create OOM errors. This parameter controls the fraction of GPU VRAM reserved for the KV cache, on top of the model weights. The default is 0.90, meaning vLLM claims 90% of available VRAM. Setting it to 0.95 increases KV cache capacity (longer sequences, larger batches) but leaves only 5% headroom for activation memory during large prefill operations. On a 24GB 4090 running a 16GB AWQ model, vLLM reserves 21.6GB (24 × 0.90), leaving roughly 5.6GB for the KV cache. Bumping to 0.95 adds about 1.2GB more cache capacity, which translates to ~15% more concurrent context length. The safe ceiling in production is 0.93: it gives you meaningful cache growth while keeping enough headroom to survive a burst of long-context requests.--tensor-parallel-sizesplits the model's weight matrices across multiple GPUs using Ray. Setting--tensor-parallel-size 2on a 2× A6000 Ada node shards the attention heads and FFN layers evenly across both GPUs, connected via PCIe Gen 4 (~64 GB/s inter-GPU bandwidth). The communication overhead of tensor parallelism (PCIe carries each all-reduce synchronization between layers) means you won't see a 2× throughput increase, but you will see 1.6–1.8× on generation and unlock models that exceed single-GPU VRAM capacity.
Here's the full launch command for the Scenario B configuration from Chapter 1:
python -m vllm.entrypoints.openai.api_server \
--model /runpod-volume/llama-3-70b-instruct-awq \
--quantization awq \
--tensor-parallel-size 2 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.93 \
--max-model-len 8192 \
--enforce-eager \
--host 0.0.0.0 \
--port 8000The --enforce-eager flag disables CUDA graph capture, which is useful when you're iterating on configurations and want faster startup times. Remove it in a stable production deployment to recover the ~15% throughput gain from graph optimization.
For agentic pipelines that match the SGLang profile (multiple sequential LLM calls, shared prefixes, guaranteed JSON), here's the equivalent server launch configuration.
SGLang for agentic workloads
For an agent pipeline serving structured JSON, the SGLang runtime configuration looks like this:
python -m sglang.launch_server \
--model-path /runpod-volume/llama-3-70b-instruct-awq \
--quantization awq \
--tp 2 \
--mem-fraction-static 0.88 \
--max-prefill-tokens 16384 \
--enable-flashinfer \
--host 0.0.0.0 \
--port 30000The --mem-fraction-static 0.88 is SGLang's equivalent of vLLM's --gpu-memory-utilization. Keep it slightly lower than your vLLM setting because SGLang's RadixAttention tree grows dynamically; you need headroom for prefix cache expansion. The --enable-flashinfer flag activates FlashInfer attention kernels, which reduce TTFT by 10–15% on prefill-heavy agentic requests.
Chapter 4: Advanced latency reduction
Once your engine is tuned for throughput, you can attack TTFT and generation latency directly with two techniques: speculative decoding and KV cache monitoring.
Speculative decoding
Speculative decoding targets a known bottleneck in autoregressive generation. The target model (70B) generates one token per forward pass, but most of that compute is wasted on tokens the model would predict with very high confidence anyway, such as words like "the", "is", punctuation, and common phrases. The solution is to run a smaller draft model (8B) to speculatively generate 4–5 candidate tokens, then verify all of them with a single forward pass of the target model. When the draft tokens are correct (which happens 70–80% of the time with a well-matched draft model on chat workloads), you've generated 5 tokens for the cost of one target forward pass.
In practice, this delivers 1.8–2.2× speedup on generation throughput for chat-style workloads, with zero change to output quality (the verification step ensures the output distribution is identical to the original model).
Enabling speculative decoding in vLLM requires only two additional flags:
python -m vllm.entrypoints.openai.api_server \
--model /runpod-volume/llama-3-70b-instruct-awq \
--speculative-model /runpod-volume/llama-3-8b-instruct-awq \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--tensor-parallel-size 2 \
--quantization awq \
--max-num-seqs 64 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000The draft model runs on GPU 0 via --speculative-draft-tensor-parallel-size 1; the target model spans GPUs 1–2. You need a 3-GPU pod for this configuration. On Runpod, provision it by setting gpu_count: 3 in your pod template (use the vllm-production.yaml from the Appendix as your base, changing gpu_count from 2 to 3). Note --gpu-memory-utilization is set to 0.90 rather than 0.93: the third GPU carries the draft model's weights and KV cache, so leave extra headroom. On Runpod, a 3× A6000 Ada node at $2.37/hr delivers roughly 1,700 tok/s on the 70B model with speculative decoding active, cutting the effective cost per million tokens to around $0.39. TTFT drops to approximately 210ms.
Speculative decoding accelerates generation. The other half is knowing when your inference engine is the bottleneck, and fixing it before users notice.
KV cache management with Prometheus
vLLM exposes a Prometheus metrics endpoint at /metrics that gives you the observability needed to tune cache behavior in production. The two metrics to watch are vllm:gpu_cache_usage_perc and vllm:num_preemptions_total.
When gpu_cache_usage_perc runs consistently above 95%, you're cache-bound: new requests queue while waiting for cache slots to free up. The fix is either lowering --max-model-len (reducing max context per sequence) or increasing --gpu-memory-utilization by 0.01–0.02 increments.
When num_preemptions_total is climbing, vLLM is evicting in-progress sequences to free cache space for new prefills. This is the latency killer: preempted sequences must restart their prefill computation. If you see more than 5 preemptions per 1,000 requests, reduce --max-num-seqs until the preemption rate drops to near zero.
Add this scrape config to your Prometheus setup to capture both metrics:
scrape_configs:
-job_name:'vllm'
static_configs:
-targets:[':8000']
metrics_path:'/metrics'
scrape_interval: 10sDeploying Prometheus on Runpod: Runpod's single-container model means Prometheus can't run as a sidecar in the same pod. The simplest approach is to launch a separate Runpod pod using the prom/prometheus Docker image, then target your vLLM pod's internal IP in the scrape config above. Find the vLLM pod's internal IP in the Runpod console under pod details. Alternatively, expose port 8000 publicly on your vLLM pod and scrape it from an external Prometheus instance. A minimal Prometheus pod launch:
docker run -d \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latestYour model is optimized, your engine is tuned, and your cache behavior is instrumented. The final decision is where and how you run it.
Chapter 5: Infrastructure architecture: serverless vs. pods
Chapters 1–4 determine your cost per token. This chapter determines your monthly bill. At moderate traffic, the gap between the wrong and right deployment architecture is 77% of your compute spend.
How Runpod serverless works
Traditional serverless GPU infrastructure was impractical for LLM serving because cold starts took 30–120 seconds (time to pull container, load CUDA, initialize model weights). At that latency, a user's first message in any given session would time out.
Runpod's FlashBoot technology changes this constraint. By pre-warming container snapshots and caching model weight shards at the storage layer, FlashBoot reduces serverless cold starts to under 250ms for most requests, with roughly half of cold starts completing in under 200ms. That makes scale-to-zero viable for production, because a cold-start pod responds before most users register any latency.
The architecture looks like this:
Runpod bills serverless on a per-second basis against actual compute consumed, not reserved capacity. When no requests are in flight, your cost is zero.
The cost calculation
Here's the real-world comparison for a chatbot with 1 million monthly requests, averaging 512 output tokens per response, running Llama-3-70B AWQ on 2× A6000 Ada hardware.
24/7 dedicated pod:
- $1.58/hr × 24 × 30 = $1,137.60/month
- GPU utilization: approximately 15–25% (most SaaS chatbots are bursty)
- Effective cost per million tokens: $2.90+
Runpod Serverless with autoscaling:
- 1M requests × 512 tokens ÷ 850 tok/s = ~602,000 seconds of GPU time
- 602,000 ÷ 3,600 = 167.2 GPU-hours
- 167.2 × $1.58/hr = $264.20/month
- Effective cost per million tokens: $0.52
At moderate traffic, serverless costs 77% less. The break-even point occurs when your utilization consistently exceeds 65–70% of a 24/7 pod. At that utilization level, switching to reserved capacity (RunPpod's committed pricing tier) drops the hourly rate further while maintaining the same API surface and configuration.
The math favors serverless at this traffic level.
Autoscaling configuration
In the Runpod Serverless console, set your endpoint concurrency as follows:
- Min workers: 0 for overnight/weekend scale-to-zero, or 1 if your SLA requires sub-200ms cold start avoidance
- Max workers: Set to your peak expected concurrent users divided by your
--max-num-seqssetting. For a 70B model at 64 max sequences per worker, 10 max workers handles 640 simultaneous requests. - Idle timeout: 30 seconds balances cold start frequency against wasted idle compute
The Runpod SDK lets you trigger endpoint calls and monitor queue depth programmatically:
import os
import runpod
runpod.api_key = os.environ["RUNPOD_API_KEY"] # set via: export RUNPOD_API_KEY=...
endpoint = runpod.Endpoint("YOUR_ENDPOINT_ID")
run_request = endpoint.run({
"input": {
"prompt": "Explain KV cache eviction in vLLM.",
"max_new_tokens": 512,
"temperature": 0.7,
"stream": True
}
})
output = run_request.output(timeout=60)
print(output)For streaming responses in production, use endpoint.run() and call .stream() on the result to get an iterator of partial outputs as they generate:
run_request = endpoint.run({"input": {"prompt": "...", "stream": True, ...}})
for chunk in run_request.stream():
print(chunk["output"], end="", flush=True)Deployment mode note: The templates in the Appendix are Pod templates; use them for the dedicated pod scenarios in Chapters 1–4. For the Runpod Serverless endpoint in this chapter, use Runpod's official vLLM worker image (runpod/worker-vllm:latest) and configure it via the Serverless endpoint console, passing vLLM flags as environment variables. The Docker args structure differs between Pods and Serverless workers; a Pod template deployed as-is will not give you scale-to-zero autoscaling.
Appendix: Copy-paste configurations
These three templates are production-ready starting points. Adjust model paths and VRAM settings to match your specific hardware.
Template 1: vllm-production.yaml (optimized for throughput)
# vllm-production.yaml
# Target: Llama-3-70B AWQ on 2x A6000 Ada
# Optimized for: throughput-first serving
version:"1.0"
container_disk_in_gb:50
volume_in_gb:100
volume_mount_path:"/runpod-volume"
gpu_type_id:"NVIDIA RTX A6000 Ada"
gpu_count:2
docker_args:>
python -m vllm.entrypoints.openai.api_server
--model /runpod-volume/llama-3-70b-instruct-awq
--quantization awq
--tensor-parallel-size 2
--max-num-seqs 64
--gpu-memory-utilization 0.93
--max-model-len 8192
--disable-log-requests
--host 0.0.0.0
--port 8000
ports:"8000/http"
env:
-VLLM_WORKER_MULTIPROC_METHOD: spawn
-RAY_DISABLE_MEMORY_MONITOR:"1"Template 2: sglang-agent.json (optimized for structured output)
{
"name": "sglang-agent-endpoint",
"imageName": "lmsysorg/sglang:v0.4.1",
"gpuTypeId": "NVIDIA RTX A6000 Ada",
"gpuCount": 2,
"containerDiskInGb": 50,
"volumeInGb": 100,
"volumeMountPath": "/runpod-volume",
"dockerArgs": "python -m sglang.launch_server --model-path /runpod-volume/llama-3-70b-instruct-awq --quantization awq --tp 2 --mem-fraction-static 0.88 --max-prefill-tokens 16384 --enable-flashinfer --enable-cache-report --host 0.0.0.0 --port 30000",
"ports": "30000/http",
"env": [
{ "key": "SGL_DISABLE_DISK_CACHE", "value": "0" },
{ "key": "TOKENIZERS_PARALLELISM", "value": "false" }
]
// Pin to stable release. Check https://github.com/sgl-project/sglang/releases for latest.
}Template 3: prometheus-vllm.yaml (GPU and inference metrics)
# prometheus-vllm.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
-job_name:'vllm-inference'
static_configs:
-targets:['localhost:8000']
metrics_path:'/metrics'
scrape_interval: 10s
params:
format:['prometheus']
-job_name:'node-gpu'
static_configs:
-targets:['localhost:9835'] # nvidia-dcgm-exporter
scrape_interval: 5s
rule_files:
-'vllm_alerts.yaml'
# Key metrics to dashboard:
# vllm:gpu_cache_usage_perc -> KV cache pressure (alert >90%)
# vllm:num_preemptions_total -> Eviction rate (alert >5/1000 req)
# vllm:avg_generation_throughput_toks_per_s -> Throughput baseline
# vllm:time_to_first_token_seconds -> P50/P95/P99 TTFTTemplate 4: vllm-alerts.yaml (alerting rules)
groups:
-name: vllm_inference
rules:
-alert: VllmCacheNearFull
expr: vllm:gpu_cache_usage_perc > 0.90
for: 2m
labels:
severity: warning
annotations:
summary:"vLLM KV cache above 90%, consider reducing --max-model-len or --max-num-seqs"
-alert: VllmHighPreemptions
expr: rate(vllm:num_preemptions_total[5m]) * 1000 > 5
for: 5m
labels:
severity: warning
annotations:
summary:"vLLM preemption rate above 5/1000 requests, consider reducing --max-num-seqs"Conclusion
These optimizations stack. Move from FP16 to AWQ on a 4090 and your cost per million tokens drops 70%. Tune --max-num-seqs and --gpu-memory-utilization to saturation and you recover another 20–30% in throughput. Add speculative decoding on a 3-GPU configuration and generation latency drops by another 45%, bringing TTFT to ~210ms at $0.39/M tokens. Deploy serverless instead of a 24/7 pod at moderate traffic and your monthly bill drops by more than half. Apply all four and the same seed-round budget that previously bought you 30 days of compute buys you four months.
The infrastructure layer shouldn't be the bottleneck on any of this. On Nebius or a bare-metal provider, you'd still be installing drivers when your competitor ships. On Runpod, you take the configurations from the Appendix above, deploy them via the console or the Runpod SDK in under three minutes, and start profiling against real traffic on day one.
Optimization is ongoing. Pick the matching template, wire up Prometheus, and start iterating from a working baseline.
Log in to the Runpod Console and launch your first optimized endpoint using the vllm-production.yaml template above. You'll have your first benchmark numbers in under three minutes.

