You picked a model off Hugging Face, picked a GPU that sounded big enough, and the deployment crashed on startup with CUDA out of memory. Or it loaded fine, ran for a few requests, then died the first time three users hit the endpoint at once. Both failures trace back to the same root cause: a mismatch between available VRAM and what the model requires.
This is the single most common mistake teams make moving from "I called the OpenAI API" to "I'm serving my own model." Training has forgiving failure modes; a job that runs out of memory just restarts from a checkpoint. Inference doesn't have that luxury. An undersized endpoint fails in production, in front of users, and an oversized one quietly burns budget every hour it sits idle.
This guide walks through what happens during inference, why memory, not raw compute, is what limits how many users you can serve, and how to calculate VRAM requirements before you select hardware.
It closes with a GPU tier comparison for inference workloads and a walkthrough for deploying your first endpoint on Runpod. For the deeper mechanics of why memory behaves this way (KV cache, paged attention, batching), see LLM Inference from First Principles; for how inference's hardware profile differs from training's, see AI Inference vs Training: GPU Selection and Cost.
What Happens During LLM Inference: Prefill, Decode, and KV Cache
Inference is the process of running a trained model forward to generate output tokens from input tokens. The weights are frozen: you're not updating anything, just executing a fixed function, very fast, many times per second.
That execution splits into two phases with different resource profiles:
Prefill processes your entire input prompt in one parallel pass. The model computes attention across all input tokens simultaneously, which is why prefill is compute-bound; more prompt tokens means more matrix multiplication. This is also where time-to-first-token (TTFT) accumulates, since nothing streams back to the user until prefill finishes.
Decode generates output tokens one at a time, sequentially. Each step produces a single token, appends it to the sequence, and runs again. Decode is memory-bandwidth-bound: at every step, the GPU has to pull the full model weights plus the growing KV cache out of VRAM, and the actual compute is comparatively trivial. This is why a GPU with more memory bandwidth often beats a GPU with more theoretical FLOPS on decode latency.
The KV cache is the part that catches most people off guard. Every decode step reuses the key and value tensors computed for every prior token in the sequence, rather than recomputing them from scratch. That cache has to live in VRAM alongside the model weights, and it grows with every token generated and with every concurrent request you're serving. A model that fits comfortably on a GPU at idle can run out of memory the moment real concurrent traffic shows up, not because the weights changed, but because the cache did.
That's the core insight this guide builds on: sizing for inference means sizing for weights and KV cache together, under your expected concurrency, not just sizing for the model file on disk.
Why GPU Memory, Not FLOPS, Is the Binding Constraint for LLM Inference
It's tempting to shop for GPUs on TFLOPS, the way you'd shop for a CPU on clock speed. For inference, that's the wrong number to anchor on.
Decode, the phase that dominates wall-clock time on any multi-token response, spends most of its time moving data, not computing. The matrix multiplications involved in generating one token are small relative to how much weight and cache data has to travel from VRAM to the compute units to make that multiplication happen. A GPU can have enormous FLOPS on paper and still serve tokens slowly if its memory bandwidth can't keep the compute units fed.
This is also why VRAM capacity, not just bandwidth, sets a hard ceiling on what you can serve at all. If the model's weights plus the KV cache for your expected concurrent load don't fit in the GPU's memory, the deployment doesn't run slowly; it doesn't run. There's no graceful degradation; vLLM and similar serving engines will reject new requests or crash with an out-of-memory error once the cache pool is exhausted.
So the sizing question isn't "is this GPU fast enough?" It's "does this GPU have enough memory, with enough bandwidth, to hold my model and my cache and still move data fast enough to hit my latency target?" That's a memory-first question, and it's why the formula in the next section starts with VRAM, not FLOPS.
How to Calculate VRAM Requirements for LLM Inference
The fast version of this calculation has three components: model weights, KV cache, and overhead.
Step 1: Weights
Weight memory follows directly from parameter count and numeric precision:
Weight memory (GB) = parameters (billions) × bytes per parameterBytes per parameter depends on precision:
A practical rule of thumb that holds up well for serving estimates: roughly 2 GB of VRAM per billion parameters at FP16, or roughly 0.5 GB per billion parameters at 4-bit quantization. Quality degradation isn't linear across this range. BF16 is effectively lossless versus FP32 for serving, INT8 is usually imperceptible with reasonable calibration, and INT4 starts to show measurable degradation on reasoning-heavy tasks. Validate on your own evals before committing INT4 to production.
Step 2: KV cache
This is the part most back-of-envelope estimates skip, and it's the part that determines how many concurrent users you can actually serve. The KV cache cost per token is:
KV bytes per token = 2 × num_kv_heads × head_dim × num_layers × bytes_per_element(The factor of 2 covers the separate K and V tensors.) Multiply by sequence length and by the number of concurrent sequences you want to support, and you get total cache memory:
Total KV cache (GB) = KV bytes per token × sequence_length × concurrent_requests ÷ 1e9
Models using grouped-query attention (GQA), including Llama 3 and most current open-weight models, have far fewer KV heads than query heads, which keeps this number much smaller than it would be under older multi-head attention. For Llama 3.1 8B Instruct in BF16 (32 layers, 8 KV heads, head_dim 128), the KV cache works out to roughly 512 MB per request at a 4,096-token sequence length. Older multi-head-attention models, which would use the full 32 query heads as KV heads instead of 8, can run closer to 2 GB per request at the same length. Architecture matters as much as parameter count here.
Step 3: Overhead
Add 15–20% on top of weights plus KV cache for activation memory, framework overhead, and CUDA context. In Runpod's vLLM deployments, this is roughly what GPU_MEMORY_UTILIZATION controls: setting it to 0.90 tells the engine to reserve 90% of VRAM for weights and cache combined, leaving the remainder as safety margin.
Worked example: Llama 3.1 8B Instruct, serving 50 concurrent users
- Weights (BF16): 8 × 2 GB = 16 GB
- KV cache: ~512 MB/request × 50 requests = ~25 GB at a 4,096-token sequence length
- Subtotal: 41 GB
- Plus 15% overhead: ~47 GB
That's already past a 24 GB RTX 4090 and into 48 GB L40S / A6000 Ada territory, and this is an 8B model. The lesson generalizes: weights are usually the smaller line item once you're serving real concurrent traffic. Cache dominates.
Worked example: Llama 3 70B, single request vs. quantized
- FP16 weights: 70 × 2 GB = 140 GB. Doesn't fit on a single 80 GB H100 or A100; needs at least two GPUs with tensor parallelism.
- 4-bit (AWQ) weights: 70 × 0.5 GB = 35 GB in theory; quantization metadata and runtime overhead push real-world consumption to roughly 40–45 GB, which fits a single 80 GB A100 or H100 with limited room left for KV cache.
This is the calculation to run before you pick a GPU, not after a deployment fails. Pick your model, pick your precision, pick your target concurrency and context length, run the math, and only then go shopping for hardware.
GPU Tier Comparison for Inference Workloads: RTX 4090 vs A100 vs H100
Three tiers cover the overwhelming majority of LLM inference workloads. Each is optimized for memory bandwidth, the decode-phase bottleneck described above, rather than peak FLOPS.
(Pricing reflects Runpod's published rates checked against runpod.io/pricing in June 2026; check for current figures.)
A few things worth calling out beyond the raw spec sheet:
The RTX 4090 is a consumer card, and that matters for more than price. It lacks ECC memory, which corrects silent bit-level memory errors. For short-lived prototyping or low-stakes endpoints, that's a non-issue. For a production endpoint that runs unattended for weeks, undetected memory corruption is a real (if rare) risk that the A100 and H100 are built to handle.
A100 and H100 differ more in bandwidth than in capacity. Both ship with 80 GB of VRAM, so for workloads where capacity, not speed, is the constraint (fitting a 70B model's weights, for instance), the A100 does the job at a meaningfully lower hourly rate. Where the H100 earns its premium is decode-bound, high-concurrency serving, where its ~3.35 TB/s of bandwidth (and native FP8 Tensor Core support) directly translates into more tokens per second per dollar at scale. NVIDIA's own benchmarking on H100 FP8 reports up to 4.6x higher peak throughput and 4.4x faster time-to-first-token versus A100 on equivalent workloads, though that comparison reflects NVIDIA's specific test configuration, and your mileage will vary by model and batch size.
Real measured throughput, for calibration: Runpod's own benchmarking of an AWQ-quantized Llama-3-8B on a single RTX 4090 measured roughly 3,500 tokens/sec at batch size 32 with a ~120ms TTFT, published in Runpod's LLM inference optimization playbook. On the high end, NVIDIA's own published TensorRT-LLM benchmarks for Llama-3.3-70B-Instruct at FP8 show throughput ranging from roughly 2,785 tokens/sec at a 2,048/2,048-token input/output split up to over 6,000 tokens/sec at a 128/128-token split, on a 2×H100 tensor-parallel configuration. That's two GPUs working together, not one, so don't expect those exact numbers from a single-H100 70B deployment; the point of citing it is a real, sourced throughput ceiling to sanity-check against vendor marketing claims, not a single-GPU equivalent. Either way, these numbers won't transfer directly to your model and traffic shape: run your own benchmark against your own model before committing to a GPU tier at scale.
Runpod Serverless vs. Dedicated Pods for LLM Inference
Once you know how much VRAM you need, the next decision is how you pay for it.
Serverless fits variable or bursty traffic, the common case for early-stage products, internal tools, or anything without a predictable request volume. Runpod Serverless scales workers from zero and bills per second of active compute, so an endpoint that sees 10 requests an hour costs nothing in between. Runpod's FlashBoot optimization keeps cold starts low (the majority complete in under a few seconds with cached models), which makes scale-to-zero viable for most latency budgets, though baking model weights into your container image removes cold-start variance entirely for latency-sensitive deployments.
Dedicated Pods make sense once sustained request volume is high enough that a fixed hourly rate beats per-second billing, typically once utilization is consistently high through most of the day. A Pod also gives you full environment control and zero cold starts, which matters if your latency SLA can't tolerate even an occasional cold-start request.
The decision isn't permanent. Plenty of teams start with a Serverless Flex endpoint to validate demand, then move to Active workers or a dedicated Pod once traffic is steady enough to justify it. Run the math on your own expected GPU-hours against both billing models before committing either way; see the optimization tradeoffs in the LLM Inference Optimization Playbook for the full cost-modeling approach.
Runpod Serverless Endpoint Setup Walkthrough
With sizing done and a GPU tier picked, deploying takes under 10 minutes:
- Log in to the Runpod console and go to the vLLM worker page in the Runpod Hub.
- Click Deploy, and use the latest available vLLM worker version.
- Enter your model name in the Model field, for example, meta-llama/Llama-3.1-8B-Instruct. For gated Hugging Face models, add HF_TOKEN under environment variables.
- Pick your GPU tier based on the VRAM math above. An RTX 4090 or L4 handles an 8B model comfortably; a 70B model at 4-bit needs an A100 80GB; full-precision 70B needs multiple GPUs with tensor parallelism.
- Click Advanced to expand the vLLM settings, and set MAX_MODEL_LEN to your real expected sequence length (not the model's default context window: over-provisioning this wastes VRAM that could otherwise go to concurrent capacity), and GPU_MEMORY_UTILIZATION to 0.90 as a starting point.
- Click Create Endpoint, then watch the Logs tab for Application startup complete. A CUDA out of memory error here means your VRAM math was off somewhere; drop to a larger GPU, lower MAX_MODEL_LEN, or apply quantization.
- Test the endpoint directly from the Requests tab before wiring it into your application, no separate tooling needed, and the endpoint is OpenAI-compatible from the moment it's live.
For the full mechanics of what's happening under the hood during that deployment, PagedAttention, continuous batching, the serving engine's scheduler, see vLLM Explained: PagedAttention and Continuous Batching. For a from-scratch Docker deployment on a dedicated Pod instead of Serverless, see Deploy vLLM with Docker on Runpod.
GPU Memory Sizing Checklist for LLM Inference
Before you rent a GPU for an inference workload, you should be able to answer:
- What's your model's parameter count and target precision? (That's your weight memory.)
- What's your expected concurrent request count and typical sequence length? (That's your KV cache memory.)
- Does weights + cache + 15–20% overhead fit inside a single GPU's VRAM, or do you need tensor parallelism across multiple cards?
- Is your traffic bursty enough to favor Serverless, or steady enough to justify a dedicated Pod?
Get those four answers before you pick hardware, and you avoid both failure modes this guide opened with: the endpoint that crashes under real traffic, and the GPU that sits oversized and overpriced because nobody ran the math.
FAQ: GPU Memory Sizing for LLM Inference
How much VRAM do I need for an 8B model?
At BF16, roughly 16 GB for weights alone, but real-world serving needs more headroom for the KV cache and overhead. Run the full calculation in the "How to Calculate VRAM Requirements" section above rather than sizing off the model's file size: a 16 GB model file can easily need 40+ GB of VRAM once you account for concurrent requests at a realistic context length.
What happens if my endpoint runs out of VRAM?
The deployment fails outright rather than degrading gracefully. vLLM and similar serving engines reject new requests or crash with a CUDA out of memory error once weights and KV cache exceed available VRAM. On Runpod, the fix is usually one of three things: pick a larger GPU tier, lower MAX_MODEL_LEN, or apply quantization (AWQ, GPTQ, or INT8) to shrink the weight footprint.
Can I run a 70B model on a single GPU?
Only with aggressive quantization. At FP16, a 70B model needs about 140 GB of VRAM, which exceeds even an 80 GB H100 or A100, so you'd need tensor parallelism across at least two GPUs. At 4-bit (AWQ/GPTQ), real-world VRAM use lands around 40–45 GB, which fits a single 80 GB card with limited room left for KV cache. If you do need multiple GPUs, Runpod's TENSOR_PARALLEL_SIZE environment variable handles the split for you.
Does longer context length really use that much more memory?
Yes, and it's the part most sizing estimates miss. KV cache memory scales linearly with both sequence length and concurrent requests, not just model size. Doubling your expected context length roughly doubles your KV cache footprint at the same concurrency. Set MAX_MODEL_LEN to what your application actually needs rather than the model's maximum supported context window.
What's the cheapest way to test a GPU tier before committing to it?
Deploy a Runpod Serverless endpoint instead of a dedicated Pod first. Pay-per-second billing means you can test your actual model against an RTX 4090, A100, or H100 tier for a few dollars and compare real throughput and VRAM usage before committing to hardware for production. The setup walkthrough above covers the full process.
Ready to size and deploy your own endpoint? Spin up a Serverless vLLM endpoint on Runpod: pay-per-second billing means testing a GPU tier against your own model costs a few dollars, not a reserved-instance commitment.
