You've decided to self-host. The model's picked, the use case is real, and now there's one decision left before you can ship: which serving engine actually runs the model in production. The two real contenders are vLLM and TensorRT-LLM, and both are good choices.
The right one depends on your traffic shape, your team's tolerance for build complexity, and how much that extra throughput is worth to you. This article compares them directly and shows how to deploy either on Runpod Serverless.
What are vLLM and TensorRT-LLM?
vLLM is an open-source inference engine built around PagedAttention and continuous batching. It manages the KV cache in non-contiguous memory blocks (the way an OS manages virtual memory pages), which eliminates the memory fragmentation that plagued earlier serving stacks, and it ships an OpenAI-compatible API out of the box. The practical upshot: pip install vllm, point it at a Hugging Face model ID, and you have a production-grade endpoint in minutes. We cover the internals - PagedAttention, the batching scheduler - in detail in vLLM Explained: PagedAttention and Continuous Batching; this article assumes that background and focuses on the comparison.
TensorRT-LLM is NVIDIA's compiler-based inference library. Instead of running your model through a general-purpose Python/PyTorch execution path, it compiles the model into an optimized engine specific to your exact GPU architecture, batch size range, and precision - fusing kernels, applying NVIDIA-specific quantization (including FP8 on Hopper and newer), and stripping out generality the way an ahead-of-time compiler strips out a JIT's runtime checks. That compilation step is also the cost: building an engine takes real time, has to be redone when you change GPU architecture or key parameters, and exposes more configuration surface than vLLM's drop-in deployment. (TensorRT-LLM's newer PyTorch backend, stable and default since v1.0, removes the explicit compile step at some cost to peak throughput - more on that below.)
Both are real production systems used at scale. The difference is where on the speed-versus-flexibility curve they sit.
vLLM vs TensorRT-LLM Throughput Comparison
NVIDIA publishes its own throughput numbers for TensorRT-LLM. On Llama-3.3-70B-Instruct-FP8 with tensor parallelism of 2 on H100, NVIDIA's official TensorRT-LLM performance overview reports the following maximum-load throughput using the current PyTorch backend:
vLLM tracks its own performance on a continuously updated, interactive performance dashboard rather than a fixed benchmark. Overall:
- TensorRT-LLM's compiled, kernel-fused execution path tends to win on raw tokens/sec at sustained high load, consistent with NVIDIA's own numbers above.
- vLLM gives back some of that peak throughput in exchange for a deployment path with no compile step.
If you need an exact number for your own model and traffic shape, the right move is to check vLLM's dashboard for your model/GPU pair, and run NVIDIA's trtllm-bench (documented on the same performance-overview page) for TensorRT-LLM.
Cold start is where the comparison flips, and it matters more than the throughput gap for most Serverless deployments:
- TensorRT-LLM's legacy compile-first workflow builds an engine before it can serve a single request, and for a 70B-class model that build routinely takes tens of minutes.
- vLLM's cold start is just weight loading, no compilation, so it's typically a small fraction of that. For Serverless deployments that scale workers up and down with traffic, this directly affects how many requests hit a cold worker and eat the full cold-start latency penalty.
- TensorRT-LLM's newer PyTorch backend (default since v1.0) skips the ahead-of-time compile step entirely and brings cold start down to roughly 60-90 seconds, closing most of that gap at some cost to peak throughput.
- If cold starts are a concern, check whether your build is using the PyTorch backend before ruling TensorRT-LLM out on that basis alone.
NVIDIA's own framing of TensorRT-LLM's ceiling, from a separate, earlier benchmark: NVIDIA's official "H100 has 4.6x A100 Performance" blog reports an H100 hitting over 10,000 output tokens/sec at around 100ms time-to-first-token at 64 concurrent requests. That figure used TensorRT-LLM v0.5.0 against an A100, an older release than the v0.21-era PyTorch-backend numbers in the table above, and it's not a head-to-head against vLLM, so treat it as a best-case ceiling under NVIDIA's own tuned configuration rather than a number directly comparable to the table.
As with any vendor benchmark, treat absolute numbers as directional and re-run against your own model and traffic pattern before sizing a production deployment.
TensorRT-LLM: When the Throughput Edge Is Worth the Build Time
A consistent double-digit throughput gain at scale can be the difference between needing one extra GPU and not. TensorRT-LLM is worth the extra build investment when:
Your traffic is high and predictable enough to amortize the compile cost. Building an engine takes meaningfully longer than starting a vLLM worker, and that engine is tied to a specific GPU architecture, batch size range, and precision configuration. If you're running a steady, high-volume production workload on a fixed GPU tier, you compile once and serve millions of requests against that engine - the build cost disappears into the denominator.
You're running on Hopper-class or newer GPUs and want native FP8. TensorRT-LLM's FP8 support is mature and well-integrated with NVIDIA's own quantization tooling, and FP8's throughput and memory benefits compound with the compiler's other optimizations.
You have the engineering capacity to own a more complex deployment pipeline. Engine builds need to be regenerated when you change GPU type, update the model, or shift your expected batch size range - that's a build step that has to live somewhere in CI, not just a container restart. (The PyTorch backend reduces, but doesn't eliminate, this overhead.)
Where it's the wrong call: if your model lineup changes frequently, if you're iterating on prompts/fine-tunes/model versions week to week, or if your team doesn't have spare cycles to own engine-build infrastructure, the throughput gain isn't worth what you give up in iteration speed.
vLLM: When Iteration Speed Beats Raw Throughput
vLLM is the right default for most teams, and that's not a hedge - it's a reflection of how most inference workloads actually behave in their first year of production.
You're iterating on models. Swapping a model is a config change, not a rebuild. If you're still testing which open-weight model fits your use case, or you fine-tune frequently, vLLM's lack of a compile step means every iteration cycle is faster.
Your traffic doesn't justify the engineering investment yet. If you're not yet at a scale where a double-digit percentage of throughput meaningfully changes your GPU bill, the operational simplicity of vLLM is the better trade.
You need to move across GPU types or cloud regions without re-engineering. A vLLM deployment is portable in a way a compiled TensorRT-LLM engine isn't - there's no architecture-specific artifact to regenerate.
You want the larger open-source ecosystem. vLLM has broader day-one support for newly released open-weight models, a larger contributor base, and faster community turnaround when a new model architecture lands on Hugging Face.
The honest framing: vLLM gets you to production faster and keeps you flexible. TensorRT-LLM gets you more tokens per dollar once you know exactly what you're running and you're running it at volume.
How to Deploy vLLM and TensorRT-LLM on Runpod Serverless
vLLM on Runpod Serverless:
- Go to the vLLM worker page in the Runpod Hub and click Deploy.
- Enter your Hugging Face model ID in the Model field (add HF_TOKEN as an environment variable for gated models).
- Click Advanced to set MAX_MODEL_LEN to your real sequence length and GPU_MEMORY_UTILIZATION to 0.90 as a starting point. Select your GPU tier - see the hardware-sizing table in vLLM Explained: PagedAttention and Continuous Batching if you're not sure how much VRAM your model and expected concurrency need.
- Click Create Endpoint. The endpoint is OpenAI-API-compatible immediately - no client-side changes needed if you're already using an OpenAI SDK.
For a from-scratch Docker-based vLLM deployment with more control over the container (note: this is a Pods-based workflow, not Serverless), see Deploy vLLM with Docker on Runpod.
TensorRT-LLM on Runpod Serverless:
- Build (or obtain) a TensorRT-LLM engine matched to your target GPU architecture, model, precision, and expected batch size range. This step happens before you touch Runpod - engine builds are GPU-architecture-specific, so build on the same GPU family you intend to serve on. (Using the PyTorch backend instead of the legacy compile path removes this step, at some throughput cost - see above.)
- Package the compiled engine (or PyTorch-backend setup) plus the TensorRT-LLM runtime into a container image, with a handler function for Runpod Serverless. Push it to a registry Runpod can pull from.
- In the Runpod console, create a Serverless endpoint from a Custom Container, pointing at your image, rather than deploying a pre-built worker from the Runpod Hub.
- Select the GPU tier matching the architecture your engine was compiled for - an engine built for H100 will not run on an A100.
- Set worker concurrency and scaling parameters based on your benchmarked per-worker throughput (from a process like the one in the table above, run against your own model).
- Deploy and load-test before routing production traffic - confirm the loaded engine matches the GPU it's actually running on.
The deployment-complexity gap between these two paths is itself a data point in the decision: if step 1 of the TensorRT-LLM path is a blocker for your team right now, that's a legitimate reason to start with vLLM and revisit once traffic justifies the investment.
vLLM vs TensorRT-LLM Decision Matrix
For most teams shipping their first self-hosted inference endpoint, vLLM is the pragmatic starting point. For teams running stable, high-volume production traffic on a known GPU architecture where the throughput gain translates directly into fewer GPUs and lower spend, TensorRT-LLM earns its build complexity.
Either way, Runpod Serverless is the deployment substrate underneath: scale-to-zero billing while you're testing, dedicated worker pools once you've committed to a framework and a traffic pattern. Compare GPU pricing and start a Serverless endpoint.
FAQ: vLLM vs TensorRT-LLM on Runpod Serverless
Is vLLM or TensorRT-LLM faster for LLM inference?
TensorRT-LLM is faster in most published benchmarks, especially at sustained high load where its compiled execution path pays off (see NVIDIA's own throughput numbers above). But vLLM has a much faster cold start since it skips the engine-compile step, which matters more than peak throughput for a lot of serverless workloads. You can test both on the same GPU without committing upfront: deploy vLLM through the Runpod Hub in minutes, then benchmark a TensorRT-LLM custom container against it.
Can I run TensorRT-LLM on Runpod Serverless?
Yes. TensorRT-LLM deploys as a custom container endpoint rather than through the Runpod Hub's pre-built worker. You build the engine (or use the PyTorch backend to skip that step), package it with the TensorRT-LLM runtime and a handler function, and push that image to a registry Runpod can pull from. See the deployment steps above for the full process.
Can I run vLLM on Runpod Serverless?
Yes, vLLM runs on Runpod Serverless: it's a one-click deploy from the Runpod Hub's vLLM worker page. You pick a Hugging Face model, set GPU tier and a couple of vLLM settings (MAX_MODEL_LEN, GPU_MEMORY_UTILIZATION), and the endpoint comes up with an OpenAI-compatible API (and Anthropic-compatible Messages API) out of the box.
Do I need an NVIDIA GPU to run TensorRT-LLM?
Yes, TensorRT-LLM only runs on NVIDIA hardware. vLLM is more flexible: it supports both NVIDIA (CUDA) and AMD (ROCm) GPUs, so it's the better choice if you want to keep your hardware options open. Runpod's GPU catalog includes both NVIDIA (A100, H100, H200) and AMD (MI300X) instances if portability matters to your deployment.
Can I switch between vLLM and TensorRT-LLM without switching cloud providers?
Yes. Both frameworks deploy on Runpod Serverless, vLLM through the Hub's pre-built worker, TensorRT-LLM through a custom container. That means you can start with vLLM for faster iteration and move to TensorRT-LLM later for the throughput gain, all on the same platform.
Which framework should I start with if I'm not sure yet?
Start with vLLM. It deploys in minutes through the Runpod Hub with no compile step, so you can validate your model and traffic pattern before investing in TensorRT-LLM's build pipeline. Once your traffic is high and predictable enough that the throughput gain translates into real GPU savings, that's the point to evaluate a TensorRT-LLM migration, on the same Runpod Serverless infrastructure you're already using.
Is SGLang a third option besides vLLM and TensorRT-LLM?
Yes. SGLang uses RadixAttention to share KV cache across requests with a common prefix, and its throughput and cold start generally land between vLLM and TensorRT-LLM, which is why it's not the focus of this two-framework comparison. Runpod supports it as well: deploy it through the SGLang worker on the Runpod Hub, or see Building Production LLM Pipelines with SGLang for a Pods-based deployment walkthrough. Worth benchmarking against whichever of vLLM or TensorRT-LLM you're leaning toward if you're choosing between all three.
