

Compare every GPU on Runpod by tokens/sec, VRAM, and cost for your specific model, task, and cluster config.
| GPU | VRAM | Tokens / sec | Max Context | VRAM Fit | Rec. Engine | Price | Availability |
|---|
Find your model below to see the ideal Runpod deployment for your specific workload.
Runs comfortably at FP8 on an 8× H100 SXM node. Requires 8× H200 for BF16. LoRA fine-tuning feasible on 8× H100; full-parameter requires multi-node.
MoE architecture reduces active VRAM usage during inference. FP8 quantization strongly recommended. SGLang provides best throughput on this architecture.
Fits on a single H200 at FP8. LoRA fine-tuning achievable on 2× H100. SGLang provides the best throughput for chain-of-thought workloads.
Fits comfortably on a single H100 SXM at FP8. H200 provides extra headroom for long reasoning traces at full context.
7B distilled variant. Fits on a single RTX 4090. L40S offers better throughput for batch inference. 14B variant requires L40S or H100.
MoE architecture; active parameters are much lower than total. L40S handles standard inference well. H100 recommended for 10M-token context window workloads.
Runs well on a single RTX 4090. L40S is the sweet spot for batch generation. H100 offers minimal additional benefit for diffusion workloads at standard resolutions.
Side-by-side specs and benchmarks for every GPU Runpod offers.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
Compare every GPU on Runpod by tokens/sec, VRAM, and cost for your specific model, task, and cluster config.
| GPU | VRAM | Tokens / sec | Max Context | VRAM Fit | Rec. Engine | Price | Availability |
|---|
Find your model below to see the ideal Runpod deployment for your specific workload.
Runs comfortably at FP8 on an 8× H100 SXM node. Requires 8× H200 for BF16. LoRA fine-tuning feasible on 8× H100; full-parameter requires multi-node.
MoE architecture reduces active VRAM usage during inference. FP8 quantization strongly recommended. SGLang provides best throughput on this architecture.
Fits on a single H200 at FP8. LoRA fine-tuning achievable on 2× H100. SGLang provides the best throughput for chain-of-thought workloads.
Fits comfortably on a single H100 SXM at FP8. H200 provides extra headroom for long reasoning traces at full context.
7B distilled variant. Fits on a single RTX 4090. L40S offers better throughput for batch inference. 14B variant requires L40S or H100.
MoE architecture; active parameters are much lower than total. L40S handles standard inference well. H100 recommended for 10M-token context window workloads.
Runs well on a single RTX 4090. L40S is the sweet spot for batch generation. H100 offers minimal additional benefit for diffusion workloads at standard resolutions.
Side-by-side specs and benchmarks for every GPU Runpod offers.
The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.