Runpod × OpenAI: Parameter Golf challenge is live
You've unlocked a referral bonus! Sign up today and you'll get a random credit bonus between $5 and $500
You've unlocked a referral bonus!
Claim Your Bonus
Claim Bonus

GPU Benchmarks Directory

GPU Comparison Tool

Compare every GPU on Runpod by tokens/sec, VRAM, and cost for your specific model, task, and cluster config.

Hide incompatible GPUs
GPU VRAM Tokens / sec Max Context VRAM Fit Rec. Engine Price Availability

Recommended Configurations

Find your model below to see the ideal Runpod deployment for your specific workload.

Llama 3.1 405B SGLang
VRAM Required~800 GB (FP8) / ~1.6 TB (BF16)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuning8× H200 (LoRA) Rent →

Runs comfortably at FP8 on an 8× H100 SXM node. Requires 8× H200 for BF16. LoRA fine-tuning feasible on 8× H100; full-parameter requires multi-node.

DeepSeek-V3.2 / V4 SGLang
VRAM Required~700 GB (FP8)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuning8× H200 Rent →

MoE architecture reduces active VRAM usage during inference. FP8 quantization strongly recommended. SGLang provides best throughput on this architecture.

Kimi K2.5 vLLM
VRAM Required~640 GB (FP8)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuningMulti-node H200

MoE architecture. Best served with FP8. High throughput relative to size due to sparse activation.

GPT-OSS 120B vLLM
VRAM Required~240 GB (FP8)
Min. viable GPU4× H100 SXM Rent →
Optimal GPU2× H200 Rent →
Fine-tuning4× H200 Rent →

Fits on 2× H200 at FP8 with headroom for long context. Full-parameter fine-tuning requires 4× H200 minimum.

DeepSeek-R1 SGLang
VRAM Required~130 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPUH200 (single) Rent →
Fine-tuning2× H100 SXM Rent →

Fits on a single H200 at FP8. LoRA fine-tuning achievable on 2× H100. SGLang provides the best throughput for chain-of-thought workloads.

QwQ-32B vLLM
VRAM Required~64 GB (FP8)
Min. viable GPUH100 SXM (single) Rent →
Optimal GPUH200 (single) Rent →
Fine-tuningH100 SXM (single) Rent →

Fits comfortably on a single H100 SXM at FP8. H200 provides extra headroom for long reasoning traces at full context.

Llama 4 Maverick vLLM
VRAM Required~95 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPUH200 (single) Rent →
Fine-tuning2× H100 SXM Rent →

MoE architecture. Single H200 recommended for full-context inference. LoRA fine-tuning achievable on 2× H100.

Mixtral 8x22B vLLM
VRAM Required~140 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPU2× H100 SXM Rent →
Fine-tuning4× H100 SXM Rent →

MoE architecture with low active-parameter overhead. 2× H100 SXM handles inference efficiently. LoRA fine-tuning on 2× H100; full-parameter needs 4×.

DeepSeek-R1-Distill vLLM
VRAM Required~16 GB (FP8 / 7B)
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningRTX 4090 (single) Rent →

7B distilled variant. Fits on a single RTX 4090. L40S offers better throughput for batch inference. 14B variant requires L40S or H100.

Llama 4 Scout vLLM
VRAM Required~50 GB (FP8)
Min. viable GPUL40S Rent →
Optimal GPUH100 SXM (single) Rent →
Fine-tuningH100 SXM (single) Rent →

MoE architecture; active parameters are much lower than total. L40S handles standard inference well. H100 recommended for 10M-token context window workloads.

Gemma 3 vLLM
VRAM Required~28 GB (BF16 / 14B)
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningL40S Rent →

14B variant fits on a single L40S at BF16. RTX 4090 works for the 12B variant. H100 recommended for the 27B variant at full precision.

GPT-OSS 20B vLLM
VRAM Required~40 GB (BF16)
Min. viable GPUL40S Rent →
Optimal GPUH100 SXM (single) Rent →
Fine-tuningL40S or H100 Rent →

Fits on a single L40S at BF16 with some headroom. H100 offers significantly higher throughput and better support for long context.

Flux 2 Max ComfyUI / Diffusers
VRAM Required~24 GB
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningRTX 4090 or L40S Rent →

Runs well on a single RTX 4090. L40S is the sweet spot for batch generation. H100 offers minimal additional benefit for diffusion workloads at standard resolutions.

Flux.1 Dev Diffusers
VRAM Required~16–24 GB
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuning (LoRA)RTX 4090 Rent →

The cost-effective choice for Flux.1 Dev inference. RTX 4090 handles single-image generation well; use L40S for high-throughput batch generation.

GPU Comparison Directory

Side-by-side specs and benchmarks for every GPU Runpod offers.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

GPU Comparison Tool

Compare every GPU on Runpod by tokens/sec, VRAM, and cost for your specific model, task, and cluster config.

Hide incompatible GPUs
GPU VRAM Tokens / sec Max Context VRAM Fit Rec. Engine Price Availability

Recommended Configurations

Find your model below to see the ideal Runpod deployment for your specific workload.

Llama 3.1 405B SGLang
VRAM Required~800 GB (FP8) / ~1.6 TB (BF16)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuning8× H200 (LoRA) Rent →

Runs comfortably at FP8 on an 8× H100 SXM node. Requires 8× H200 for BF16. LoRA fine-tuning feasible on 8× H100; full-parameter requires multi-node.

DeepSeek-V3.2 / V4 SGLang
VRAM Required~700 GB (FP8)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuning8× H200 Rent →

MoE architecture reduces active VRAM usage during inference. FP8 quantization strongly recommended. SGLang provides best throughput on this architecture.

Kimi K2.5 vLLM
VRAM Required~640 GB (FP8)
Min. viable GPU8× H100 SXM Rent →
Optimal GPU8× H200 Rent →
Fine-tuningMulti-node H200

MoE architecture. Best served with FP8. High throughput relative to size due to sparse activation.

GPT-OSS 120B vLLM
VRAM Required~240 GB (FP8)
Min. viable GPU4× H100 SXM Rent →
Optimal GPU2× H200 Rent →
Fine-tuning4× H200 Rent →

Fits on 2× H200 at FP8 with headroom for long context. Full-parameter fine-tuning requires 4× H200 minimum.

DeepSeek-R1 SGLang
VRAM Required~130 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPUH200 (single) Rent →
Fine-tuning2× H100 SXM Rent →

Fits on a single H200 at FP8. LoRA fine-tuning achievable on 2× H100. SGLang provides the best throughput for chain-of-thought workloads.

QwQ-32B vLLM
VRAM Required~64 GB (FP8)
Min. viable GPUH100 SXM (single) Rent →
Optimal GPUH200 (single) Rent →
Fine-tuningH100 SXM (single) Rent →

Fits comfortably on a single H100 SXM at FP8. H200 provides extra headroom for long reasoning traces at full context.

Llama 4 Maverick vLLM
VRAM Required~95 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPUH200 (single) Rent →
Fine-tuning2× H100 SXM Rent →

MoE architecture. Single H200 recommended for full-context inference. LoRA fine-tuning achievable on 2× H100.

Mixtral 8x22B vLLM
VRAM Required~140 GB (FP8)
Min. viable GPU2× H100 SXM Rent →
Optimal GPU2× H100 SXM Rent →
Fine-tuning4× H100 SXM Rent →

MoE architecture with low active-parameter overhead. 2× H100 SXM handles inference efficiently. LoRA fine-tuning on 2× H100; full-parameter needs 4×.

DeepSeek-R1-Distill vLLM
VRAM Required~16 GB (FP8 / 7B)
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningRTX 4090 (single) Rent →

7B distilled variant. Fits on a single RTX 4090. L40S offers better throughput for batch inference. 14B variant requires L40S or H100.

Llama 4 Scout vLLM
VRAM Required~50 GB (FP8)
Min. viable GPUL40S Rent →
Optimal GPUH100 SXM (single) Rent →
Fine-tuningH100 SXM (single) Rent →

MoE architecture; active parameters are much lower than total. L40S handles standard inference well. H100 recommended for 10M-token context window workloads.

Gemma 3 vLLM
VRAM Required~28 GB (BF16 / 14B)
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningL40S Rent →

14B variant fits on a single L40S at BF16. RTX 4090 works for the 12B variant. H100 recommended for the 27B variant at full precision.

GPT-OSS 20B vLLM
VRAM Required~40 GB (BF16)
Min. viable GPUL40S Rent →
Optimal GPUH100 SXM (single) Rent →
Fine-tuningL40S or H100 Rent →

Fits on a single L40S at BF16 with some headroom. H100 offers significantly higher throughput and better support for long context.

Flux 2 Max ComfyUI / Diffusers
VRAM Required~24 GB
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuningRTX 4090 or L40S Rent →

Runs well on a single RTX 4090. L40S is the sweet spot for batch generation. H100 offers minimal additional benefit for diffusion workloads at standard resolutions.

Flux.1 Dev Diffusers
VRAM Required~16–24 GB
Min. viable GPURTX 4090 Rent →
Optimal GPUL40S Rent →
Fine-tuning (LoRA)RTX 4090 Rent →

The cost-effective choice for Flux.1 Dev inference. RTX 4090 handles single-image generation well; use L40S for high-throughput batch generation.

GPU Comparison Directory

Side-by-side specs and benchmarks for every GPU Runpod offers.

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.

You’ve unlocked a
referral bonus!

Sign up today and you’ll get a random credit bonus between $5 and $500 when you spend your first $10 on Runpod.