Compare GPU Benchmarks for AI Workloads

GPU Comparison Tool

Compare every GPU on Runpod by tokens/sec, VRAM, and cost for your specific model, task, and cluster config.

Hide incompatible GPUs

GPU	VRAM	Tokens / sec	Max Context	VRAM Fit	Rec. Engine	Price	Availability

Recommended Configurations

Find your model below to see the ideal Runpod deployment for your specific workload.

Llama 3.1 405B SGLang

VRAM Required~800 GB (FP8) / ~1.6 TB (BF16)

Min. viable GPU8× H100 SXM Rent →

Optimal GPU8× H200 Rent →

Fine-tuning8× H200 (LoRA) Rent →

Runs comfortably at FP8 on an 8× H100 SXM node. Requires 8× H200 for BF16. LoRA fine-tuning feasible on 8× H100; full-parameter requires multi-node.

DeepSeek-V3.2 / V4 SGLang

VRAM Required~700 GB (FP8)

Min. viable GPU8× H100 SXM Rent →

Optimal GPU8× H200 Rent →

Fine-tuning8× H200 Rent →

MoE architecture reduces active VRAM usage during inference. FP8 quantization strongly recommended. SGLang provides best throughput on this architecture.

Kimi K2.5 vLLM

VRAM Required~640 GB (FP8)

Min. viable GPU8× H100 SXM Rent →

Optimal GPU8× H200 Rent →

Fine-tuningMulti-node H200

MoE architecture. Best served with FP8. High throughput relative to size due to sparse activation.

GPT-OSS 120B vLLM

VRAM Required~240 GB (FP8)

Min. viable GPU4× H100 SXM Rent →

Optimal GPU2× H200 Rent →

Fine-tuning4× H200 Rent →

Fits on 2× H200 at FP8 with headroom for long context. Full-parameter fine-tuning requires 4× H200 minimum.

DeepSeek-R1 SGLang

VRAM Required~130 GB (FP8)

Min. viable GPU2× H100 SXM Rent →

Optimal GPUH200 (single) Rent →

Fine-tuning2× H100 SXM Rent →

Fits on a single H200 at FP8. LoRA fine-tuning achievable on 2× H100. SGLang provides the best throughput for chain-of-thought workloads.

QwQ-32B vLLM

VRAM Required~64 GB (FP8)

Min. viable GPUH100 SXM (single) Rent →

Optimal GPUH200 (single) Rent →

Fine-tuningH100 SXM (single) Rent →

Fits comfortably on a single H100 SXM at FP8. H200 provides extra headroom for long reasoning traces at full context.

Llama 4 Maverick vLLM

VRAM Required~95 GB (FP8)

Min. viable GPU2× H100 SXM Rent →

Optimal GPUH200 (single) Rent →

Fine-tuning2× H100 SXM Rent →

MoE architecture. Single H200 recommended for full-context inference. LoRA fine-tuning achievable on 2× H100.

Mixtral 8x22B vLLM

VRAM Required~140 GB (FP8)

Min. viable GPU2× H100 SXM Rent →

Optimal GPU2× H100 SXM Rent →

Fine-tuning4× H100 SXM Rent →

MoE architecture with low active-parameter overhead. 2× H100 SXM handles inference efficiently. LoRA fine-tuning on 2× H100; full-parameter needs 4×.

DeepSeek-R1-Distill vLLM

VRAM Required~16 GB (FP8 / 7B)

Min. viable GPURTX 4090 Rent →

Optimal GPUL40S Rent →

Fine-tuningRTX 4090 (single) Rent →

7B distilled variant. Fits on a single RTX 4090. L40S offers better throughput for batch inference. 14B variant requires L40S or H100.

Llama 4 Scout vLLM

VRAM Required~50 GB (FP8)

Min. viable GPUL40S Rent →

Optimal GPUH100 SXM (single) Rent →

Fine-tuningH100 SXM (single) Rent →

MoE architecture; active parameters are much lower than total. L40S handles standard inference well. H100 recommended for 10M-token context window workloads.

Gemma 3 vLLM

VRAM Required~28 GB (BF16 / 14B)

Min. viable GPURTX 4090 Rent →

Optimal GPUL40S Rent →

Fine-tuningL40S Rent →

14B variant fits on a single L40S at BF16. RTX 4090 works for the 12B variant. H100 recommended for the 27B variant at full precision.

GPT-OSS 20B vLLM

VRAM Required~40 GB (BF16)

Min. viable GPUL40S Rent →

Optimal GPUH100 SXM (single) Rent →

Fine-tuningL40S or H100 Rent →

Fits on a single L40S at BF16 with some headroom. H100 offers significantly higher throughput and better support for long context.

Flux 2 Max ComfyUI / Diffusers

VRAM Required~24 GB

Min. viable GPURTX 4090 Rent →

Optimal GPUL40S Rent →

Fine-tuningRTX 4090 or L40S Rent →

Runs well on a single RTX 4090. L40S is the sweet spot for batch generation. H100 offers minimal additional benefit for diffusion workloads at standard resolutions.

Flux.1 Dev Diffusers

VRAM Required~16–24 GB

Min. viable GPURTX 4090 Rent →

Optimal GPUL40S Rent →

Fine-tuning (LoRA)RTX 4090 Rent →

The cost-effective choice for Flux.1 Dev inference. RTX 4090 handles single-image generation well; use L40S for high-throughput batch generation.

GPU Comparison Directory

Side-by-side specs and benchmarks for every GPU Runpod offers.

GPU Benchmarks Directory

GPU Comparison Tool

Recommended Configurations

GPU Comparison Directory

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

GPU Comparison Tool

Recommended Configurations

GPU Comparison Directory

Build what’s next.

GPU Benchmarks Directory

GPU Comparison Tool

Recommended Configurations

GPU Comparison Directory

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

GPU Comparison Tool

Recommended Configurations

GPU Comparison Directory

Build what’s next.

You’ve unlocked areferral bonus!

You’ve unlocked a
referral bonus!