The Nvidia H200 is the memory-upgraded evolution of the H100, built on the same Hopper architecture but with a significantly expanded memory subsystem. Where the H100 tops out at 80 GB of HBM3, the H200 jumps to 141 GB of HBM3e with 4.8 TB/s of bandwidth. It is designed specifically for workloads where the H100's 80 GB VRAM is the limiting factor: frontier-scale language models, long-context inference, and large-batch training at the highest parameter counts.
This guide covers the full H200 specs, VRAM capacity and bandwidth, SXM vs NVL variant differences, current pricing, how it compares to the H100 and A100, and how to access H200 instances on-demand through Runpod.
How Much VRAM Does the H200 Have?
The Nvidia H200 has 141 GB of HBM3e memory with approximately 4.8 TB/s of memory bandwidth. This is 76% more VRAM and 43% more bandwidth than the H100 SXM5, which has 80 GB of HBM3 at 3.35 TB/s.
For context across the current GPU landscape:
- H200 SXM: 141 GB HBM3e, ~4.8 TB/s bandwidth
- H100 SXM5: 80 GB HBM3, ~3.35 TB/s bandwidth
- A100 80GB: 80 GB HBM2e, ~2 TB/s bandwidth
- RTX 5090: 32 GB GDDR7, ~1.79 TB/s bandwidth
- RTX 4090: 24 GB GDDR6X, ~1 TB/s bandwidth
The 141 GB capacity is the H200's defining feature. It enables loading and serving models that exceed the H100's 80 GB limit at full precision, including the largest open-source models in the 70B-180B parameter range. For long-context inference, where KV cache size scales with sequence length, the extra memory directly extends the maximum context window a single GPU can handle without offloading.
For workloads that fit within 80 GB, the H100 and H200 perform comparably on compute-bound tasks. The H200's advantage is specifically in memory-bound scenarios: large-batch inference, long context windows, and models that would otherwise require quantization or multi-GPU tensor parallelism on the H100.
Nvidia H200 Specs
The H200 uses the same GH100 die as the H100, meaning the core compute architecture is identical. The upgrade is entirely in the memory subsystem: six stacks of 24 GB HBM3e (vs. five stacks of 16 GB HBM3 on the H100), delivering the capacity and bandwidth increases.
H200 SXM (High-Performance Variant)
- CUDA Cores: 16,896 (identical to H100 SXM5)
- Tensor Cores: 528 (4th generation with FP8 Transformer Engine)
- VRAM: 141 GB HBM3e
- Memory Bandwidth: 4.8 TB/s
- FP8 Throughput: 3,958 TFLOPS (identical to H100 SXM5)
- TF32 Throughput: 989 TFLOPS (with sparsity)
- FP64 Throughput: 34 TFLOPS
- TDP: 700W
- NVLink: NVLink 4.0, 900 GB/s bidirectional bandwidth per GPU
- MIG: Up to 7 instances
- Form Factor: SXM5 (requires DGX H200 or HGX H200 server platform)
H200 NVL (Standard Variant)
- CUDA Cores: 16,896
- VRAM: 141 GB HBM3e
- Memory Bandwidth: 4.8 TB/s
- TDP: Lower than SXM, air-cooling compatible
- NVLink: 2-4 way NVLink bridges (not full NVLink 4.0 fabric)
- Form Factor: NVL, fits in standard server slots with appropriate power delivery
H200 SXM vs NVL: Which Should You Use?
Unlike the H100 where SXM and PCIe are the two variants, the H200 comes in SXM and NVL form factors. The distinctions matter for deployment planning:
- Performance: The SXM variant offers approximately 18% higher throughput than NVL due to its superior memory interconnect and thermal headroom at 700W TDP.
- NVLink: SXM uses point-to-point NVLink 4.0 with 900 GB/s bidirectional bandwidth per GPU, enabling efficient large-scale multi-GPU training. NVL uses 2-4 way NVLink bridges, still capable of multi-GPU scaling but at lower inter-GPU bandwidth.
- Infrastructure requirements: SXM requires liquid cooling and a specialized server platform (DGX H200 or HGX H200). NVL is air-cooling compatible and fits in more standard server configurations.
- Cost and availability: SXM variants are primarily available in complete DGX H200 systems. NVL offers a more accessible entry point for organizations that need H200 VRAM capacity without the full SXM infrastructure commitment.
For frontier-scale distributed training, SXM is the correct choice. For inference serving that needs 141 GB VRAM but not maximum multi-GPU throughput, NVL is more practical to deploy.
H200 AI Performance
Because the H200 shares the same GH100 die and Transformer Engine as the H100, compute-bound performance is essentially identical between the two. The H200's performance advantages are specifically memory-driven:
- Long-context inference: For LLMs with extended context windows, KV cache grows linearly with sequence length. The H200's 141 GB allows significantly longer contexts on a single GPU before requiring multi-GPU tensor parallelism. Testing with large language models shows up to 3.4x throughput improvement over H100 for long-context workloads where the H100 runs out of memory.
- Large-batch inference: With more VRAM available for activations and batch data, the H200 sustains larger batch sizes before hitting memory limits. BF16 precision workloads show approximately 47% throughput improvement over H100 at batch sizes that exceed the H100's capacity.
- Standard inference (models under 80 GB): For models that fit comfortably within the H100's 80 GB, the H200 offers minimal improvement, typically 0-11%. The bandwidth increase helps at the margin but compute throughput is identical.
- Training at 100B+ parameters: Models in the 100-180B parameter range that require tensor parallelism across multiple H100s can sometimes fit on fewer H200s, reducing inter-GPU communication overhead and improving training efficiency.
H200 Price: What Does It Cost?
The H200 carries a meaningful premium over the H100, reflecting both the advanced HBM3e memory and limited production volumes. Indicative pricing:
- H200 NVL (standalone): Approximately $35,000-$45,000 per card through authorized resellers
- DGX H200 (8x SXM H200): Approximately $350,000-$500,000 for complete systems
- HGX H200 baseboard: Varies by OEM and configuration
Lead times through enterprise channels have historically been 6-12 months, though availability has improved as production has ramped. Cloud access remains the fastest and most cost-effective path for most organizations that need H200 capabilities without the hardware investment.
H200 vs H100: How Do They Compare?
The H200 is an upgrade within the Hopper generation, not a new architecture. The comparison is straightforward:
- VRAM: 141 GB (H200) vs 80 GB (H100). The 76% increase is the primary differentiator.
- Memory bandwidth: 4.8 TB/s (H200) vs 3.35 TB/s (H100 SXM5), a 43% improvement that benefits memory-bound workloads.
- Compute throughput: Identical. Same GH100 die, same Tensor Core generation, same FP8 Transformer Engine.
- NVLink: Identical in SXM configuration, 900 GB/s bidirectional per GPU.
- MIG: Identical, up to 7 instances per GPU.
- Cost: H200 commands a 30-40% premium over H100 pricing.
The H100 remains the better value for any workload that fits within 80 GB. The H200 is justified specifically when 80 GB is the constraint, either through model size, context length, or batch size requirements.
H200 vs A100: How Do They Compare?
The A100 is still widely deployed and available at lower cost. For teams evaluating the full range of Nvidia data-center GPUs:
- VRAM: H200: 141 GB. A100: up to 80 GB (also available in 40 GB variant).
- Memory bandwidth: H200: 4.8 TB/s. A100 80GB: ~2 TB/s. The H200 has more than double the bandwidth.
- Compute throughput: H200 (and H100) deliver roughly 3-4x better AI training throughput vs A100 via FP8 Transformer Engine. A100 does not support FP8.
- Multi-GPU: All three support NVLink in SXM configuration, though H100/H200 use the faster NVLink 4.0 vs A100's NVLink 3.0.
- Cost: A100 PCIe has declined to $10,000-$15,000 on the secondary market, making it significantly more accessible for workloads within its 80 GB VRAM ceiling.
The A100 remains a strong choice for inference workloads that don't need FP8 throughput and fit within 80 GB. The H200 is the right choice when 141 GB VRAM and maximum memory bandwidth are required.
Rent an H200 on Runpod
Runpod provides on-demand access to H200 GPU instances without hardware procurement, enterprise contracting, or 6-12 month lead times. You get full root access to a containerized GPU environment and can be running within minutes of signing up.
- On-demand H200 instances: Hourly billing with no upfront commitment, giving immediate access to 141 GB VRAM without hardware procurement
- Serverless endpoints: Deploy models on H200 workers and pay per inference request with no idle GPU cost
- Network Volumes: Attach persistent storage to keep model weights, datasets, and checkpoints available across sessions without re-downloading. Standard storage starts at $0.07/GB/mo (first TB, then $0.05/GB/mo), with high-performance storage at $0.14/GB/mo for maximum throughput on large training pipelines. Volumes can be shared across multiple H200 instances simultaneously.
- Templates: Pre-built environments for PyTorch, vLLM, TGI, Jupyter, and other frameworks with no manual CUDA setup required
- Flexibility: Switch between H200, H100, A100, and consumer GPU instances based on workload and budget requirements
- Immediate availability: Access H200 instances today without waiting for enterprise procurement cycles
See current H200 availability and pricing on the Runpod pricing page.
H200 FAQs
How much VRAM does the H200 have?
The Nvidia H200 has 141 GB of HBM3e memory with approximately 4.8 TB/s of memory bandwidth. This is 76% more VRAM and 43% more bandwidth than the H100 SXM, which has 80 GB of HBM3 at 3.35 TB/s.
What is the Nvidia H200?
The Nvidia H200 is a data-center GPU built on the same Hopper architecture (GH100 die) as the H100, upgraded with 141 GB of HBM3e memory and 4.8 TB/s of bandwidth. It is designed for memory-intensive AI workloads including large language model inference, long-context processing, and large-batch training at parameter counts that exceed the H100's 80 GB VRAM capacity.
What are the Nvidia H200 specs?
The H200 SXM features 16,896 CUDA cores, 528 4th-generation Tensor Cores with FP8 Transformer Engine, 141 GB HBM3e VRAM, 4.8 TB/s memory bandwidth, 3,958 TFLOPS of FP8 throughput, 700W TDP, and NVLink 4.0 with 900 GB/s bidirectional bandwidth per GPU. It supports MIG with up to 7 instances.
H200 vs H100: what is the difference?
The H200 and H100 use the same GH100 die, so compute throughput (TFLOPS) is identical. The H200 upgrades the memory subsystem to 141 GB HBM3e at 4.8 TB/s, vs the H100's 80 GB HBM3 at 3.35 TB/s. This 76% VRAM increase and 43% bandwidth improvement benefits memory-bound workloads significantly, while compute-bound tasks see minimal difference. The H200 carries a 30-40% price premium over the H100.
How much does the H200 cost?
H200 NVL cards sell for approximately $35,000-$45,000 through authorized resellers. Complete DGX H200 systems (8x SXM H200) are priced at approximately $350,000-$500,000. On Runpod, you can access H200 instances on-demand at hourly rates without the upfront hardware cost.
Is the H200 better than the H100 for LLM inference?
For LLMs that fit within 80 GB, the H200 and H100 perform similarly on inference throughput since compute is identical. The H200's advantage is for models between 80-141 GB (which require quantization or multi-GPU parallelism on H100) and for long-context inference where KV cache size is the bottleneck. At long contexts, the H200 can show up to 3.4x better throughput than the H100 on memory-bound workloads.
What is H200 NVL?
The H200 NVL is the non-SXM form factor of the H200, designed for deployment in standard server configurations with air-cooling compatibility. It has the same 141 GB HBM3e VRAM and 4.8 TB/s bandwidth as the SXM variant but uses 2-4 way NVLink bridges rather than the full NVLink 4.0 point-to-point fabric, resulting in approximately 18% lower throughput than SXM in multi-GPU configurations. It is more accessible for organizations that cannot support SXM infrastructure requirements.


.webp)