Training LLMs on H100 PCIe GPUs in the Cloud: Setup and Optimization

As large language models (LLMs) continue to grow in complexity and parameter count, the demand for high-performance GPUs has never been greater. NVIDIA's H100 GPUs, built on the Hopper architecture, offer cutting-edge capabilities for training LLMs. While the SXM version of H100 delivers the highest throughput, the H100 PCIe variant offers a more accessible and scalable alternative — especially in cloud-based environments.

In this article, we’ll walk through how to set up a robust training environment using H100 PCIe GPUs on Runpod, leverage frameworks like DeepSpeed and Fully Sharded Data Parallel (FSDP), and explore tips to optimize performance. Whether you're fine-tuning a 7B model or tackling a full-scale pretraining run, this guide will help you take full advantage of H100 PCIe GPUs in the cloud.

Why Choose H100 PCIe GPUs?

The H100 PCIe GPUs provide a cost-effective, high-performance option for LLM training. While they have slightly lower bandwidth than their SXM counterparts, they retain the core architectural innovations of Hopper — including Transformer Engine support, fourth-generation NVLink (when available), and FP8 precision for faster matrix multiplications.

Key advantages:

Widely available in cloud environments
Lower upfront cost compared to SXM configurations
Excellent compatibility with popular AI frameworks

You can deploy H100 PCIe GPUs instantly on Runpod and start training without managing physical infrastructure.

Step 1: Setting Up Your Environment on Runpod

Setting up your training stack on Runpod is straightforward. Here’s how to get started:

1. Launch an H100 PCIe Pod

Log in to your Runpod account. Don’t have one? Sign up here — it’s free.
Navigate to the Cloud tab and select H100 PCIe under GPU options.
Choose your preferred region and storage size.
Select a base image or bring your own Docker container.

2. Use a Pre-Built Docker Image

For most LLM training workflows, we recommend starting with a Docker image that includes:

Python 3.10+
PyTorch 2.x with CUDA 12 support
NVIDIA Apex (optional, for mixed precision)
DeepSpeed and/or FSDP
Tokenizers and Hugging Face Transformers
NCCL and UCX libraries for multi-GPU communication

If you're using Hugging Face Transformers and DeepSpeed, consider using this Dockerfile as a base:

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04RUN apt update && apt install -y git build-essential python3 python3-pipRUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118RUN pip3 install deepspeed transformers datasets accelerate

Alternatively, you can select one of Runpod’s pre-configured containers optimized for LLM training. Just head to the Templates section in your dashboard.

Step 2: Optimizing with DeepSpeed and FSDP

DeepSpeed

DeepSpeed provides efficient memory and compute optimization for large models. It enables zero redundancy optimizations (ZeRO), offloading, and mixed precision training. While H100 PCIe GPUs offer strong performance, DeepSpeed helps minimize memory bottlenecks and maximize throughput.

Best practices for DeepSpeed on H100 PCIe:

Use ZeRO-3 for large models (>6B parameters).
Enable NVMe offloading for checkpointing if disk I/O is a concern.
Toggle FP8 training when supported by PyTorch and your model.

deepspeed train.py --deepspeed ds_config.json

Sample ds_config.json snippet:

{ "train_batch_size": 32, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu" } }, "fp16": { "enabled": true }}

FSDP (Fully Sharded Data Parallel)

FSDP is a PyTorch-native alternative that performs model sharding across GPUs. It is well-suited for multi-GPU training on H100 PCIe and integrates cleanly with Hugging Face's Trainer.

Tips for FSDP:

Use auto_wrap_policy to wrap only large submodules.
FP8 support can reduce memory footprint and improve throughput.
Pair with torch.distributed.launch for multi-GPU scaling.

Example:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDPmodel = FSDP(model, auto_wrap_policy=policy)

Performance Tips: PCIe vs. SXM

While PCIe models of H100 GPUs offer tremendous value, their interconnect bandwidth is limited compared to SXM (NVLink) models. This can influence training performance, especially across multiple GPUs.

FeatureH100 PCIeH100 SXMGPU-GPU Bandwidth~64 GB/s~900 GB/s (NVLink)Max Power (TDP)300W700WNVLink SupportOptional (via bridge)NativeAvailabilityHigh (Cloud)Limited

Optimization Tips for PCIe:

Favor data parallelism over tensor parallelism to reduce communication overhead.
Use gradient accumulation to simulate larger batch sizes without cross-GPU sync.
Profile GPU utilization with nvidia-smi and PyTorch profiler.

Checkpointing and Storage

Efficient checkpointing is vital, especially with large models. On Runpod, you can mount persistent storage volumes or use object storage (e.g., AWS S3) for remote checkpoints.

Best Practices:

Save checkpoints in FP16 or FP8 to reduce disk usage.
Store logs and checkpoints in a /workspace/checkpoints directory mounted to persistent storage.
Use accelerate or DeepSpeed’s built-in checkpoint manager for seamless saving and resuming.

Sample Deployment Guide on Runpod

Here’s a step-by-step guide to deploy and train a 7B LLM using H100 PCIe on Runpod:

Create a Pod with H100 PCIe GPU and at least 200GB storage.
Select Docker Image: Use a pre-built image with DeepSpeed and Transformers.
Clone Your Repo:git clone https://github.com/your-org/llm-training.gitcd llm-training
Install Dependencies:pip install -r requirements.txt
Configure DeepSpeed or FSDP.
Run Training:deepspeed train.py --deepspeed ds_config.json

Want to try it yourself? Launch your training Pod now and get started in minutes.

FAQs

Q: What are the bandwidth tradeoffs of PCIe vs. SXM?

PCIe has lower GPU-to-GPU bandwidth, which can affect communication-heavy tasks like tensor parallelism. To mitigate this, use data parallelism and gradient accumulation. For more on GPU architectures, see NVIDIA's Hopper overview.

Q: Will PCIe bottlenecks limit my model size?

Not necessarily. With DeepSpeed's ZeRO optimizations and FSDP, you can train massive models by partitioning memory and optimizer states. PCIe may slow down inter-GPU communication, but it won't limit model size if properly parallelized.

Q: Can I scale to multiple GPUs or nodes on Runpod?

Yes! Runpod supports multi-GPU and multi-node training. Use torchrun or DeepSpeed’s launch.py to coordinate across GPUs. Make sure to configure NCCL and set --nproc_per_node accordingly.

Get Started with Runpod Today

Training LLMs at scale doesn’t have to be complicated or costly. With Runpod, you can access high-performance H100 PCIe GPUs in the cloud, fully optimized for frameworks like DeepSpeed and FSDP. Whether you're experimenting with a new architecture or scaling production training, Runpod provides the flexibility and power you need.

✅ Sign up for a Runpod account

🚀 Launch your H100 PCIe Pod now

📚 Check out our documentation for more setup guides and tutorials.

Maximize your LLM training potential — all in the cloud, all on your terms.