We raised a Series A! Read a post from our CEO, Zhen Lu: 1M devs and the cloud we're building next.

Rent NVIDIA A100 GPUs from $1.39/hr

High-performance data center GPU based on Ampere architecture with up to 80 GB HBM2e memory and third-generation Tensor Cores, purpose-built for large-scale AI training, LLM fine-tuning, and multi-tenant inference workloads.

A100

Powering the next generation of AI & high-performance computing.

Engineered for large-scale AI training, deep learning, and high-performance workloads, delivering unprecedented compute power and efficiency.

NVIDIA Ampere Architecture

Advanced AI acceleration with third-generation Tensor Cores delivering up to 312 TFLOPS TF32 performance.

Third-Generation Tensor Cores

Enhanced AI acceleration with 432 Tensor Cores supporting TF32, BF16, FP16, and INT8 precision modes.

80 GB HBM2e Memory

High-capacity memory with up to 2 TB/s bandwidth enables training and inference on large-scale AI models without CPU offloading.

Multi-Instance GPU (MIG)

Partition a single A100 into up to 7 isolated compute instances, ideal for multi-tenant inference and cost-efficient serving of multiple models.

Why rent the A100 instead of buying?

A proven workhorse for AI training and inference

The A100's Ampere architecture delivers up to 312 TFLOPS for AI operations, significantly faster than its predecessor, the V100. With MIG support for up to 7 isolated instances, it handles multi-tenant inference workloads at scale and large-model training runs equally well. The 80 GB variant holds the weights of today's largest open-source models without CPU offloading.

Pay only for what you use

A100 hardware costs $10,000-$20,000 per card depending on variant and availability. Runpod's on-demand pricing gives you access to the same hardware without a capital commitment. There’s no upfront cost, no maintenance, no idle hardware.

Deploy in seconds, scale without limits

Provision an A100 instance in seconds. Scale from a single GPU for development and fine-tuning to a multi-GPU cluster for full pre-training runs. When the job is done, scale back down or switch GPUs entirely. Runpod handles the infrastructure so you stay focused on your model.

Key specs at a glance.

Performance benchmarks that push AI, ML, and HPC workloads further.

Memory Bandwidth

2.0

TB/s

FP16 Tensor Performance

624

TFLOPS

NVLink Bandwidth

600

GB/s

Popular use cases.

Designed for demanding workloads
—learn if this GPU fits your needs.

Inference workload illustration

Inference

Serve inference for image, text, and audio generation at any scale.

Fine-tuning workload illustration

Fine-tuning

Train custom models on
your specific datasets.

AI agents workload illustration

Agents

Build intelligent agent-based systems and workflows.

Compute-heavy workload illustration

Compute-heavy tasks

Run compute-heavy workloads like rendering and simulations.

Ready for your most
demanding workloads.

Essential technical specifications to help you choose the right GPU for your workload.

Specification
Details
Great for...
Memory Bandwidth
2.0 TB/s
Large model inference, LLM serving, high-throughput batch inference
FP16 Tensor Performance
624 TFLOPS
Model training, fine-tuning, compute-intensive deep learning workloads
NVLink Bandwidth
600 GB/s
Multi-GPU distributed training, large model parallelism (70B+ models), tensor parallelism across GPUs
Specification Details Great for...
Architecture NVIDIA Ampere (GA100) Large-scale AI training, inference, and HPC workloads requiring MIG support and NVLink scalability
Manufacturing Process 7nm, 54.2 billion transistors
CUDA Cores 6,912 Parallel workloads including large-batch inference, scientific simulation, and model training
Tensor Cores 432 (3rd generation) Mixed-precision training with TF32, BF16, FP16, and INT8 support for LLMs and vision models
GPU Memory 40 GB or 80 GB HBM2e The 80 GB variant handles large-scale NLP and scientific simulations; 40 GB is cost-effective for standard training and inference
TDP PCIe: 250–300W / SXM4: 400W Predictable power budgeting across both form factors
FP64 Performance 9.7 TFLOPS Scientific computing, HPC simulations, and high-precision workloads
FP32 Performance 19.5 TFLOPS Standard-precision AI training and inference
TF32 Tensor Core 156 TFLOPS (312 with sparsity) Accelerated training with near-FP32 accuracy at roughly 2× the standard throughput
Multi-Instance GPU (MIG) Up to 7 isolated instances Multi-tenant inference and cost-efficient serving of multiple models from a single GPU
Structural Sparsity Up to 2× speedup Production inference with pruned or sparsified models for higher throughput
"The Runpod team has clearly prioritized the developer experience to create an elegant solution that enables individuals to rapidly develop custom AI apps or integrations while also paving the way for organizations to truly deliver on the promise of AI."

Amjad Masad

"Runpod is the only place I can deploy high-end GPU models instantly—no sales calls, no rate limits, no nonsense."

Daniel Chang

“The main value proposition for us was the flexibility Runpod offered. We were able to scale up effortlessly to meet the demand at launch.”

Josh Payne

“Runpod helped us scale the part of our platform that drives creation. That’s what fuels the rest—image generation, sharing, remixing. It starts with training.”

Matty Shimura

Powerful GPUs. Globally available.
Reliability you can trust.

30+ GPUs, 31 regions, instant scale. Fine-tune or go full Skynet—we’ve got you.

Community Cloud
$1.19/hr
Secure Cloud
$1.39/hr
Unique GPU Models
Community Cloud
25
Secure Cloud
19
Global Regions
Community Cloud
17
Secure Cloud
14
Network Storage
Community Cloud
Secure Cloud
Enterprise-Grade Reliability
Community Cloud
Secure Cloud
Savings Plans
Community Cloud
Secure Cloud
24/7 Support
Community Cloud
Secure Cloud
Delightful Dev Experience
Community Cloud
Secure Cloud

Questions? Answers.

What are the current hourly rates for renting an A100 on Runpod?


For current A100 rental rates including on-demand and reserved options for both the PCIe and the SXM variants, refer to the Runpod pricing page.

How does the A100 compare to the H100 for AI workloads?


The A100 delivers strong performance for training and inference, with 312 TFLOPS TF32 and up to 2 TB/s memory bandwidth. The H100 surpasses it in raw throughput, particularly for large-scale LLM training, but the A100 offers a compelling cost-performance ratio for fine-tuning, inference, and development workloads. For a detailed comparison, see our GPU benchmarks.

Can I run multiple workloads on a single A100 using MIG?


You can run multiple workloads on a single A100 using MIG. The A100's Multi-Instance GPU (MIG) feature partitions a single GPU into up to 7 isolated instances, each with dedicated memory, compute cores, and cache, ideal for multi-tenant environments or serving multiple models without interference.

What's the difference between the A100 PCIe and A100 SXM?

The PCIe connects to the system via standard PCIe lanes and is the more cost-efficient option for most training and inference workloads. The SXM uses a direct socket mount with significantly higher memory bandwidth (2.0 TB/s vs 1.6 TB/s) and NVLink bandwidth, making it better suited for multi-GPU distributed training, large model parallelism, and memory-intensive workloads where inter-GPU communication is a bottleneck. For current pricing on both variants, see the Runpod pricing page.

Is the A100 good for both training and inference?


The A100 is ideal for both training and inference. For inference, it handles high-throughput workloads efficiently, and MIG allows concurrent serving of multiple models from a single GPU. For training, it scales across multiple GPUs via NVLink and is fully compatible with PyTorch, TensorFlow, and JAX.

10,100,100,100

Requests since launch & 400k developers worldwide

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.