Deploying Hugging Face models in the cloud has become one of the most efficient ways to run large-scale machine learning inference and fine-tuning workloads. As model sizes and demand for real-time inference grow, picking the right GPU infrastructure becomes critical. In this guide, we’ll walk through how to deploy Hugging Face models on NVIDIA A100 SXM GPUs using cloud containers, with a special focus on Runpod’s high-performance infrastructure.
You’ll learn how to set up an inference server using Hugging Face Transformers, optimize for batching and token limits, and fully utilize the powerful A100 SXM GPU architecture. Whether you're running LLMs like LLaMA 2, BLOOM, or BERT, this tutorial will get you up and running quickly.
Why Choose A100 SXM Over PCIe?
Before diving into deployment, let’s explore why the A100 SXM variant stands out.
The A100 GPU comes in two primary form factors: SXM and PCIe. The A100 SXM is designed for high-throughput, data center-grade workloads. Here’s how it compares to the PCIe version:
FeatureA100 PCIeA100 SXMNVLink SupportLimitedFull NVLink for peer-to-peer communicationMemory Bandwidth~1.5 TB/sUp to 2 TB/sPower Budget250W400WPerformanceGreatExceptional (up to 1.5x faster)
SXM’s superior interconnect bandwidth and power budget translate to better throughput, lower latency, and higher model capacity—making it perfect for large language models that need high memory bandwidth and multi-GPU parallelism.
Runpod offers on-demand A100 SXM GPU instances at a fraction of the cost you’d expect from traditional cloud providers. Get started with Runpod and launch a GPU container in minutes.
Step 1: Set Up a Runpod Container with A100 SXM
To deploy your Hugging Face model, start by setting up a containerized environment on Runpod.
1.1. Create a Runpod Account
If you haven't already, sign up for Runpod and choose a GPU cloud instance.
1.2. Launch a Custom Cloud Container
- Navigate to “Cloud → Create Pod”
- Select A100 SXM (40GB or 80GB) as your GPU type
- Choose a base image such as
runpod/pytorch:2.0.1-cuda11.8
or use your custom Dockerfile - Expose ports (e.g. 8000 for inference)
- Add volumes for model storage if needed
Click “Deploy Pod”—your container will be live in seconds.
Step 2: Install Hugging Face Transformers and Dependencies
Once your container is running, SSH into it or use the built-in terminal from the Runpod dashboard.
Install the required Python packages:
pip install transformers accelerate torch
For larger models or efficient inference, consider installing additional tools:
pip install bitsandbytes optimum
Step 3: Load and Serve Your Hugging Face Model
Let’s use the popular text-generation
pipeline as an example. Here’s how to load and serve a model like LLaMA 2:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipelinemodel_name = "meta-llama/Llama-2-7b-hf" # Change to your preferred modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
To serve this via an API, you can use FastAPI or Flask. Here’s a FastAPI example:
from fastapi import FastAPI, Requestfrom pydantic import BaseModelapp = FastAPI()class Prompt(BaseModel): prompt: str max_tokens: int = 100@app.post("/generate")async def generate_text(prompt: Prompt): output = generator(prompt.prompt, max_new_tokens=prompt.max_tokens, do_sample=True) return {"response": output[0]["generated_text"]}
Save the file as app.py
, and run:
uvicorn app:app --host 0.0.0.0 --port 8000
Your inference server is now live and accessible on the container’s public IP.
Step 4: Optimize for Batching and Token Limits
To maximize GPU utilization, especially on A100 SXM, it’s important to configure batching and tokenization efficiently.
Token Limit Tips
- Check the model’s max token limit (e.g., GPT-2 supports 1024 tokens, LLaMA 2 supports up to 4096).
- Pad or truncate inputs to align with model expectations.
- Use
tokenizer.pad_token = tokenizer.eos_token
if padding is missing.
Dynamic Batching
Batching multiple requests reduces per-inference overhead. Use libraries like Text Generation Inference (TGI)
or vLLM
for optimized batching. For example:
pip install text-generationtext-generation-launcher --model-id meta-llama/Llama-2-7b-hf --port 8000
TGI handles:
- Token streaming
- Smart batching
- Multi-GPU inference (via DeepSpeed or Accelerate)
Step 5: Monitor GPU Utilization
Runpod provides built-in GPU usage stats but you can also monitor from within your container using:
watch -n 1 nvidia-smi
Or use PyTorch to profile usage:
import torchprint("GPU Memory Usage:", torch.cuda.memory_allocated())
To further optimize:
- Use mixed-precision (
torch_dtype=torch.float16
) - Offload layers to CPU if memory-constrained
- Use quantization (
bitsandbytes
oroptimum
)
Cost Estimates on Runpod
One of the biggest advantages of Runpod is transparent, usage-based pricing. As of this writing:
- A100 80GB SXM: Starting at $1.99/hour
- A100 40GB SXM: Starting at $1.49/hour
When compared to major cloud providers charging $3–$4/hour for similar performance, Runpod offers a clear edge.
Use Runpod’s pricing calculator to estimate your monthly inference costs.
Want to save even more? Set your pods to spot instances or auto-terminate when idle.
Frequently Asked Questions
When should I choose A100 SXM over other GPUs?
Use SXM if:
- Your model exceeds 15B parameters
- You need low latency for real-time inference
- You run multi-GPU training or inference jobs
- You require high memory bandwidth
For smaller models (e.g., BERT, DistilGPT2), a lower-tier GPU like the RTX 4090 may be more cost-effective.
Are Hugging Face models compatible with A100 SXM?
Yes, Hugging Face models are compatible with all NVIDIA GPUs, including A100 SXM. Most models can auto-detect the GPU with device_map="auto"
.
For extremely large models, use accelerate
, DeepSpeed
, or transformers
integration with memory offloading.
How can I reduce costs while using A100?
- Use quantized models (e.g., 4-bit with bitsandbytes)
- Batch inference requests
- Use Runpod’s spot instances
- Load only the model components you need (e.g., encoder-only)
Start Deploying Today
With Hugging Face and Runpod, deploying state-of-the-art models like LLaMA 2, Falcon, or BLOOM has never been easier—or faster.
By leveraging the raw power of A100 SXM GPUs, you can achieve higher throughput and lower latency, all while keeping costs under control.
Ready to get started?
👉 Launch your first A100 SXM container on Runpod
👉 Explore Runpod containers and templates
👉 Join our Discord community for support and deployment tips
Resources
Stay tuned for more tutorials on deploying LLMs, fine-tuning on Runpod, and optimizing resource usage for production AI workloads.
Happy deploying!