How to Deploy a Hugging Face Model on a GPU-Powered Docker Container
Hugging Face has made it easier than ever to use powerful transformer models for NLP, computer vision, and more. But running these models efficiently, especially the larger ones, requires GPU acceleration. And if you want your model to be accessible via an API or run in a controlled environment, deploying it inside a Docker container on a GPU machine is often the best approach.
In this guide, we’ll walk through the process of packaging a Hugging Face model into a Docker container, setting it up for inference, and deploying it on a GPU with RunPod. This setup allows you to scale from local testing to full production with ease.
Sign up for RunPod to deploy your GPU container, inference API and model in the cloud.
Why Use Docker for Hugging Face Models?
When you’re dealing with machine learning dependencies, version mismatches and environment issues can slow you down. Docker simplifies this by letting you:
-
Package the exact runtime your model needs
-
Reproduce environments consistently across systems
-
Move easily between local, cloud, or on-prem machines
-
Use GPU acceleration with minimal extra setup
With a containerized setup, you can go from testing on your laptop to serving production traffic on a cloud GPU, without reconfiguring anything.
What You'll Need
Before you begin, make sure you have:
-
A Hugging Face model (hosted or downloaded)
-
A Dockerfile to build your container
-
Access to a GPU (locally or on RunPod)
-
Python and the Hugging Face
transformers
library -
Optional: FastAPI or Flask to expose your model via REST API
We’ll assume you’re working with a text generation model like distilGPT2
, but the steps apply broadly to most transformer models.
Step 1: Create a Simple Inference Script
First, let’s write a small Python script to handle text input and generate responses using a Hugging Face model.
python
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
app = FastAPI()
generator = pipeline("text-generation", model="distilgpt2")
class InputPrompt(BaseModel):
prompt: str
max_tokens: int = 50
@app.post("/generate")
def generate_text(data: InputPrompt):
output = generator(data.prompt, max_new_tokens=data.max_tokens)
return {"response": output[0]["generated_text"]}
This uses FastAPI to expose a /generate
endpoint. When you send it a prompt, it returns the model’s generated response.
Step 2: Write Your Dockerfile
Now you need a Dockerfile
to containerize the inference app. Here’s a simple one:
Dockerfile
FROM python:3.10-slim
RUN pip install --no-cache-dir fastapi uvicorn transformers
COPY app.py /app/app.py
WORKDIR /app
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
If you plan to use a GPU, make sure your Docker runtime is set up to use NVIDIA drivers. You’ll also need to pull a compatible base image (e.g., one with CUDA support) and adjust the install commands accordingly.
Step 3: Build and Test Locally
Once your files are ready, build the container locally:
bash
docker build -t huggingface-api .
Then run it:
bash
docker run -p 8000:8000 huggingface-api
Test the endpoint using curl
or Postman:
bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Once upon a time", "max_tokens": 50}'
You should get a response with generated text.
Step 4: Push Your Image to a Registry
To deploy on RunPod, your container image must be accessible from a public or private Docker registry.
If you're using Docker Hub:
bash
docker tag huggingface-api yourusername/huggingface-api
docker push yourusername/huggingface-api
If you’re using GitHub Container Registry or another provider, adjust the commands accordingly.
Step 5: Deploy on RunPod
Now that your image is ready, it’s time to run it on a GPU in the cloud.
-
Go to RunPod Containers
-
Create a New Pod
-
Select an appropriate GPU (e.g., RTX A4000 or A5000 depending on model size)
-
Use the “Custom Image” option
-
Paste the Docker image name (e.g.,
yourusername/huggingface-api
)
-
-
Expose Port 8000
- Add 8000 to the exposed ports section
-
Set Resources
- Choose enough disk and RAM to handle your model (usually 10GB+ disk and 16GB+ RAM)
-
Launch
In a few minutes, your container will be live, and you’ll get a proxy URL like:
cpp
https://<pod-id>-8000.proxy.runpod.net
Use this as the base URL for your API.
Best Practices for Hugging Face Containers
To make your deployment more robust:
-
Use
torch_dtype=torch.float16
for faster inference (if supported by your model) -
Use quantized models if GPU memory is limited
-
Persist your model files in a mounted volume if needed
-
Set environment variables to toggle between CPU/GPU or enable debugging
-
Profile GPU usage with tools like
nvidia-smi
ortorch.cuda.memory_allocated()
You can also start from a RunPod template if you want a base container with PyTorch, CUDA, or Jupyter pre-installed.
Troubleshooting Tips
Here are some common issues and how to fix them:
-
Make sure it’s available in the container
-
Check Hugging Face’s
from_pretrained()
path or access token -
Try loading from local files instead of downloading on startup
-
Your container may not be running or the port isn’t exposed
-
Make sure FastAPI is running on
0.0.0.0
and port 8000
-
Ensure you deployed on a GPU pod
-
Add
--gpus all
when running locally or use RunPod’s default GPU-enabled runtime -
Modify your script to move model to
cuda
(e.g.,model.to("cuda")
)
Cost and Performance Considerations
Running models in the cloud gives you power and flexibility, but it’s important to understand costs:
-
A mid-range GPU like an RTX A4000 costs around $0.17/hour
-
Larger GPUs like A100s can run $2–$3/hour but handle 13B+ models easily
-
To optimize costs:
-
Use smaller or quantized models
-
Shut down the pod when idle
-
Test locally with CPUs before scaling to GPUs
-
Compare your options on the RunPod GPU Pricing Page.
Extending Your Deployment
Once your base container is running:
-
Add authentication to your API if needed
-
Connect it to a front-end app or chatbot
-
Switch to RunPod’s Serverless Inference Endpoints for automatic scaling
-
Integrate with RunPod’s API to launch pods programmatically – see API Docs
RunPod makes it easy to evolve from solo dev project to full-scale deployment with no infrastructure headaches.
Frequently Asked Questions (FAQ)
The right GPU depends largely on the model size and precision you're using. If you’re working with smaller models like distilGPT2
or bert-base
, a GPU such as the NVIDIA T4 or RTX A4000 (16GB VRAM) will generally be enough to get fast, responsive inference. These are excellent options for prototyping or lightweight applications. For larger models—think LLaMA 13B
, Gemma 7B
, or StarCoder2 15B
—you’ll want something with more VRAM, like the RTX A5000 (24GB) or even an A100 (40GB or 80GB) if you’re dealing with high throughput or long-context generation. RunPod’s GPU Pricing Page breaks down each available GPU by VRAM size, hourly cost, and performance benchmarks, so you can find the one that fits your budget and model requirements.
To confirm your model is using GPU acceleration, you first need to explicitly move your model and input tensors to CUDA in your script. For example, after loading your model with Hugging Face’s transformers
library, you should use model.to("cuda")
and ensure any input tensors are also moved to the same device with .to("cuda")
. On RunPod, GPU support is built in for GPU-selected pods, so you don’t need to set up the driver or runtime yourself. Once your container is running, you can check if the GPU is being used by opening a terminal inside the container and running nvidia-smi
. This tool will show you whether the GPU is active, how much memory is in use, and which processes are currently accessing it.
For most use cases, serving Hugging Face models through a REST API using FastAPI is the most efficient and flexible approach. FastAPI is lightweight, Pythonic, and supports async handling, which can help with responsiveness under load. A simple server setup with FastAPI allows you to expose endpoints for text generation, classification, summarization, or whatever task your model supports. If you're planning to scale beyond a single container or want to support higher traffic, you can deploy multiple pods on RunPod and load balance across them. For even more automation and scaling, RunPod offers Serverless GPU Endpoints, which allow your model API to autoscale and handle incoming requests without requiring you to manage the container lifecycle yourself.
Disk space requirements depend on the size of your model and any additional dependencies you're including in the container. Most smaller transformer models (like distilBERT
or GPT2-small
) are around 500MB to 2GB in size, and the container environment itself might need another 1–3GB depending on the base image and installed libraries. For typical inference setups, allocating 10–20GB of disk space is a good starting point. However, if you’re working with larger models like Gemma 27B
or running multiple models in the same container, you may need significantly more—up to 40–60GB. On RunPod, you can configure this in the container settings before deployment. If you need the models or data to persist across sessions or pod restarts, it's best to use volume storage, which gives you a persistent file system attached to your container.
Yes, containers on RunPod are flexible enough to allow updates even after deployment. If you need to make changes to your model files or application code, you have a few options. One approach is to SSH into the running pod via the RunPod console and manually edit the files or pull changes from a Git repository. Another option is to rebuild your Docker image locally with the new model or updated code and push it to your container registry, then deploy a new pod with the updated image. If your use case requires frequent updates but not permanent changes, you can also leverage RunPod’s persistent volume feature to store the model separately from the container. That way, you can update the model or switch to a different one without rebuilding the entire image. This method is ideal when you want to separate infrastructure from model logic or iterate rapidly without downtime.
By containerizing your Hugging Face model and running it on a GPU-powered platform like RunPod, you unlock speed, scale, and full control over your deployment. Whether you’re building a chatbot, summarization tool, or a complex AI pipeline—this setup gives you a solid foundation.
Sign up for RunPod to deploy your GPU container / inference API / model today.