Explore our credit programs for startups and researchers.

Guides

May 23, 2025

How to Deploy a Hugging Face Model on a GPU-Powered Docker Container

Solutions Engineer

Hugging Face has made it easier than ever to use powerful transformer models for NLP, computer vision, and more. But running these models efficiently, especially the larger ones, requires GPU acceleration. And if you want your model to be accessible via an API or run in a controlled environment, deploying it inside a Docker container on a GPU machine is often the best approach.

In this guide, we’ll walk through the process of packaging a Hugging Face model into a Docker container, setting it up for inference, and deploying it on a GPU with RunPod. This setup allows you to scale from local testing to full production with ease.

Sign up for RunPod to deploy your GPU container, inference API and model in the cloud.

Why Use Docker for Hugging Face Models?

When you’re dealing with machine learning dependencies, version mismatches and environment issues can slow you down. Docker simplifies this by letting you:

Package the exact runtime your model needs
Reproduce environments consistently across systems
Move easily between local, cloud, or on-prem machines
Use GPU acceleration with minimal extra setup

With a containerized setup, you can go from testing on your laptop to serving production traffic on a cloud GPU, without reconfiguring anything.

What You'll Need

Before you begin, make sure you have:

A Hugging Face model (hosted or downloaded)
A Dockerfile to build your container
Access to a GPU (locally or on RunPod)
Python and the Hugging Face transformers library
Optional: FastAPI or Flask to expose your model via REST API

We’ll assume you’re working with a text generation model like distilGPT2, but the steps apply broadly to most transformer models.

Step 1: Create a Simple Inference Script

First, let’s write a small Python script to handle text input and generate responses using a Hugging Face model.

python

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
generator = pipeline("text-generation", model="distilgpt2")

class InputPrompt(BaseModel):
prompt: str
max_tokens: int = 50

@app.post("/generate")
def generate_text(data: InputPrompt):
output = generator(data.prompt, max_new_tokens=data.max_tokens)
return {"response": output[0]["generated_text"]}

This uses FastAPI to expose a /generate endpoint. When you send it a prompt, it returns the model’s generated response.

Step 2: Write Your Dockerfile

Now you need a Dockerfile to containerize the inference app. Here’s a simple one:

Dockerfile

FROM python:3.10-slim

RUN pip install --no-cache-dir fastapi uvicorn transformers

COPY app.py /app/app.py
WORKDIR /app

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

If you plan to use a GPU, make sure your Docker runtime is set up to use NVIDIA drivers. You’ll also need to pull a compatible base image (e.g., one with CUDA support) and adjust the install commands accordingly.

Step 3: Build and Test Locally

Once your files are ready, build the container locally:

bash
docker build -t huggingface-api .

Then run it:

bash
docker run -p 8000:8000 huggingface-api

Test the endpoint using curl or Postman:

bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Once upon a time", "max_tokens": 50}'

You should get a response with generated text.

Step 4: Push Your Image to a Registry

To deploy on RunPod, your container image must be accessible from a public or private Docker registry.

If you're using Docker Hub:

bash
docker tag huggingface-api yourusername/huggingface-api
docker push yourusername/huggingface-api

If you’re using GitHub Container Registry or another provider, adjust the commands accordingly.

Step 5: Deploy on RunPod

Now that your image is ready, it’s time to run it on a GPU in the cloud.

Go to RunPod Containers
Create a New Pod
- Select an appropriate GPU (e.g., RTX A4000 or A5000 depending on model size)
- Use the “Custom Image” option
- Paste the Docker image name (e.g., yourusername/huggingface-api)
Expose Port 8000
- Add 8000 to the exposed ports section
Set Resources
- Choose enough disk and RAM to handle your model (usually 10GB+ disk and 16GB+ RAM)
Launch

In a few minutes, your container will be live, and you’ll get a proxy URL like:

cpp

https://<pod-id>-8000.proxy.runpod.net

Use this as the base URL for your API.

Best Practices for Hugging Face Containers

To make your deployment more robust:

Use torch_dtype=torch.float16 for faster inference (if supported by your model)
Use quantized models if GPU memory is limited
Persist your model files in a mounted volume if needed
Set environment variables to toggle between CPU/GPU or enable debugging
Profile GPU usage with tools like nvidia-smi or torch.cuda.memory_allocated()

You can also start from a RunPod template if you want a base container with PyTorch, CUDA, or Jupyter pre-installed.

Troubleshooting Tips

Here are some common issues and how to fix them:

The model won’t load

Make sure it’s available in the container
Check Hugging Face’s from_pretrained() path or access token
Try loading from local files instead of downloading on startup

I get a 502 error

Your container may not be running or the port isn’t exposed
Make sure FastAPI is running on 0.0.0.0 and port 8000

It’s not using the GPU

Ensure you deployed on a GPU pod
Add --gpus all when running locally or use RunPod’s default GPU-enabled runtime
Modify your script to move model to cuda (e.g., model.to("cuda"))

Cost and Performance Considerations

Running models in the cloud gives you power and flexibility, but it’s important to understand costs:

A mid-range GPU like an RTX A4000 costs around $0.17/hour
Larger GPUs like A100s can run $2–$3/hour but handle 13B+ models easily
To optimize costs:
- Use smaller or quantized models
- Shut down the pod when idle
- Test locally with CPUs before scaling to GPUs

Compare your options on the RunPod GPU Pricing Page.

Extending Your Deployment

Once your base container is running:

Add authentication to your API if needed
Connect it to a front-end app or chatbot
Switch to RunPod’s Serverless Inference Endpoints for automatic scaling
Integrate with RunPod’s API to launch pods programmatically – see API Docs

RunPod makes it easy to evolve from solo dev project to full-scale deployment with no infrastructure headaches.

Frequently Asked Questions (FAQ)

Q: What GPUs work best for Hugging Face models?

The right GPU depends largely on the model size and precision you're using. If you’re working with smaller models like distilGPT2 or bert-base, a GPU such as the NVIDIA T4 or RTX A4000 (16GB VRAM) will generally be enough to get fast, responsive inference. These are excellent options for prototyping or lightweight applications. For larger models—think LLaMA 13B, Gemma 7B, or StarCoder2 15B—you’ll want something with more VRAM, like the RTX A5000 (24GB) or even an A100 (40GB or 80GB) if you’re dealing with high throughput or long-context generation. RunPod’s GPU Pricing Page breaks down each available GPU by VRAM size, hourly cost, and performance benchmarks, so you can find the one that fits your budget and model requirements.

Q: How do I make sure my container uses the GPU?

To confirm your model is using GPU acceleration, you first need to explicitly move your model and input tensors to CUDA in your script. For example, after loading your model with Hugging Face’s transformers library, you should use model.to("cuda") and ensure any input tensors are also moved to the same device with .to("cuda"). On RunPod, GPU support is built in for GPU-selected pods, so you don’t need to set up the driver or runtime yourself. Once your container is running, you can check if the GPU is being used by opening a terminal inside the container and running nvidia-smi. This tool will show you whether the GPU is active, how much memory is in use, and which processes are currently accessing it.

Q: What’s the best way to serve Hugging Face models as APIs?

For most use cases, serving Hugging Face models through a REST API using FastAPI is the most efficient and flexible approach. FastAPI is lightweight, Pythonic, and supports async handling, which can help with responsiveness under load. A simple server setup with FastAPI allows you to expose endpoints for text generation, classification, summarization, or whatever task your model supports. If you're planning to scale beyond a single container or want to support higher traffic, you can deploy multiple pods on RunPod and load balance across them. For even more automation and scaling, RunPod offers Serverless GPU Endpoints, which allow your model API to autoscale and handle incoming requests without requiring you to manage the container lifecycle yourself.

Q: How much disk space do I need for deployment?

Disk space requirements depend on the size of your model and any additional dependencies you're including in the container. Most smaller transformer models (like distilBERT or GPT2-small) are around 500MB to 2GB in size, and the container environment itself might need another 1–3GB depending on the base image and installed libraries. For typical inference setups, allocating 10–20GB of disk space is a good starting point. However, if you’re working with larger models like Gemma 27B or running multiple models in the same container, you may need significantly more—up to 40–60GB. On RunPod, you can configure this in the container settings before deployment. If you need the models or data to persist across sessions or pod restarts, it's best to use volume storage, which gives you a persistent file system attached to your container.

Q: Can I update the model or code in my container after it's deployed?

Yes, containers on RunPod are flexible enough to allow updates even after deployment. If you need to make changes to your model files or application code, you have a few options. One approach is to SSH into the running pod via the RunPod console and manually edit the files or pull changes from a Git repository. Another option is to rebuild your Docker image locally with the new model or updated code and push it to your container registry, then deploy a new pod with the updated image. If your use case requires frequent updates but not permanent changes, you can also leverage RunPod’s persistent volume feature to store the model separately from the container. That way, you can update the model or switch to a different one without rebuilding the entire image. This method is ideal when you want to separate infrastructure from model logic or iterate rapidly without downtime.

By containerizing your Hugging Face model and running it on a GPU-powered platform like RunPod, you unlock speed, scale, and full control over your deployment. Whether you’re building a chatbot, summarization tool, or a complex AI pipeline—this setup gives you a solid foundation.

Sign up for RunPod to deploy your GPU container / inference API / model today.

Get started with RunPod

today.

We handle millions of gpu requests a day. Scale your machine learning workloads while keeping costs low with RunPod.

Get Started

RunPod