How to Expose an AI Model as a REST API from a Docker Container
Turning your AI model into a production-ready service starts with a simple idea: wrap it in a REST API and run it in a container. This approach makes it portable, scalable, and easy to integrate into any application—from web tools and mobile apps to internal systems and workflows.
In this guide, we’ll walk through how to expose an AI model as a REST API using Docker. Whether you’re using a Hugging Face model, a custom PyTorch model, or something built in TensorFlow, this method works across the board. We’ll show you how to go from a local script to a containerized, API-powered endpoint running on a GPU in the cloud.
Sign up for RunPod to deploy your GPU container, inference API, or model with full cloud support and custom templates.
Why Wrap Your Model in an API?
Running a model locally is great for testing, but APIs unlock production use cases. Here’s why this step is essential:
-
Standardized Access – Any application that can send an HTTP request can call your model.
-
Remote Hosting – You can run the model in the cloud and access it from anywhere.
-
Scaling & Monitoring – REST APIs make it easier to log, throttle, or distribute requests.
By exposing a /predict
or /generate
endpoint, you decouple your model from the frontend and make it available to any client.
Step 1: Build a Simple Inference Script
Start with a script that loads your model and accepts input. Here's an example using a Hugging Face model in PyTorch:
python
CopyEdit
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
def generate_text(prompt):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
You can test this function locally before wrapping it in an API.
Step 2: Set Up a FastAPI Server
FastAPI is a lightweight, high-performance web framework perfect for serving ML models:
python
CopyEdit
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Prompt(BaseModel):
prompt: str
@app.post("/generate")
def generate(prompt: Prompt):
result = generate_text(prompt.prompt)
return {"output": result}
To run this locally:
bash
CopyEdit
uvicorn myapp:app --host 0.0.0.0 --port 8000
Now your model is accessible via POST /generate
.
Step 3: Create a Dockerfile
A Dockerfile packages everything needed to run your model and API in one container:
dockerfile
CopyEdit
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install fastapi uvicorn transformers torch
CMD ["uvicorn", "myapp:app", "--host", "0.0.0.0", "--port", "8000"]
This setup:
-
Installs dependencies
-
Copies your app code into the image
-
Launches the API server
To build and test the image locally:
bash
CopyEdit
docker build -t my-api .
docker run -p 8000:8000 --gpus all my-api
Make sure your Docker runtime supports NVIDIA GPUs for acceleration.
Step 4: Run It on a Cloud GPU (with RunPod)
Once your container works locally, deploy it to the cloud to run at scale and speed.
RunPod supports custom containers with GPU acceleration and makes setup fast:
-
Launch a GPU Pod
Choose a GPU based on your model’s size. For example:-
GPT2 or BERT → RTX A4000 (16GB)
-
Llama2 13B → A100 40GB or L40S 48GB
-
-
→ Compare GPU options and pricing
-
Use a Template or Bring Your Own Container
RunPod supports:-
Prebuilt templates with Python, CUDA, PyTorch
-
Custom Docker images from Docker Hub or any registry
-
-
→ Explore available containers and templates
-
Configure Ports and Variables
-
Open port
8000
-
Set environment variables if needed (e.g.,
MODEL_PATH
) -
Ensure your CMD launches the API server on container start
-
-
Deploy and Test
RunPod will provide a proxy URL like:
https://<your-pod-id>-8000.proxy.runpod.net
You can now send HTTP POST requests to your
/generate
endpoint.
Tips for Model Deployment Success
Some quick lessons from deploying dozens of containerized APIs:
-
Optimize Load Time: Preload the model on startup to avoid delays on first request.
-
Limit Output Size: Use a sensible
max_new_tokens
ormax_length
value. -
Stream Output (Optional): Stream generated text as it’s created using Server-Sent Events or WebSockets.
-
Handle Timeouts: Add retry logic or timeout settings for longer generations.
-
Use Logging: Print model input/output and any errors for easy debugging.
Tip: Use RunPod Volumes to persist model files and avoid re-downloading on every deployment.
→ Check volume storage options
Docker Best Practices for AI Inference
Containerization ensures reproducibility. Here’s how to do it right:
-
Base Image: Use a minimal base (like
python:slim
) to keep image size small. -
Layer Caching: Install Python packages before copying code to use Docker cache.
-
Avoid Re-downloading Models: Mount volumes or pre-bake weights into the image.
-
GPU Support: Always run with
--gpus all
and checknvidia-smi
inside the container.
→ Read Docker container deployment guide
Security and Networking Considerations
When exposing a REST API:
-
Set host to 0.0.0.0 to allow external access
-
Use HTTPS if the API handles sensitive data
-
Limit access using API keys or reverse proxies if needed
-
Avoid hardcoding credentials in your Dockerfile or codebase
Real-World Use Cases
Developers use this setup across industries:
-
Chatbots: Wrap LLMs with APIs for internal tools or customer support bots.
-
Content Generation: Let marketing or creative teams access AI through a simple interface.
-
Custom Classifiers: Expose a model trained for a specific task (e.g., sentiment detection) to your app frontend.
-
Document Parsing: Use OCR + AI in a container that receives PDFs and returns extracted insights.