How to Expose an AI Model as a REST API from a Docker Container

Turning your AI model into a production-ready service starts with a simple idea: wrap it in a REST API and run it in a container. This approach makes it portable, scalable, and easy to integrate into any application—from web tools and mobile apps to internal systems and workflows.

In this guide, we’ll walk through how to expose an AI model as a REST API using Docker. Whether you’re using a Hugging Face model, a custom PyTorch model, or something built in TensorFlow, this method works across the board. We’ll show you how to go from a local script to a containerized, API-powered endpoint running on a GPU in the cloud.

Sign up for Runpod to deploy your GPU container, inference API, or model with full cloud support and custom templates.

Why Wrap Your Model in an API?

Running a model locally is great for testing, but APIs unlock production use cases. Here’s why this step is essential:

Standardized Access – Any application that can send an HTTP request can call your model.
Remote Hosting – You can run the model in the cloud and access it from anywhere.
Scaling & Monitoring – REST APIs make it easier to log, throttle, or distribute requests.

By exposing a /predict or /generate endpoint, you decouple your model from the frontend and make it available to any client.

Step 1: Build a Simple Inference Script

Start with a script that loads your model and accepts input. Here's an example using a Hugging Face model in PyTorch:

python
CopyEdit
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

def generate_text(prompt):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

You can test this function locally before wrapping it in an API.

Step 2: Set Up a FastAPI Server

FastAPI is a lightweight, high-performance web framework perfect for serving ML models:

python
CopyEdit
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Prompt(BaseModel):
prompt: str

@app.post("/generate")
def generate(prompt: Prompt):
result = generate_text(prompt.prompt)
return {"output": result}

To run this locally:

bash
CopyEdit
uvicorn myapp:app --host 0.0.0.0 --port 8000

Now your model is accessible via POST /generate.

Step 3: Create a Dockerfile

A Dockerfile packages everything needed to run your model and API in one container:

dockerfile
CopyEdit
FROM python:3.10-slim

WORKDIR /app
COPY . .

RUN pip install fastapi uvicorn transformers torch

CMD ["uvicorn", "myapp:app", "--host", "0.0.0.0", "--port", "8000"]

This setup:

Installs dependencies
Copies your app code into the image
Launches the API server

To build and test the image locally:

bash
CopyEdit
docker build -t my-api .
docker run -p 8000:8000 --gpus all my-api

Make sure your Docker runtime supports NVIDIA GPUs for acceleration.

Step 4: Run It on a Cloud GPU (with Runpod)

Once your container works locally, deploy it to the cloud to run at scale and speed.

Runpod supports custom containers with GPU acceleration and makes setup fast:

Launch a GPU Pod
Choose a GPU based on your model’s size. For example:
- GPT2 or BERT → RTX A4000 (16GB)
- Llama2 13B → A100 40GB or L40S 48GB
→ Compare GPU options and pricing
Use a Template or Bring Your Own Container
Runpod supports:
- Prebuilt templates with Python, CUDA, PyTorch
- Custom Docker images from Docker Hub or any registry
→ Explore available containers and templates
Configure Ports and Variables
- Open port 8000
- Set environment variables if needed (e.g., MODEL_PATH)
- Ensure your CMD launches the API server on container start
Deploy and Test
Runpod will provide a proxy URL like:
https://<your-pod-id>-8000.proxy.runpod.net
You can now send HTTP POST requests to your /generate endpoint.

Tips for Model Deployment Success

Some quick lessons from deploying dozens of containerized APIs:

Optimize Load Time: Preload the model on startup to avoid delays on first request.
Limit Output Size: Use a sensible max_new_tokens or max_length value.
Stream Output (Optional): Stream generated text as it’s created using Server-Sent Events or WebSockets.
Handle Timeouts: Add retry logic or timeout settings for longer generations.
Use Logging: Print model input/output and any errors for easy debugging.

Tip: Use Runpod Volumes to persist model files and avoid re-downloading on every deployment.
→ Check volume storage options

Docker Best Practices for AI Inference

Containerization ensures reproducibility. Here’s how to do it right:

Base Image: Use a minimal base (like python:slim) to keep image size small.
Layer Caching: Install Python packages before copying code to use Docker cache.
Avoid Re-downloading Models: Mount volumes or pre-bake weights into the image.
GPU Support: Always run with --gpus all and check nvidia-smi inside the container.

→ Read Docker container deployment guide

Security and Networking Considerations

When exposing a REST API:

Set host to 0.0.0.0 to allow external access
Use HTTPS if the API handles sensitive data
Limit access using API keys or reverse proxies if needed
Avoid hardcoding credentials in your Dockerfile or codebase

Real-World Use Cases

Developers use this setup across industries:

Chatbots: Wrap LLMs with APIs for internal tools or customer support bots.
Content Generation: Let marketing or creative teams access AI through a simple interface.
Custom Classifiers: Expose a model trained for a specific task (e.g., sentiment detection) to your app frontend.
Document Parsing: Use OCR + AI in a container that receives PDFs and returns extracted insights.

How to Expose an AI Model as a REST API from a Docker Container

Why Wrap Your Model in an API?

Step 1: Build a Simple Inference Script

Step 2: Set Up a FastAPI Server

Step 3: Create a Dockerfile

Step 4: Run It on a Cloud GPU (with Runpod)

Tips for Model Deployment Success

Docker Best Practices for AI Inference

Security and Networking Considerations

Real-World Use Cases

LLM Fine-Tuning on a Budget: Top FAQs on Adapters, LoRA, and Other Parameter-Efficient Methods

The Complete Guide to NVIDIA RTX A6000 GPUs: Powering AI, ML, and Beyond

AI Model Compression: Reducing Model Size While Maintaining Performance for Efficient Deployment

Build what’s next.

How to Expose an AI Model as a REST API from a Docker Container

Why Wrap Your Model in an API?

Step 1: Build a Simple Inference Script

Step 2: Set Up a FastAPI Server

Step 3: Create a Dockerfile

Step 4: Run It on a Cloud GPU (with Runpod)

Tips for Model Deployment Success

Docker Best Practices for AI Inference

Security and Networking Considerations

Real-World Use Cases

Related articles.

LLM Fine-Tuning on a Budget: Top FAQs on Adapters, LoRA, and Other Parameter-Efficient Methods

The Complete Guide to NVIDIA RTX A6000 GPUs: Powering AI, ML, and Beyond

AI Model Compression: Reducing Model Size While Maintaining Performance for Efficient Deployment

Build what’s next.