Deploying a FastAPI application that uses GPU acceleration (e.g. with PyTorch) can be achieved by containerizing the app and running it on a cloud GPU platform. In this guide, we’ll walk through building a simple FastAPI API that uses a GPU, packaging it into a Docker container with CUDA support, and deploying it on the Runpod GPU Cloud platform. We’ll cover code examples, Dockerfile setup, and detailed steps for deployment (via both Runpod’s web UI and CLI). Configuration tips and an FAQ are included to ensure your app runs securely and efficiently. Let’s dive in!
Building a FastAPI App with GPU Support (PyTorch Example)
First, create a basic FastAPI application that utilizes PyTorch to perform a simple inference on a GPU. This example will load a dummy PyTorch model and expose an endpoint for inference:
# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
# Define an input schema for the request
class InputData(BaseModel):
values: list[float] # e.g., a list of floats to feed into the model
# Initialize FastAPI app
app = FastAPI()
# Determine device: use GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a simple model (e.g., a single-linear layer) and move to device
model = torch.nn.Linear(10, 1).to(device)
model.eval()
@app.post("/predict")
async def predict(data: InputData):
# Convert input list to a tensor and run through the model
x = torch.tensor(data.values, dtype=torch.float32).to(device)
x = x.view(1, -1) # reshape to [1, 10] if needed
with torch.no_grad():
output = model(x)
result = output.cpu().item() # move result to CPU and get Python number
return {"result": result}
In this code:
- We use torch.cuda.is_available() to pick the device (GPU or CPU). When running on a GPU-enabled environment with the correct drivers, this should be True, and the model will run on the GPU.
- We define a FastAPI app with a single POST endpoint /predict that accepts a JSON body containing a list of floats (values) and returns a simple computed result.
- The model here is a placeholder (a single linear layer with 10 inputs and 1 output). In a real scenario, this could be a loaded pretrained model (e.g., a neural network for inference). You would load your actual model.pt or define your model architecture accordingly.
- The model is moved to the GPU device at startup, and we perform inference inside the endpoint. We use async def for the endpoint to allow FastAPI’s asynchronous request handling (FastAPI fully supports async endpoints for performing concurrent I/O or other awaitable tasks).
Tip: It’s a good idea to preload or “warm up” your model on startup. For example, you might call model(torch.randn(1,10).to(device)) once at launch to ensure any CUDA kernels are initialized before the first real request, which can help reduce the first-request latency.
Containerizing the FastAPI App with CUDA and PyTorch
To deploy this app on a cloud GPU service, we should containerize it using Docker. The container needs to include CUDA drivers, PyTorch, and our FastAPI code. We’ll create a Docker image that has all dependencies and is configured to run our app with Uvicorn.
Dockerfile Example (with CUDA and PyTorch):
# Use an official PyTorch image with CUDA support as base (contains Python, CUDA, PyTorch)
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
# Install FastAPI and Uvicorn (and any other dependencies your app needs)
RUN pip install --no-cache-dir fastapi uvicorn
# Copy application code
WORKDIR /app
COPY app/ /app/
# Expose port 8000 (the port our FastAPI app will run on)
EXPOSE 8000
# Specify the entrypoint to run the app with Uvicorn (listening on all interfaces at port 8000)
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
A few notes on this Dockerfile:
- We chose an official PyTorch CUDA runtime base image (pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime). This image already has Ubuntu, Python, CUDA 11.7 drivers, cuDNN, and PyTorch 2.0.1 installed, which simplifies our setup. (You can adjust the tag to match the PyTorch/CUDA version you need.)
- We install FastAPI and Uvicorn via pip. This ensures our API framework and ASGI server are available.
- We copy the application code into the image (assuming your FastAPI code is in an app/ directory with an __init__.py and the main.py we wrote).
- We expose port 8000 and set the container to run Uvicorn on startup. Important: We bind Uvicorn to 0.0.0.0 (all network interfaces) so that the FastAPI app is accessible from outside the container . If you bind to the default 127.0.0.1 inside the container, the service wouldn’t be reachable externally. The --port 8000 flag matches the port we’ll expose.
After creating the Dockerfile, build and test the image locally:
- Build the Docker image (run this in the directory containing the Dockerfile):
docker build -t fastapi-gpu-app .
- This will produce an image named fastapi-gpu-app. (Optionally tag it with a version or your Docker Hub username, e.g., myusername/fastapi-gpu-app:v1 if you plan to push to a registry.)
- Test locally (optional): If you have Docker set up with NVIDIA Container Toolkit for GPU access, you can test run the container on a GPU machine:
docker run --rm -p 8000:8000 --gpus all fastapi-gpu-app
- This maps container port 8000 to localhost:8000 and grants the container access to all GPUs on the host (--gpus all). Ensure that the NVIDIA drivers and Docker’s NVIDIA runtime are installed on your system for this to work. Inside the container, torch.cuda.is_available() should return True if configured correctly. You can then visit http://localhost:8000/docs to see the FastAPI auto-generated docs or send a test POST request to http://localhost:8000/predict with a JSON body (e.g. {"values": [0.1, 0.2, ..., 0.0]}) to see the model result.
- If you don’t have a local GPU environment, you can still run the container without --gpus (it will use CPU), or proceed to deploy on Runpod’s cloud GPU which is our main goal.
For more on writing Dockerfiles and customizing images, see Runpod’s Dockerfile guide . Once the image is built and working, push it to a container registry (such as Docker Hub or GitHub Container Registry) so that Runpod can pull it.
Deploying the Container on Runpod with GPU Access
With our container ready, we can deploy it on Runpod, a cloud platform that lets you run Docker containers on a variety of GPUs on-demand. We’ll cover both the Runpod web interface (console) method and the Runpod CLI method to launch a GPU pod running our FastAPI app.
Using the Runpod Web UI
- Log in to Runpod and create a new Pod: In the Runpod console, go to the Pods section and click “Deploy Pod”. This starts the setup process for a new container Pod (a Pod is a container instance with your specified resources).
- Select GPU type and cloud: Choose GPU (not CPU) for the pod type. You’ll be presented with GPU options like NVIDIA RTX A10, RTX 3090, L40S, etc. Select a GPU that suits your model’s needs and your budget (these GPUs have different performance and memory – e.g., an RTX 3090 has 24 GB memory ). You can also choose between Community Cloud (cheaper, spot-instance style) or Secure Cloud (more robust, static instances) as available on Runpod.
- Container image and settings: Provide the Docker image name for your FastAPI app. If you pushed to Docker Hub, this might be something like myusername/fastapi-gpu-app:latest. You can optionally specify a custom container start command or arguments, but since our image already defines the entrypoint (uvicorn ... command), we can leave the command blank to use the default.
- Expose the FastAPI port: In the Pod configuration, find the “Expose HTTP Ports” field and enter 8000 (the port our app listens on). This tells Runpod to expose port 8000 as a public HTTP endpoint. For example, if your Pod ID is abcd1234, after deployment you could access the API at a URL like https://abcd1234-8000.proxy.runpod.net (Runpod auto-generates a proxy address combining your pod ID and port). This proxy will route HTTP traffic to your container. (Behind the scenes, Runpod uses a proxy with a Cloudflare front-end; note that extremely long-running requests over 100 seconds may be cut off by the proxy , so for long jobs consider designing around that or use a direct TCP port.)
- Example: Configuring a Runpod Pod via the web UI with an image name and exposed port.
- Persistent storage (if needed): If your application needs to save files or store models, you can allocate a persistent volume. In the Pod setup, you’ll see options for Volume Disk size and Volume Mount Path (default is often /workspace or /runpod). Set these if you need your data to persist across restarts. For basic inference service, you might not need extra volume beyond the container’s 20GB default disk, but if you expect to write outputs or logs that should survive pod restarts, consider using a volume. (Runpod also supports attaching a Network Volume to share data between pods .)
- Deploy: Give your pod a name (for your reference) and click Deploy. Runpod will pull your Docker image and start the container on the selected GPU. The first launch can take a bit of time if the image is large (several GB), since it has to download the image to the host . Once it’s downloaded, the container will start and your FastAPI app will launch on the GPU.
- Access your API: When the pod is running, you can click the Connect button on the pod details. Since we exposed port 8000 as HTTP, you’ll get a proxy URL (as mentioned in step 4) to access the FastAPI service. Open the URL in a browser to see the FastAPI documentation (if not disabled) or send requests to your endpoints. For example, a GET request to https://<podID>-8000.proxy.runpod.net/ (with the proper pod ID) should hit the FastAPI root. A POST to /predict on that URL will invoke our model inference. You can also use tools like curl or Postman to test the endpoint.
- Runpod also provides a live logs view in the web UI (and via CLI) so you can see the Uvicorn server logs and any printouts in real-time. This helps verify that the app started correctly. If something goes wrong, check the logs for errors (for instance, import errors or CUDA not found errors).
Using the Runpod CLI (runpodctl)
For automation or integration into scripts, Runpod offers a CLI tool runpodctl . After [installing and configuring runpodctl with your API key ](https://docs.runpod.io/runpodctl/install), you can create and manage pods from your terminal.
Here’s an example command to deploy the FastAPI GPU container via CLI:
runpodctl create pod \
--name fastapi-gpu-app \
--imageName myusername/fastapi-gpu-app:latest \
--gpuType "NVIDIA GeForce RTX 3090" \
--gpuCount 1 \
--containerDiskSize 20 \
--ports "8000/http"
Let’s break down the important flags used above (there are more available, see Runpod’s CLI reference):
- --imageName: The Docker image to run (including tag). Ensure this image is accessible (public on Docker Hub or a registry you’ve connected credentials for).
- --gpuType: The GPU type (as a string). In this example, we specified RTX 3090. You could use other IDs like "NVIDIA RTX A10" or "NVIDIA L40S" depending on availability. The string must match one of the supported GPU names on Runpod .
- --gpuCount: Number of GPUs (default is 1; you can deploy multi-GPU pods if your workload can leverage them).
- --containerDiskSize: Disk size in GB for the container (default 20 GB). Increase this if your image or runtime needs more space.
- --ports: Ports to expose. We use "8000/http" to expose port 8000 via HTTP proxy (as opposed to a raw TCP port). Only one HTTP port is allowed per pod . If you needed a raw TCP port instead, you could specify e.g. "8000/tcp" here (or even both an HTTP and a TCP port, up to one each).
You can also specify environment variables with --env KEY=value if needed (e.g., secrets or configurations), request a persistent volume with --volumeSize and --volumePath, and set a maximum price or specific cloud region. See the Runpod CLI docs for more details on these flags.
After running the runpodctl create pod command, it will return information about the created pod, including its Pod ID. You can then use runpodctl get pods to list pods and see status. Once the pod status is Running, retrieve the connect info:
runpodctl get pod --id <YOUR_POD_ID>
This will show details including the exposed endpoint. For an HTTP port, you’ll typically see the proxy URL (as discussed earlier). You can then interact with your FastAPI app at that URL.
(Alternatively, Runpod has an API and SDKs to programmatically launch pods and get URLs, which can integrate with your CI/CD or application code. But using the CLI covers most use cases for now.)
Configuration Tips for FastAPI on GPUs in Production
When deploying a FastAPI application with GPU access, keep these best practices and configuration tips in mind:
- Port and Networking: Always bind your FastAPI/Uvicorn host to 0.0.0.0 inside the container, and expose the correct container port on Runpod. Runpod’s proxy will provide a public URL for the port . Remember that the external URL includes the internal port number (e.g., -8000.proxy.runpod.net for port 8000). If you change the app port, update it in both the Dockerfile/command and the Runpod settings accordingly.
- Environment Variables: Use environment variables to configure your app at runtime. For example, you might toggle debug mode, set a model path, or configure a cache directory via env vars. You can supply these through Runpod’s interface or CLI (--env). FastAPI itself can read env vars or you can integrate something like Pydantic’s settings management. Keep secrets (like API keys) out of your code and use env vars or Runpod’s secrets management features.
- Security Hardening: In production, consider disabling FastAPI’s interactive docs and redoc (especially if your API is public). This can be done by initializing FastAPI with docs_url=None and redoc_url=None. Also, if your application will be accessed from a browser or other domains, configure CORS properly. FastAPI allows adding CORS middleware easily:
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI()
app.add_middleware(CORSMiddleware, allow_origins=["https://your-frontend.com"], allow_methods=["*"], allow_headers=["*"])
- Adjust the origins and methods as needed. This ensures only allowed domains can use your API from a frontend.
- Logging and Monitoring: By default, Uvicorn will log requests and errors to stdout. These logs appear in Runpod’s pod logs (accessible via the web UI or runpodctl logs). Ensure that any important info or errors are indeed logged to stdout/stderr so you can troubleshoot in the cloud. You might want to increase log level or structure logs in JSON for easier parsing. Also, be mindful of sensitive information in logs if your app is public.
- Performance Optimization: Running on GPU means you want to maximize throughput:
- Use async endpoints in FastAPI for handling network or I/O-bound portions of requests (FastAPI and Uvicorn are asynchronous and can handle many concurrent requests efficiently). In our example, the inference itself is CPU/GPU bound. PyTorch operations release the GIL when performing GPU work, so using async won’t harm performance and allows the server to handle other requests if one is waiting on GPU computation.
- If you expect high load, consider running multiple Uvicorn worker processes (especially if the CPU can handle more concurrency or if you want to utilize multiple CPU cores for preprocessing). This can be done by using Gunicorn with Uvicorn workers or specifying the --workers flag in Uvicorn. For example, in Docker CMD: ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]. Multiple workers can increase throughput on CPU-bound parts and handle more concurrent connections, but if they share a single GPU, you need to ensure the model can be loaded in each worker (which doubles memory usage) or use a strategy to share the model (more advanced).
- Warm-up the model at startup (as noted earlier) so that the first request doesn’t pay the overhead of CUDA context initialization or lazy loading.
- Take advantage of persistent pods if you need to keep the app running to serve traffic with minimal cold starts. Runpod pods will keep running until you stop them (billing by the second), so your service can remain available. If you instead use a serverless approach, you might have cold starts; but in this guide, we focused on a dedicated pod.
- Persistent Storage & Models: If your model weights are large and you prefer not to bake them into the Docker image (to keep the image smaller), you can mount a volume and download the model on startup. Runpod volumes (either the built-in container disk or attached volumes) can be used to store these weights. Just be aware of the trade-off: downloading at startup vs. a larger image. For private models, you may need to include credentials or download from a secure source. In many cases, bundling the model in the image (or a base image) is simplest for deployment.
- Scaling and Clusters: For higher availability or scaling out, you could deploy multiple pods (each running the same container) behind an external load balancer or traffic manager. Runpod doesn’t automatically load balance multiple pods for you, but you can manage this at the application level. Alternatively, look into Runpod’s Instant Clusters or orchestration using their API if you want to automate spinning up pods on demand.
By following these tips, you’ll ensure your FastAPI application is robust, secure, and performant on the cloud GPU instance.
FAQ: FastAPI with GPU on Runpod
Q: How can I verify that my FastAPI app is indeed using the GPU?
A: One simple way is to call torch.cuda.is_available() in your application code and log or return the result. For example, you could add a temporary endpoint in FastAPI:
@app.get("/gpu-check")
def gpu_check():
return {"gpu_available": torch.cuda.is_available(),
"device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"}
Deploy your app and send a request to /gpu-check. It should respond with gpu_available: true and the name of the GPU (e.g., “NVIDIA GeForce RTX 3090”) if everything is set up correctly. You can also check the server logs on startup – printing torch.cuda.is_available() or using nvidia-smi inside the container (if installed) are good indicators. On Runpod, if you selected a GPU and your image has CUDA and PyTorch, it should show as true. (Keep in mind that if you accidentally ran the container on a CPU instance or without the NVIDIA runtime, it would return false. On Runpod GPUs, no special --gpus flag is needed – the platform attaches the GPU automatically.)
Q: Does FastAPI support async inference calls?
A: Yes. FastAPI is built on Starlette, an ASGI framework, and it fully supports asynchronous endpoints. You can define your route functions with async def and even perform async I/O operations within them. In the context of inference, the heavy computation (the model forward pass) is typically a synchronous call to PyTorch (which will utilize the GPU). While you cannot await a pure computation (unless you offload it to something like asyncio.to_thread or a background task), having your endpoint as async allows the server to handle other incoming requests while one request is waiting for GPU computation. This can improve throughput under load. If you have long-running inference and want to not tie up the request, you might also explore using BackgroundTasks in FastAPI or external task queues. But for most cases, you can write your inference code in the request handler directly (as we did above) – FastAPI will handle it just fine. Just be careful to avoid global locks and ensure thread-safety if using multiple workers. In summary, FastAPI + Uvicorn can handle multiple requests concurrently, and using async endpoints is recommended for any I/O-bound parts of the workflow.
Q: How long does a container take to start on Runpod?
A: Startup times on Runpod depend on a couple of factors: primarily the image size and the container initialization. For a typical PyTorch + CUDA image (which can be a few GB), the first time your pod is scheduled on a new node, it may take roughly 30 seconds to a few minutes to pull the image and start. In many cases, users report pods becoming ready in ~1-2 minutes for medium-sized images. Subsequent starts (if the image is cached on the node) are faster. Runpod charges by the second, and the billing starts when the pod status is initializing/running, so you’re paying for that startup time as well – but it’s usually quite short compared to, say, training jobs. If your container has a very large image (10+ GB) or downloads huge files at startup, that will add to the time. The Community Cloud pods might queue for a bit if capacity is limited, whereas Secure Cloud pods start immediately when capacity is available. Overall, expect on the order of a minute for the pod to be up and your FastAPI app responsive. It’s always good to test the startup to know the behavior; you can use the Runpod CLI or API to measure how long it takes from create pod to first successful response. If you need near-instant availability, keep the pod running or consider strategies to handle cold start (like a loading screen or warm-up requests).
Deploying a FastAPI application with GPU access is straightforward with the right approach. We built a simple example with PyTorch and FastAPI, containerized it with a proper CUDA-enabled environment, and deployed it on Runpod’s GPU cloud infrastructure. By containerizing your app, you ensure consistency between your local tests and the cloud deployment . Runpod’s platform takes care of provisioning the GPU and exposing your service, so you can focus on your application logic.
Next Steps: You can expand this approach to deploy more complex models (e.g. Transformers, image generation, etc.), add authentication to your API, or even serve multiple models via different endpoints. For more information on container deployments, check out Runpod’s documentation on container networking and ports and the Dockerfile best practices. When you’re ready to scale up or deploy other projects, the Runpod GPU Cloud offers a variety of GPU options and pricing to fit your needs. Happy deploying!