How to Serve Phi-2 on a Cloud GPU with vLLM and FastAPI

Solutions Engineer

Small, efficient language models can often surprise you with their capabilities. Phi-2 is a 2.7 billion-parameter model from Microsoft that has shown nearly state-of-the-art performance among models under 13B . It’s a great candidate for deployment when you need solid performance but want to keep the resource footprint low. In this article, we’ll discuss how to serve the Phi-2 model on a cloud GPU using vLLM (an advanced LLM inference engine) and FastAPI (a lightweight web framework) to create a robust API endpoint. The combination of vLLM and FastAPI allows us to maximize inference speed and handle requests efficiently, all while making the service easy to interact with.

By the end, you’ll have a blueprint for deploying not just Phi-2, but potentially any Hugging Face-compatible model with a similar approach. This is geared toward intermediate engineers who have some familiarity with Python web APIs and want to step into AI model serving.

Why Choose vLLM for Inference?

vLLM is an open-source library designed for high-performance LLM inference. One of its key features is efficient use of GPU memory via a technology called PagedAttention, which allows handling longer context and batching multiple requests without blowing up memory. Essentially, vLLM can serve more requests in parallel and utilize your GPU more fully than naive approaches. For serving a model like Phi-2, which is relatively small by today’s standards, vLLM ensures we squeeze out maximal throughput – meaning it can handle many Q&A requests per second if needed.

In contrast, using the model through a standard transformer pipeline might not leverage these optimizations. vLLM also provides an OpenAI-compatible API out of the box (making it easy to swap into existing apps that expect an OpenAI API), but here we’ll integrate it with our own FastAPI app for full control.

Setting Up the RunPod Environment

To get started, set up a GPU environment on RunPod:

Launch a GPU Pod: Select a GPU type that fits the model. Phi-2 at 2.7B parameters will run on practically any modern GPU – even a 8GB or 12GB card should do in half-precision, though for comfort a 16GB or larger is ideal (gives room for bigger batches or longer prompts). Using RunPod’s template library, pick a base that has the essentials. A good choice is the “PyTorch” or “Transformers” template, which comes with CUDA, PyTorch, etc., pre-installed. This saves you from installing GPU drivers or base libraries. Ensure the pod has internet access initially (to download model weights).
Install vLLM and FastAPI: Once your pod is up, open a terminal (or use the Jupyter/VSCode interface if provided). Install the required Python packages:

pip install vllm fastapi uvicorn[standard]

This will install vLLM, FastAPI, and Uvicorn (an ASGI server to run the FastAPI app). You might also need transformers if vLLM doesn’t pull it in, and torch (PyTorch) should already be present via the template.
Download the Phi-2 Model: There are a couple of ways. The simplest is to use Hugging Face’s APIs. You can either:
- Use transformers to download: e.g., in Python, AutoModelForCausalLM.from_pretrained("microsoft/phi-2"). But since we plan to use vLLM’s optimized loader, we might not need to explicitly load via transformers.
- vLLM can accept a model name and handle the download internally the first time. For example, vLLM’s API server can be pointed at "microsoft/phi-2" and it will fetch the weights. Check vLLM documentation for the exact usage. If needed, do a quick dry run to cache the model: python -c "from vllm import LLM; llm = LLM(model='microsoft/phi-2')" – this will download weights to the local cache.
- Ensure you have enough disk space for the model (Phi-2 is 2.7B, so FP16 weights are around ~5.4GB, plus tokenizer etc.). The default container disk might be small, but you can increase it in RunPod settings if required.
Set up the FastAPI app: We will create a simple FastAPI application that exposes an endpoint (say, POST /generate) where users can send a prompt and get the model’s completion. In a file (for example, app.py), write something like:

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()

# Initialize the vLLM runtime with the Phi-2 model
llm = LLM(model="microsoft/phi-2")

class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 100

@app.post("/generate")
def generate_text(request: GenerationRequest):
    # Use vLLM to generate text
    outputs = llm.generate([request.prompt], 
                            sampling_params=SamplingParams(max_tokens=request.max_tokens))
    # llm.generate returns a list of outputs (one per prompt)
    result = outputs[0].outputs[0].text  # get the text of the first (and only) prompt
    return {"completion": result}

This is a basic example. We create an LLM instance from vLLM pointed at Phi-2. When a request comes in, we call llm.generate with the provided prompt. The SamplingParams can be adjusted – here we just set max_tokens from the request (or default 100). We then return the generated text in a JSON response.
Run the server: Start Uvicorn to serve this FastAPI app:

uvicorn app:app --host 0.0.0.0 --port 8000

This will launch the server on port 8000. On RunPod, ensure you’ve exposed port 8000 in the pod settings (if you launched via a template, you can add an exposed port either at launch or via the UI). If not, you might need to restart the pod with port 8000 open. Once running, you can use the RunPod interface to hit the endpoint. For example, open a web terminal on the pod and do:

curl -X POST -H "Content-Type: application/json" \
     -d '{"prompt": "Hello, how are you?", "max_tokens": 50}' \
     http://localhost:8000/generate

This should return a JSON with a "completion" field containing Phi-2’s answer. If you have global networking enabled (or access the pod via forwarded port), you could call this API from outside as well.

At this point, you have a working API serving Phi-2! 🎉 The FastAPI app handles HTTP requests and uses vLLM under the hood for efficient inference.

Sign up for RunPod to deploy workflows like this quickly on cloud GPUs. With an account, you can spin up the environment, install vLLM, and serve models like Phi-2 without the usual infrastructure hassle.

Advantages of This Setup

Efficiency: vLLM will manage GPU memory and inference optimizations, allowing you to serve multiple requests concurrently and utilize batching. This means if two users hit your /generate endpoint at the same time, vLLM can efficiently handle them together rather than strictly one-by-one, improving throughput.
Flexibility: Using FastAPI gives you full control over the API. You can implement authentication, input validation (we used Pydantic’s BaseModel for the request), custom logic (pre/post-processing), and even additional endpoints (e.g., a health check or a batch generation endpoint).
Scalability: If you later decide to scale out, you could run this container on multiple pods and put a load balancer in front (or use RunPod’s serverless endpoints to scale automatically). The separation of concerns (serving logic vs. model inference) makes it easier to manage scaling – vLLM takes care of using one machine’s resources well, and FastAPI plays nice with external scaling tools.
Ease of Development: You can test this setup locally (on a smaller model or CPU) by running the FastAPI with vLLM on your machine. It will behave the same on RunPod with a GPU, just faster.

Frequently Asked Questions

How do I handle the model download and loading time?

The first time you run LLM(model="microsoft/phi-2"), vLLM will load the model weights. This can take some time (tens of seconds up to a couple of minutes, depending on size and disk speed). It’s a one-time hit per session. To avoid this impacting your API users, you should initialize the LLM at app startup (as we did in the code, globally). That way, when the first request comes in, the model is already loaded into memory. If you restart the pod, it will have to load again – consider keeping the pod running to serve traffic continuously if you have steady usage. If using a serverless approach, cold starts will incur that loading time; for a 2.7B model it’s not too bad, but it’s still noticeable. One strategy to mitigate cold start issues is warming up the endpoint (trigger a dummy request right after deployment to force the model to load before real users hit it).

What if I want to use a different model or multiple models?

vLLM and FastAPI aren’t tied to Phi-2 specifically. You can replace "microsoft/phi-2" with any model name on Hugging Face (or a path on disk to local weights). Just ensure your GPU has enough VRAM for the model you choose. For larger models (like 13B, 30B), you might need a higher-memory GPU and possibly adjust vLLM settings (it might automatically offload some layers to CPU if GPU memory is insufficient – which can slow things down). If you want to serve multiple models, you have a few options:

Load multiple LLM instances, each with a different model, in the same FastAPI app under different endpoints (e.g., /generate_phi2 vs /generate_othermodel). This will consume more memory obviously (since both models reside in RAM/GPU).
Run separate pods for different models, which is cleaner in terms of isolation.
Or load a big model and a small model together if the GPU can handle it, etc. vLLM’s LLM class is for one model at a time, so you’d instantiate two LLM objects if doing multi-model in one app.

How does vLLM compare to vanilla Hugging Face Transformers pipeline in serving?

In terms of serving, vLLM is specialized for optimized inference. It can do things like:

Dynamically batch incoming requests (even if prompts differ) to utilize GPU parallelism.
Use contiguous memory management to maximize throughput for generation.
Stream output (if you use its streaming, which we didn’t cover here but it can stream token by token similar to OpenAI’s API).

A vanilla Transformers pipeline would load the model (possibly with half precision if you set it) and generate outputs one by one. It might not batch different requests unless you code it to do so. For a low request volume, you might not notice a huge difference – Phi-2 will respond quickly either way for single users. But under load, vLLM will shine by serving more users efficiently. Additionally, vLLM is often more memory-efficient with large models due to its paging technique.

How can I scale this API for more users?

There are a few ways:

Vertically: Get a bigger GPU or a more powerful pod. If you find the GPU is at 100% utilization or memory full with your current setup, you might move to a higher tier GPU. This is limited though – there’s a cap to one GPU’s capacity.
Horizontally: Run multiple instances of this service. You could start multiple pods on RunPod each running the FastAPI+vLLM app. Then use a simple round-robin scheme or a load balancer (even something like Nginx or HAProxy) to distribute requests among them. RunPod might not provide an out-of-the-box load balancer for pods, so you’d manage that externally or via your application logic.
Serverless scaling: Package the logic into a RunPod Serverless Endpoint. RunPod can then scale out workers in response to load. However, as mentioned, the initial model load time in a serverless worker might be a factor. For truly high throughput, persistent pods might be more predictable.
Batch requests on the client side: If you have a scenario where a single client wants to ask multiple questions at once, design an endpoint that accepts a batch of prompts in one request. vLLM can handle a list of prompts in its llm.generate and will process them together, which is more efficient than doing each separately. We did llm.generate([request.prompt], ...) for one prompt; you could allow prompt to be a list and pass it through to generate.

How do I ensure the FastAPI app stays responsive?

One thing to be mindful of: by default, Uvicorn with FastAPI will use a single process, and the llm.generate call will use the GPU and possibly block that worker until completion. If you expect long-running requests (say someone asks for a very long completion), it could hold up that process. FastAPI (through Uvicorn/Starlette) can handle multiple requests concurrently via asyncio, but if the Python code is busy (especially in GIL-heavy operations, or waiting on GPU), it might not start processing another request until the current one finishes. To mitigate this:

You could run Uvicorn with multiple workers (uvicorn app:app --workers 2 --host 0.0.0.0 --port 8000) which will start multiple processes. However, note that each worker would load its own model instance – meaning higher memory usage and potentially redundant computation. This is not ideal for large models (you wouldn’t want two copies of a 2.7B model in one GPU memory – likely impossible). For Phi-2, two copies might barely fit on a 40GB GPU, but it’s wasteful. So multiple workers on one GPU doesn’t make sense for large models; it’s better to handle concurrency within one process if possible.
vLLM’s internal server (if you used it via OpenAI API mode) inherently handles multiple streams. But since we are using it programmatically, what we can do is ensure to use its batching. If many requests come in concurrently, vLLM will batch them when llm.generate is called sequentially. In a single FastAPI worker, requests are handled asynchronously. If one request is waiting on llm.generate, another request could start and also call llm.generate. vLLM might queue it or throw if not thread-safe. This part might need careful handling – possibly using an asyncio lock around generate calls to avoid thread contention on the GPU. There’s some nuance here: vLLM’s LLM class might not be thread-safe for simultaneous calls. One approach is to funnel all generation calls through a single async task or queue.
Simpler alternative: instruct users to keep requests short and handle just fine. For most typical use (one request returns before next starts), this is okay.

What about model outputs and formatting?

Phi-2 is a general model (trained on lots of Q&A and code). Depending on your use case, you might want to format prompts or outputs. For example, to ensure the model follows a certain style, you could include a system prompt or some few-shot examples. Since we have full control in FastAPI, you can bake in a prompt template. e.g., always prepend "Instruction: {user prompt}\nResponse:" to the user prompt before sending to llm.generate, if that improves model behavior. You can also post-process the output (strip extra whitespace, etc.). If the application requires, you might also implement safety checks on the output (Phi-2 is open, so if you worry about offensive content, you might add a simple filter or use another model to classify the response).

By serving Phi-2 on a cloud GPU with this stack, you get a nimble yet powerful API. The model’s smaller size keeps costs low and speed high, while vLLM squeezes out performance, and FastAPI makes it accessible over HTTP. This general pattern can be reused for many models and use-cases – so it’s a great addition to your AI engineering toolbox. Happy deploying!

Get started with RunPod

today.

We handle millions of gpu requests a day. Scale your machine learning workloads while keeping costs low with RunPod.

Get Started

RunPod