How to Run StarCoder2 as a REST API in the Cloud
StarCoder2 is an open-source code generation model from the BigCode project, available in 3 model sizes (3B, 7B, and 15B parameters). The largest, StarCoder2-15B, delivers strong coding capabilities with an extended context window of 16k tokens – ideal for tasks like code completion and generating functions. In this article, we’ll show how to deploy StarCoder2 as a RESTful API service on a cloud GPU. This allows developers to send code prompts to the model and receive generated code suggestions over HTTP.
Sign up for RunPod to deploy this workflow in the cloud with GPU support and custom containers.
Preparing the Model and Environment
Getting the model: StarCoder2’s weights are hosted on Hugging Face. You’ll need to download the model (for example, the 15B variant) and have it accessible in your environment. Since the model is large (the 15B version is around 32 GB in memory at float16 precision), using a GPU with sufficient VRAM is important. You might start with the 7B model if you have a smaller GPU, as it will require roughly half the memory of the 15B.
On RunPod, you can launch a GPU instance with high VRAM (for instance, an NVIDIA A100 40GB or an L40 with 48GB, depending on availability) for the 15B model. Alternatively, you can employ model optimizations:
-
8-bit or 4-bit loading: Using libraries like bitsandbytes, you can load the model in 8-bit or 4-bit precision, dramatically reducing memory usage. For example, the 15B model that uses ~32 GB in full precision can shrink to about 16 GB in 8-bit mode , and even around 9 GB in 4-bit mode. This lets you run StarCoder2-15B on a single 16 GB GPU (with 4-bit quantization) or a 24 GB GPU (with 8-bit precision).
-
Sharded loading: Hugging Face’s Transformers library can automatically shard a large model across GPUs if you have more than one available. However, in a cloud context, it’s often simpler and more cost-effective to use a single powerful GPU rather than multiple smaller GPUs.
Setting up the API server: To serve StarCoder2 via REST API, you’ll want to create a small web server that loads the model and handles inference requests. A common approach is to use FastAPI (Python) or Flask to set up an endpoint. For instance, you could write a Python script that does the following:
-
Load the model and tokenizer at startup (possibly with AutoModelForCausalLM.from_pretrained("bigcode/starcoder2-15b", device_map="auto", torch_dtype=torch.float16) for an automatic GPU placement).
-
Define a POST endpoint (e.g., /generate_code) that accepts a JSON payload with a prompt (and any generation parameters like max length) and returns the model’s generated code completion.
-
Within that endpoint, implement the generation call. Using the Transformers pipeline is a convenient option: e.g., generator = pipeline("text-generation", model=model, tokenizer=tokenizer, ...); then result = generator(prompt, max_new_tokens=128, do_sample=False) and return the result text.
Ensure you set the server to run with Uvicorn or Gunicorn (if using FastAPI) and bind to the appropriate host/port. In a Dockerfile, you might use an entrypoint like uvicorn app:api --host 0.0.0.0 --port 8000. This will start the service inside a container.
Containerizing the service: Create a Docker image that contains all necessary dependencies:
-
Base it on a CUDA-enabled base image (e.g., nvidia/cuda:12.2.0-runtime-ubuntu20.04 or an official PyTorch image that includes CUDA support) so that GPU usage is enabled.
-
Install Python libraries: Transformers, FastAPI, Uvicorn, and bitsandbytes (if using 8-bit loading) among others.
-
Copy your model weights into the image (or better, mount a volume at runtime with the model files to avoid a huge image size) and copy your API server code.
-
Expose port 8000 (or whichever you chose for the API).
Build and test this image locally if possible, to ensure that the model can load and the API responds.
Deploying on RunPod and Exposing the API
With the container ready, deploying on RunPod is straightforward:
-
Upload the image or use a registry: Push your Docker image to a registry (like Docker Hub). In RunPod, you can specify a custom container by providing the image name. Alternatively, use RunPod’s template library – if there’s an existing template for a similar setup (for example, a template with Hugging Face Transformers installed), you can start from that and just add your code.
-
Select an appropriate GPU pod: For StarCoder2-15B, choose a pod type with sufficient VRAM (such as an A100 40GB, or RTX 6000 Ada 48GB). For 7B or quantized versions, a 16 GB GPU (like RTX A4000) might suffice. The GPU pricing page on RunPod will show you options and costs – for instance, an A100 40GB on the community cloud might cost around $2–$3/hour, whereas smaller GPUs like an RTX 3090 (24GB) or RTX A5000 (24GB) are closer to ~$1/hour.
-
Environment configuration: When launching the pod, be sure to set the container’s command or entrypoint to start your API server (if it’s not already the default). Also, open the API port in the RunPod interface (if using port 8000, add that to the exposed ports). You can also set environment variables if needed (for example, if you have a Hugging Face token for downloading the model, you might set HF_TOKEN as an env var).
-
Deploy and test: Once the pod is running, you’ll get a unique endpoint URL. For example, RunPod might provide a proxy URL like https://<pod-id>-8000.proxy.runpod.net. You can send an HTTP request to your /generate_code endpoint at that URL. Try it out with a small prompt, e.g., send a JSON {"prompt": "def hello_world():"} to see if the model completes the function. The first time the model loads, it may take some time (several seconds to even a minute, given the size). Subsequent requests should be faster as the model remains in memory.
One upside of using RunPod is that you can scale this service horizontally by deploying multiple pods if needed, or scale vertically by choosing a bigger GPU if the load increases. RunPod also supports Serverless GPU Endpoints, where your containerized API can automatically scale and handle requests without you managing the Pod lifecycle – this could be a next step when you want to productionize the service.
Frequently Asked Questions (FAQ)
Q: What are the hardware requirements for StarCoder2?
A: For the full 15B model at float16 precision, you’ll need roughly 32 GB of GPU memory. This typically means using a high-end GPU like an NVIDIA A100 40GB or RTX 6000 Ada (48GB). If you use 8-bit quantization, you can cut that requirement roughly in half (about 16 GB, which fits on cards like RTX A5000 or 3090). The 7B model needs about half the VRAM of the 15B, so roughly 16 GB in full precision (or ~8 GB with 8-bit). The smallest 3B model is much lighter (~6 GB or less). Always leave a bit of headroom – if a model “just barely” fits in VRAM, you might still run into out-of-memory errors during inference, so it’s better to use a slightly larger GPU or reduce the batch size.
Q: How does StarCoder2 compare to other code models in terms of inference speed?
A: Being a 15B parameter model, StarCoder2 is quite large, so each inference step is somewhat heavy. In practice, generating code with StarCoder2 will be a bit slower than smaller models (like a 7B model) but it’s faster than using something like OpenAI’s GPT-4 via API, since you have dedicated GPU power and no network overhead. If you need faster responses and are okay with slightly less capable outputs, you could try the 7B model which will run roughly twice as fast. Another strategy for speed is to limit the maximum tokens generated – because StarCoder2 has a 16k context, if you ask it to produce a very long completion, it will take more time. For many code completion tasks (where the outputs are maybe a few dozen lines of code), it should respond in a couple of seconds on a good GPU. You can also experiment with turning off certain features like sliding window attention if supported, but that tends to affect quality more than speed.
Q: Can I run StarCoder2 on a CPU-only setup?
A: Technically yes, but it’s not practical for anything beyond experimentation. The 15B model would take a huge amount of RAM and be extremely slow on CPU (potentially taking many tens of seconds or minutes per request). If you don’t have a suitable GPU yourself, it’s much better to use a cloud service like RunPod to access a GPU on-demand. Even the 3B model on CPU will be sluggish. However, you could use quantization and a smaller model if absolutely needed on CPU, but expect significant latency.
Q: How do I handle multiple concurrent requests to the API?
A: If you anticipate many simultaneous requests, there are a few approaches:
-
Queueing: Since the model can only generate for one request at a time per GPU, you might implement a simple queue on the server side to handle requests one after the other. This is easiest to do in the application logic (for example, using asyncio in FastAPI to queue tasks).
-
Batching: Advanced inference engines like vLLM can batch multiple prompts together to better utilize the GPU, but if you’re rolling your own FastAPI server, this would require more complex logic. It’s often not needed unless you have high throughput requirements.
-
Scaling out: You can run multiple pods (containers) each with the model, and use a load balancer to distribute requests. For instance, on RunPod you might deploy 2–3 instances of the service and front them with a simple round-robin in your application or use RunPod’s serverless endpoints (which scale under the hood). This way each instance handles a subset of requests. Remember that running multiple copies will multiply your GPU cost, so ensure it’s necessary before scaling out.
Q: How much will it cost to serve StarCoder2 on RunPod?
A: It depends on the GPU you choose and how long you run it. As a ballpark, an A100 40GB on RunPod is around $2.50 per hour on the community cloud. If you used that for 4 hours a day during workdays, it’d be about $10 a day. Smaller GPUs like an RTX A5000 (24GB) or RTX 3090 are cheaper (around $1–$1.50/hour) and can handle the 15B model in 8-bit mode. If you opt for the 7B model, you could even run on an RTX 3080 (10GB) which is roughly $0.17–$0.20/hour , significantly reducing costs. Also consider the serverless option – with that, you pay per second of usage which can be cost-effective if your usage is sporadic (the endpoint can scale down when idle).
Q: The model sometimes generates incorrect or incomplete code. How can I improve its outputs?
A: Large language models for code can sometimes produce syntactically correct but logically incorrect code. To improve outputs:
-
Provide more detailed prompts or context. StarCoder2 has a large context window, so you can supply it with relevant codebase context (like function definitions, docstrings, etc.) to guide its output.
-
Use temperature and other generation parameters. A lower temperature (e.g., 0.2) will make outputs more deterministic, which is often good for code consistency. You might also set do_sample=False for deterministic completions.
-
Fine-tune the model on your specific code style or problem domain if you have the data and resources. StarCoder2 can be fine-tuned, and if you have a domain-specific dataset (for example, all your company’s internal API usage patterns), a fine-tune could yield more tailored code generation.
-
Lastly, always test and review generated code. Consider integrating linting or even simple test runs for the generated code if possible, to catch errors. Running an automated unit test on the output (in a sandbox) can give quick feedback on whether the generated function works as intended.