How to Serve Gemma Models on L40S GPUs with Docker
Google’s Gemma family of models represents a series of lightweight, state-of-the-art open large language models, built from the same research and technology used in Google’s Gemini models . Gemma models are available in various sizes – for instance, Gemma 2 comes in 2B, 9B, and 27B parameter versions. The largest (Gemma 2 27B) pushes the limits of what you can run on a single GPU. In this guide, we’ll explain how to serve Gemma models (especially the larger ones) using Docker on an NVIDIA L40S GPU. The L40S, with 48 GB of VRAM, is a powerful GPU choice for deploying these models in the cloud.
Sign up for RunPod to deploy this workflow in the cloud with GPU support and custom containers.
Why L40S for Gemma Models?
The Gemma 27B model is quite resource-intensive; in 16-bit precision it can require around 50–60 GB of memory for inference, which means it won’t fit on typical 24 GB cards without modifications. The NVIDIA L40S offers 48 GB of VRAM, giving you a single-GPU solution to run these larger models. It’s roughly comparable to the A100 40GB or RTX 6000 Ada in terms of memory capacity, but with the newer Ada Lovelace architecture that provides excellent inference performance for generative AI workloads. Using an L40S means you might be able to run the Gemma 27B model with some optimizations (like 8-bit loading) without needing multiple GPUs.
On RunPod, you can spin up an L40S instance directly. According to RunPod’s GPU pricing table, an L40S 48GB on the community cloud is about $0.79 per hour , making it a cost-effective choice for short-term heavy workloads. (For comparison, the older RTX A6000 48GB is around $0.33/hr but has less processing power; the L40S is newer and faster for these tasks.)
Setting Up the Docker Environment
To serve Gemma models via Docker, we have two main approaches:
1. Using an off-the-shelf inference server (fast setup): One convenient method is to use Hugging Face’s Text Generation Inference (TGI) server. This is a specialized, highly optimized server for hosting LLMs. Hugging Face provides a Docker image that you can use. The basic idea is:
-
Pull the TGI Docker image (e.g., ghcr.io/huggingface/text-generation-inference:latest).
-
Run it with the appropriate model and GPU settings. For example:
docker run --gpus all -p 8000:80 -e MODEL_ID=google/gemma2-27b -e QUANTIZE=bitsandbytes int8 ghcr.io/huggingface/text-generation-inference:latest
-
In this hypothetical command, we’re telling the container to load the Gemma 27B model from Hugging Face (MODEL_ID) and apply 8-bit quantization on the fly (QUANTIZE=bitsandbytes int8) to reduce memory usage. The container would expose an API (on port 80 inside, mapped to 8000 outside) that you can call for text generation. The details might vary, and you should check Hugging Face’s documentation for the exact usage, but this approach saves a lot of coding. Essentially, you let the TGI server handle the model loading, GPU optimization, and serving of a /generate endpoint.
-
Advantages: Quick to deploy, optimized backend for inference (written in C++/Rust, with support for efficient batching and quantization), and it supports features like streaming token output.
-
Disadvantages: Less customizable in terms of adding your own pre- or post-processing logic around the model output, and you have to trust the container as a black box for serving.
2. Rolling your own server with Docker (custom setup): If you need more control or want to integrate Gemma into a larger application pipeline, you can create your own Docker image:
-
Start from a CUDA-enabled base (ensuring drivers for the L40S will work – RunPod’s base images or their PyTorch template are a good starting point).
-
Install the required libraries. You’ll likely use 🤗 Transformers to load the Gemma model. For 27B, make sure to use a recent transformers version that supports large model loading (and possibly accelerate for low-memory loading techniques). Also install bitsandbytes if you plan to leverage 8-bit loading.
-
Copy or download the Gemma model weights. Google has made Gemma weights available openly (Gemma 2B and 7B are on Hugging Face; the 27B may require accepting a license or using Google’s AI platform to download). You might use the Hugging Face Hub with an access token to download directly in your container build, or download externally and COPY into the image.
-
Write a lightweight server script (similar to what we described for StarCoder2). For example, a FastAPI app with a single /generate endpoint. Load the model with something like:
model = AutoModelForCausalLM.from_pretrained("google/gemma-27b", device_map="auto", torch_dtype=torch.float16, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-27b")
-
This uses 8-bit loading to cut memory usage roughly in half. On an L40S, 8-bit mode should allow the 27B model to fit in memory (it might use around 30 GB out of 48 GB, leaving room for overhead and runtime).
-
In the server endpoint, implement generation using the model. Because Gemma is a decoder-only model (similar to Llama or GPT-style), you’ll use model.generate() or a transformers pipeline for text-generation. Set reasonable limits like max_new_tokens to prevent extremely long completions that could fill up memory or take too long.
-
Expose the required port (say 8000) in your Dockerfile and ensure the container starts the server on run (e.g., via CMD in Dockerfile).
Build this image and test it locally (you could test with Gemma 2B or 7B to ensure your code works, then trust that it will handle 27B on the L40S in the cloud).
Deployment on RunPod with L40S
Once you have your approach chosen (whether the TGI container or your custom image), deploy it on RunPod:
-
Choose the L40S GPU type when creating a new pod. This ensures you get the 48 GB GPU. You can also choose the Secure Cloud or Community Cloud option depending on availability and cost preference (Community is cheaper as seen in pricing).
-
Attach your container:
-
If using the TGI approach, you can directly use the public image with the proper environment variables. RunPod’s interface allows you to specify environment vars like MODEL_ID and QUANTIZE when deploying. You’d input the image name (huggingface/text-generation-inference:latest) and then set MODEL_ID=google/gemma-27b (for example) and QUANTIZE=bitsandbytes int8 in the env vars section. Also add the port (8000) to exposed ports.
-
If using a custom image, push it to a registry and provide that image name in the deployment form. Include any necessary startup command or it will use the container’s default CMD.
-
-
Storage considerations: The Gemma 27B model is large (the weights are tens of gigabytes). On RunPod, ensure the pod has enough container disk allocated. You might need, say, 60–80 GB of disk to comfortably hold the model and libraries. This can be configured in the pod settings (for example, set container disk size to 80 GB if you plan to download the model inside).
-
Networking: Open the port that the server uses (if 8000, add that; if using the TGI default, perhaps 80 or 8000). RunPod will then provide a proxy URL to access the service.
-
Launch and monitor: Deploy the pod and watch the logs. Loading a 27B model will take some time (possibly 20-30 seconds or more on first load). If you see out-of-memory errors in logs, you may need to double-check that 8-bit mode is enabled or opt for an even more aggressive quantization (4-bit) or use the 9B model instead. Once the pod is running, test the endpoint. For example, if you have a /generate endpoint, try a prompt like “Q: What is the capital of France?\nA:” to see if it responds with a reasonable answer (Gemma is not always instruction-tuned, so the format of the prompt can influence the answer).
Frequently Asked Questions (FAQ)
Q: What performance can I expect from Gemma 27B on an L40S?
A: The L40S is a very capable GPU. In terms of raw performance, it should handle Gemma 27B quite well, though generation will not be instantaneous for long outputs. Expect on the order of a few tokens per second generation speed in full 16-bit mode. With 8-bit, you might get a slight boost and, importantly, the ability to run the model without out-of-memory issues. The quality of responses from Gemma 27B is among the best for open models of that size (since it’s built with techniques from Google’s Gemini research). The 27B model will certainly outperform the 2B and 7B versions in terms of knowledge and coherence, at the cost of speed. If you find it too slow and your task can tolerate slightly less accuracy, the 9B model on the L40S would be much faster (and could even be run on smaller GPUs or with more headroom).
Q: Do I have to use an L40S, or can I run Gemma on other GPUs?
A: You can run smaller Gemma models on smaller GPUs easily. For example, Gemma 2B and 7B models can run on a 16 GB GPU (like RTX A4000 or T4) with no issues, and even on 8 GB GPUs if you use 8-bit or 4-bit quantization. The 9B model ideally wants around 16–20 GB (so it fits on a 24 GB card comfortably, or perhaps a 16 GB card with 8-bit compression). The 27B model is the most demanding – if you don’t have an L40S or similar 48 GB card, you would need to either use multi-GPU (e.g., split across two 24 GB GPUs) or heavy quantization to fit in a single 24 GB memory. RunPod does offer multi-GPU options and other 48 GB cards (like A6000 or RTX 6000 Ada). The L40S just happens to combine large memory with great throughput, making it a convenient single-GPU solution for Gemma 27B.
Q: What’s the benefit of using Docker in this setup?
A: Docker ensures that your environment is consistent and portable. When dealing with complex ML libraries and large models, environment issues can be frustrating (library versions, CUDA dependencies, etc.). By containerizing the Gemma serving setup, you can move it between machines (or share it with colleagues) without having to re-do the environment each time. On RunPod, you can deploy a custom Docker container in one click, which means once you’ve built and pushed your Docker image with Gemma, scaling out or redeploying it is trivial. Additionally, using Docker allows you to encapsulate model files within the image (though for very large models it might be better to download or mount them at runtime to avoid a huge image). Overall, Dockerizing the service makes it more reproducible and easier to manage.
Q: How do I handle inference if the model responses are too slow or too long?
A: If responses are too slow, consider these adjustments:
-
Reduce the output length. If you set a very high token limit, the model will take correspondingly longer for large answers. Determine a reasonable maximum for your use case (maybe you rarely need more than a 100-token completion for a given query).
-
Use batching if you have many short requests. With a single model, you can batch multiple prompts into one forward pass to better utilize the GPU (the TGI server does this automatically). This is advanced to implement yourself, but if you anticipate high load, using an inference server like TGI or vLLM that handles batching can massively improve throughput.
-
Enable streaming. Rather than waiting for the full completion, you can stream tokens as they are generated. This way, the user starts seeing output sooner. Both the Hugging Face TGI and a custom FastAPI setup (using server-sent events or websockets) can support streaming tokens.
-
If responses are too verbose or rambling, you might need to adjust your prompting strategy or use decoding parameters (like increasing the stop probability or using stop sequences). Since Gemma models aren’t heavily instruction-tuned, they might not know when to stop without a clear stop cue. You can insert a stop token (like "\nQ:" to simulate a next question prompt as a stopping point) or simply limit the max tokens.
Q: How much does it cost to run Gemma 27B on RunPod’s L40S?
A: As mentioned, the L40S on community costs about $0.79 per hour . So if you ran it continuously for 24 hours, that’s roughly $19. Depending on your usage pattern, you might only spin it up when needed. For example, if you use it 8 hours a day on weekdays, that’s around $0.79 * 8 * 22 ≈ $139 per month. If you need it less often, you can cut those numbers down. By contrast, a smaller model on a cheaper GPU could cost under $0.20/hr (e.g., running Gemma 2B on an RTX 4000), which might be only ~$35 per month at 8hrs/day usage. Always consider whether the 27B’s improved accuracy is critical, or if a fine-tuned smaller model could achieve your goals at a fraction of the cost. For development and testing, you can use the smaller models and then scale up to L40S/27B when you’re ready to deploy the best model for production.
Q: Are there any special considerations when using Google’s Gemma models?
A: One thing to note is that the Gemma models, while open, may have specific licenses (Apache 2.0 for the code, and a separate license for the weights possibly). Make sure to review the license if you plan to use it in a commercial application. Technically, Gemma models use transformer architectures similar to other LLMs, so serving them doesn’t require any Google-specific software. They are available on platforms like Hugging Face, which makes them easy to work with. If you use the Google AI versions (e.g., via their API or JAX implementation), that’s a different approach. In our Docker/RunPod scenario, we’re using the PyTorch-compatible weights via Hugging Face which is straightforward. Just keep your libraries up to date – these models are new, and support in libraries like Transformers is continually improving (for example, optimized kernels for new architectures or memory-saving features). If you encounter issues, checking the Hugging Face forums or GitHub for the Gemma model can provide solutions, since the community is actively working with these models.