A previous blog post, No-Code AI: How I Ran My First LLM Without Coding, walked through deploying the Mistral-7B LLM on Runpod without writing code. The author did this with the text-generation-webui-oneclick-UI-and-API template, which enables you to interact with the model through a chat interface using text-generation-web-ui.
This post builds on that foundation, exploring the same deployment path but through a more technical lens. If you have already deployed Mistral before, see how you can further optimize and customize it to get better control and performance.
What you’ll learn
In this blog post, you’ll learn how to:
- Deploy Mistral-7B with quantized weights
- Deploy Mistral-7B using vLLM workers
Requirements
Before following this tutorial, do the following:
Deploy Mistral with quantized weights
LLM quantization is a compression technique that reduces the precision of a model’s parameters from a high-precision format, such as 32-bit floating point, to a lower-precision format, such as 8-bit integers. This reduces the size of the model, meaning that it consumes less memory, requires less storage, is more energy-efficient, and runs faster.
We’ll deploy Mistral-7B in the GGUF format, which is optimized for efficient storing and deploying of models and designed to work well on consumer hardware. It also provides improved performance and support for quantization.
Let’s deploy a 4-bit quantization of the Mistral-7B model and compare performance across different GPUs.
- Sign into the Runpod Console and open the Explore page.
- Search for Text Generation Web UI and APIs and select it. This template is developed by Runpod and uses the same web UI that was used in No-Code AI: How I Ran My First LLM Without Coding.

- Select Deploy.

- Let’s start with the cheapest GPU, the RTX 2000 Ada with 16 GB VRAM.

- Enter a name in the Pod Name field, then select Edit Template.

- Select Add environment variable. Set the key to HF_TOKEN and the value to your Hugging Face access token. Then, select Set Overrides.
Note: These environment variables are public, meaning that anyone can view them. Rather than enter your Hugging Face access token directly here, use a secret.

- Runpod starts to deploy your pod. Select the arrow next to the pod to expand it, then select Logs to view progress.

- Wait until the log on the Container tab says Container is READY! then select Connect on the pod.
- Connect to port 3000 (the Text Generation Web UI service).

- Let’s deploy the Mistral-7B model. Select the Model tab.

- We’ll deploy the MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF model from Hugging Face. This is the mistralai/Mistral-7B-Instruct-v0.3 model in GGUF format.
An instruct model is a type of LLM that has been fine-tuned to follow specific instructions or commands, making it effective for tasks like summarization, translation, and answering questions directly.
Under Download model or LoRA, enter “MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF”. In the box below that, enter “Mistral-7B-Instruct-v0.3.IQ4_XS.gguf”, which is the file name of the IQ4_XS (4-bit quantization) version of the model. Then select Download.

- After the model is downloaded, select the refresh button so it is displayed in the Model drop-down menu. Select it from the menu, then select Load. The settings for the model should be loaded. Then select Save settings.

- Go back to the Chat tab and try asking the model a question, such as, “What is the best Linux distro?” Confirm that it gives an intelligible response.

- Let’s check the performance of the model by reading through the logs. Go back to the pod overview page and select Connect. Under Web Terminal, select Start. After Runpod starts it up, select Open Web Terminal.

- Enter the following command:
tail -f /workspace/logs/textgen.log
tail outputs the last 10 lines (by default) of a file. The -f switch “follows” the file as new text is added, and outputs it to the terminal.
The main part we are interested in is the last line, which should look something like this:
Output generated in 3.98 seconds (29.17 tokens/s, 116 tokens, context 95, seed 17592148)
Make a note of this somewhere, because we’ll use it to compare to another GPU’s performance.
- Create another pod following the exact same steps, but this time, select a GPU with more VRAM, such as the RTX 5090.
- Ask the same question that you asked the model running on the lower-performance GPU.
Note: In my testing, the first time I asked a question took a lot longer than subsequent times. For the purposes of this blog post, I’m going to compare performance on the second question I ask each model.
- Check the log using the web terminal to check the performance. Compare the throughput (tokens/second) of the two pods. Here are my results:
- RTX 2000 Ada: 29.17 tokens/second
- RTX 5090: 91.84 tokens/second
You should see an exponential increase in performance between the lower- and higher-end GPUs. In my case, doubling the VRAM from 16 to 32 GB multiplied the performance by a factor of about 3X.
Use this information to determine the types of GPUs that you use for your workloads, balancing the tradeoffs of cost and performance.
Next steps
Compare the performance of the 4-bit quantized model with the non-quantized model, or other quantizations. You should see performance gains with lower-precision models.
Don’t forget to Stop your pods when you’re done using them, and Terminate them when you no longer need them!
Deploy vLLM workers
A vLLM worker is a specialized container that efficiently deploys and serves LLMs on Runpod Serverless, designed for performance, scaling, and cost-effective operation.
Runpod Serverless allows you to create an endpoint that you can call via an API. Runpod allocates workers to the endpoint, which handle API calls as they come in.
Serverless vLLMs have several advantages over pods:
- vLLMs come with the inference engine pre-configured, optimized for memory usage and faster inference.
- vLLM APIs are compatible with OpenAI APIs, so you can reuse your existing code just by switching the endpoint URL and API key.
- Runpod Serverless automatically scales your endpoint workers based on demand and bills on a per-second basis.
If you are not comfortable writing code to call your endpoint, deploying a pod template is a more user-friendly option. However, if you know how to call an API, vLLM on Runpod Serverless is a great option.
Let’s deploy Mistral-7B using vLLM workers. You can deploy LLMs using either the Runpod Console or a Docker image. We’ll go through the Runpod Console method in this blog post, but if you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time, as this can significantly reduce cold start times. See Deploy using a Docker image.
- Sign into the Runpod Console and open the Serverless page.
- Under Quick Deploy, in the Text tab, under Serverless vLLM, select Configure.

- Under Hugging Face Models, enter mistralai/Mistral-7B-Instruct-v0.3. Under Hugging Face Token, enter your Hugging Face user access token. Select Next.
Note: Rather than enter your Hugging Face access token directly here, use a secret.

- Leave the default vLLM settings and select Next.

- Leave the default worker configuration settings and select Deploy. In our example, we will test workers with 48GB GPUs.
Note: Prices are subject to change.

- Runpod initializes the endpoint and deploys a vLLM worker image with the Mistral-7B model. Wait until the status is Ready.
- On your endpoint page, select the Requests tab. Note the default test request:
{
"input": {
"prompt": "Hello World"
}
}
Select Run to initialize the workers and send a “Hello World” prompt to the model. You can watch the status of the request on the page, which is refreshed every few seconds.

- When the worker finishes processing the request, you can see the response on the right, which should be similar to this:
{
"delayTime": 839,
"executionTime": 3289,
"id": "03d33dea-2294-4b85-9c7d-ed76345ed5f7-e2",
"output": [
{
"choices": [
{
"tokens": [
"<CHAT RESPONSE>"
]
}
],
"usage": {
"input": 9,
"output": 100
}
}
],
"status": "COMPLETED",
"workerId": "n514p4lqp6ylq8"
}
The response may be weird - I got some commentary on an American football team. You should get a better response if you ask a question. Select New Request to input a different prompt, like “What is the best Linux distro?”

Next steps
Compare the performance of using vLLM workers across various situations:
- vLLM workers versus pod + web UI
- Quantized versus non-quantized models
- Low-end versus high-end GPUs
You should see performance gains across the board when using vLLM workers, both in terms of inference speed and cost.
If you want to further optimize performance, try increasing the minimum active workers. When a worker is not being used, it goes into an idle state, from which it takes some time to wake up and start a new task. If you have a certain number of active workers available at all times, you can decrease the time it takes to start a new task. For more ways to optimize your endpoints, see Endpoint configuration and optimization.
Don’t forget to delete your endpoints when you’re done using them. On the endpoint overview page, select Manage > Delete Endpoint.
Conclusion
Once you are comfortable deploying a model like Mistral-7B using a pod template or vLLM on Runpod Serverless, there are many things you can do to further optimize for performance and cost. We’ve only touched on a few here, but there are so many more areas to explore.
In summary, to optimize the performance of your workloads:
- Use quantized (lower-precision) GGUF models
- Use higher-VRAM GPUs
- Deploy vLLM workers on Runpod Serverless
- Compare different strategies to balance cost and performance