Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

I just love the idea of running an LLM locally. It has huge implications for data security and the ability to use AI on private datasets. Get your company’s DevOps teams some real GPU servers as soon as possible.

Benchmarking LLM performance has been a blast, and I’ve found it a lot like benchmarking SSD performance because there are so many variables. In benchmarking SSD performance, you have to worry about the workload inputs, like block size, queue depth, access pattern, and the state of the drive. In LLMs you have inputs like model architecture, model size, parameters, quantization, number of concurrent requests, size of the request in tokens in and tokens out, and the prompts can make a huge difference.

The main points you want to optimize are latency, to first token and completion, and total throughput in tokens per second. For something that a user is reading along, even 10-20 tokens per second is plenty fast, but for coding you may want something much faster to be able to iterate on.

“Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs.” - NVIDIA

OpenAI currently charges for their API endpoints by the number of tokens in and tokens out. Current pricing is around $5 for 1M tokens in and $2.5 for 1M tokens out.

Open-source endpoints like ollama are a great place to start, since they are extremely easy to run. You can just spin up a docker or if you want a nice front end along you can use open-webui. I don’t need a front end since I’m using an OpenAI compatible API to target the LLM endpoint. I’m using llmperf to do the benchmarking

Testing ollama (which I suspect uses the CUDA backend), oddly only uses one of the cards when I run 10 concurrent requests, and throughput is similar to a single instance but with much higher latency.

GPU Model	Quant	Sample Size	Requests

Inference performance on ollama seems okay for single requests but doesn't scale.

The start of this project was me getting excited about NVIDIA NIMs from the keynotes at GTC and Computex this year. I can say I wish NVIDIA had just made these optimized microservices with the backend completely open-source, but they can be run anywhere (on your local compute) with their API key authentication. They are extremely easy to run, being able to spin up a completely optimized and tuned LLM endpoint in just a few minutes (most of the time being spent downloading the model).

NVIDIA claims in their blog post, that an H200 can get up to 3000 tokens/s…I wanted to put this to the test and also compare the inference performance vs some high-end consumer cards like an RTX 4090.

I grabbed the NIM from here and start it in docker on the same system.

https://build.nvidia.com/meta/llama3-8b

This is a system I’m messing around with that doesn’t have the GPU at full PCIe bandwidth, but I accidentally discovered that inference doesn’t appear to be very PCIe bandwidth intensive, but I haven’t tried a model big enough.

GPU Model	Quant	Sample Size	Requests	Tokens/sec
2x RTX 4090	llama3:8b-instruct-fp16	16 bit	20	184.68
2x RTX 4090	llama3:8b-instruct-fp16	16 bit	300	1053.51

A 4090 system is currently renting for $0.54/hr on Runpod community cloud, so by my math, in one hour, I could service 3.79M tokens on these two cards for a total of $1.08. Not bad at all! The only problem with this setup is that I can’t run the higher-end models because I don’t have enough GPU VRAM. To run the llama3-70b at 16-bit precision, I would need 141GB total, or 75GB at 8-bit.

https://ollama.com/library/llama3 you can see there are 68 different tags on llama 3, because people want to run these on different systems with different amount of GPU VRAM or system DRAM for CPU inference (which is much slower).

‍