Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

When deploying large language models on Runpod, choosing the right inference framework can dramatically impact both performance and cost efficiency. While vLLM has dominated the high-throughput inference space, SGLang emerges as the clear winner for a specific but increasingly important use case: multi-turn conversations with shared context.

Understanding Multi-Turn Conversations and Caching

Most production AI applications handle complex, multi-turn interactions where context builds over time, such as customer support chatbots, coding assistants, or educational tutoring systems. Both vLLM and SGLang recognize that reprocessing identical context repeatedly is wasteful, but they solve this problem differently.

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

vLLM's Automatic Prefix Caching (APC):

Caches exact prefix matches using block-level storage
Requires identical token sequences to trigger cache hits
Optimized for batch inference where multiple requests share exact prefixes
Works best with templated prompts and structured batch processing
Manual configuration often needed for optimal cache utilization

SGLang's RadixAttention:

Uses a radix tree structure for more flexible prefix matching
Automatically detects and caches partial overlaps in conversation context
Designed for dynamic multi-turn conversations with evolving context
Zero configuration - automatically optimizes cache usage patterns
Better handles branching conversations and varied interaction patterns

The Key Difference: Static vs Dynamic Optimization

The fundamental difference lies in their design philosophy:

vLLM excels when you can predict and structure your caching patterns. If you're running batch inference on templated prompts or have consistent request patterns, vLLM's APC provides excellent performance with precise control.

SGLang shines in unpredictable, dynamic scenarios where conversation flows vary. Its radix tree approach automatically discovers caching opportunities that would require manual optimization in vLLM.

Prompting example

Here's an example of what setting the stage for a multi-turn prompt might look like:

‍

And here's the multi-turn prompting we will be doing for our tests:

‍You can see that during this test, it successively builds on each prior example. Notice how each user question is completely different:

"what would you recommend for training a large language model on a budget?"
"how would you optimize inference costs for a production deployment?"
"What are the key considerations for distributed training mentioned in the context?"

RadixAttention's tree structure elegantly handles this pattern. It caches the shared system context once, then efficiently processes each unique user query. The cache hit covers the expensive part (processing thousands of tokens of technical context), while only the small user queries need fresh computation.

Testing parameters

To get some benchmarking numbers, I used 2X H100 SXM pods in Secure Cloud, using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Running the prompt, I get the following results.

Large Context RadixAttention Analysis

Test Type	Duration	Prompt Tokens	Response Tokens	Speed (tok/s)
7k context, fresh	5.093s	6,835	150	29.5
7k context, cache	4.287s	6,829	150	35.0
7k context, cache 2	4.295s	6,824	150	34.9
Small context	4.154s	72	150	36.1

So, hitting the cache on a 7k size prompt results in about a ~20% increase, and roughly matching the speed on a small prompt with no context at all.

vLLM Benchmark Results

Test Type	Duration	Prompt Tokens	Response Tokens	Speed (tok/s)
7k context, fresh	5.253s	6,801	150	28.6
7k context, cache	4.572s	6,801	150	32.8
7k context, cache 2	4.510s	6,801	150	33.3
Small context	4.124s	241	149	36.1

The results paint a picture - on fresh context, the two engines are relatively equally matched. However, RadixAttention gives a clear benefit in the larger multi-turn conversations, especially when the cache is involved, giving about a 10% boost over vLLM at the same context loads. These benchmark results translate to significant cost savings in production scenarios. Consider a customer support chatbot handling 1,000 conversations per hour, where each conversation averages 5 turns with substantial context. The 10-20% performance improvement from SGLang's RadixAttention means significant compute save, especially in a serverless environment where compute is paid for by the second.

When to Choose Each Framework

Choose SGLang when:

Building conversational AI applications with unpredictable dialog flows
Handling customer support, tutoring, or coding assistance use cases
Working with large context windows that vary between conversations
Prioritizing zero-configuration optimization
Deploying applications where conversation context frequently overlaps but isn't identical

Choose vLLM when:

Running batch inference with predictable, templated prompts
Handling high-throughput scenarios with exact prefix matches
Need fine-grained control over caching behavior
Working with structured workflows where request patterns are consistent
Prioritizing maximum throughput over conversation-specific optimizations

See for yourself

I've written a Jupyter notebook and some handy scripts that let you run the same prompt against both engines and it will tell you which is better for your GPU spec and your prompt, using the same real world hardware that you'll be using in production. Download it from GitHub here.

The script provides detailed metrics for each engine. Here's an example for a simple one-shot prompt (Do an in-depth historical analysis of the Declaration of Independence)‍

Engine	Duration	Tokens Generated	Speed (tok/s)
vLLM 🥇	17.06s	~1024	60.0
SGLang	15.49s	~817	52.7

Note: vLLM was 1.1× faster in tokens per second.

‍

So in this one-shot case, vLLM actually came up on top - so clearly, it's not as simple as just picking one package or another every single time.

Conclusion

We offer both SGLang and vLLM in our Quick Deploy endpoints - but to date, we haven't been super clear about why you would want to use one over the other. We want to empower you to choose the right one for the job. Both engines are very powerful and flexible, but the truth is there's going to be one tool that's better for any particular job, and we want to empower you to choose the right one for the job. Now, we have a notebook that will help you do that.

‍

Author

Brendan McKeag

Date

June 11, 2025

Table of contents

TOC

Get started

When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse

Understanding Multi-Turn Conversations and Caching

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

vLLM's Automatic Prefix Caching (APC):

Caches exact prefix matches using block-level storage
Requires identical token sequences to trigger cache hits
Optimized for batch inference where multiple requests share exact prefixes
Works best with templated prompts and structured batch processing
Manual configuration often needed for optimal cache utilization

SGLang's RadixAttention:

Uses a radix tree structure for more flexible prefix matching
Automatically detects and caches partial overlaps in conversation context
Designed for dynamic multi-turn conversations with evolving context
Zero configuration - automatically optimizes cache usage patterns
Better handles branching conversations and varied interaction patterns

The Key Difference: Static vs Dynamic Optimization

The fundamental difference lies in their design philosophy:

Prompting example

Here's an example of what setting the stage for a multi-turn prompt might look like:

‍

And here's the multi-turn prompting we will be doing for our tests:

‍You can see that during this test, it successively builds on each prior example. Notice how each user question is completely different:

"what would you recommend for training a large language model on a budget?"
"how would you optimize inference costs for a production deployment?"
"What are the key considerations for distributed training mentioned in the context?"

Testing parameters

To get some benchmarking numbers, I used 2X H100 SXM pods in Secure Cloud, using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Running the prompt, I get the following results.

Large Context RadixAttention Analysis

Test Type	Duration	Prompt Tokens	Response Tokens	Speed (tok/s)
7k context, fresh	5.093s	6,835	150	29.5
7k context, cache	4.287s	6,829	150	35.0
7k context, cache 2	4.295s	6,824	150	34.9
Small context	4.154s	72	150	36.1

So, hitting the cache on a 7k size prompt results in about a ~20% increase, and roughly matching the speed on a small prompt with no context at all.

vLLM Benchmark Results

Test Type	Duration	Prompt Tokens	Response Tokens	Speed (tok/s)
7k context, fresh	5.253s	6,801	150	28.6
7k context, cache	4.572s	6,801	150	32.8
7k context, cache 2	4.510s	6,801	150	33.3
Small context	4.124s	241	149	36.1

When to Choose Each Framework

Choose SGLang when:

Building conversational AI applications with unpredictable dialog flows
Handling customer support, tutoring, or coding assistance use cases
Working with large context windows that vary between conversations
Prioritizing zero-configuration optimization
Deploying applications where conversation context frequently overlaps but isn't identical

Choose vLLM when:

Running batch inference with predictable, templated prompts
Handling high-throughput scenarios with exact prefix matches
Need fine-grained control over caching behavior
Working with structured workflows where request patterns are consistent
Prioritizing maximum throughput over conversation-specific optimizations

See for yourself

The script provides detailed metrics for each engine. Here's an example for a simple one-shot prompt (Do an in-depth historical analysis of the Declaration of Independence)‍

Engine	Duration	Tokens Generated	Speed (tok/s)
vLLM 🥇	17.06s	~1024	60.0
SGLang	15.49s	~817	52.7

Note: vLLM was 1.1× faster in tokens per second.

‍

So in this one-shot case, vLLM actually came up on top - so clearly, it's not as simple as just picking one package or another every single time.

Conclusion

‍

When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Understanding Multi-Turn Conversations and Caching

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

The Key Difference: Static vs Dynamic Optimization

Prompting example

Testing parameters

Large Context RadixAttention Analysis

vLLM Benchmark Results

When to Choose Each Framework

See for yourself

Conclusion

When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse

Understanding Multi-Turn Conversations and Caching

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

The Key Difference: Static vs Dynamic Optimization

Prompting example

Testing parameters

Large Context RadixAttention Analysis

vLLM Benchmark Results

When to Choose Each Framework

See for yourself

Conclusion

Runpod Serverless Pricing Update

Introducing Serverless CPU: High-Performance VMs Without GPUs

Spin up a Text Generation Pod with Vicuna and Experience a GPT-4 Rival

Build what’s next.

When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Understanding Multi-Turn Conversations and Caching

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

The Key Difference: Static vs Dynamic Optimization

Prompting example

Testing parameters

Large Context RadixAttention Analysis

vLLM Benchmark Results

When to Choose Each Framework

See for yourself

Conclusion

When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse

Understanding Multi-Turn Conversations and Caching

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

The Key Difference: Static vs Dynamic Optimization

Prompting example

Testing parameters

Large Context RadixAttention Analysis

vLLM Benchmark Results

When to Choose Each Framework

See for yourself

Conclusion

Related articles.

Runpod Serverless Pricing Update

Introducing Serverless CPU: High-Performance VMs Without GPUs

Spin up a Text Generation Pod with Vicuna and Experience a GPT-4 Rival

Build what’s next.

You’ve unlocked areferral bonus!

You’ve unlocked a
referral bonus!