Design high-performance model serving systems that deliver consistent AI capabilities at enterprise scale.
Prerequisites: This guide assumes familiarity with Kubernetes (pods, deployments, CRDs), basic GPU infrastructure concepts, and REST API design. You will need a GPU-enabled Kubernetes cluster (or cloud account with GPU instances), kubectl access, and a Hugging Face account to access gated models. No prior LLM serving experience is required.
Model serving architecture is the structural design that governs how software components, infrastructure, and operational patterns work together to expose a trained AI model as a production-ready API, covering everything from model loading and request routing to GPU memory management, autoscaling, and observability. Every production stack is organized into three layers:
Getting these three layers to work together reliably is now the defining challenge in production AI, because the other barriers have fallen. Open-weight models like Llama 4, DeepSeek R1, and Qwen 3 now rival proprietary frontier models in quality, while H100 cloud pricing has dropped from approximately $8/hr in 2024 to $1.99–$2.85/hr depending on provider. With capable models and affordable hardware both widely accessible, the bottleneck has shifted from building great models to serving them at scale.
This guide walks through each layer of the serving stack with practical strategies for building systems that hold up under production traffic. Architecture decisions start with the model, so before examining each layer, we need to look at what teams are actually deploying, because model size, architecture, and modality constrain every choice downstream.
What models are teams serving in production today?
Five open-weight families account for the majority of production LLM workloads in 2026, based on deployment volume and community adoption:
- Llama 4 (April 2025) fields three Mixture of Experts (MoE) variants: Scout (109B total, 17B active, 10M-token context), Maverick (400B total, 17B active, 128 experts), and Behemoth (2T total, 288B active), all natively multimodal via early fusion.
- DeepSeek R1 (671B MoE, 37B active) proved open-weight reasoning can match proprietary frontier models. Its MoE routing and long chain-of-thought outputs make it a recurring benchmark reference for distributed inference architectures.
- Qwen 3 spans 0.6B to 235B across dense and MoE variants, covering everything from edge deployment to frontier-scale inference.
- Gemma 3 (up to 27B) is sized for single-GPU deployment.
- Mistral Large 3 (675B total, 41B active, Apache 2.0) joins the heavyweight MoE tier.
The spread here matters. A 7B dense model fits on a single GPU with basic batching; a 671B MoE model like DeepSeek R1 requires expert parallelism, disaggregated serving, and multi-node orchestration. Multimodal models like Llama 4 add another axis: vision encoders are compute-bound while text decoders are memory-bound, a split that shapes how inference engines schedule work. These differences drive every architectural decision that follows, starting with the inference engine.
What are the key inference engines and optimization techniques?
The inference engine is the lowest layer of the stack: it loads model weights onto accelerators, executes forward passes, and manages the KV cache that makes autoregressive generation possible. In early 2026, three open-source engines handle the bulk of production workloads. All three share a common baseline of continuous batching, PagedAttention-based KV cache management, and multi-GPU parallelism, but they diverge on optimization strategy, hardware affinity, and ecosystem integration.
vLLM (v0.17.1) is among the most widely deployed open-source LLM inference engines. Beyond the shared baseline, it adds prefix caching, chunked prefill, quantization across FP8, GPTQ, AWQ, and GGUF, and mature multimodal serving for text, image, video, and audio models. Its main differentiator is hardware breadth: NVIDIA GPUs, AMD ROCm, Intel XPU, Google TPU, and emerging accelerators via plugins. On an H100 cluster, vLLM achieves approximately 12,500 tokens/sec throughput on batched workloads (benchmarked on Llama 3.1 8B with 1,000 ShareGPT prompts. Keep in mind that model size, GPU count, quantization, and concurrency significantly affect absolute numbers; treat as directional).
The following example launches Llama 4 Maverick on vLLM; all three engines expose OpenAI-compatible endpoints with similar launch patterns.
# Prerequisite: accept the Llama 4 license at https://huggingface.co/meta-llama/Llama-4-Maverick-17B-Instruct
# then set your Hugging Face token:
export HF_TOKEN="hf_your_token_here"
# Launch vLLM with Llama 4 Maverick on 8xH200 with FP8
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Maverick-17B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--enable-prefix-caching \
--max-model-len 131072Authentication: For gated models like Llama 4, you must first create a Hugging Face token and accept the model’s license agreement before the token grants download access. vLLM reads the token from the HF_TOKEN environment variable.SGLang (v0.5.9) has emerged as a strong alternative, reportedly used by organizations including xAI (Grok 3), AMD, NVIDIA, LinkedIn, and Cursor. Its core innovation is RadixAttention for automatic prefix caching and a zero-overhead CPU scheduler. SGLang achieves approximately 16,200 tokens/sec in benchmarks, outperforming vLLM by roughly 29% in multi-turn scenarios (under equivalent benchmark conditions; performance varies by workload). It supports structured generation with zero overhead, and TPU-native execution via its SGLang-Jax backend. FlashAttention-4 integration is in active development.
NVIDIA TensorRT-LLM (v1.2.0) provides the highest raw throughput on NVIDIA hardware through deep graph optimization, kernel fusion, and hardware-specific acceleration. The v1.2.0 release refined the PyTorch-based architecture introduced in v1.0 as the default workflow. Key features include speculative decoding (EAGLE-3, N-gram, MTP), disaggregated serving, FP8 and NVFP4 quantization, and optimized performance for DeepSeek-R1 on Blackwell GPUs. Multiblock Attention delivers significant throughput improvement for long sequences, and NVSwitch MultiShot enables 3x faster AllReduce for distributed inference.
Unlike the engines above, NVIDIA Dynamo (announced at GTC 2025) is not an inference engine itself. It is a distributed inference orchestration layer that sits on top of vLLM, TensorRT-LLM, or SGLang and coordinates work across GPU pools. Its defining innovation is disaggregated prefill and decode: rather than running both phases on the same GPUs, Dynamo routes compute-bound prefill and memory-bound decode to separate pools that scale independently. Combined with NIXL (NVIDIA Inference Transfer Library) for accelerated KV cache transfer between GPUs, dynamic GPU scheduling, and KV-cache-aware routing, the Blackwell hardware and Dynamo software stack delivers up to 30x throughput improvement on Llama models compared to previous-generation Hopper GPUs.
Triton Inference Server, now branded as Dynamo-Triton, continues to serve as a multi-framework model server supporting TensorRT, PyTorch, ONNX, OpenVINO, and more, with dynamic batching, concurrent execution, and ensemble model support.
For local and edge deployment, Ollama and llama.cpp provide lightweight serving with the GGUF quantization format, OpenAI-compatible APIs, and support for a wide range of hardware from NVIDIA GPUs to Apple Silicon via MLX.
Engine-level optimization
Choosing an inference engine is only the first decision. How you configure quantization, attention kernels, and decoding strategy determines how much performance you actually extract from the hardware.
Quantization has become essential for cost-effective production inference. The table below maps hardware to the recommended format; the notes that follow cover technical details the table can’t convey.
- FP8 uses E4M3 (forward pass) and E5M2 (gradients) encodings with native hardware support on Hopper and Ada. Supported across vLLM, SGLang, and TensorRT-LLM.
- NVFP4 delivers approximately 1.8x memory reduction versus FP8 on Blackwell GPUs. Quantization-Aware Distillation (QAD) recovers most of the accuracy gap.
- AWQ won Best Paper at MLSys 2024. It preserves salient weights at higher precision, which is why it edges out GPTQ on perplexity.
- GPTQ uses Hessian-based error compensation for INT4 weight-only quantization. Mature ecosystem support.
- GGUF binary format supports 2-bit to 8-bit quantization via llama.cpp and Ollama.
Quantization reduces memory and compute per token. Speculative decoding attacks the other bottleneck: generating tokens faster. It uses a small draft model to generate candidate tokens that the full model verifies in parallel. EAGLE-3 is the leading algorithm, achieving 1.5-2.5x speedups in production, with checkpoints available across Llama 4 and Qwen3 model families via Red Hat’s Speculators project (v0.2.0), which standardizes speculator model packaging for vLLM. DeepSeek models take a different approach with native Multi-Token Prediction (MTP), integrated directly into vLLM and SGLang.
The gains are workload-dependent. Speculative decoding works best on tasks where the target model’s outputs are predictable from a smaller draft: greedy or low-temperature generation, repetitive coding tasks, and structured outputs. For high-temperature creative generation or highly diverse prompt distributions, draft acceptance rates can drop below 50%, reducing or eliminating the benefit. Always benchmark on a representative sample of your production traffic before enabling it, monitoring the spec_decode_draft_acceptance_rate metric exposed by vLLM.
The third optimization lever is the attention computation itself. FlashAttention reduces memory usage and speeds up transformer inference by restructuring how attention is computed on GPU hardware. FlashAttention-2 remains the stable production default for broad hardware compatibility (Ampere, Ada). FlashAttention-3 is optimized for Hopper GPUs, achieving 1.5-2x speedup over FA2 with BF16. It reaches up to 740 TFLOPs/s in FP16 (75% utilization on H100 SXM5 per the original paper) and is the production standard for H100/H200 deployments. FlashAttention-4, written in CuTe-DSL and optimized for Blackwell, delivers up to 30% faster than cuDNN attention. The initial release targets forward-pass workloads, with variable-length and GQA/MQA support under active development; use selectively behind benchmarking and fallbacks.
Multi-model serving
Not every deployment runs a single large model. NVIDIA’s Multi-Instance GPU (MIG) on A100, H100, and H200 partitions a single GPU into up to 7 isolated instances with dedicated VRAM, cache, and compute cores, enabling one GPU to serve multiple small models simultaneously. For LoRA/PEFT adapters, KV-cache-aware routing (supported in llm-d v0.5 and the Kubernetes Gateway API Inference Extension) dynamically loads fine-tuned adapters onto base model servers, enabling hundreds of customized models from shared infrastructure.
Engine selection and optimization determine what is possible at the inference level, but even the fastest engine is only as useful as the API that exposes it to production traffic.
How do you design model serving APIs that handle production traffic?
The serving layer sits between clients and the inference engine. It defines the API contract, validates and routes requests, manages connections, and handles failures. Where the engine layer optimizes computation, this layer optimizes the path requests take to reach that computation and how responses get back to clients. The patterns described here assume H100/H200-class hardware (with Blackwell B200 as the emerging flagship) as the baseline; the hardware section that follows covers how different accelerators constrain these decisions. Security concerns (authentication, rate limiting, input validation) are addressed in the Security section later in this guide.
API contract and request flow
The OpenAI API format (/v1/chat/completions, /v1/completions, /v1/embeddings) has become the universal interface for LLM serving. vLLM, SGLang, and TensorRT-LLM all expose these endpoints natively, which means existing OpenAI SDKs work as drop-in clients against any of them. For multi-model or multi-provider setups, LLM gateways like LiteLLM (supporting 100+ providers) or the Envoy AI Gateway provide unified routing, cost tracking, and fallback logic across backends.
Requests flow through a pipeline of validation, dispatch, and response formatting. On the input side, this means token count estimation before dispatch (rejecting requests that exceed the model’s context window), prompt injection detection, and PII filtering at the gateway layer. On the output side, streaming via Server-Sent Events (SSE) is now standard for LLM workloads, enabling token-by-token delivery that reduces perceived latency. SGLang’s structured generation capabilities support grammar-based output formatting and guaranteed JSON schema compliance, which is critical for agentic and tool-calling workflows.
Rate limiting deserves special attention for LLM APIs. Token-based rate limiting (capping total tokens consumed per API key per minute, rather than raw request counts) is the current best practice, supported natively by the Envoy AI Gateway and LiteLLM’s virtual key management. A 50-token request and a 50,000-token request consume vastly different GPU resources; token-based limits account for this difference.
Batching and caching
Continuous batching (also called iteration-level batching) is the universal standard across all major inference engines. Unlike static batching, it allows new requests to join a running batch as soon as any in-flight request completes, eliminating idle GPU cycles. Combined with PagedAttention (which manages KV cache in fixed-size pages analogous to OS virtual memory), continuous batching in vLLM and SGLang can improve throughput by 3-10x over naive request-at-a-time serving.
Caching operates at multiple levels. Prefix caching reuses KV cache entries for shared prompt prefixes (system prompts, few-shot examples, repeated instructions) across requests. In SGLang, RadixAttention handles this automatically; in vLLM, you must explicitly enable it via the --enable-prefix-caching flag. For multi-turn chat, prefix caching across conversation turns combined with intelligent context window management (summarizing older turns) keeps memory usage sustainable without sacrificing conversation quality. At the application layer, semantic caching of complete inference results for identical or near-identical inputs can further reduce costs (Portkey, arXiv). KV cache quantization (FP8 in SGLang and vLLM, or q8_0/q4_0 in llama.cpp) reduces memory pressure at the engine level.
Connection management
HTTP/2 SSE is the standard protocol for OpenAI-compatible streaming endpoints and works with all major clients without modification. WebSockets are better suited for bidirectional, session-oriented use cases like voice agents or interactive coding assistants. For streaming LLM responses, maintain persistent connections to avoid per-token connection overhead. Set client-side read timeouts to at least 300 seconds for long-generation requests, and configure keep-alive intervals of 15-30 seconds to prevent midstream disconnects through proxies and load balancers.
Reliability and error handling
Circuit breakers prevent cascading failures by temporarily disabling unresponsive backends. LLM gateways like Helicone provide health-aware routing with automatic circuit breaking: if a model server stops responding, the gateway reroutes traffic to healthy replicas.
Graceful degradation keeps the system functional under pressure. Strategies include falling back to smaller quantized models (e.g., from a 70B FP16 model to its 4-bit AWQ variant), reducing maximum output length under load, or routing to a cached response service for common queries.
Health probes are essential for Kubernetes-based deployments. vLLM exposes a /health endpoint; configure startup probes with long timeouts to accommodate model loading, readiness probes that verify GPU warmth and model availability, and liveness probes for ongoing health verification.
The serving layer defines how requests reach the engine and how responses get back to clients. API design patterns operate within the constraints set by the underlying accelerator hardware, which determines which models fit, at what batch sizes, and with what latency profile.
What hardware powers production inference in 2026?
The hardware options for AI inference have expanded. Choosing the right accelerator depends on model size, latency requirements, throughput targets, and budget.
NVIDIA H100 remains the workhorse of production inference, with cloud pricing ranging from approximately $1.99/hr to $3.00/hr. Its 80GB HBM3 memory, Transformer Engine with FP8 support, and mature software ecosystem make it the safe default for most deployments.
NVIDIA H200 offers 141GB HBM3e and 4.8 TB/s bandwidth, a 1.4x memory increase over the H100 on the same Hopper die. It delivers approximately 1.8x faster inference for transformer models. Cloud pricing starts around $2.32/hr on providers like Vast.ai, though spot pricing fluctuates frequently.
NVIDIA B200 (Blackwell) entered production in early 2026. With 180GB HBM3e per chip, 8 TB/s memory bandwidth, and approximately 18 PFLOPS FP4 with sparsity (roughly 5x the H100’s FP8 with sparsity), Blackwell represents a generational leap. The second-generation Transformer Engine supports FP4 precision natively. The GB200 NVL72 rack-scale system (72 Blackwell GPUs connected via NVLink) delivers 1.4 ExaFLOPS FP4 with sparsity with 13.5 TB HBM3e and 17 TB LPDDR5X memory. Supply remains constrained through mid-2026.
AMD MI300X (192GB HBM3) provides competitive inference performance as an increasingly viable alternative to NVIDIA hardware, with potential TCO advantages depending on workload. ROCm 7.x (currently v7.2.0) delivers 3.5x average training and 3x average inference performance uplift over its predecessor, with support for vLLM, PyTorch 2.7, and leading model families. The MI350 series (CDNA 4, up to 288GB HBM3e) and the MI400 series are expanding AMD’s footprint.
Google TPU v6e (Trillium) delivers 4x performance improvement over its predecessor with 67% better energy efficiency. Google’s latest Ironwood (v7) TPUs extend this trajectory further, competing directly with Blackwell and AWS Trainium2 for next-generation inference workloads. Google trained Gemini 3 entirely on its own TPU v5e and v6e pods, validating TPUs as a viable pathway for frontier-scale model development.
AWS Trainium2 is generally available, claiming 30–40% better price-performance than GPU-based H100 instances. AWS has deployed nearly 500,000 Trainium2 chips for Anthropic’s model training. AWS continues to offer its Inferentia line for inference workloads alongside Trainium.
Specialized inference accelerators have also gained traction. Cerebras CS-3 targets high-throughput inference on 120B+ parameter models and has secured a $10B+ commitment from OpenAI for inference workloads. SambaNova SN40L demonstrated high-throughput inference on full DeepSeek-R1 671B with only 16 chips (BusinessWire, SambaNova).
How to choose hardware for your workload:
Hardware constraints shape deployment topology. Parallelism strategies and orchestration patterns build on the GPU memory, bandwidth, and interconnect characteristics covered above.
What are the best production deployment patterns?
The orchestration layer handles scaling, health, and resource allocation. It determines how many engine instances run, where they run, and how traffic moves between them. This section covers deployment strategies, Kubernetes-native tooling, model loading, serverless options, and parallelism.
Blue-green deployment enables zero-downtime model updates by maintaining two identical environments and switching traffic atomically once the new version passes all quality gates. In KServe, maintain two InferenceService resources (e.g., llm-blue and llm-green) and switch traffic by updating the InferenceModel targetModels weight (0% to 100%) only after the new service’s readiness probe passes. Account for model startup time in the rollout gate: the new version must be fully loaded and warm before the switch.
Canary releases gradually route traffic to new model versions (typically 5% to 10% to 25% to 50% to 100%) while monitoring latency, error rates, guardrail triggers, and quality scores at each stage. KServe provides native traffic splitting; LLM gateways like Portkey support weighted load balancing for canary testing via configuration without application code changes. Automatic rollback triggers if metrics degrade beyond defined thresholds.
Shadow deployments duplicate live traffic to the new version without affecting users, enabling output comparison and quality evaluation before promotion. A/B testing capabilities enable scientific comparison of different model versions or serving configurations.
Kubernetes-native LLM serving
Kubernetes has become the dominant orchestration platform for LLM inference. The CNCF Annual Cloud Native Survey (published January 2026) reports that approximately 66% of organizations hosting generative AI models use Kubernetes for some or all inference workloads.
KServe (v0.15), now a CNCF incubating project, provides the standard Kubernetes-native model serving framework with first-class generative AI support. It integrates with KEDA for LLM-specific autoscaling, the Envoy AI Gateway for token rate limiting and dynamic model routing, and Knative for scale-to-zero GPU workloads.
# Minimal KServe InferenceService for vLLM
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama4-maverick
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: kserve-vllm
storageUri: "pvc://model-store/llama4-maverick-fp8"
resources:
limits:
nvidia.com/gpu: "8"Storage prerequisite:pvc://model-store/llama4-maverick-fp8uses KServe’s PVC URI format (pvc://<pvc-name>/<path-on-volume>) and references a PersistentVolumeClaim namedmodel-storewith model weights pre-loaded at the pathllama4-maverick-fp8/. To populate this PVC, use an init-container job that pulls weights from the Hugging Face Hub viahuggingface-cli download, or use KServe’s built-in ClusterStorageContainer with model caching. See the KServe PVC storage docs for complete setup instructions.
Once model servers are running, the next problem is routing traffic to them. Traditional round-robin load balancing is inadequate for LLM workloads because request shapes are non-uniform: a 100-token prompt and a 10,000-token prompt consume vastly different resources.
The Kubernetes Gateway API Inference Extension solves this with LLM-aware routing primitives (InferencePool groups identical model server backends; InferenceModel defines routing priorities), routing requests based on KV cache utilization, pending queue depth, and prefix affinity to minimize time-to-first-token. In practice, this cache-aware routing achieves lower TTFT than round-robin approaches. The extension also supports native LoRA/PEFT dynamic adapter loading.
llm-d (founded by Google, NVIDIA, IBM Research, and CoreWeave, with Red Hat as a major contributor) builds on these primitives to provide a full distributed inference stack. It wires together vLLM as the model server, the Kubernetes Inference Gateway as the control plane, and Envoy proxy for load balancing. Version 0.5 (February 2026) added hierarchical KV offloading, cache-aware LoRA routing, and scale-to-zero autoscaling.
Models that span multiple GPUs or nodes need additional scheduling primitives. LeaderWorkerSet treats pod groups as a single unit for 400B+ parameter models, and Dynamic Resource Allocation (DRA) (GA in Kubernetes v1.34) automates GPU and TPU scheduling via a claim-based model, replacing the static device plugin approach.
# KEDA ScaledObject targeting vLLM concurrent requests
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama4-maverick
minReplicaCount: 0
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_running
query: avg(vllm:num_requests_running{model="llama4-maverick"})
threshold: "12"Note: ThescaleTargetRefmust includeapiVersionandkindwhen targeting a KServeInferenceService(a custom resource). Omitting these fields causes KEDA to default to targeting aDeployment, which will silently fail to scale. Custom resource scaling has been supported since KEDA v2.10; current stable is v2.19.0. See the KEDA ScaledObject spec for details on custom resource targets.
The key is choosing the right scaling signal. CPU and memory utilization are poor proxies for LLM load; scale instead on concurrent active requests (vllm:num_requests_running), KV cache utilization, or token generation rate. As a starting point, target 8-16 concurrent requests per replica for 7B-13B models on a single high-end GPU, dropping to 4-8 for 70B models with multi-GPU tensor parallelism. For workloads with variable prompt lengths, scaling on KV cache utilization (e.g., trigger scale-out at 80% cache fill) tends to be more stable than request count alone. Profile your actual TTFT distribution under load and tune from there.
Model loading and startup optimization
Large language models like Llama 4 Behemoth (2T total parameters) can take significant time to load, making warm pools and intelligent caching essential for responsive service. Modern approaches include advanced model caching (supported in KServe v0.15), GKE secondary boot disk caching that delivers up to 29x faster container load times compared to pulling from a registry, and persistent volume weight preloading. Quantized formats, particularly FP8 on Hopper/Blackwell GPUs and GGUF for CPU/hybrid inference, reduce both load times and memory footprints.
Serverless and managed inference
Serverless inference platforms eliminate infrastructure management entirely. RunPod offers FlashBoot technology with cold starts in the 500ms to 2-second range, per-second billing with zero egress fees, and multi-GPU workers up to 4x80GB, with H100 PCIe instances starting at $2.39/hr. Modal provides true scale-to-zero with a Rust-based container stack for ultra-fast spin-ups. Together AI and Fireworks AI offer managed inference for 200+ open-source models with pre-optimized serving configurations.
For cold start mitigation on serverless platforms specifically, pre-built TensorRT engines and ReadWriteMany PVCs complement the model loading strategies described above.
Multi-GPU parallelism strategies
When a model exceeds a single GPU’s memory, the parallelism strategy you choose determines your cluster topology and node configuration.
- Tensor Parallelism (TP): Splits layer weights across GPUs within a node. Requires NVLink. Best for latency-sensitive serving. Set TP equal to the number of GPUs per node.
- Pipeline Parallelism (PP): Splits layers across nodes. Adds bubble overhead, so use m>=3 microbatches to keep it under 25%. Best for throughput-oriented batch workloads; avoid PP=4+ for interactive use cases.
- Expert Parallelism (EP): Distributes MoE experts across GPUs (DeepSeek R1, Llama 4 Maverick). Combined EP+PP configurations improve interactivity with minimal throughput loss.
- Context Parallelism (CP): Distributes sequence activations across GPUs for long-context inference (1M+ tokens), critical for Llama 4 Scout’s 10M-token context window.
A common production configuration for a 70B dense model on 2 nodes of 8 GPUs each: TP=8, PP=2. For MoE models, wide-EP in the decode phase with data parallelism in the prefill phase. NVIDIA Dynamo can dynamically reallocate GPUs between these pools based on real-time demand.
Deployment patterns are only as reliable as the monitoring that watches them.
How do you monitor and optimize model serving in production?
Monitoring LLM serving differs from traditional API monitoring because the metrics that matter are model-specific: token throughput, KV cache pressure, and generation latency.
Prometheus and Grafana dashboards should track the key LLM-specific metrics: time-to-first-token (TTFT) p99, inter-token latency (ITL), KV cache utilization, request queue depth, GPU utilization, GPU memory usage, and aggregate tokens/sec throughput. vLLM exposes these metrics natively via its Prometheus endpoint. Essential PromQL queries for getting started:
```jsx
# p99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
# KV cache utilization (leading indicator of queue buildup)
vllm:gpu_cache_usage_perc
# Running request queue depth
vllm:num_requests_running
```Alert when p99 TTFT exceeds your SLA threshold sustained for 5 minutes, or when KV cache utilization exceeds 85% (a leading indicator of queue buildup). vLLM’s Grafana dashboard template on GitHub provides a ready-to-deploy starting point.
For end-to-end visibility across client, API gateway, proxy, and inference server, OpenTelemetry-based distributed tracing is the standard. Specialized LLM observability platforms (Langfuse, Helicone, Fiddler AI, and Datadog’s LLM Monitoring) add token-level cost attribution, quality scoring, and prompt analytics.
Beyond infrastructure metrics, tracking business-level signals (model accuracy, user satisfaction, conversion rates) connects serving performance to outcomes. For LLM output drift detection, traditional tabular drift metrics (Evidently, Alibi Detect) are not directly applicable. Teams typically measure LLM-specific drift via embedding distribution shifts, output quality regression via an LLM-as-judge scorer, or downstream task metric degradation. Platforms like Fiddler AI and Arize Phoenix support LLM-native drift detection.
Cost management and optimization
GPU utilization below 60% during peak hours typically signals over-provisioned infrastructure or suboptimal batching configuration. The most impactful cost levers for LLM inference in 2026:
- Quantization: Moving from FP16 to FP8 cuts memory in half with negligible quality loss; INT4 (AWQ/GPTQ) reduces it by 4x.
- Scale-to-zero: KServe + KEDA or serverless platforms (RunPod Flex Workers) eliminate 100% of idle costs.
- Spot/preemptible instances: Up to 60% savings for fault-tolerant workloads.
- Prefix caching: Eliminates redundant computation for shared system prompts, saving significant prefill compute.
- Disaggregated serving: Independently scaling prefill and decode phases via NVIDIA Dynamo or llm-d optimizes GPU utilization for each phase’s distinct compute profile.
For capacity planning, NVIDIA Dynamo’s Planner component uses pre-deployment profiling to evaluate how TP configuration and batching parameters affect performance, recommending configurations that meet TTFT/ITL targets within a given GPU budget.
Monitoring and cost data also inform how you secure the serving stack.
Security, compliance, and guardrails
Security is a cross-cutting concern that applies to every layer of the serving architecture.
Authentication and authorization start at the API gateway. Virtual API key management (supported by LiteLLM and Portkey) enables per-team spend tracking and rate limiting, while controlling access for different user types and applications.
Beyond access control, production LLM deployments require guardrails on both sides of the inference call: input guardrails (regex-based PII detection, jailbreak detection, prompt injection prevention) and output guardrails (toxicity filtering, sensitive data exposure detection, schema validators). Teams typically implement guardrails at two layers: input filtering at the API gateway (fast regex/classifier checks before the request reaches the inference server) and output filtering as a post-processing step before returning the response to the client.
For programmable policy-based guardrails, NVIDIA NeMo Guardrails runs as a separate container in the same pod (sidecar pattern) or as an independent service, using its Colang policy language to intercept both requests and responses. A minimal configuration defines rails for input and output:
# config.yml - NeMo Guardrails minimal configuration
models:
- type: main
engine: vllm
model: meta-llama/Llama-4-Maverick-17B-Instruct
rails:
input:
flows:
- check jailbreak
- check input sensitive data
output:
flows:
- check output sensitive dataSynchronous guardrail checks add 5-30ms per request depending on classifier complexity; consider async evaluation for non-blocking output filtering on latency-sensitive paths. For complete setup and Colang policy examples, see the NeMo Guardrails quickstart. The OWASP LLM Top 10 (2025) provides a useful threat model for deciding which rails to prioritize, with prompt injection, sensitive data leakage, and system prompt leakage at the top of the list.
Finally, audit logging ties the security layer together. Track API usage, model versions, and system changes to support compliance reporting and incident investigation. Tag every request with model_name, model_version, dataset_id, run_id, tenant_id, region, and endpoint so you can reconstruct any inference event after the fact.
With the full serving stack in place, the question is no longer any single component choice but how well these layers work together.
Conclusion
Architecture decisions in model serving are deeply interdependent. Choosing an engine determines which quantization formats and parallelism strategies are available. Hardware selection constrains which models fit and at what batch sizes. Deployment patterns dictate how the system behaves under failure and during updates. Monitoring closes the loop by feeding real-time data back into scaling and routing decisions. No single component choice matters as much as the coherence of the full stack.
Change in this space remains intense. Disaggregated serving (Dynamo, llm-d) is redefining how GPU pools are managed. New quantization formats (NVFP4) are unlocking performance on Blackwell hardware. Kubernetes-native primitives (Gateway API Inference Extension, DRA) are making LLM serving a first-class citizen in the orchestration layer. Teams that invest in understanding these building blocks, rather than treating model serving as a solved deployment problem, will have the infrastructure advantage that compounds over time.
For teams looking to deploy these architectures on managed infrastructure, RunPod offers optimized GPU instances including H100, H200, and B200 with per-second billing and zero egress fees.
FAQ
Q: What is model serving architecture?
A: Model serving architecture is the complete system of software, infrastructure, and design patterns that takes a trained AI model and makes it available as a reliable, scalable production service. It is typically organized into three layers: an inference engine that executes model computations on accelerator hardware (e.g., vLLM, TensorRT-LLM, SGLang), a serving layer that manages request routing, batching, and API contracts (e.g., KServe, LiteLLM, Envoy AI Gateway), and an orchestration layer that handles scaling, health monitoring, and resource allocation (e.g., Kubernetes + KEDA, llm-d, GKE Inference Gateway). A well-designed model serving architecture delivers low-latency predictions to concurrent users while optimizing hardware utilization and cost.
Q: What response times should I expect from a well-optimized model serving API?
A: Response times vary significantly by model size, hardware, quantization, and concurrency. As a general reference, production LLM serving in 2026 targets TTFT in the low hundreds of milliseconds for standard workloads and inter-token latency in the tens of milliseconds during generation (BentoML, Databricks). Simple classification and embedding models respond in 10–50ms. Large reasoning models (DeepSeek R1) may take seconds for the thinking phase before output begins. On specialized inference hardware like Cerebras CS-3, output throughput can be significantly higher than GPU-based alternatives for supported model sizes (benchmarks).
Q: How many concurrent requests can a single GPU handle for LLM serving?
A: With continuous batching and PagedAttention, concurrency depends on model size, quantization, and the ratio of input to output tokens. As a starting point, expect 8-16 concurrent requests per replica for 7B-13B models on a single high-end GPU, dropping to 4-8 for 70B models with multi-GPU tensor parallelism. MIG partitioning on H100/H200 enables serving multiple small models on a single GPU with isolated resources. Profile your actual workload to tune these thresholds.
Q: What’s the difference between synchronous and asynchronous model serving?
A: Synchronous serving provides immediate streaming responses and is standard for interactive chat and code completion. Asynchronous (queue-based) serving accepts requests immediately and delivers results through callbacks or polling, ideal for batch processing, long-running reasoning tasks, and workloads that tolerate higher latency. RunPod Serverless, for example, supports both patterns with queue-based scaling for async and load-balanced endpoints for sync.
Q: How should I choose between vLLM, SGLang, and TensorRT-LLM?
A: vLLM (v0.17.1) offers the broadest hardware support (NVIDIA, AMD, Intel, TPU), a large ecosystem, and mature production tooling; it is the default choice for most deployments. SGLang (v0.5.9) excels in multi-turn and structured generation scenarios with higher throughput in certain benchmarks, and is a strong choice for agentic and programming-heavy workloads. TensorRT-LLM (v1.2.0) delivers the highest raw throughput on NVIDIA hardware but requires more complex setup and is NVIDIA-only. For orchestration across engines, NVIDIA Dynamo provides engine-agnostic disaggregated serving.
Q: What infrastructure is needed for high-availability model serving?
A: High-availability LLM serving requires multiple inference replicas across availability zones, LLM-aware load balancing (KV-cache-aware routing via GKE Inference Gateway, Envoy, or llm-d), comprehensive health monitoring with startup/readiness/liveness probes, KEDA-based autoscaling on LLM-specific metrics, and automatic failover with circuit breakers. Plan for 2–3x capacity during peak loads. Canary deployments with automatic rollback prevent bad model updates from affecting production traffic.
Q: How do I optimize costs for LLM serving while maintaining performance?
A: The highest-impact cost optimizations: (1) Quantize: FP8 on Hopper/Blackwell halves memory with negligible quality loss; INT4 (AWQ) cuts it by 4x. (2) Scale to zero with KServe + KEDA or serverless platforms to eliminate idle GPU costs entirely. (3) Enable prefix caching to avoid redundant computation on shared system prompts. (4) Use disaggregated serving (Dynamo or llm-d) to independently scale prefill and decode phases, optimizing utilization for each. (5) Right-size hardware: an H200 may be more cost-effective than two H100s for memory-bound models due to its larger capacity enabling higher batch sizes.


.webp)