KV cache is the unwritten cost line on every long-context deployment. Above 32K tokens, KV memory consumption starts outpacing parameter memory; above 128K it dominates; at 1M tokens, KV cache eats 70-90% of available GPU VRAM and 60-85% of wall-clock per token.
That makes KV optimization the single biggest cost lever in 2026 production inference. Five technique families do the work: paged attention (vLLM's memory-management substrate), prefix caching (vLLM and SGLang RadixAttention), attention-layer compression (MQA, GQA, DeepSeek's MLA), and KV-cache quantization (INT8 and FP8). Used together, they collapse long-context inference cost by 4-40×.
This engineering guide covers each technique's mechanics, the workloads where it pays off, and the measured throughput / cost numbers across vLLM 0.7, SGLang, and TensorRT-LLM at 32K, 128K, and 512K context.
- 01Above 128K context, KV cache memory is bigger than parameter memory.On a Llama 70B baseline at 1M context, KV cache hits ~135 GB at FP16 — more than the 140 GB of FP16 model weights. The hot path moves from compute-bound to memory-bound, and optimization priorities flip.
- 02Paged attention (vLLM) is the substrate, not an optimization — it's required, not optional.Without paged attention, GPU memory fragments rapidly under variable-length batches and you waste 30-50% of available VRAM. Every 2026 production inference stack has paged attention by default; the question is what you layer on top.
- 03Prefix caching gives 85-95% cost savings on cache hits — it's the highest-leverage application optimization.vLLM's prefix cache and SGLang's RadixAttention both hash prompt prefixes and reuse the KV state. On agent loops, multi-tenant SaaS, repo Q&A, and long-doc workflows, hit-rates of 60-85% are achievable, dropping per-call cost by 5-12×.
- 04DeepSeek's MLA is the architectural endgame for KV efficiency in 2026.Multi-head Latent Attention compresses KV cache 7-14× by storing a low-rank projection instead of full keys and values. DeepSeek V2/V3/V4 build on it; competing models are still on GQA at 4-8× compression. MLA is the reason V4-Pro can run 1M context economically.
- 05FP8 KV cache halves memory with sub-1% accuracy regression — turn it on.INT8 KV cache costs 1.5-3 points on long-context retrieval (NIAH-2 multi-needle); FP8 costs 0.3-0.7 points — within noise for most production workloads. The 50% memory savings translate to 30-50% throughput gains via larger batch size at the same VRAM budget.
01 — Why It MattersKV cache is the biggest line item in long-context cost.
The KV cache stores per-layer key and value tensors for every token in the context window so that subsequent tokens can attend to them without recomputing. Memory grows linearly with context length, layer count, and head count — and at frontier scale, the growth is faster than parameter count growth. The result: above 128K tokens, KV memory exceeds parameter memory on most architectures.
KV cache size · 1M context across attention variants
Source: vLLM benchmarks · DeepSeek V2 paper · Apr 2026Two reads matter. First: at 1M context with naive MHA + FP16, KV cache exceeds parameter memory — the model literally cannot hold its own context on the same GPU as its weights without help. Second: stacking optimizations compounds. MLA + FP8 together drop KV from 135 GB to 8 GB — a 17× reduction. That is the difference between not being able to serve 1M context at all and serving it economically.
02 — Paged AttentionPaged attention is the substrate everything else builds on.
Paged attention (introduced by vLLM in 2023) treats KV cache memory like virtual memory in an operating system — divided into fixed-size blocks, allocated on demand, and indirected through a block table. Without it, KV memory fragments rapidly under variable-length batches: a request that ends frees a block in the middle of physical memory, and the unfreed contiguous span cannot host a new long request. Effective utilization drops to 50-65% under typical bursty traffic.
Paged attention solves this by removing the contiguity requirement entirely. Each request's KV blocks can be scattered across physical memory, with the attention kernel reading through a page-table indirection. The trade is a small per-token compute overhead (~2-5%) for a dramatic gain in effective utilization (95%+) and the ability to support continuous batching cleanly.
03 — Prefix CachingPrefix caching is the highest-leverage application-side win.
Prefix caching reuses the KV state of a shared prompt prefix across multiple calls. If two calls start with the same 200K tokens of system prompt + reference docs and only differ in the final 5K of user content, the cache stores the 200K KV state on the first call and reuses it on every subsequent call — collapsing 200K tokens of attention compute into a memory read.
Hash-based, automatic, broadly supported
vLLM's automatic prefix caching hashes incoming prompt prefixes and reuses KV blocks when matches are found. Works without application changes; works across requests in the same batch and across batches if the cache is warm. Default for vLLM 0.4+.
Default vLLMTree-structured, optimal for branching workloads
SGLang's RadixAttention organizes the cache as a radix tree, optimal for branching prompt patterns (multiple completions sharing a prefix, then diverging). Wins on agent loops with multiple candidate paths, multi-turn chat with sibling thread structure, and Monte Carlo decoding.
Branching workloadsApplication opts in via cache-control breakpoints
Application places explicit cache markers in the prompt; the serving stack honors them as cache breakpoints. More control, slightly more code. Pattern Anthropic exposes in their API and the recommended approach for self-hosted multi-tenant SaaS.
Multi-tenant SaaSLower-level, NVIDIA-only, more setup friction
TensorRT-LLM exposes KV reuse APIs that integrate with the lower-level engine. Highest peak performance for static workloads, more setup friction. Right call for stable, high-volume production where the prompt structure is locked in.
Stable + max throughput"Prefix caching is the only optimization on this list that gets cheaper the longer the context. The math compounds in your favor."— Internal serving-stack notes, May 2026
04 — Attention CompressionMQA, GQA, MLA — architectural KV compression.
The architectural side of KV optimization compresses the attention layer itself. Three patterns dominate: Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and DeepSeek's Multi-head Latent Attention (MLA). Each trades a degree of attention expressivity for dramatic KV cache savings.
Multi-Query Attention
1 KV head shared across all Q headsOriginal aggressive compression. All query heads share a single key and value head — 32× compression on a 32-head model. Quality regression at 1-3 points on most tasks; not used in frontier 2026 except for very latency-sensitive serving.
32× compression · -1 to -3 ptsGrouped-Query Attention
1 KV head per group of Q headsThe 2024-2026 default. Llama 3.1, 3.3, and 4 all use GQA with 8 groups of 4 heads (32 Q, 8 KV). Mistral and Qwen use similar configurations. 4-8× compression at sub-0.5 point quality regression. Right default for non-latent architectures.
4-8× compression · default 2026Multi-head Latent Attention
Low-rank projection + RoPE-decoupled KDeepSeek's contribution. Stores compressed latent state (typically 512 dim) instead of full K, V tensors. Decoupled RoPE-encoded keys handle position. 7-14× compression at <0.2 point regression. Only DeepSeek V2/V3/V4 use it as of Apr 2026.
7-14× compression · 2026 SOTAWindow-bounded attention
Each token attends to last W tokens onlyMistral and Mixtral use 4K sliding windows. Caps KV memory growth past the window size at the cost of long-range attention. Right for very long contexts where most reasoning is local; wrong for long-document Q&A or code-search workloads where long-range attention matters.
Constant KV · narrow attention05 — QuantizationKV cache quantization is free money most teams leave on the table.
KV cache quantization stores the K and V tensors at lower precision than the model's compute precision — typically INT8 or FP8 instead of FP16. The savings are direct: 50% memory reduction at FP8/INT8 vs FP16, 75% at INT4. The cost is small accuracy regression — and on FP8, that regression is often within measurement noise.
Reference precision
FP16 KV cache is the typical baseline. No quality regression versus the model's training precision; full memory cost. Always run baseline benchmarks at FP16 before evaluating any compression.
ReferenceMemory reduction
FP8 (E4M3 or E5M2) at 1 byte per element instead of 2. Quality regression: 0.3-0.7 points on long-context retrieval (NIAH-2 multi-needle), within noise on most application workloads. Default 2026 production setting on H100 and newer.
Right defaultMemory reduction
INT8 with per-channel scale. Slightly more aggressive quality regression than FP8 (1.5-3 points on long-context tasks) because INT8 has less dynamic range. Useful on hardware without native FP8 support; otherwise prefer FP8.
Pre-Hopper hardwareCompounding: KV quantization stacks cleanly with attention compression. A DeepSeek V4-Pro deployment running MLA at FP8 sees 7× × 2× = 14× total KV cache compression versus a naive MHA at FP16 baseline — the difference between 135 GB and 10 GB at 1M context on a 70B-class footprint. That ratio is what makes long-context production economical at all.
06 — DecisionPicking techniques by workload class.
Most teams shouldn't treat these as separate decisions. The right starting point is "all five together, then back off anything that breaks your eval." The architectural choices (MQA vs GQA vs MLA) are pinned by model selection; the runtime choices (paged attention, prefix cache, FP8 quant) are knobs in the serving stack.
Long-document Q&A · static reference
Single corpus, repeated questions, low cache invalidation. Stack: paged + prefix cache (24h TTL via explicit markers) + FP8 KV. Architectural: any GQA or MLA model. Wins are 80-95% per-call savings after prime.
All 5 layers · MLA-class modelMulti-tenant SaaS knowledge base
Many corpora (per-tenant), bursty hit-rate. Stack: paged + prefix cache (per-tenant cache markers) + FP8 KV. Avoid sliding window. SGLang RadixAttention often wins here over vLLM due to branching pattern.
SGLang + per-tenant markersLong-running agent loops
Single session, growing tool history. Stack: paged + sliding-window-style prefix cache + FP8 KV. Re-anchor at 30-50 turns to keep prefix matchable. SGLang RadixAttention helps with branching candidate paths.
Re-anchored prefix cacheHighly dynamic / short-context
Chat without long reference, fresh content per call. Stack: paged + FP8 KV. Skip prefix cache (low hit rate). Sliding window OK. Architecture: GQA is fine; MLA's advantage shrinks below 32K.
Paged + FP8 only07 — ConclusionKV is the real long-context cost.
The model picks the floor; the serving stack picks the ceiling.
By April 2026 the technique landscape is settled: paged attention is the substrate, prefix caching is the high-leverage application optimization, GQA or MLA picks the architectural floor, and FP8 KV is the free 50% memory savings every team should already have on. Used together, these compound to 4-40× cost reduction on long-context inference — the difference between "1M context is a marketing claim" and "1M context is in production."
The deeper move is to stop thinking about KV cache as a byproduct of inference and start thinking about it as the primary cost variable for any deployment that touches more than 32K tokens. Once the team is measuring KV memory consumption and KV bandwidth as first-class metrics, the right optimizations become obvious — and the wrong ones (eager batch size, heroic parameter quantization) stop being suggested.
The next architectural compression is already in flight: DeepSeek's CSA + HCA in V4 brings MLA-style compression into a sparser substrate, and the Mamba-MoE hybrids point to sub-1 GB KV at 1M context within 12 months. Build the muscle now to take advantage of it when it ships.