KV cache is the unwritten cost line on every long-context deployment. Above 32K tokens, KV memory consumption starts outpacing parameter memory; above 128K it dominates; at 1M tokens, KV cache eats 70-90% of available GPU VRAM and 60-85% of wall-clock per token.

That makes KV optimization the single biggest cost lever in 2026 production inference. Five technique families do the work: paged attention (vLLM's memory-management substrate), prefix caching (vLLM and SGLang RadixAttention), attention-layer compression (MQA, GQA, DeepSeek's MLA), and KV-cache quantization (INT8 and FP8). Used together, they collapse long-context inference cost by 4-40×.

This engineering guide covers each technique's mechanics, the workloads where it pays off, and the measured throughput / cost numbers across vLLM 0.7, SGLang, and TensorRT-LLM at 32K, 128K, and 512K context.

Key takeaways

01
Above 128K context, KV cache memory is bigger than parameter memory.On a Llama 70B baseline at 1M context, KV cache hits ~135 GB at FP16 — more than the 140 GB of FP16 model weights. The hot path moves from compute-bound to memory-bound, and optimization priorities flip.
02
Paged attention (vLLM) is the substrate, not an optimization — it's required, not optional.Without paged attention, GPU memory fragments rapidly under variable-length batches and you waste 30-50% of available VRAM. Every 2026 production inference stack has paged attention by default; the question is what you layer on top.
03
Prefix caching gives 85-95% cost savings on cache hits — it's the highest-leverage application optimization.vLLM's prefix cache and SGLang's RadixAttention both hash prompt prefixes and reuse the KV state. On agent loops, multi-tenant SaaS, repo Q&A, and long-doc workflows, hit-rates of 60-85% are achievable, dropping per-call cost by 5-12×.
04
DeepSeek's MLA is the architectural endgame for KV efficiency in 2026.Multi-head Latent Attention compresses KV cache 7-14× by storing a low-rank projection instead of full keys and values. DeepSeek V2/V3/V4 build on it; competing models are still on GQA at 4-8× compression. MLA is the reason V4-Pro can run 1M context economically.
05
FP8 KV cache halves memory with sub-1% accuracy regression — turn it on.INT8 KV cache costs 1.5-3 points on long-context retrieval (NIAH-2 multi-needle); FP8 costs 0.3-0.7 points — within noise for most production workloads. The 50% memory savings translate to 30-50% throughput gains via larger batch size at the same VRAM budget.

01 — Why It MattersKV cache is the biggest line item in long-context cost.

The KV cache stores per-layer key and value tensors for every token in the context window so that subsequent tokens can attend to them without recomputing. Memory grows linearly with context length, layer count, and head count — and at frontier scale, the growth is faster than parameter count growth. The result: above 128K tokens, KV memory exceeds parameter memory on most architectures.

KV cache size · 1M context across attention variants

Source: vLLM benchmarks · DeepSeek V2 paper · Apr 2026

Llama 70B · 8K context · MHA · FP16Standard short-context baseline

1.0 GB

Llama 70B · 32K context · MHA · FP16Common production context budget

4.3 GB

Llama 70B · 128K context · MHA · FP16Long-context production ceiling

17.3 GB

Llama 70B · 1M context · MHA · FP16Hypothetical (without optimization)

135 GB

Llama 70B · 1M context · GQA · FP8Production-realistic with GQA + FP8

17 GB

−87%

DeepSeek V4-Pro · 1M context · MLA · FP8MLA + FP8 compounding

8 GB

−94%

Two reads matter. First: at 1M context with naive MHA + FP16, KV cache exceeds parameter memory — the model literally cannot hold its own context on the same GPU as its weights without help. Second: stacking optimizations compounds. MLA + FP8 together drop KV from 135 GB to 8 GB — a 17× reduction. That is the difference between not being able to serve 1M context at all and serving it economically.

02 — Paged AttentionPaged attention is the substrate everything else builds on.

Paged attention (introduced by vLLM in 2023) treats KV cache memory like virtual memory in an operating system — divided into fixed-size blocks, allocated on demand, and indirected through a block table. Without it, KV memory fragments rapidly under variable-length batches: a request that ends frees a block in the middle of physical memory, and the unfreed contiguous span cannot host a new long request. Effective utilization drops to 50-65% under typical bursty traffic.

Paged attention solves this by removing the contiguity requirement entirely. Each request's KV blocks can be scattered across physical memory, with the attention kernel reading through a page-table indirection. The trade is a small per-token compute overhead (~2-5%) for a dramatic gain in effective utilization (95%+) and the ability to support continuous batching cleanly.

Why this is not optional

Every production inference stack in 2026 ships paged attention by default — vLLM, SGLang, TensorRT-LLM all use it. If a deployment guide tells you to disable paged attention for a specific benchmark number, ignore it; the benchmark is single-request and has no relationship to production. The only reason to think about paged attention is to know it's there and to size block-size appropriately (16 or 32 tokens for most workloads).

03 — Prefix CachingPrefix caching is the highest-leverage application-side win.

Prefix caching reuses the KV state of a shared prompt prefix across multiple calls. If two calls start with the same 200K tokens of system prompt + reference docs and only differ in the final 5K of user content, the cache stores the 200K KV state on the first call and reuses it on every subsequent call — collapsing 200K tokens of attention compute into a memory read.

vLLM prefix cache

Hash-based, automatic, broadly supported

vLLM's automatic prefix caching hashes incoming prompt prefixes and reuses KV blocks when matches are found. Works without application changes; works across requests in the same batch and across batches if the cache is warm. Default for vLLM 0.4+.

Default vLLM

SGLang RadixAttention

Tree-structured, optimal for branching workloads

SGLang's RadixAttention organizes the cache as a radix tree, optimal for branching prompt patterns (multiple completions sharing a prefix, then diverging). Wins on agent loops with multiple candidate paths, multi-turn chat with sibling thread structure, and Monte Carlo decoding.

Branching workloads

Anthropic-style explicit cache markers

Application opts in via cache-control breakpoints

Application places explicit cache markers in the prompt; the serving stack honors them as cache breakpoints. More control, slightly more code. Pattern Anthropic exposes in their API and the recommended approach for self-hosted multi-tenant SaaS.

Multi-tenant SaaS

TensorRT-LLM kv-cache reuse

Lower-level, NVIDIA-only, more setup friction

TensorRT-LLM exposes KV reuse APIs that integrate with the lower-level engine. Highest peak performance for static workloads, more setup friction. Right call for stable, high-volume production where the prompt structure is locked in.

Stable + max throughput

"Prefix caching is the only optimization on this list that gets cheaper the longer the context. The math compounds in your favor."— Internal serving-stack notes, May 2026

04 — Attention CompressionMQA, GQA, MLA — architectural KV compression.

The architectural side of KV optimization compresses the attention layer itself. Three patterns dominate: Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and DeepSeek's Multi-head Latent Attention (MLA). Each trades a degree of attention expressivity for dramatic KV cache savings.

MQA

Multi-Query Attention

1 KV head shared across all Q heads

Original aggressive compression. All query heads share a single key and value head — 32× compression on a 32-head model. Quality regression at 1-3 points on most tasks; not used in frontier 2026 except for very latency-sensitive serving.

32× compression · -1 to -3 pts

GQA

Grouped-Query Attention

1 KV head per group of Q heads

The 2024-2026 default. Llama 3.1, 3.3, and 4 all use GQA with 8 groups of 4 heads (32 Q, 8 KV). Mistral and Qwen use similar configurations. 4-8× compression at sub-0.5 point quality regression. Right default for non-latent architectures.

4-8× compression · default 2026

MLA

Multi-head Latent Attention

Low-rank projection + RoPE-decoupled K

DeepSeek's contribution. Stores compressed latent state (typically 512 dim) instead of full K, V tensors. Decoupled RoPE-encoded keys handle position. 7-14× compression at <0.2 point regression. Only DeepSeek V2/V3/V4 use it as of Apr 2026.

7-14× compression · 2026 SOTA

Sliding-window attention

Window-bounded attention

Each token attends to last W tokens only

Mistral and Mixtral use 4K sliding windows. Caps KV memory growth past the window size at the cost of long-range attention. Right for very long contexts where most reasoning is local; wrong for long-document Q&A or code-search workloads where long-range attention matters.

Constant KV · narrow attention

Why MLA matters

MLA is the architectural reason DeepSeek V2/V3/V4 ship 1M context economically. The compression ratio (7-14×) compounds with FP8 quantization (2×) and prefix caching (5-12× on hit) to land KV memory at production-feasible levels for million-token contexts. Competing models on GQA can match the context length on paper but serve it at 2-4× the GPU cost — which is why DeepSeek inference providers can list lower per-token rates at the same context length than Llama-MoE providers.

05 — QuantizationKV cache quantization is free money most teams leave on the table.

KV cache quantization stores the K and V tensors at lower precision than the model's compute precision — typically INT8 or FP8 instead of FP16. The savings are direct: 50% memory reduction at FP8/INT8 vs FP16, 75% at INT4. The cost is small accuracy regression — and on FP8, that regression is often within measurement noise.

FP16 baseline

100%

Reference precision

FP16 KV cache is the typical baseline. No quality regression versus the model's training precision; full memory cost. Always run baseline benchmarks at FP16 before evaluating any compression.

Reference

FP8 KV cache

50%

Memory reduction

FP8 (E4M3 or E5M2) at 1 byte per element instead of 2. Quality regression: 0.3-0.7 points on long-context retrieval (NIAH-2 multi-needle), within noise on most application workloads. Default 2026 production setting on H100 and newer.

Right default

INT8 KV cache

50%

Memory reduction

INT8 with per-channel scale. Slightly more aggressive quality regression than FP8 (1.5-3 points on long-context tasks) because INT8 has less dynamic range. Useful on hardware without native FP8 support; otherwise prefer FP8.

Pre-Hopper hardware

Compounding: KV quantization stacks cleanly with attention compression. A DeepSeek V4-Pro deployment running MLA at FP8 sees 7× × 2× = 14× total KV cache compression versus a naive MHA at FP16 baseline — the difference between 135 GB and 10 GB at 1M context on a 70B-class footprint. That ratio is what makes long-context production economical at all.

06 — DecisionPicking techniques by workload class.

Most teams shouldn't treat these as separate decisions. The right starting point is "all five together, then back off anything that breaks your eval." The architectural choices (MQA vs GQA vs MLA) are pinned by model selection; the runtime choices (paged attention, prefix cache, FP8 quant) are knobs in the serving stack.

Workload

Long-document Q&A · static reference

Single corpus, repeated questions, low cache invalidation. Stack: paged + prefix cache (24h TTL via explicit markers) + FP8 KV. Architectural: any GQA or MLA model. Wins are 80-95% per-call savings after prime.

All 5 layers · MLA-class model

Workload

Multi-tenant SaaS knowledge base

Many corpora (per-tenant), bursty hit-rate. Stack: paged + prefix cache (per-tenant cache markers) + FP8 KV. Avoid sliding window. SGLang RadixAttention often wins here over vLLM due to branching pattern.

SGLang + per-tenant markers

Workload

Long-running agent loops

Single session, growing tool history. Stack: paged + sliding-window-style prefix cache + FP8 KV. Re-anchor at 30-50 turns to keep prefix matchable. SGLang RadixAttention helps with branching candidate paths.

Re-anchored prefix cache

Workload

Highly dynamic / short-context

Chat without long reference, fresh content per call. Stack: paged + FP8 KV. Skip prefix cache (low hit rate). Sliding window OK. Architecture: GQA is fine; MLA's advantage shrinks below 32K.

Paged + FP8 only

07 — ConclusionKV is the real long-context cost.

KV optimization, April 2026

The model picks the floor; the serving stack picks the ceiling.

By April 2026 the technique landscape is settled: paged attention is the substrate, prefix caching is the high-leverage application optimization, GQA or MLA picks the architectural floor, and FP8 KV is the free 50% memory savings every team should already have on. Used together, these compound to 4-40× cost reduction on long-context inference — the difference between "1M context is a marketing claim" and "1M context is in production."

The deeper move is to stop thinking about KV cache as a byproduct of inference and start thinking about it as the primary cost variable for any deployment that touches more than 32K tokens. Once the team is measuring KV memory consumption and KV bandwidth as first-class metrics, the right optimizations become obvious — and the wrong ones (eager batch size, heroic parameter quantization) stop being suggested.

The next architectural compression is already in flight: DeepSeek's CSA + HCA in V4 brings MLA-style compression into a sparser substrate, and the Mamba-MoE hybrids point to sub-1 GB KV at 1M context within 12 months. Build the muscle now to take advantage of it when it ships.

KV Cache Optimization 2026: The Engineering Guide

01 — Why It MattersKV cache is the biggest line item in long-context cost.

KV cache size · 1M context across attention variants

02 — Paged AttentionPaged attention is the substrate everything else builds on.

03 — Prefix CachingPrefix caching is the highest-leverage application-side win.

Hash-based, automatic, broadly supported

Tree-structured, optimal for branching workloads

Application opts in via cache-control breakpoints

Lower-level, NVIDIA-only, more setup friction

04 — Attention CompressionMQA, GQA, MLA — architectural KV compression.

Multi-Query Attention

Grouped-Query Attention

Multi-head Latent Attention

Window-bounded attention

05 — QuantizationKV cache quantization is free money most teams leave on the table.

Reference precision

Memory reduction

Memory reduction

06 — DecisionPicking techniques by workload class.

Long-document Q&A · static reference

Multi-tenant SaaS knowledge base

Long-running agent loops

Highly dynamic / short-context

07 — ConclusionKV is the real long-context cost.

The model picks the floor; the serving stack picks the ceiling.

Move past compute-bound thinking. Optimize for memory bandwidth.

KV optimization engagements

The questions we get every week.

Continue exploring inference engineering.

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Data

Self-Hosting Frontier AI Models: 2026 TCO Analysis

Claude Opus 4.7 1M Context: The Cost-Strategy Guide