Prompt caching is the single highest-leverage cost lever in production LLM engineering for 2026: it stores the computed key-value tensors behind a repeated prompt prefix so that the static portion of every request — your tool definitions, system prompt, and reference documents — bills at up to 90% off, with the model producing byte-identical output. No distillation, no quantization, no quality trade-off. Just a structural change to how you order a prompt.
The reason it matters now is scale. Agentic workloads loop the same 20,000-token system prompt across dozens of steps per task; retrieval pipelines re-send the same document corpus on every query; chat products replay an entire conversation history each turn. Every one of those tokens was already paid for once and recomputed from scratch on the next call. Caching reclaims that spend — but only if the cacheable block is genuinely stable, which most teams discover the hard way.
This guide is the cross-provider reference: how the key-value cache mechanism actually works, current Anthropic, OpenAI, and Google pricing with the write premiums and TTL tiers spelled out, a breakeven matrix that shows exactly when caching saves money versus when it costs more, the prompt-ordering rules that move a workload from a 7% to an 84% hit rate, and the worked cost math behind all of it. Every figure is sourced to a provider doc or a published study.
- 01Caching is a 90% input discount with zero quality loss.Cache reads bill at 0.10× base input on Anthropic and on OpenAI's newer models; Google's implicit caching delivers a 75% discount. The model recomputes nothing it has already seen, so output is unchanged. Caching only ever touches input-side tokens — output is never discounted.
- 02Breakeven is low but real — track your hit rate.Anthropic's 5-minute tier breaks even at roughly 1.4 reads per cached write. Below a ~30% hit rate on a stable-prompt workload, the write premium can cost more than the reads save. A hit rate under 60% on stable prompts signals a structural problem.
- 03Prompt order is the whole game.Order content most-to-least stable: tool definitions, then system prompt, then reference docs, then conversation history, then the live user query. Any change to a block invalidates that block and everything after it on Anthropic — so dynamic data must live at the very end.
- 04Moving working memory out of the prefix is the big win.ProjectDiscovery raised its cache hit rate from 7% to 84% by relocating dynamic working memory out of the system prompt and into a user message at the end of the prompt — cutting overall LLM cost 59%, with 9.8 billion tokens served from cache.
- 05Pick the TTL tier to match request frequency.Use the 5-minute tier for high-QPS workloads, the 1-hour tier for medium frequency despite its higher write premium, and OpenAI's 24-hour extended retention for slow-burn batch agents. Naive full-context caching can paradoxically raise latency, so control the cache boundary deliberately.
01 — The MechanismWhat a cache actually stores.
Prompt caching operates on the key-value (KV) cache — the same data structure that makes transformer decoding tractable in the first place. During the prefill phase, every prompt token is processed and the key (K) and value (V) matrix projections from each attention layer are computed and stored. During decode, each newly generated token attends to those cached K/V matrices rather than recomputing the projections for every previous token. For a repeated prefix, that reduces the attention work from O(n²) to O(n).
Provider-level prompt caching extends that mechanism across requests. When a new request shares an identical prefix with a recent one, the provider reuses the already-computed KV state for that prefix instead of running prefill again. You pay a reduced “cache read” rate for those tokens, and time-to-first-token drops because the expensive prefill is skipped. The catch is that the match must be exact — the cache is keyed on the literal token sequence, so a single changed character before the cache boundary breaks the hit.
Provider prefix caching
Reuses the cached KV state for a repeated prompt prefix. This is the Anthropic / OpenAI / Gemini mechanism this guide covers. Saves input-side cost and time-to-first-token; output is recomputed and billed in full.
Semantic response caching
Returns a stored response for a semantically similar prompt using vector embeddings — it bypasses the model entirely on a hit, saving both input and output tokens. vCache (arXiv 2502.03771, 2026) adds per-prompt learned similarity thresholds with user-defined error-rate guarantees.
Exact-match response cache
Hashes the full request and returns a stored response on an identical match. Low hit rate over natural language, but excellent for templated or repeated queries. Stacking all three layers (Redis pattern) maximizes coverage.
02 — 2026 PricingWhat each provider charges in 2026.
The three major providers landed on caching at different times and with structurally different pricing. Anthropic launched prompt caching in public beta on August 14, 2024 and reached general availability on December 17, 2024; OpenAI shipped automatic caching on October 1, 2024 (initially a 50% discount, since raised to up to 90% on newer models); and Google introduced explicit context caching at Google I/O in May 2024, then added zero-setup implicit caching for Gemini 2.5 models on May 8, 2025. The table below consolidates the current 2026 figures — no single vendor page has all three side by side.
| Provider / model | Base input | Cache read | Discount | Write premium | Min prefix | Setup |
|---|---|---|---|---|---|---|
| Anthropic — 5-minute / 1-hour TTL tiers | ||||||
| Claude Sonnet 4.6 | $3.00 | $0.30 | 90% | +25% / +100% | 1,024 | Breakpoints |
| Claude Opus 4.8 | $5.00 | $0.50 | 90% | +25% / +100% | 1,024 | Breakpoints |
| Claude Haiku 4.5 | $1.00 | $0.10 | 90% | +25% / +100% | 4,096 | Breakpoints |
| OpenAI — automatic, 24-hour extended on newer models | ||||||
| GPT-5.5 | $5.00 | $0.50 | 90% | None | 1,024 | Automatic |
| Google Gemini — implicit (zero-setup) and explicit caching | ||||||
| Gemini 2.5 Flash (implicit) | $0.30 | $0.03 | 75–90% | None | 1,024 | Automatic |
| Gemini 2.5 Pro (explicit) | $1.25 | ~90% off* | ~90%* | $1/MTok/hr storage | 2,048 | Explicit cache |
| DeepSeek — automatic, no storage fee | ||||||
| DeepSeek V4 Flash | See provider | ~10% of input | ~90% | None | Not published | Automatic |
Two structural differences are worth internalizing. First, Anthropic charges a write premium — you pay more than base input the first time a prefix is cached (1.25× for the 5-minute tier, 2.0× for the 1-hour tier), then 0.10× on every read. OpenAI and Google implicit caching charge no write premium at all; the trade-off is less control over the cache boundary. Second, Google’s explicit caching adds a time-based storage fee ($1.00/MTok/hour for most models, $4.50 for Gemini 3.1 Pro Preview) that the others do not — so explicit caching only pays off when read volume per stored hour is high enough to amortize that storage cost.
Pairing caching with model selection compounds the savings. If you are routing cheap, cacheable steps to a small model and reserving a frontier model for the hard ones, our deep dive on LLM model routing strategies covers the second half of that equation.
Prompt caching is a critical new innovation for language model inference — saving developers up to 90% and making long context inputs suddenly viable.— Artificial Analysis, caching analysis
03 — BreakevenWhen caching helps and when it hurts.
The intuition that caching always saves money is wrong. On a write-premium provider like Anthropic, a low hit rate means you keep paying the inflated write cost without enough reads to amortize it. The decision hinges on one number: your cache hit rate — the fraction of requests that land on an already-cached prefix. The matrix below recomputes the net cost of caching as a percentage of the no-caching baseline at each hit rate, for the four most common configurations.
The formula is straightforward. A request either misses (you pay the write multiplier) or hits (you pay the read multiplier), so blended cost relative to base input is (1 − hitRate) × writeMultiplier + hitRate × readMultiplier. For Anthropic 5-minute the multipliers are 1.25× write / 0.10× read; for 1-hour they are 2.0× / 0.10×; for OpenAI and Google implicit there is no write premium so a miss costs 1.0× and a read costs 0.10× (OpenAI) or 0.25× (Google’s 75% implicit discount).
| Cache hit rate | Anthropic 5m (1.25× / 0.10×) | Anthropic 1h (2.0× / 0.10×) | OpenAI (1.0× / 0.10×) | Google implicit (1.0× / 0.25×) |
|---|---|---|---|---|
| 30% | 90.5% | 143% — costs more | 73% | 77.5% |
| 50% | 67.5% | 105% — costs more | 55% | 62.5% |
| 60% | 56% | 86% | 46% | 55% |
| 70% | 44.5% | 67% | 37% | 47.5% |
| 80% | 33% | 48% | 28% | 40% |
| 90% | 21.5% | 29% | 19% | 32.5% |
The matrix exposes the trap. Anthropic’s 1-hour tier costs more than not caching at 50% hit rate and below — its 2.0× write premium needs roughly a 67% hit rate just to break even versus 5-minute, and a 50%+ rate to beat the no-cache baseline at all. The 5-minute tier, OpenAI, and Google implicit caching all stay net-positive across the whole range because their write cost is at or below base input. This is why Anthropic’s documented breakeven of about 1.4 reads per cached write applies specifically to the 5-minute tier.
04 — Prompt OrderingOrder content most-to-least stable.
Because cache invalidation is hierarchical — on Anthropic, any change to a content block invalidates that block and every block after it — the entire optimization reduces to a single principle: put the content that never changes first, and the content that changes every request last. Get the order right and a long, expensive prefix caches cleanly; get it wrong and a single volatile token at the top wipes the cache on every call.
| Content block | Frequency of change | Cache position | Impact if misplaced |
|---|---|---|---|
| Tool / function definitions | Almost never | 1st (top) | Reordering tools invalidates everything below |
| System prompt instructions | Rarely | 2nd | A single edited word breaks the prefix |
| Reference docs / RAG corpus | Occasionally | 3rd | Re-ranking retrieved chunks each call kills hits |
| Conversation history | Grows; older turns fixed | 4th | Editing prior turns invalidates the tail |
| Live user query / working memory | Every request | Last (bottom) | Placing it early invalidates the whole prefix |
On Anthropic specifically, the discipline is stricter than “put dynamic data last.” Tool definitions must remain byte-identical and in the same order across requests; changing tool_choice, thinking parameters, or an image in the system prompt invalidates downstream cache entries. The provider supports up to four cache breakpoints per request with a lookback window of up to 20 content blocks per breakpoint — enough to cache tools, system prompt, and a document corpus as separate stable segments while leaving the query uncached.
05 — Case StudyProjectDiscovery’s 7% to 84% overnight.
The most instructive published case comes from ProjectDiscovery, whose security agent Neo runs 20 to 40-plus LLM steps per task on top of a 20,000-token system prompt. Their initial cache hit rate was a dismal 7% — because the system prompt contained dynamic working memory that mutated as the agent worked, invalidating the entire cacheable prefix on nearly every step. The fix was a single architectural change: move the dynamic working memory out of the system prompt and place it as a user message at the end of the prompt.
That one relocation raised the cache hit rate to 84% overnight and cut overall LLM cost by 59% — peaking at 70% in the final 10 days of the engagement, with 9.8 billion tokens ultimately served from cache. The lesson generalizes: the highest-value caching work is almost never tuning TTLs or breakpoints, it is auditing what dynamic content has crept into the supposedly static prefix.
Cache hit rate · ProjectDiscovery Neo agent
Source: ProjectDiscovery engineering blog, 202506 — TTL StrategyMatch the TTL tier to request frequency.
Time-to-live is the second decision after prompt order. The cache entry expires after its TTL window of inactivity, so the right tier depends entirely on how frequently the same prefix is hit. Pick wrong and you either pay a write premium you cannot amortize, or you let a still-useful cache expire between requests.
5-minute TTL
For workloads hitting the same prefix more than ~12 times per hour, the 5-minute tier (1.25× write) breaks even at roughly 1.4 reads and stays cheapest. The default for chat backends and busy agent loops.
1-hour TTL
For prefixes hit a few times per hour, the 1-hour tier (2.0× write) avoids constant re-warming. It only beats the baseline above a ~50% hit rate, so reserve it for medium-frequency, high-prefix-size workloads.
24-hour extended
OpenAI offers extended 24-hour cache retention on newer models (gpt-5.5 and select gpt-5.x variants). Ideal for batch agents and slow-burn pipelines that revisit the same prefix across a day with no write premium.
Gemini explicit cache
For a large static corpus queried repeatedly within an hour, Google's explicit caching trades a storage fee ($1/MTok/hour) for a steep read discount. Worth it only when read volume per stored hour clears the storage cost.
One counterintuitive finding deserves emphasis. The academic study “Don’t Break the Cache” (arXiv 2601.06007, 2026) tested caching across 500-plus agent sessions with 10,000-token system prompts and found that caching reduced API costs 41–80% and improved time-to-first-token 13–31% — but that naive full-context caching can paradoxically increase latency rather than reduce it. Strategic control of the cache boundary outperforms caching everything by default. The implication for engineers: do not just wrap the whole context in a cache breakpoint and assume you have won.
Strategic cache boundary control...outperforms naive full-context caching, which can paradoxically increase latency rather than improve it.— Lumer et al., Don't Break the Cache (arXiv 2601.06007)
07 — Worked MathThe cost math, recomputed end to end.
Abstract discounts only land when you run them against a real workload. Below are three worked scenarios with every cell recomputed from the provider’s stated per-token prices. These build on the kind of real-world baselines we measured in our analysis of token cost ROI across 50 agency workflows.
Scenario A — Claude Sonnet 4.6, 50K-token system prompt
One hundred requests a day against a 50,000-token system prompt. With no caching, that is 100 × 50,000 × $3.00/MTok = $15.00/day. On the 5-minute tier at a 90% hit rate: 10 cache writes (10 × 50,000 × $3.75/MTok = $1.875) plus 90 cache reads (90 × 50,000 × $0.30/MTok = $1.35) totals $3.225/day — a 78.5% reduction. On the 1-hour tier at a 95% hit rate: 5 writes (5 × 50,000 × $6.00/MTok = $1.50) plus 95 reads (95 × 50,000 × $0.30/MTok = $1.425) totals $2.925/day — an 80.5% reduction.
Scenario B — OpenAI GPT-5.5, agent loop
A 10,000-token system prompt looped 50 steps per run across 200 runs a day. With no caching: 200 × 50 × 10,000 × $5.00/MTok = $500/day. With 24-hour extended caching at a 95% hit rate (no write premium on OpenAI): misses are 200 × 5 × 10,000 × $5.00/MTok = $50, hits are 200 × 45 × 10,000 × $0.50/MTok = $45, totaling $95/day — an 81% reduction.
Sonnet 4.6 · 50K prompt
$15.00/day → $3.225/day at a 90% hit rate. Ten writes at $3.75/MTok plus ninety reads at $0.30/MTok. Recomputed from Anthropic's stated 1.25× write / 0.10× read multipliers on $3.00 base input.
Sonnet 4.6 · higher hit rate
$15.00/day → $2.925/day at a 95% hit rate. Five writes at $6.00/MTok plus ninety-five reads at $0.30/MTok. The 1-hour tier wins here because the higher hit rate amortizes the 2.0× write premium.
GPT-5.5 · agent loop
$500/day → $95/day at a 95% hit rate across 200 runs of 50 steps each. No write premium on OpenAI, so misses bill at full $5.00 and hits at $0.50. Output tokens excluded — caching never discounts them.
Two caveats keep these numbers honest. First, every figure is an input-side saving — output token cost is identical with or without caching, so a workload that is output-heavy will see a smaller blended reduction than these input-only percentages suggest. Second, the hit rates assumed here (90–95%) are achievable only on genuinely stable prefixes; plug your measured hit rate into the breakeven matrix above before promising a finance team an 80% cut.
08 — Anti-PatternsThe mistakes that break every cache.
Most caching failures trace back to a handful of avoidable mistakes. Because the cache key is the literal token sequence, even a capitalization or whitespace difference before the boundary breaks the match — and the resulting miss looks identical to a working cache except for the bill. The defensive rules are simple but unforgiving.
Timestamps & clock times
A datetime in the cacheable prefix changes every request and invalidates it on every call. Date-only formatting is safe if the date does not change mid-session — but full timestamps are the single most common cache killer.
Session IDs & user names
Per-request identifiers — session IDs, user names, request IDs — belong in the trailing user message, never in the system prompt. They are unique per call by definition, so any prefix containing them can never hit.
Whitespace & casing drift
A single changed character, an extra space, or a capitalization difference before the cache boundary breaks the exact match. Serialize the cacheable prefix deterministically and diff it across requests if hit rate drops.
A useful operational habit on Anthropic is pre-warming: send a request with max_tokens: 0 against the system prompt before traffic arrives. The API returns immediately with stop_reason: "max_tokens" and populates cache_creation_input_tokens, confirming the cache is warm before the first user shows up. For monitoring, OpenAI surfaces the cached token count as cached_tokens in the response usage field — wire that into your observability so a hit rate regression triggers an alert rather than a surprise invoice. For the deeper memory-management mechanics underneath all of this, our companion piece on KV cache optimization techniques goes a layer below the provider abstraction.
Looking forward, the next frontier is semantic caching that bypasses the model on near-matches rather than exact ones. Approaches like vCache return cached responses for semantically similar prompts under user-defined error-rate guarantees, saving both input and output tokens where prefix caching saves only input. Expect 2026 production stacks to layer exact-match, semantic, and prefix caches together — and the teams that instrument hit rate per layer will be the ones who actually capture the savings the pricing tables promise.
09 — ConclusionA structural win, not a tuning trick.
Prompt caching is the rare cost lever that asks for engineering discipline, not quality compromise.
Prompt caching is the highest-leverage, lowest-risk cost reduction available to production LLM teams in 2026. It cuts the input bill on repeated prefixes by up to 90% with no change to model output — the saving is structural, not a quality trade-off. Every major provider now ships it, the discounts are steep, and the only real work is ordering a prompt so the static part actually stays static.
The honest framing is that caching rewards discipline and punishes carelessness. Get the order right — tools, system prompt, docs, history, then the live query — and keep timestamps and session data out of the prefix, and an 80% saving is routine. Leave dynamic data in the cacheable block and you land at ProjectDiscovery’s starting 7%, paying write premiums for a cache that almost never hits. The breakeven matrix is the tool that tells you which side of that line you are on.
The broader signal is that token economics, not raw capability, is where the next wave of LLM-product margin is won. As models converge on capability, the teams that win on unit economics will be the ones who treat caching, model routing, and prompt structure as first-class engineering concerns — measured, monitored, and tuned — rather than afterthoughts bolted on once the bill arrives.