Prompt caching is the single highest-leverage cost lever in production LLM engineering for 2026: it stores the computed key-value tensors behind a repeated prompt prefix so that the static portion of every request — your tool definitions, system prompt, and reference documents — bills at up to 90% off, with the model producing byte-identical output. No distillation, no quantization, no quality trade-off. Just a structural change to how you order a prompt.

The reason it matters now is scale. Agentic workloads loop the same 20,000-token system prompt across dozens of steps per task; retrieval pipelines re-send the same document corpus on every query; chat products replay an entire conversation history each turn. Every one of those tokens was already paid for once and recomputed from scratch on the next call. Caching reclaims that spend — but only if the cacheable block is genuinely stable, which most teams discover the hard way.

This guide is the cross-provider reference: how the key-value cache mechanism actually works, current Anthropic, OpenAI, and Google pricing with the write premiums and TTL tiers spelled out, a breakeven matrix that shows exactly when caching saves money versus when it costs more, the prompt-ordering rules that move a workload from a 7% to an 84% hit rate, and the worked cost math behind all of it. Every figure is sourced to a provider doc or a published study.

Key takeaways

01
Caching is a 90% input discount with zero quality loss.Cache reads bill at 0.10× base input on Anthropic and on OpenAI's newer models; Google's implicit caching delivers a 75% discount. The model recomputes nothing it has already seen, so output is unchanged. Caching only ever touches input-side tokens — output is never discounted.
02
Breakeven is low but real — track your hit rate.Anthropic's 5-minute tier breaks even at roughly 1.4 reads per cached write. Below a ~30% hit rate on a stable-prompt workload, the write premium can cost more than the reads save. A hit rate under 60% on stable prompts signals a structural problem.
03
Prompt order is the whole game.Order content most-to-least stable: tool definitions, then system prompt, then reference docs, then conversation history, then the live user query. Any change to a block invalidates that block and everything after it on Anthropic — so dynamic data must live at the very end.
04
Moving working memory out of the prefix is the big win.ProjectDiscovery raised its cache hit rate from 7% to 84% by relocating dynamic working memory out of the system prompt and into a user message at the end of the prompt — cutting overall LLM cost 59%, with 9.8 billion tokens served from cache.
05
Pick the TTL tier to match request frequency.Use the 5-minute tier for high-QPS workloads, the 1-hour tier for medium frequency despite its higher write premium, and OpenAI's 24-hour extended retention for slow-burn batch agents. Naive full-context caching can paradoxically raise latency, so control the cache boundary deliberately.

01 — The MechanismWhat a cache actually stores.

Prompt caching operates on the key-value (KV) cache — the same data structure that makes transformer decoding tractable in the first place. During the prefill phase, every prompt token is processed and the key (K) and value (V) matrix projections from each attention layer are computed and stored. During decode, each newly generated token attends to those cached K/V matrices rather than recomputing the projections for every previous token. For a repeated prefix, that reduces the attention work from O(n²) to O(n).

Provider-level prompt caching extends that mechanism across requests. When a new request shares an identical prefix with a recent one, the provider reuses the already-computed KV state for that prefix instead of running prefill again. You pay a reduced “cache read” rate for those tokens, and time-to-first-token drops because the expensive prefill is skipped. The catch is that the match must be exact — the cache is keyed on the literal token sequence, so a single changed character before the cache boundary breaks the hit.

Layer 1

Provider prefix caching

Reduces input tokens only

Reuses the cached KV state for a repeated prompt prefix. This is the Anthropic / OpenAI / Gemini mechanism this guide covers. Saves input-side cost and time-to-first-token; output is recomputed and billed in full.

Exact prefix match required

Layer 2

Semantic response caching

Bypasses the model on a hit

Returns a stored response for a semantically similar prompt using vector embeddings — it bypasses the model entirely on a hit, saving both input and output tokens. vCache (arXiv 2502.03771, 2026) adds per-prompt learned similarity thresholds with user-defined error-rate guarantees.

Saves input + output

Layer 3

Exact-match response cache

Hash lookup, no model call

Hashes the full request and returns a stored response on an identical match. Low hit rate over natural language, but excellent for templated or repeated queries. Stacking all three layers (Redis pattern) maximizes coverage.

Best for templated queries

The one rule that governs everything

Caching only ever discounts input tokens. No provider’s caching scheme touches output token cost. Any savings estimate that extrapolates the discount to output spend is wrong — model the input side only, and treat the generated-token bill as unchanged.

02 — 2026 PricingWhat each provider charges in 2026.

The three major providers landed on caching at different times and with structurally different pricing. Anthropic launched prompt caching in public beta on August 14, 2024 and reached general availability on December 17, 2024; OpenAI shipped automatic caching on October 1, 2024 (initially a 50% discount, since raised to up to 90% on newer models); and Google introduced explicit context caching at Google I/O in May 2024, then added zero-setup implicit caching for Gemini 2.5 models on May 8, 2025. The table below consolidates the current 2026 figures — no single vendor page has all three side by side.

2026 prompt caching pricing across Anthropic, OpenAI, Google, and DeepSeek, showing base input price, cache read price, discount, write premium, minimum prefix tokens, and setup requirement. All figures are vendor-stated and current as of June 2026.
Provider / model	Base input	Cache read	Discount	Write premium	Min prefix	Setup
Anthropic — 5-minute / 1-hour TTL tiers
Claude Sonnet 4.6	$3.00	$0.30	90%	+25% / +100%	1,024	Breakpoints
Claude Opus 4.8	$5.00	$0.50	90%	+25% / +100%	1,024	Breakpoints
Claude Haiku 4.5	$1.00	$0.10	90%	+25% / +100%	4,096	Breakpoints
OpenAI — automatic, 24-hour extended on newer models
GPT-5.5	$5.00	$0.50	90%	None	1,024	Automatic
Google Gemini — implicit (zero-setup) and explicit caching
Gemini 2.5 Flash (implicit)	$0.30	$0.03	75–90%	None	1,024	Automatic
Gemini 2.5 Pro (explicit)	$1.25	~90% off*	~90%*	$1/MTok/hr storage	2,048	Explicit cache
DeepSeek — automatic, no storage fee
DeepSeek V4 Flash	See provider	~10% of input	~90%	None	Not published	Automatic

Prices in USD per 1M tokens, vendor-stated and current as of June 2026. Anthropic write premium shown as 5-minute / 1-hour TTL (1.25× / 2.0× base input). *Gemini 2.5 Pro’s exact cached-read rate is not separately published for this model on the pricing page; the ~90% figure is inferred from the discount structure — verify on ai.google.dev before budgeting. DeepSeek’s minimum cacheable prefix is not published in its documentation.

Two structural differences are worth internalizing. First, Anthropic charges a write premium — you pay more than base input the first time a prefix is cached (1.25× for the 5-minute tier, 2.0× for the 1-hour tier), then 0.10× on every read. OpenAI and Google implicit caching charge no write premium at all; the trade-off is less control over the cache boundary. Second, Google’s explicit caching adds a time-based storage fee ($1.00/MTok/hour for most models, $4.50 for Gemini 3.1 Pro Preview) that the others do not — so explicit caching only pays off when read volume per stored hour is high enough to amortize that storage cost.

Pairing caching with model selection compounds the savings. If you are routing cheap, cacheable steps to a small model and reserving a frontier model for the hard ones, our deep dive on LLM model routing strategies covers the second half of that equation.

Prompt caching is a critical new innovation for language model inference — saving developers up to 90% and making long context inputs suddenly viable.— Artificial Analysis, caching analysis

03 — BreakevenWhen caching helps and when it hurts.

The intuition that caching always saves money is wrong. On a write-premium provider like Anthropic, a low hit rate means you keep paying the inflated write cost without enough reads to amortize it. The decision hinges on one number: your cache hit rate — the fraction of requests that land on an already-cached prefix. The matrix below recomputes the net cost of caching as a percentage of the no-caching baseline at each hit rate, for the four most common configurations.

The formula is straightforward. A request either misses (you pay the write multiplier) or hits (you pay the read multiplier), so blended cost relative to base input is (1 − hitRate) × writeMultiplier + hitRate × readMultiplier. For Anthropic 5-minute the multipliers are 1.25× write / 0.10× read; for 1-hour they are 2.0× / 0.10×; for OpenAI and Google implicit there is no write premium so a miss costs 1.0× and a read costs 0.10× (OpenAI) or 0.25× (Google’s 75% implicit discount).

Cache hit rate breakeven matrix: net cost of caching as a percentage of the no-caching baseline at each hit rate, for Anthropic 5-minute TTL, Anthropic 1-hour TTL, OpenAI, and Google implicit caching. Values above 100 percent mean caching costs more than not caching. Computed from each provider’s stated write and read multipliers.
Cache hit rate	Anthropic 5m (1.25× / 0.10×)	Anthropic 1h (2.0× / 0.10×)	OpenAI (1.0× / 0.10×)	Google implicit (1.0× / 0.25×)
30%	90.5%	143% — costs more	73%	77.5%
50%	67.5%	105% — costs more	55%	62.5%
60%	56%	86%	46%	55%
70%	44.5%	67%	37%	47.5%
80%	33%	48%	28%	40%
90%	21.5%	29%	19%	32.5%

Net input cost as a percentage of the no-caching baseline (lower is better; 100% = no savings). Computed from blended cost = (1 − hit rate) × write multiplier + hit rate × read multiplier, using each provider’s stated multipliers. Storage fees for Google explicit caching are excluded — model those separately against read volume.

The matrix exposes the trap. Anthropic’s 1-hour tier costs more than not caching at 50% hit rate and below — its 2.0× write premium needs roughly a 67% hit rate just to break even versus 5-minute, and a 50%+ rate to beat the no-cache baseline at all. The 5-minute tier, OpenAI, and Google implicit caching all stay net-positive across the whole range because their write cost is at or below base input. This is why Anthropic’s documented breakeven of about 1.4 reads per cached write applies specifically to the 5-minute tier.

The hit-rate floor

On a stable-prompt workload, a cache hit rate below 60% signals a structural problem — usually dynamic data leaking into the cacheable prefix. Below 30%, the write premium can cost more than the reads save on a write-premium provider. Production reports from Vellum and Helicone cluster in the 50–80% range; treat anything under that as a bug to fix, not a ceiling to accept.

04 — Prompt OrderingOrder content most-to-least stable.

Because cache invalidation is hierarchical — on Anthropic, any change to a content block invalidates that block and every block after it — the entire optimization reduces to a single principle: put the content that never changes first, and the content that changes every request last. Get the order right and a long, expensive prefix caches cleanly; get it wrong and a single volatile token at the top wipes the cache on every call.

Agentic workflow prompt ordering guide: content type, how often it changes, recommended cache position from first to last, and the impact on cache hit rate if the block is misplaced. Synthesized from Anthropic docs, the Redis caching guide, and arXiv 2601.06007.
Content block	Frequency of change	Cache position	Impact if misplaced
Tool / function definitions	Almost never	1st (top)	Reordering tools invalidates everything below
System prompt instructions	Rarely	2nd	A single edited word breaks the prefix
Reference docs / RAG corpus	Occasionally	3rd	Re-ranking retrieved chunks each call kills hits
Conversation history	Grows; older turns fixed	4th	Editing prior turns invalidates the tail
Live user query / working memory	Every request	Last (bottom)	Placing it early invalidates the whole prefix

Optimal prompt ordering for maximum cache hits, synthesized from Anthropic’s caching docs, the Redis caching guide, and arXiv 2601.06007. Anthropic supports up to 4 cache breakpoints per request and two coexisting TTLs (1-hour must precede 5-minute in message order).

On Anthropic specifically, the discipline is stricter than “put dynamic data last.” Tool definitions must remain byte-identical and in the same order across requests; changing tool_choice, thinking parameters, or an image in the system prompt invalidates downstream cache entries. The provider supports up to four cache breakpoints per request with a lookback window of up to 20 content blocks per breakpoint — enough to cache tools, system prompt, and a document corpus as separate stable segments while leaving the query uncached.

05 — Case StudyProjectDiscovery’s 7% to 84% overnight.

The most instructive published case comes from ProjectDiscovery, whose security agent Neo runs 20 to 40-plus LLM steps per task on top of a 20,000-token system prompt. Their initial cache hit rate was a dismal 7% — because the system prompt contained dynamic working memory that mutated as the agent worked, invalidating the entire cacheable prefix on nearly every step. The fix was a single architectural change: move the dynamic working memory out of the system prompt and place it as a user message at the end of the prompt.

That one relocation raised the cache hit rate to 84% overnight and cut overall LLM cost by 59% — peaking at 70% in the final 10 days of the engagement, with 9.8 billion tokens ultimately served from cache. The lesson generalizes: the highest-value caching work is almost never tuning TTLs or breakpoints, it is auditing what dynamic content has crept into the supposedly static prefix.

Cache hit rate · ProjectDiscovery Neo agent

Source: ProjectDiscovery engineering blog, 2025

Before refactorDynamic working memory in the system prompt

After refactorWorking memory moved to a trailing user message

84%

Overall cost reductionPeaked at 70% in the final 10 days

−59%

The 5K-line takeaway in one sentence

The fastest way to fix a bad cache hit rate is not a config change — it is to find the dynamic data hiding in your system prompt and relocate it to the end of the request, where it no longer invalidates the static prefix.

06 — TTL StrategyMatch the TTL tier to request frequency.

Time-to-live is the second decision after prompt order. The cache entry expires after its TTL window of inactivity, so the right tier depends entirely on how frequently the same prefix is hit. Pick wrong and you either pay a write premium you cannot amortize, or you let a still-useful cache expire between requests.

High QPS

5-minute TTL

For workloads hitting the same prefix more than ~12 times per hour, the 5-minute tier (1.25× write) breaks even at roughly 1.4 reads and stays cheapest. The default for chat backends and busy agent loops.

Pick 5-minute tier

Medium QPS

1-hour TTL

For prefixes hit a few times per hour, the 1-hour tier (2.0× write) avoids constant re-warming. It only beats the baseline above a ~50% hit rate, so reserve it for medium-frequency, high-prefix-size workloads.

Pick 1-hour tier

Slow burn

24-hour extended

OpenAI offers extended 24-hour cache retention on newer models (gpt-5.5 and select gpt-5.x variants). Ideal for batch agents and slow-burn pipelines that revisit the same prefix across a day with no write premium.

Pick OpenAI 24h

Document corpus

Gemini explicit cache

For a large static corpus queried repeatedly within an hour, Google's explicit caching trades a storage fee ($1/MTok/hour) for a steep read discount. Worth it only when read volume per stored hour clears the storage cost.

Pick explicit cache

One counterintuitive finding deserves emphasis. The academic study “Don’t Break the Cache” (arXiv 2601.06007, 2026) tested caching across 500-plus agent sessions with 10,000-token system prompts and found that caching reduced API costs 41–80% and improved time-to-first-token 13–31% — but that naive full-context caching can paradoxically increase latency rather than reduce it. Strategic control of the cache boundary outperforms caching everything by default. The implication for engineers: do not just wrap the whole context in a cache breakpoint and assume you have won.

Strategic cache boundary control...outperforms naive full-context caching, which can paradoxically increase latency rather than improve it.— Lumer et al., Don't Break the Cache (arXiv 2601.06007)

07 — Worked MathThe cost math, recomputed end to end.

Abstract discounts only land when you run them against a real workload. Below are three worked scenarios with every cell recomputed from the provider’s stated per-token prices. These build on the kind of real-world baselines we measured in our analysis of token cost ROI across 50 agency workflows.

Scenario A — Claude Sonnet 4.6, 50K-token system prompt

One hundred requests a day against a 50,000-token system prompt. With no caching, that is 100 × 50,000 × $3.00/MTok = $15.00/day. On the 5-minute tier at a 90% hit rate: 10 cache writes (10 × 50,000 × $3.75/MTok = $1.875) plus 90 cache reads (90 × 50,000 × $0.30/MTok = $1.35) totals $3.225/day — a 78.5% reduction. On the 1-hour tier at a 95% hit rate: 5 writes (5 × 50,000 × $6.00/MTok = $1.50) plus 95 reads (95 × 50,000 × $0.30/MTok = $1.425) totals $2.925/day — an 80.5% reduction.

Scenario B — OpenAI GPT-5.5, agent loop

A 10,000-token system prompt looped 50 steps per run across 200 runs a day. With no caching: 200 × 50 × 10,000 × $5.00/MTok = $500/day. With 24-hour extended caching at a 95% hit rate (no write premium on OpenAI): misses are 200 × 5 × 10,000 × $5.00/MTok = $50, hits are 200 × 45 × 10,000 × $0.50/MTok = $45, totaling $95/day — an 81% reduction.

Scenario A · 5m TTL

Sonnet 4.6 · 50K prompt

78.5%

$15.00/day → $3.225/day at a 90% hit rate. Ten writes at $3.75/MTok plus ninety reads at $0.30/MTok. Recomputed from Anthropic's stated 1.25× write / 0.10× read multipliers on $3.00 base input.

100 requests/day

Scenario A · 1h TTL

Sonnet 4.6 · higher hit rate

80.5%

$15.00/day → $2.925/day at a 95% hit rate. Five writes at $6.00/MTok plus ninety-five reads at $0.30/MTok. The 1-hour tier wins here because the higher hit rate amortizes the 2.0× write premium.

95% hit rate

Scenario B · OpenAI 24h

GPT-5.5 · agent loop

81%

$500/day → $95/day at a 95% hit rate across 200 runs of 50 steps each. No write premium on OpenAI, so misses bill at full $5.00 and hits at $0.50. Output tokens excluded — caching never discounts them.

50 steps · 200 runs

Two caveats keep these numbers honest. First, every figure is an input-side saving — output token cost is identical with or without caching, so a workload that is output-heavy will see a smaller blended reduction than these input-only percentages suggest. Second, the hit rates assumed here (90–95%) are achievable only on genuinely stable prefixes; plug your measured hit rate into the breakeven matrix above before promising a finance team an 80% cut.

08 — Anti-PatternsThe mistakes that break every cache.

Most caching failures trace back to a handful of avoidable mistakes. Because the cache key is the literal token sequence, even a capitalization or whitespace difference before the boundary breaks the match — and the resulting miss looks identical to a working cache except for the bill. The defensive rules are simple but unforgiving.

Never cache

Timestamps & clock times

A datetime in the cacheable prefix changes every request and invalidates it on every call. Date-only formatting is safe if the date does not change mid-session — but full timestamps are the single most common cache killer.

Top failure mode

Never cache

Session IDs & user names

Per-request identifiers — session IDs, user names, request IDs — belong in the trailing user message, never in the system prompt. They are unique per call by definition, so any prefix containing them can never hit.

Move to suffix

Watch for

Whitespace & casing drift

A single changed character, an extra space, or a capitalization difference before the cache boundary breaks the exact match. Serialize the cacheable prefix deterministically and diff it across requests if hit rate drops.

Silent and expensive

A useful operational habit on Anthropic is pre-warming: send a request with max_tokens: 0 against the system prompt before traffic arrives. The API returns immediately with stop_reason: "max_tokens" and populates cache_creation_input_tokens, confirming the cache is warm before the first user shows up. For monitoring, OpenAI surfaces the cached token count as cached_tokens in the response usage field — wire that into your observability so a hit rate regression triggers an alert rather than a surprise invoice. For the deeper memory-management mechanics underneath all of this, our companion piece on KV cache optimization techniques goes a layer below the provider abstraction.

Looking forward, the next frontier is semantic caching that bypasses the model on near-matches rather than exact ones. Approaches like vCache return cached responses for semantically similar prompts under user-defined error-rate guarantees, saving both input and output tokens where prefix caching saves only input. Expect 2026 production stacks to layer exact-match, semantic, and prefix caches together — and the teams that instrument hit rate per layer will be the ones who actually capture the savings the pricing tables promise.

09 — ConclusionA structural win, not a tuning trick.

The shape of LLM cost control, mid-2026

Prompt caching is the rare cost lever that asks for engineering discipline, not quality compromise.

Prompt caching is the highest-leverage, lowest-risk cost reduction available to production LLM teams in 2026. It cuts the input bill on repeated prefixes by up to 90% with no change to model output — the saving is structural, not a quality trade-off. Every major provider now ships it, the discounts are steep, and the only real work is ordering a prompt so the static part actually stays static.

The honest framing is that caching rewards discipline and punishes carelessness. Get the order right — tools, system prompt, docs, history, then the live query — and keep timestamps and session data out of the prefix, and an 80% saving is routine. Leave dynamic data in the cacheable block and you land at ProjectDiscovery’s starting 7%, paying write premiums for a cache that almost never hits. The breakeven matrix is the tool that tells you which side of that line you are on.

The broader signal is that token economics, not raw capability, is where the next wave of LLM-product margin is won. As models converge on capability, the teams that win on unit economics will be the ones who treat caching, model routing, and prompt structure as first-class engineering concerns — measured, monitored, and tuned — rather than afterthoughts bolted on once the bill arrives.

Prompt Caching in 2026: Cut LLM Costs, Keep Quality