Most teams that buy into Claude Opus 4.7's 1M-token window blow through their AI budget within a quarter. The window is not the problem — the problem is treating it like a free resource and paying full uncached input rates on every call.

The math is uncompromising. A single 800K-token call at the API rack rate of $5 per 1M input costs $4 in input alone, before a single output token is billed. Hit that same call as a cache read at $0.50 per 1M and the cost drops to $0.40 — a 90% reduction. Make the call ten times in a day and the uncached version costs $40; the cached version costs $4 plus a single $5 prime. The difference between burning money and shipping production is cache topology, not the size of the window.

This guide walks through the cost math, the six-pattern break- even tree (when 1M context pays off versus when retrieval- augmented generation wins on cost), and the four prompt-cache topologies that actually hold up in production — plus the hidden costs that wreck naive deployments.

Key takeaways

01
Uncached 1M context is a bug, not a feature, of your deployment.If your average call is paying $5/1M input on context that would repeat across requests, you are leaving 80–90% on the table. Cache topology is the first cost question to answer.
02
RAG wins on cost below ~5 cached calls per day; 1M context wins above ~12.Below 5 cached calls/day on the same context, the cache prime cost amortizes too slowly. Above 12, retrieval overhead and miss-rate make RAG more expensive overall. The 5–12 band is where you measure.
03
The four cache topologies — static prefix, layered prefix, sliding window, and hybrid RAG-cache — solve different production problems.Static prefix handles repo Q&A; layered prefix handles multi-tenant SaaS; sliding window handles long agent loops; hybrid handles knowledge-base scale. Pick by workload, not by which one looks newest.
04
Output-token amplification is the silent budget killer.1M of input often elicits 5–20K of output even on simple questions, because the model summarizes everything in scope. At $25/1M output, that's $0.13–$0.50 per call before reasoning. Trim outputs aggressively.
05
Tool-call loops invalidate cache after roughly 50–80 turns.Long agent loops that mutate tool history past ~50 turns push the cached prefix out of effective scope. Plan multi-turn agents around either short loops (under 30 turns) or explicit cache-refresh checkpoints.

01 — The MathThe uncompromising arithmetic of 1M context.

Claude Opus 4.7's pricing is straightforward — what catches teams off guard is how the absolute numbers scale at long context. The same per-million rate that feels harmless at 8K tokens becomes load-bearing at 800K. The table below shows the full input cost for one call at four context sizes, three cache states.

Input cost per call · Claude Opus 4.7 · 4 context sizes

Source: Anthropic API pricing · Apr 2026

8K context · uncachedSingle short message, no system prompt cache

$0.04

200K context · uncachedRepo skeleton + style guide, no cache

$1.00

200K context · 90% cachedSame payload, 5-minute cache active

$0.10

−90%

800K context · uncachedFull long-document call, no cache

$4.00

800K context · 90% cachedSame call after cache prime

$0.40

−90%

1M context · uncachedMaximum window, no cache

$5.00

1M context · 90% cachedMaximum window, post-prime

$0.50

−90%

Two reads of this data matter. First: the cached 800K call ($0.40) is cheaper than the uncached 200K call ($1.00), which is the entire point of using cache aggressively — you can run with much more context for less money. Second: the cache prime is real. The first call at 800K still costs roughly $5 (1.25× input rate, the cache-write tariff), so the savings only materialize once you hit the same prefix more than once within the cache TTL.

That second point is what governs the workflow design. If you cannot guarantee at least 5–10 hits on the same cached prefix within the 5-minute (or 1-hour, or 24-hour) cache window, the prime cost will dominate your spend and you will be better off with retrieval.

Cache tier reality

The default cache TTL is 5 minutes. Anthropic offers extended tiers — 1 hour and 24 hours — at a write-time premium. The 5-min tier is right for interactive workflows (chat sessions, IDE plug-ins). The 1-hour tier is right for batch agents and overnight jobs. The 24-hour tier is right for static reference content (repo skeleton, brand guides, regulatory text) where the underlying payload changes infrequently. Pick by how long the payload is stable, not by your budget intuition.

02 — 1M vs RAGThe decision tree.

Long-context plus aggressive caching does not always beat retrieval-augmented generation. The crossover point depends on three numbers: how often you hit the same context, how stable the underlying corpus is, and how much answer fidelity you lose from chunked retrieval. The matrix below is the policy we use internally as a starting point.

Pattern

Static repo / book / contract Q&A

Single corpus, repeated questions, low frequency of payload change. 1M cache wins decisively here. Prime once per day; questions cost cents.

1M context · 24h cache

Pattern

Multi-tenant SaaS knowledge base

Many corpora (per-tenant), bursty usage. Cache hit-rate per tenant is low; RAG over a per-tenant index almost always wins on $/answer.

RAG · per-tenant index

Pattern

Long-running agent loops

Single session, growing tool-history context, sub-50 turns. Sliding-window cache lets the agent re-use the early prefix even as recent turns mutate.

Sliding window cache

Pattern

Highly dynamic corpus (news, prices, logs)

Underlying data changes by the minute. Cache constantly invalidates; retrieval against a fresh index dominates on both cost and freshness.

RAG · fresh index

"Below five cached calls per day, the prime cost wrecks the math. Above twelve, retrieval miss-rate wrecks the answer."— Internal cost-policy doc, May 2026

03 — Cache TopologiesThe four patterns that actually work.

Anthropic's prompt-cache is a prefix cache — the model caches a prefix of the input once, then re-uses it as long as subsequent calls share the exact same prefix. That constraint shapes the four topologies below. Each one solves a different production problem; they are not interchangeable.

Topology 1

Static prefix

system + reference docs · cache marker

System prompt + style guide + repo skeleton or knowledge base, all before any user-specific content. Best for repo Q&A, brand-content generation, single-corpus reference.

5-min or 24-hour cache

Topology 2

Layered prefix

global · org · user — three cache markers

Multi-tenant SaaS pattern. Layer cache markers at global system, org-level config, and user-session levels. Each layer hits its own cache TTL; eviction propagates correctly.

Layered TTL · per tier

Topology 3

Sliding window

anchor prefix + rolling tail

Long agent loops. Anchor the static portion (system, tools schema, mission) at the start; let the most recent turns of tool-history sit outside the cache. Re-anchor every 30-50 turns.

Anchor + rolling · 5-min

Topology 4

Hybrid RAG-cache

RAG retrieve · then cache assembled context

Knowledge base with selective context. RAG picks the relevant chunks, then those chunks form a cacheable prefix. Wins when the corpus is too big for 1M but the chunk-set repeats across users.

Cache assembled prefix · 5-min

The cache-marker discipline

Each cache marker after the first costs a write premium (1.25× input rate). Two markers is fine; four is fine; ten is a sign your prompt is not actually structured around stable layers. Measure cache write spend separately from read spend — when the ratio of writes to reads exceeds 0.4, your topology is wrong.

04 — Break-Even TablesThe arithmetic of when caching pays off.

The break-even point is the number of times you have to hit a cached prefix within its TTL before the cache-write premium plus the prime is recovered by the per-call savings. The numbers below assume the typical 800K-token reference context and $5/$25 input/output pricing.

5-min tier

2hits

Break-even on 800K prefix

Cache write ~$5 (1.25× input). Each cached read ~$0.40. After 2 reads in 5 minutes you have recovered $0.80 of $4.50 net cost. Need 12+ reads to fully amortize.

Interactive · IDE plugin

1-hour tier

5hits

Break-even on 800K prefix

Higher write premium (~$10), but 60-min window. Right for batch overnight jobs and document-pipeline agents that re-read the same reference set across many tasks.

Batch · pipeline

24-hour tier

20hits

Break-even on 800K prefix

Highest write premium (~$25), 24-hour window. Right for static-reference content — brand guidelines, regulatory text, repo skeleton — where the corpus is stable for a day or more.

Static · daily

The mistake teams make most often is choosing the long-tier TTL because it sounds more durable, then failing to hit the break-even count. A 24-hour cache that gets 8 hits is more expensive than a 5-minute cache hit 8 times in a single session.

05 — Hidden CostsThe costs that wreck naive deployments.

Even with a clean cache topology, three categories of cost slip past most production budgets. Each is small per call; each compounds at scale.

Output amplification.Long-context inputs tend to elicit long outputs because the model summarizes everything in scope. A 200K-token context call asking "what does this codebase do?" routinely returns 5K of output at $25/1M — $0.13 per call before reasoning. Add an explicit output token budget (max_tokens plus a stop-on-section instruction) or output-shape system prompt to cut this 60-80%.
Tool-call cache invalidation. In long agent loops the tool history mutates with every turn. Past roughly 50 turns, the cached prefix sits behind so much new content that re-anchoring becomes cheaper than continued read. Build an explicit re-anchor checkpoint into agent loops longer than 30 turns.
Multi-region replication.Cache lives in the region where it was written. Multi-region deployments that fail over without warming the new region's cache pay full uncached rates until the new region's cache primes. For mission-critical apps, build pre-warming into the fail-over runbook.

"Output amplification was 41% of our spend in the first month — we cut it to 9% with one paragraph of output-shape instructions."— Internal cost-tuning retro, May 2026

06 — Production RecipesSix patterns, costed out.

The recipes below are what we run for clients today. Each lists the cache topology, the typical context size, and the per-call cost band so the production budget question has a real answer up front.

Recipe 1

Repo Q&A · static prefix

Repo skeleton + style guide + relevant 50K of file content cached for 24h. Per-question cost: $0.06–$0.12 after prime. 12+ hits/day = clean win versus RAG.

$0.06 / question

Recipe 2

Long-document review · sliding window

200K-page contract + role/constraints prompt. Each query iterates over a different page range. Cache hits hold for system + contract; query rotates.

$0.18 / query

Recipe 3

Multi-tenant docs assistant · layered

Global system + org config + tenant data, three cache markers. Per-tenant TTL set to organic session length. Mid-band cost; correctness payoff is the win.

$0.10 / answer

Recipe 4

Agent loop · re-anchor at 40 turns

Tools + mission cached at start; re-anchor every 40 turns to keep prefix in cache scope. Prevents the silent invalidation that wrecks long-running agents.

$0.04 / turn

Recipe 5

Brand content generator · static

Brand voice + style guide + 200K of approved reference content cached for 24h. Generation calls each cost ~$0.08 + output. Pays back after ~10 daily generations.

$0.08 + output

Recipe 6

Knowledge base · hybrid RAG-cache

Vector retrieval narrows to top 30K of context, then the assembled prefix is cached for 5 minutes. Wins when same chunks repeat across users in a session.

$0.05 / query

07 — ConclusionCache topology is the real 1M context strategy.

The shape of Opus 4.7 economics · April 2026

The window is free; the topology is what costs money.

Claude Opus 4.7's 1M-token context is the headline feature, but the economics are governed by the cache. Every team that ships production on Opus 4.7 either learns this in the first month and adapts, or learns it in the third month and rewrites.

The four cache topologies above are not theoretical — they are what we run today. Static prefix for repo Q&A, layered prefix for multi-tenant SaaS, sliding window for long agent loops, hybrid RAG-cache for knowledge bases. Picking the right one is a function of how often the same context repeats and how stable it is, not how new the topology looks.

The deeper move is to stop measuring per-token rate and start measuring cost-per-answer. That number tells you whether the cache is earning its keep — and it's the only metric that holds up across model releases, pricing shifts, and the next context-window jump.

Claude Opus 4.7 1M Context: Cost Strategy

01 — The MathThe uncompromising arithmetic of 1M context.

Input cost per call · Claude Opus 4.7 · 4 context sizes

02 — 1M vs RAGThe decision tree.

Static repo / book / contract Q&A

Multi-tenant SaaS knowledge base

Long-running agent loops

Highly dynamic corpus (news, prices, logs)

03 — Cache TopologiesThe four patterns that actually work.

Static prefix

Layered prefix

Sliding window

Hybrid RAG-cache

04 — Break-Even TablesThe arithmetic of when caching pays off.

Break-even on 800K prefix

Break-even on 800K prefix

Break-even on 800K prefix

05 — Hidden CostsThe costs that wreck naive deployments.

06 — Production RecipesSix patterns, costed out.

Repo Q&A · static prefix

Long-document review · sliding window

Multi-tenant docs assistant · layered

Agent loop · re-anchor at 40 turns

Brand content generator · static

Knowledge base · hybrid RAG-cache

07 — ConclusionCache topology is the real 1M context strategy.

The window is free; the topology is what costs money.

Move past per-token pricing. Optimize for cost-per-answer.

Opus 4.7 cost engagements

The questions we get every week.

Continue exploring frontier model economics.

Long-Context Retrieval 2026: Needle-in-Haystack Test

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing

Self-Hosting Frontier AI Models: 2026 TCO Analysis