The Claude Opus 4.6 to 4.7 migration looks like a routine version bump on the release notes and behaves like a four-axis breaking change in production. Prompt-cache keys change shape, the tool-use schema gains structured outputs plus parallel calls, 1M context arrives with a new economic envelope, and a built-in compaction API quietly retires every hand-rolled conversation summarizer your team has ever shipped.
None of those are listed as breaking changes in the headline release notes. All four are breaking in the sense that matters — they change observable behavior in production code, they change the cost curve, and they change the patterns your agents were written against. Treat 4.7 as a minor version at your peril; treat it as a migration and the work gets predictable.
This guide is the playbook we run for clients moving from Opus 4.6 to 4.7 across agent stacks, RAG pipelines, and production chat. Seven sections cover what shipped, what broke under the hood, how to decide which workloads jump to 1M, when to adopt the compaction API, and the four pitfalls every migration we have audited hits in the first 72 hours. The brief is short: phased beats big-bang, shadow before cut over, and the playbook matters more than the version bump.
- 01Prompt-cache key changes are the silent breaking change.Cache keys are derived differently in 4.7 — 4.6 cache entries do not transfer. Re-warm cache assumptions before cut over, or expect a hit-rate cliff on day one and a surprise on the next invoice.
- 02Parallel tool calls fundamentally shift agentic patterns.4.7 can emit multiple tool_use blocks in a single turn. Audit your orchestration loops — code written for one-tool-per-turn will silently drop calls or serialize what should be parallel.
- 031M context is a workload-by-workload decision.Default to selective adoption. 1M is excellent for full-codebase analysis and long-document RAG; for chat and short-context agents it inflates spend without measurable lift. Decide per workload, not per model.
- 04Compaction API removes the most fragile custom code teams wrote.Adopt aggressively. The built-in compaction handles conversation summarization that every agent stack hand-rolled — and it does it with the model that produced the conversation, which is the right primitive.
- 05Phased rollout beats big-bang every time.Shadow-test on 5% of traffic before cutting over. Two-thirds of the regressions we have audited surface in the first 72 hours and would have been caught in shadow at zero customer cost.
01 — What's New4.7 ships in four axes — capability, cache, tools, compaction.
Opus 4.7 is the first Claude release in the 4.x line that is simultaneously a capability bump, an infrastructure refactor, and a primitives expansion. The capability story is the one the release notes lead with — improved coding, longer planning horizons, better tool-use reliability. The other three stories are the ones that decide whether your migration is clean.
Frame the four axes deliberately, because each one has a different blast radius and a different rollout strategy. Capability is backward-compatible by definition — your prompts still work, they just work better. Cache is silent and infrastructural — wrong, and you only notice on the invoice. Tools is interface-level — the schema looks similar but agentic code written for 4.6 will misbehave. Compaction is purely additive — adopting it deletes code, but ignoring it leaves you running fragile hand-rolled summarizers in production.
Capability
Backward-compatibleCoding, planning horizon, tool-use reliability all improve. SWE-bench, AIME, and Anthropic's internal agentic harnesses all move up. No prompt changes required — but worth re-running your evals because the rank-ordering of your best prompts may shift.
Drop-in liftPrompt cache
Silent breaking changeCache key derivation changed in 4.7. Existing 4.6 cache entries do not transfer. On cut-over day, hit rate drops to zero until the new cache warms — for high-traffic workloads, that's a real spend spike. Plan for a warm-up window, or pre-warm via shadow traffic.
Re-warm before cut overTool-use schema
Structured outputs + parallel calls4.7 can return strict-schema structured outputs and emit multiple tool_use blocks per turn. Most agent loops written for 4.6 assume one-tool-per-turn — they will either drop calls or serialize, defeating the latency benefit of parallel tools.
Audit orchestration loopsCompaction API
Built-in conversation compactionNative API for summarizing in-flight conversations to fit budget windows. Replaces hand-rolled summarizers most agent stacks ship today. Same model that produced the conversation does the compaction — the correct primitive, finally exposed.
Adopt aggressivelyOne framing note worth getting right early: 4.7 is not a strict superset of 4.6 in every dimension. On a small number of narrow evals — particularly highly-rehearsed prompt patterns that were tuned to 4.6 quirks — 4.6 still edges out. Run your own evals before cut over. The aggregate trend is clearly up; the per-prompt picture is occasionally messier.
02 — Prompt CacheHashing, TTLs, and the new cache-miss patterns.
Prompt caching is the single highest-leverage cost lever Anthropic ships, and the 4.7 release changes how it works under the hood in three ways that matter. First, cache-key derivation includes new model-version components, so 4.6 cache entries are effectively invalidated for 4.7 traffic. Second, TTL semantics are extended — the default 5-minute window is joined by a longer-lived tier for workloads with stable system prompts. Third, the cache-miss patterns themselves change shape: a miss in 4.7 is more often a partial miss with a shorter re-encode, rather than a full re-encode.
The practical implications stack. If your workload was caching a stable 50K-token system prompt on 4.6 and seeing 90%+ hit rates, day one on 4.7 is a 0% hit rate until traffic warms the new cache. For high-QPS workloads, that's a meaningful day-one spend spike — measurable, transient, but visible on the invoice. The fix is to pre-warm via shadow traffic, or to time the cut over for a low-traffic window so the warm-up cost is absorbed at off-peak rates.
Shadow traffic before cut over
Route a duplicate copy of production traffic to 4.7 for 24-48 hours before cut over. Cache warms naturally, the spend spike happens during the shadow window, and cut-over day is uneventful. Most expensive of the three options, but the safest.
Best for high-QPSTime the switch to low-traffic windows
Cut over during the lowest-traffic hour. Cache warms organically on the smallest customer cohort, the spend spike is contained, and by the time peak traffic hits, the cache is hot. Cheapest option for variable-traffic workloads.
Best for variable trafficCut over and absorb 24 hours of cold cache
Cut over instantly, accept a one-day spend bump while the cache warms naturally. Operationally simplest. Only acceptable if your finance team has been pre-briefed and the one-day delta is within tolerance — for some workloads it is, for others it is not.
Default — only with finance sign-offAdopt the extended TTL where it fits
For workloads with truly stable system prompts (a customer-support bot, a fixed-persona assistant), the longer-lived cache tier reduces re-warm frequency. Not for prompts that change daily — the longer TTL only helps if the underlying prompt is genuinely stable.
Worth evaluating per-workloadOne subtle behavior to test for: cache-key derivation in 4.7 is more sensitive to ordering and whitespace in the system prompt than the 4.6 implementation was. If your prompt-assembly code occasionally produces semantically-identical-but-byte-different outputs — say, a JSON serializer that doesn't pin key order — you'll see lower hit rates than expected. Pin the byte representation of the cached portion of your prompt deterministically.
For workloads where prompt caching is doing real economic work, the migration is a good moment to also audit which portions of your prompt are actually cache-eligible. Anthropic's cache granularity has tightened over time; some patterns that worked on 4.5 and earlier are sub-optimal on 4.7. If you want a structured audit of your stack's prompt-cache hit rate plus recommendations on partitioning, our AI transformation engagements include exactly that as part of the migration scope.
"The cache layer is the silent profit center of every production Claude deployment. A migration that ignores cache key changes can erase a quarter of margin in one cut-over weekend."— Production lesson · Digital Applied migration kit
03 — Tool-Use SchemaStructured outputs and parallel tool calls.
The tool-use surface in 4.7 changes in two ways that matter for anyone running agents. First, structured outputs gain strict-schema adherence — when you provide a JSON schema, 4.7 produces output that conforms by construction, not by best-effort. Second, and more consequentially, a single model turn can now emit multiple tool_use blocks. The 4.6 pattern of one tool call per turn becomes one or more tool calls per turn, with the model deciding when parallelism is appropriate.
Parallel tool calls are a strict capability improvement — but only if your orchestration loop knows how to handle them. The patterns we have seen break in the first week of cut over are consistent enough to be worth listing.
The three orchestration anti-patterns
- Drop-the-second-call. The loop reads the first tool_use block, dispatches it, and ignores subsequent blocks in the same response. Symptom: the agent silently drops half its work. The model thinks it called five tools; you returned results for one.
- Serialize-what-should-be-parallel. The loop detects multiple tool_use blocks and dispatches them sequentially. Symptom: latency that should be parallel-bounded becomes additive. The agent works, just slowly — and the latency-improvement story that justified the migration evaporates.
- Lose-tool-ordering. The loop dispatches parallel calls correctly but returns the tool_result blocks in the wrong order. The model handled them but is now reasoning over scrambled context. Symptom: subtle quality drops that don't show up on aggregate evals.
Structured outputs are the gentler half of the schema story. Strict-schema adherence means you can trust the JSON shape coming back — no more defensive parsers, no more retry-on-malformed logic. For teams that built elaborate validation layers around 4.6 outputs, the migration is also a deletion opportunity. Keep the schema validation as a safety net, delete the retry-on-parse fallback loops.
For the deeper context on how to write tool definitions that play well with the new parallel-call behavior, our piece on building Claude Code custom subagents covers the same tool-allowlist discipline applied to the subagent surface — the lessons port directly to plain API tool definitions.
04 — 1M ContextBreak-even tables — when 1M pays off.
Opus 4.7 ships with a 1M-token context window as a configurable tier. The capability is real, the pricing is non-trivial, and the right answer is almost always per-workload selective adoption rather than turning it on everywhere. The economics flip at different points for different workload classes — the chart below sketches the rough break-evens.
The framing question is simple: at what input-token count does the marginal cost of the longer context window stop being amortized by the marginal capability lift? For most chat traffic, the answer is "never" — chat messages don't approach 1M tokens, so you pay the long-context overhead with no benefit. For full-codebase analysis, RAG pipelines over large corpora, and multi-document reasoning, the answer flips the other way and 1M becomes economically dominant.
When to adopt 1M context · workload-class decision matrix
Illustrative break-even framing — pricing and capability deltas vary by workload; benchmark on your own prompts.Read the chart two ways. The bars on the right are the workloads where staying on the 200K tier is fine and 1M is overhead. The bars on the left are workloads where 1M unlocks behavior that was either impossible (cross-file refactor reasoning) or required elaborate chunking workarounds (long-document RAG). The middle band — mid-context agents — is the one that takes per-workload measurement.
One pattern that has worked well in client migrations: route within a single deployment based on input length. A simple length-based router that sends short requests to the 200K tier and long requests to 1M captures the economic upside without forcing a single-tier decision on every workload. The downside is the routing layer becomes part of your migration scope — another reason phased rollout matters.
Long-document RAG
Legal, financial, scientific corpus workflows where chunking previously meant retrieval misses and stitched-together answers. 1M lets the model see the whole document, and quality lifts in ways that show up on user-perceived metrics.
Cross-document reasoningFull-codebase analysis
Architecture review, cross-file refactor reasoning, large-monorepo onboarding. There is no equivalent capability at 200K — chunking a codebase fragments the dependency graph the model needs to reason over. 1M is the only path.
Cross-file dependencyShort interactive chat
Customer-support bots, in-product assistants, anything where 95% of turns fit comfortably in 32K. 1M adds cost without measurable benefit. Stay on the shorter tier; route the rare long-context outlier to 1M if it appears.
Default to shorter tier05 — Compaction APIBuilt-in conversation compaction.
Every production agent stack ships with a hand-rolled conversation summarizer. The pattern is universal because the problem is universal — long-running agents accumulate context that eventually exceeds budget, and the only way to keep running is to summarize older turns into a compressed form. The code that does this in every stack we have audited is the most fragile, least-tested, most-likely-to-silently-lose-information code in the agent. Opus 4.7's compaction API replaces all of it.
The primitive is the right one. Pass in the conversation, get back a structured summary that preserves the information the model itself judges relevant for continued reasoning. The model doing the compaction is the same model that produced the conversation, so it knows which threads it was tracking, which entities matter, and what state it had built up internally. The hand-rolled equivalent — using a separate cheap model to summarize, or asking the same model to re-summarize from scratch on every call — never had that context.
"Conversation compaction was the single most-fragile piece of code in every agent stack we audited. The compaction API moves that responsibility from product code to platform — which is exactly where it belongs."— Production lesson · Digital Applied migration kit
Three adoption patterns are worth knowing. The first is the drop-in replacement — wherever you currently call your hand-rolled summarizer, call the compaction API instead. This is the lowest-risk path and produces immediate quality lift on most workloads. The second is the threshold-triggered pattern — compact only when the conversation approaches a configurable budget, rather than on every turn. This minimizes compaction cost while preserving budget headroom. The third is the opportunistic pattern — compact during natural turn boundaries (user idle, system events) so compaction latency never appears on the critical path.
One adoption note: the compaction API output is a structured summary, not a raw text blob. Code that previously stitched a summary string into the prompt will need to handle the structured shape. The shape is stable and well-documented in the Anthropic SDK release notes, but it's not a one-line substitution — budget a few hours per agent for the integration.
06 — Phased RolloutAudit → shadow → cut over → retire.
Big-bang migrations look efficient on a sprint plan and burn the most engineering time in the rollback. Phased rollout looks slower on paper and ends up faster end-to-end because the surprises surface during the shadow phase rather than in production. The four phases below are the rhythm we run across client migrations.
Each phase has a clear entry and exit criterion, a defined duration, and a specific class of work that happens during it. Skipping a phase doesn't make the migration faster — it just shifts which phase eats the surprises.
Audit
1-3 days · read-onlyInventory every workload calling 4.6. For each: classify input-length distribution, identify tool-use patterns, locate hand-rolled summarizers, measure current cache hit rate. Output is a per-workload migration plan, not a single decision for the org.
Entry: 4.7 available · Exit: per-workload planShadow
1-2 sprints · duplicate trafficRoute a duplicate copy of production traffic to 4.7. Compare outputs, measure latency, watch cache warm-up, validate parallel tool-call handling. No customer-visible change. Most of the surprises surface here, at zero customer impact.
Entry: per-workload plan · Exit: regression-free 72hCut over
5% → 25% → 100% over 1-2 weeksRoute real traffic to 4.7 in increasing slices. Monitor user-facing metrics at each step. Roll back instantly on regression. The 5% slice is the one that catches edge cases shadow missed; 25% confirms cache behavior at meaningful scale; 100% is the boring step.
Entry: clean shadow · Exit: 100% on 4.7Retire
1 sprint · deletionDelete the 4.6 code paths, the hand-rolled summarizers replaced by compaction API, the parse-retry loops replaced by structured outputs, the fallback model routing. Phase 4 is where the migration pays for itself in maintenance.
Entry: 100% on 4.7 · Exit: 4.6 code goneTwo phase-level rules earn their keep. First, the shadow phase exits only when 72 hours of duplicated traffic have produced no regressions — anything shorter and you haven't seen the workload's full cycle (the report that runs daily, the batch job that runs Sunday night, the campaign that fires Monday morning). Second, the cut-over phase has explicit rollback criteria documented before the first slice goes — not invented under pressure when an alert fires at 3am.
For broader context on how Anthropic SDK changes interact with the migration — and the specific SDK upgrade path that needs to happen in parallel — our companion guide on the Anthropic SDK v2 to v3 migration walks through the TypeScript-specific deltas. For most teams, the SDK upgrade lands at the same time as the model migration and the two are best planned together.
07 — Common PitfallsFour ways the migration bites.
Across the migrations we have audited, four pitfalls show up often enough to be worth naming explicitly. None of them are exotic — all of them are easy to miss if the migration is run as a version bump rather than a four-axis change.
Cache-miss cliff on day one
Cache invalidated by version bump; hit rate drops to 0% until traffic warms. The finance team notices before the engineering team. Mitigation: pre-warm via shadow traffic, time cut over to off-peak, or brief finance ahead of the spike.
Fix: pre-warm cacheSilent parallel tool-call drops
Agent loop reads the first tool_use block and ignores the rest. The model thinks it called five tools; only one runs. Symptom is subtle — quality regresses on multi-step tasks but aggregate metrics look fine. Mitigation: explicit N-tool-block handling.
Fix: audit orchestration1M context spend spike
Turning on 1M everywhere because the release notes mentioned it. Most workloads don't benefit; all of them pay. Mitigation: per-workload adoption decision, length-based routing for mixed-distribution workloads.
Fix: selective adoptionLingering hand-rolled summarizers
Adopting 4.7 capabilities without adopting the compaction API. The summarizer still runs, still ships bugs, still consumes engineering time. Mitigation: include the compaction-API swap in the migration scope, not as a follow-up.
Fix: include in scopeOne meta-pitfall worth naming: assuming the migration is done when traffic is on 4.7. Phase 4 — retirement — is where the migration actually pays back. Code paths that linger past the cut-over date accumulate drift, get accidentally re-exercised by unrelated changes, and confuse the next engineer who joins the team. Schedule the retirement work explicitly. The migration ends when the 4.6 code is gone, not when the 4.7 code is live.
Frontier-model migrations compound — the playbook matters more than the version bump.
Opus 4.6 to 4.7 looks like a routine upgrade and behaves like a four-axis migration. Capability is the drop-in part; cache, tools, and compaction are the parts that decide whether the rollout is clean. Each axis has a different blast radius and a different rollout strategy, and the playbook that treats them separately is the one that ships in a sprint instead of a quarter.
The broader signal is that frontier-model release cadence has quietly become a continuous-migration problem. The teams that run frontier models in production are running a migration every six to twelve weeks — same patterns, different specifics each time. The playbook above generalizes: audit, shadow, cut over, retire applies just as cleanly to the next Opus release, the next GPT release, the next Gemini release. Build the playbook once; execute it on cadence.
One last framing. The teams that get hurt by frontier-model migrations are the ones who treat each release as a separate project. The teams that get leverage are the ones who treat migration as a recurring competence — a fixed-cost engineering capability that turns a quarterly disruption into a sprint-level task. The four-axis playbook is how you get there.