Claude Opus 4.7: Anthropic's New Frontier Model Guide
Claude Opus 4.7 scores 64.3% on SWE-bench Pro with 2576px vision, xhigh effort, and same Opus 4.6 pricing. Full benchmark and migration guide.
SWE-bench Pro
SWE-bench Verified
Max Image Edge
Input / Output
Key Takeaways
On April 16, 2026, Anthropic made Claude Opus 4.7 generally available. It is the company's latest frontier model and the first Mythos-class release to ship with production safeguards, positioned as a direct upgrade to Opus 4.6 with substantial gains in advanced software engineering, vision, and long-horizon agentic work.
For agencies, platform teams, and anyone building on top of Claude, this release matters for three reasons. The coding benchmarks show a genuine step change rather than an incremental bump. The pricing holds flat at $5 input and $25 output per million tokens. And the model's stricter instruction following means existing prompts and harnesses need a careful second pass before rollout. This guide walks through what actually changed, what the numbers mean in production, and how to plan the migration.
Release snapshot: Opus 4.7 launched April 16, 2026 across the Claude apps, Claude Code, and the Claude API as claude-opus-4-7, plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Based on the official Anthropic announcement.
What Is Claude Opus 4.7
Claude Opus 4.7 is Anthropic's latest generally available frontier model. It sits below Claude Mythos Preview in raw capability, Anthropic's most powerful but deliberately restricted model, and above Opus 4.6 across essentially every benchmark Anthropic reported. The Mythos restriction ties back to Project Glasswing, Anthropic's framework for managing cybersecurity risks from frontier AI, which established that Mythos-class models would be held back while safeguards were tested on less capable systems. Opus 4.7 is the first such model.
During training, Anthropic specifically experimented with reducing Opus 4.7's cyber capabilities relative to Mythos, and the release ships with safeguards that automatically detect and block prompts signaling prohibited or high-risk cybersecurity use. Legitimate security professionals, red teamers, and vulnerability researchers can apply to a new Cyber Verification Program to unlock the relevant capabilities.
- Stronger advanced software engineering, with the largest gains on the hardest tasks.
- Better long-running task handling with self-verification before reporting back.
- Substantially higher-resolution vision, with image edges up to 2,576 pixels.
- More tasteful and creative outputs on professional work like interfaces, slides, and docs.
- Better use of file system memory across long, multi-session work.
- A new xhigh effort level for fine-grained reasoning control.
Context Window and Output Limits
Opus 4.7 keeps the same envelope as Opus 4.6 for context size, output, and platform features. The model supports a full 1M token context window at standard API pricing with no long-context premium, and a 128k max output tokens ceiling per response. Adaptive thinking, tool use, file system memory, and the other platform capabilities shipped with Opus 4.6 are all available on day one. The model ID on the API is claude-opus-4-7.
How It Fits the Claude Lineup
Opus remains Anthropic's most capable generally available tier, with Sonnet as the balanced workhorse and Haiku as the fast, cost-efficient option. Mythos Preview sits above Opus 4.7 but under controlled access. For agency workflows that lean heavily on reasoning depth, tool use, and long-horizon autonomy, Opus 4.7 is the new default choice.
Benchmark Results Breakdown
Anthropic published a comparison table covering Opus 4.7 against Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Mythos Preview across twelve benchmark categories. A few themes emerge from the numbers.
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro | Mythos Preview |
|---|---|---|---|---|---|
| SWE-bench Pro | 64.3% | 53.4% | 57.7% | 54.2% | 77.8% |
| SWE-bench Verified | 87.6% | 80.8% | — | 80.6% | 93.9% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | 75.1% | 68.5% | 82.0% |
| Humanity's Last Exam (tools) | 54.7% | 53.3% | 58.7% | 51.4% | 64.7% |
| BrowseComp | 79.3% | 83.7% | 89.3% | 85.9% | 86.9% |
| MCP-Atlas | 77.3% | 75.8% | 68.1% | 73.9% | — |
| OSWorld-Verified | 78.0% | 72.7% | 75.0% | — | 79.6% |
| Finance Agent v1.1 | 64.4% | 60.1% | 61.5% | 59.7% | — |
| CyberGym | 73.1% | 73.8% | 66.3% | — | 83.1% |
| GPQA Diamond | 94.2% | 91.3% | 94.4% | 94.3% | 94.6% |
| CharXiv Reasoning (tools) | 91.0% | 84.7% | — | — | 93.2% |
| MMMLU | 91.5% | 91.1% | — | 92.6% | — |
The standout gains sit squarely in agentic coding and long-horizon work. SWE-bench Pro jumps almost eleven points against Opus 4.6 and beats GPT-5.4 by more than six. SWE-bench Verified clears 87.6%, and Opus 4.7 holds the top spot on MCP-Atlas for scaled tool use and on Finance Agent v1.1.
The weak spots are worth noting too. BrowseComp dropped from 83.7% on Opus 4.6 to 79.3% on Opus 4.7, with GPT-5.4 Pro at 89.3% holding clear leadership for agentic search. Terminal-Bench 2.0 at 69.4% trails GPT-5.4's self-reported 75.1% result. And on multilingual Q&A, Gemini 3.1 Pro keeps a narrow edge. For most coding-heavy agency work these tradeoffs will be acceptable, but teams running production web research pipelines should test both Opus 4.7 and GPT-5.4 before switching.
For a deeper head-to-head on the GPT-5.4 matchup, see our Claude Opus 4.7 vs GPT-5.4 agentic coding comparison.
Need help deciding which model fits your stack? Model selection rarely comes down to a single benchmark. Explore our AI Digital Transformation service to map models to your actual workloads.
The Coding Leap Explained
The benchmark table only tells part of the coding story. The more interesting signal comes from early-access partner evaluations across production workloads, which consistently describe Opus 4.7 as a step change rather than a routine upgrade.
CursorBench jumped from 58% on Opus 4.6 to over 70% on Opus 4.7, described by co-founder Michael Truell as a meaningful jump in capabilities and more creative reasoning.
Resolution lifted 13% over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve, with faster median latency and stricter instruction following.
Opus 4.7 resolves 3x more production tasks than Opus 4.6, with double-digit gains in code quality and test quality on real engineering work shipped by Rakuten teams.
Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6, and the model correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.
Long-Horizon Autonomy
Partner reports keep circling back to the same theme: Opus 4.7 sustains coherent work over much longer runs than Opus 4.6. Devin's team describes it as working coherently for hours, pushing through hard problems rather than giving up. Genspark highlights loop resistance as the most critical production metric, noting that a model looping indefinitely on 1 in 18 queries wastes compute and blocks users. And Notion Agent's team reports a 14% improvement on complex multi-step workflows at fewer tokens and a third of the tool errors.
Self-Verification Before Reporting
A behavioral shift worth highlighting: Opus 4.7 devises ways to verify its own outputs before reporting back. Vercel's Joe Haddad noted the model even does proofs on systems code before starting work, which is new behavior compared to earlier Claude models. For agency workloads where clients push back on confident-but-wrong AI output, this is the single most valuable behavior change in the release.
Vision and Multimodal Gains
Opus 4.7 now accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels, more than three times the pixel count of prior Claude models. Crucially this is a model-level change rather than an API parameter, so images sent to Claude are automatically processed at higher fidelity. Users who do not need the extra detail can downsample before sending to manage token cost.
XBOW, a company running autonomous penetration testing with heavy computer-use workloads, reports Opus 4.7 scoring 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. CEO Oege de Moor described it as a step change where their single biggest Opus pain point effectively disappeared, unlocking computer-use workflows they could not previously run.
Low-Level Perception and Coordinate Mapping
Beyond raw resolution, Opus 4.7 improves on low-level perception tasks like pointing, measuring, and counting, and on natural-image bounding-box localization and detection. Coordinates returned by the model now map 1:1 to actual image pixels, so operations that involve mapping coordinates back onto an image no longer require scale-factor math. For computer-use agents and UI automation workloads, that removes a common source of brittle glue code.
Practical Use Cases Unlocked
- Computer-use agents reading dense application screenshots, CRM dashboards, and complex SaaS interfaces without losing small UI elements.
- Data extraction from diagrams including engineering schematics, technical drawings, and chemical structures, which Solve Intelligence flagged as newly viable for life sciences patent workflows.
- Dashboard and data interface work, with v0/Vercel's team describing Opus 4.7 as the best model in the world for building dashboards and data-rich interfaces.
- Document analysis including tables, annotated PDFs, and marked-up legal documents where fine visual detail changes meaning.
- .docx redlining and .pptx editing, where Opus 4.7 produces and self-verifies tracked changes and slide layouts. If your prompts include mitigations like "double-check the slide layout before returning," Anthropic recommends removing that scaffolding and re-baselining.
- Chart and figure analysis via programmatic tool-calling with image libraries like PIL, including pixel-level data transcription from rendered charts.
Effort Control, Pricing, and Tokenizer
Three changes matter for anyone planning capacity and cost against Opus 4.7: a new effort level, an updated tokenizer, and task budgets entering public beta.
The New xhigh Effort Level
Opus 4.7 introduces an xhigh tier that sits between the existing high and max settings, giving developers a finer-grained tradeoff between reasoning depth and latency on hard problems. In Claude Code the default has been raised to xhigh across all plans, and Anthropic recommends starting with high or xhigh for coding and agentic use cases. For latency-sensitive interactive work, stepping down from max to xhigh can recover meaningful response time without a meaningful quality drop.
Pricing Holds Flat at $5 / $25
Opus 4.7 lands at the same price as Opus 4.6: $5 per million input tokens and $25 per million output tokens across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Holding price flat on a genuine capability upgrade is the main commercial headline for agencies running existing Claude pipelines.
The Tokenizer Change to Watch
Tokenizer shift: Opus 4.7 uses an updated tokenizer. The same input text can map to roughly 1.0 to 1.35x more tokens depending on content type, which directly affects both input cost and context window usage. Opus 4.7 also thinks more at higher effort levels, producing more output tokens on hard problems.
Anthropic's internal testing shows the net effect is favorable on an internal coding evaluation, with token usage improved across all effort levels. But those internals run autonomously from a single prompt, so results may not map cleanly to interactive coding or client workloads. The practical guidance is to measure real token spend on representative traffic before rolling out across production pipelines. Task budgets, now in public beta on the Claude Platform, give a way to cap per-task token spend so long runs stay bounded.
Real Agency Applications
For agencies running AI-assisted delivery, here are the areas where Opus 4.7's specific improvements translate into client-facing value:
Long-Running Engineering Tasks
Before: Complex refactors, migrations, and multi-file builds required heavy human supervision because the model would lose context or give up on hard sub-problems.
After: Opus 4.7 sustains coherent work for hours on the same task, with Devin and Factory both reporting the model carries work all the way through rather than stopping halfway.
Impact: Agency engineers can run tasks in parallel rather than 1:1, supervising multiple agents on long client projects.
Dashboard and Interface Builds
Before: Client dashboard projects required heavy design iteration because AI-generated interfaces often lacked taste on spacing, hierarchy, and color.
After: Vercel's v0 team calls Opus 4.7 the best model in the world for building dashboards and data-rich interfaces, with design choices they would actually ship.
Impact: Faster first-draft dashboards for SaaS and analytics clients, fewer revision cycles on visual hierarchy.
Code Review and QA Workflows
Before: Automated code review missed subtle bugs, produced noisy false-positive comments, and struggled with race conditions or concurrency issues.
After: CodeRabbit reports recall improved over 10% while precision held steady, and Warp notes Opus 4.7 cracked a concurrency bug Opus 4.6 could not. The new /ultrareview Claude Code command produces a dedicated reviewer session.
Impact: Higher-quality PR reviews delivered to clients, with the most-difficult-to-detect bugs surfaced automatically.
Document and Diagram Analysis
Before: Client document analysis was limited by image resolution, forcing manual extraction of tables and figures from technical PDFs.
After: The 2,576-pixel image support and 21% fewer document-reasoning errors (per Databricks OfficeQA Pro) enables pulling data from complex diagrams, annotated PDFs, and dense dashboards.
Impact: New service lines for agencies serving life sciences, legal, and financial clients with document-heavy workflows.
Agent Team Orchestration
Before: Multi-agent workflows suffered from loop conditions, poor tool-call planning, and agents drifting from assigned roles on longer jobs.
After: Genspark reports the highest quality-per-tool-call ratio they have measured, Hebbia sees double-digit gains in accuracy of tool calls and planning, and Ramp reports stronger role fidelity and coordination.
Impact: More reliable production agent systems for clients, supporting use cases that were previously too flaky to ship.
API Changes and Behavior Shifts
Claude Managed Agents absorb these API shifts automatically, but teams calling the Messages API directly need to plan for a few breaking changes and several behavior changes before switching traffic to Opus 4.7.
Messages API Breaking Changes
- Extended thinking budgets are removed. Passing
thinking: { type: "enabled", budget_tokens: N }now returns a 400 error. Adaptive thinking is the only thinking-on mode, and Anthropic's internal evaluations show it reliably outperforms the old extended-thinking mode. - Adaptive thinking is off by default. Requests without an explicit
thinkingfield run with thinking off. Setthinking: { type: "adaptive" }to opt back in. - Sampling parameters are removed. Setting
temperature,top_p, ortop_kto any non-default value returns a 400 error. The safest migration path is to drop those parameters entirely and steer behavior through prompting. - Thinking content is omitted by default. Thinking blocks still appear in the response stream but the
thinkingfield is empty unless callers opt in withdisplay: "summarized". Products that stream reasoning to users will otherwise show a long pause before output. - Updated tokenizer. The new tokenizer may use roughly 1x to 1.35x as many tokens on the same input (up to about 35% more, varying by content). Anthropic recommends raising
max_tokensfor additional headroom and reviewing compaction triggers.
Task Budgets (Public Beta)
Task budgets are a new advisory cap across a full agentic loop, covering thinking, tool calls, tool results, and final output. The model sees a running countdown and uses it to scope work and finish the task gracefully as the budget is consumed. This is distinct from max_tokens, which is a hard per-request ceiling that the model never sees.
response = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=128000,
output_config={
"effort": "high",
"task_budget": {"type": "tokens", "total": 128000},
},
messages=[
{"role": "user", "content": "Review the codebase and propose a refactor plan."}
],
betas=["task-budgets-2026-03-13"],
)The minimum task budget is 20k tokens. Skip task budgets for open-ended work where quality matters more than speed; reserve them for workloads where scoping the run to a token allowance is part of the product. Too-restrictive budgets can cause Opus 4.7 to complete tasks less thoroughly or refuse outright.
Behavior Shifts That Affect Prompts
- More literal instruction following, especially at lower effort levels. The model will not silently generalize from one example to another or infer requests you did not make.
- Response length calibrates to perceived complexity rather than defaulting to a fixed verbosity, so hardcoded "keep it under 3 paragraphs" scaffolding may become redundant.
- Fewer tool calls by default, with the model leaning more on reasoning. Raising effort increases tool usage when that is what you want.
- More direct, opinionated tone with less validation-forward phrasing and fewer emoji than Opus 4.6's warmer style. Client-facing chat prompts tuned to Opus 4.6's tone may need refreshing.
- More regular progress updates during long agentic traces. If you have added scaffolding to force interim status messages, try removing it.
- Fewer subagents spawned by default; steerable through prompting when a fan-out is needed.
- Real-time cybersecurity safeguards that may refuse prohibited or high-risk requests. Legitimate security work should apply to the Cyber Verification Program.
Migrating from Opus 4.6
Anthropic describes Opus 4.7 as a direct upgrade to Opus 4.6, but the migration is not strictly drop-in. Three practical areas deserve attention before flipping traffic.
1. Re-Tune Prompts for Literal Instruction Following
Opus 4.7 takes instructions literally rather than loosely. Where Opus 4.6 might have softened or skipped parts of a prompt, Opus 4.7 follows them as written. Prompts that relied on the model filling in gaps, or that contained contradictory instructions the old model silently resolved, will produce unexpected results. Audit system prompts for leftover instructions, clarify ambiguity, and drop redundant guardrails that were compensating for Opus 4.6's looseness.
2. Measure Token Spend on Real Traffic
The tokenizer change plus deeper thinking at higher effort levels shift the token math. On an internal agentic coding benchmark Anthropic shows token usage improved across all effort levels, but the right answer for any given workload is to measure. Run a representative sample of production traffic through Opus 4.7 at the effort level you plan to deploy, compare input-plus-output token totals against Opus 4.6, and adjust task budgets accordingly.
3. Start with High or xhigh Effort
Anthropic explicitly recommends starting with high or xhigh effort for coding and agentic use cases, and in Claude Code xhigh is the new default across all plans. For consumer-facing interactive workloads, test medium first to hold latency down, since Hex's observation that low-effort Opus 4.7 matches medium-effort Opus 4.6 applies here.
Further reading: Anthropic's full Opus 4.7 migration guide covers effort-level tuning, task budgets, and prompt adjustments in detail.
Safeguards and Alignment
Opus 4.7 is the first model released under the Project Glasswing framework, where Mythos-class capability improvements roll out with production safeguards on less capable models before broader Mythos release. For Opus 4.7 specifically, that means two deliberate decisions.
First, Anthropic experimented during training with reducing the model's cyber capabilities relative to Mythos. Second, the released model ships with automated safeguards that detect and block requests indicating prohibited or high-risk cybersecurity uses. On the CyberGym benchmark Opus 4.7 scores 73.1%, modestly below Opus 4.6's 73.8% and well below Mythos Preview's 83.1%.
Security professionals conducting legitimate vulnerability research, penetration testing, and red-teaming can apply to Anthropic's new Cyber Verification Program to access Opus 4.7 for cybersecurity use cases that would otherwise be blocked by the default safeguards.
Alignment Assessment
Anthropic's alignment evaluation concluded the model is largely well-aligned and trustworthy, though not fully ideal in its behavior. Opus 4.7 shows a similar overall safety profile to Opus 4.6, with improvements on honesty and resistance to prompt injection attacks, and modest regressions in one area: it can give overly detailed harm-reduction advice on controlled substances. Mythos Preview remains the best-aligned model Anthropic has trained by their own evaluations.
For enterprise use the practical implication is that Opus 4.7 is an acceptable drop-in for production workloads, with the standard caveats around domain-specific red-teaming. Full details are in the Claude Opus 4.7 System Card on Anthropic's site.
Conclusion
Claude Opus 4.7 is a meaningful step up for any team building production AI workflows. The coding gains are real and broad, confirmed across dozens of independent partner benchmarks rather than just Anthropic's own evaluations. The vision and memory improvements unlock use cases that were previously unreliable. And the new xhigh effort level plus task budgets give finer control over the cost-quality tradeoff than Opus 4.6 offered.
The migration requires care rather than being strictly drop-in. Stricter instruction following means prompts need auditing, the updated tokenizer shifts token spend math, and teams leaning on agentic search should compare against GPT-5.4 before fully switching. For coding, long-horizon agents, dashboard work, and document analysis, Opus 4.7 is the new default choice on the Claude family.
Ready to Put Opus 4.7 to Work?
Whether you're evaluating frontier models, migrating an existing Claude pipeline, or building new agentic workflows for clients, we can help you navigate model selection, prompt engineering, and production rollout.
Frequently Asked Questions
Related Guides
Continue exploring AI coding tools and frontier models