AI Development18 min read

Claude Opus 4.7: Anthropic's New Frontier Model Guide

Claude Opus 4.7 scores 64.3% on SWE-bench Pro with 2576px vision, xhigh effort, and same Opus 4.6 pricing. Full benchmark and migration guide.

Digital Applied Team

April 16, 2026

18 min read

64.3%

SWE-bench Pro

87.6%

SWE-bench Verified

2576px

Max Image Edge

$5 / $25

Input / Output

Key Takeaways

State-of-the-Art Coding: Opus 4.7 scores 64.3% on SWE-bench Pro and 87.6% on SWE-bench Verified, with partners reporting double-digit gains on real production workloads.

Sharper Instruction Following: The model takes instructions literally rather than loosely, meaning prompts tuned for Opus 4.6 will often need re-tuning to avoid unexpected results.

Higher-Resolution Vision: Support for images up to 2,576 pixels on the long edge, over 3x the pixel count of prior Claude models, unlocks dense screenshot and diagram workflows.

New xhigh Effort Level: A new reasoning tier sits between high and max, giving developers finer control over the reasoning versus latency tradeoff on hard problems.

Same Price, More Tokens: Pricing holds at $5 input and $25 output per million tokens, but a new tokenizer and deeper thinking mean the same input can map to 1.0 to 1.35x more tokens.

On April 16, 2026, Anthropic made Claude Opus 4.7 generally available. It is the company's latest frontier model and the first Mythos-class release to ship with production safeguards, positioned as a direct upgrade to Opus 4.6 with substantial gains in advanced software engineering, vision, and long-horizon agentic work.

For agencies, platform teams, and anyone building on top of Claude, this release matters for three reasons. The coding benchmarks show a genuine step change rather than an incremental bump. The pricing holds flat at $5 input and $25 output per million tokens. And the model's stricter instruction following means existing prompts and harnesses need a careful second pass before rollout. This guide walks through what actually changed, what the numbers mean in production, and how to plan the migration.

Release snapshot: Opus 4.7 launched April 16, 2026 across the Claude apps, Claude Code, and the Claude API as claude-opus-4-7, plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Based on the official Anthropic announcement.

What Is Claude Opus 4.7

Claude Opus 4.7 is Anthropic's latest generally available frontier model. It sits below Claude Mythos Preview in raw capability, Anthropic's most powerful but deliberately restricted model, and above Opus 4.6 across essentially every benchmark Anthropic reported. The Mythos restriction ties back to Project Glasswing, Anthropic's framework for managing cybersecurity risks from frontier AI, which established that Mythos-class models would be held back while safeguards were tested on less capable systems. Opus 4.7 is the first such model.

During training, Anthropic specifically experimented with reducing Opus 4.7's cyber capabilities relative to Mythos, and the release ships with safeguards that automatically detect and block prompts signaling prohibited or high-risk cybersecurity use. Legitimate security professionals, red teamers, and vulnerability researchers can apply to a new Cyber Verification Program to unlock the relevant capabilities.

Headline Improvements Over Opus 4.6

Stronger advanced software engineering, with the largest gains on the hardest tasks.
Better long-running task handling with self-verification before reporting back.
Substantially higher-resolution vision, with image edges up to 2,576 pixels.
More tasteful and creative outputs on professional work like interfaces, slides, and docs.
Better use of file system memory across long, multi-session work.
A new xhigh effort level for fine-grained reasoning control.

Context Window and Output Limits

Opus 4.7 keeps the same envelope as Opus 4.6 for context size, output, and platform features. The model supports a full 1M token context window at standard API pricing with no long-context premium, and a 128k max output tokens ceiling per response. Adaptive thinking, tool use, file system memory, and the other platform capabilities shipped with Opus 4.6 are all available on day one. The model ID on the API is claude-opus-4-7.

How It Fits the Claude Lineup

Opus remains Anthropic's most capable generally available tier, with Sonnet as the balanced workhorse and Haiku as the fast, cost-efficient option. Mythos Preview sits above Opus 4.7 but under controlled access. For agency workflows that lean heavily on reasoning depth, tool use, and long-horizon autonomy, Opus 4.7 is the new default choice.

Benchmark Results Breakdown

Anthropic published a comparison table covering Opus 4.7 against Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Mythos Preview across twelve benchmark categories. A few themes emerge from the numbers.

Benchmark	Opus 4.7	Opus 4.6	GPT-5.4	Gemini 3.1 Pro	Mythos Preview
SWE-bench Pro	64.3%	53.4%	57.7%	54.2%	77.8%
SWE-bench Verified	87.6%	80.8%	—	80.6%	93.9%
Terminal-Bench 2.0	69.4%	65.4%	75.1%	68.5%	82.0%
Humanity's Last Exam (tools)	54.7%	53.3%	58.7%	51.4%	64.7%
BrowseComp	79.3%	83.7%	89.3%	85.9%	86.9%
MCP-Atlas	77.3%	75.8%	68.1%	73.9%	—
OSWorld-Verified	78.0%	72.7%	75.0%	—	79.6%
Finance Agent v1.1	64.4%	60.1%	61.5%	59.7%	—
CyberGym	73.1%	73.8%	66.3%	—	83.1%
GPQA Diamond	94.2%	91.3%	94.4%	94.3%	94.6%
CharXiv Reasoning (tools)	91.0%	84.7%	—	—	93.2%
MMMLU	91.5%	91.1%	—	92.6%	—

The standout gains sit squarely in agentic coding and long-horizon work. SWE-bench Pro jumps almost eleven points against Opus 4.6 and beats GPT-5.4 by more than six. SWE-bench Verified clears 87.6%, and Opus 4.7 holds the top spot on MCP-Atlas for scaled tool use and on Finance Agent v1.1.

The weak spots are worth noting too. BrowseComp dropped from 83.7% on Opus 4.6 to 79.3% on Opus 4.7, with GPT-5.4 Pro at 89.3% holding clear leadership for agentic search. Terminal-Bench 2.0 at 69.4% trails GPT-5.4's self-reported 75.1% result. And on multilingual Q&A, Gemini 3.1 Pro keeps a narrow edge. For most coding-heavy agency work these tradeoffs will be acceptable, but teams running production web research pipelines should test both Opus 4.7 and GPT-5.4 before switching.

For a deeper head-to-head on the GPT-5.4 matchup, see our Claude Opus 4.7 vs GPT-5.4 agentic coding comparison.

Need help deciding which model fits your stack? Model selection rarely comes down to a single benchmark. Explore our AI Digital Transformation service to map models to your actual workloads.

The Coding Leap Explained

The benchmark table only tells part of the coding story. The more interesting signal comes from early-access partner evaluations across production workloads, which consistently describe Opus 4.7 as a step change rather than a routine upgrade.

Cursor

CursorBench coding benchmark

CursorBench jumped from 58% on Opus 4.6 to over 70% on Opus 4.7, described by co-founder Michael Truell as a meaningful jump in capabilities and more creative reasoning.

GitHub

93-task coding benchmark

Resolution lifted 13% over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve, with faster median latency and stricter instruction following.

Rakuten

Rakuten-SWE-Bench production tasks

Opus 4.7 resolves 3x more production tasks than Opus 4.6, with double-digit gains in code quality and test quality on real engineering work shipped by Rakuten teams.

Hex

Data reasoning and honesty

Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6, and the model correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.

Long-Horizon Autonomy

Partner reports keep circling back to the same theme: Opus 4.7 sustains coherent work over much longer runs than Opus 4.6. Devin's team describes it as working coherently for hours, pushing through hard problems rather than giving up. Genspark highlights loop resistance as the most critical production metric, noting that a model looping indefinitely on 1 in 18 queries wastes compute and blocks users. And Notion Agent's team reports a 14% improvement on complex multi-step workflows at fewer tokens and a third of the tool errors.

Self-Verification Before Reporting

A behavioral shift worth highlighting: Opus 4.7 devises ways to verify its own outputs before reporting back. Vercel's Joe Haddad noted the model even does proofs on systems code before starting work, which is new behavior compared to earlier Claude models. For agency workloads where clients push back on confident-but-wrong AI output, this is the single most valuable behavior change in the release.

Vision and Multimodal Gains

Opus 4.7 now accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels, more than three times the pixel count of prior Claude models. Crucially this is a model-level change rather than an API parameter, so images sent to Claude are automatically processed at higher fidelity. Users who do not need the extra detail can downsample before sending to manage token cost.

XBOW Visual Acuity Benchmark

XBOW, a company running autonomous penetration testing with heavy computer-use workloads, reports Opus 4.7 scoring 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. CEO Oege de Moor described it as a step change where their single biggest Opus pain point effectively disappeared, unlocking computer-use workflows they could not previously run.

Low-Level Perception and Coordinate Mapping

Beyond raw resolution, Opus 4.7 improves on low-level perception tasks like pointing, measuring, and counting, and on natural-image bounding-box localization and detection. Coordinates returned by the model now map 1:1 to actual image pixels, so operations that involve mapping coordinates back onto an image no longer require scale-factor math. For computer-use agents and UI automation workloads, that removes a common source of brittle glue code.

Practical Use Cases Unlocked

Computer-use agents reading dense application screenshots, CRM dashboards, and complex SaaS interfaces without losing small UI elements.
Data extraction from diagrams including engineering schematics, technical drawings, and chemical structures, which Solve Intelligence flagged as newly viable for life sciences patent workflows.
Dashboard and data interface work, with v0/Vercel's team describing Opus 4.7 as the best model in the world for building dashboards and data-rich interfaces.
Document analysis including tables, annotated PDFs, and marked-up legal documents where fine visual detail changes meaning.
.docx redlining and .pptx editing, where Opus 4.7 produces and self-verifies tracked changes and slide layouts. If your prompts include mitigations like "double-check the slide layout before returning," Anthropic recommends removing that scaffolding and re-baselining.
Chart and figure analysis via programmatic tool-calling with image libraries like PIL, including pixel-level data transcription from rendered charts.

Effort Control, Pricing, and Tokenizer

Three changes matter for anyone planning capacity and cost against Opus 4.7: a new effort level, an updated tokenizer, and task budgets entering public beta.

The New xhigh Effort Level

Opus 4.7 introduces an xhigh tier that sits between the existing high and max settings, giving developers a finer-grained tradeoff between reasoning depth and latency on hard problems. In Claude Code the default has been raised to xhigh across all plans, and Anthropic recommends starting with high or xhigh for coding and agentic use cases. For latency-sensitive interactive work, stepping down from max to xhigh can recover meaningful response time without a meaningful quality drop.

Pricing Holds Flat at $5 / $25

Opus 4.7 lands at the same price as Opus 4.6: $5 per million input tokens and $25 per million output tokens across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Holding price flat on a genuine capability upgrade is the main commercial headline for agencies running existing Claude pipelines.

The Tokenizer Change to Watch

Tokenizer shift: Opus 4.7 uses an updated tokenizer. The same input text can map to roughly 1.0 to 1.35x more tokens depending on content type, which directly affects both input cost and context window usage. Opus 4.7 also thinks more at higher effort levels, producing more output tokens on hard problems.

Anthropic's internal testing shows the net effect is favorable on an internal coding evaluation, with token usage improved across all effort levels. But those internals run autonomously from a single prompt, so results may not map cleanly to interactive coding or client workloads. The practical guidance is to measure real token spend on representative traffic before rolling out across production pipelines. Task budgets, now in public beta on the Claude Platform, give a way to cap per-task token spend so long runs stay bounded.

Real Agency Applications

For agencies running AI-assisted delivery, here are the areas where Opus 4.7's specific improvements translate into client-facing value:

Long-Running Engineering Tasks

Before: Complex refactors, migrations, and multi-file builds required heavy human supervision because the model would lose context or give up on hard sub-problems.

After: Opus 4.7 sustains coherent work for hours on the same task, with Devin and Factory both reporting the model carries work all the way through rather than stopping halfway.

Impact: Agency engineers can run tasks in parallel rather than 1:1, supervising multiple agents on long client projects.

Dashboard and Interface Builds

Before: Client dashboard projects required heavy design iteration because AI-generated interfaces often lacked taste on spacing, hierarchy, and color.

After: Vercel's v0 team calls Opus 4.7 the best model in the world for building dashboards and data-rich interfaces, with design choices they would actually ship.

Impact: Faster first-draft dashboards for SaaS and analytics clients, fewer revision cycles on visual hierarchy.

Code Review and QA Workflows

Before: Automated code review missed subtle bugs, produced noisy false-positive comments, and struggled with race conditions or concurrency issues.

After: CodeRabbit reports recall improved over 10% while precision held steady, and Warp notes Opus 4.7 cracked a concurrency bug Opus 4.6 could not. The new /ultrareview Claude Code command produces a dedicated reviewer session.

Impact: Higher-quality PR reviews delivered to clients, with the most-difficult-to-detect bugs surfaced automatically.

Document and Diagram Analysis

Before: Client document analysis was limited by image resolution, forcing manual extraction of tables and figures from technical PDFs.

After: The 2,576-pixel image support and 21% fewer document-reasoning errors (per Databricks OfficeQA Pro) enables pulling data from complex diagrams, annotated PDFs, and dense dashboards.

Impact: New service lines for agencies serving life sciences, legal, and financial clients with document-heavy workflows.

Agent Team Orchestration

Before: Multi-agent workflows suffered from loop conditions, poor tool-call planning, and agents drifting from assigned roles on longer jobs.

After: Genspark reports the highest quality-per-tool-call ratio they have measured, Hebbia sees double-digit gains in accuracy of tool calls and planning, and Ramp reports stronger role fidelity and coordination.

Impact: More reliable production agent systems for clients, supporting use cases that were previously too flaky to ship.

API Changes and Behavior Shifts

Claude Managed Agents absorb these API shifts automatically, but teams calling the Messages API directly need to plan for a few breaking changes and several behavior changes before switching traffic to Opus 4.7.

Messages API Breaking Changes

Extended thinking budgets are removed. Passing thinking: { type: "enabled", budget_tokens: N } now returns a 400 error. Adaptive thinking is the only thinking-on mode, and Anthropic's internal evaluations show it reliably outperforms the old extended-thinking mode.
Adaptive thinking is off by default. Requests without an explicit thinking field run with thinking off. Set thinking: { type: "adaptive" } to opt back in.
Sampling parameters are removed. Setting temperature, top_p, or top_k to any non-default value returns a 400 error. The safest migration path is to drop those parameters entirely and steer behavior through prompting.
Thinking content is omitted by default. Thinking blocks still appear in the response stream but the thinking field is empty unless callers opt in with display: "summarized". Products that stream reasoning to users will otherwise show a long pause before output.
Updated tokenizer. The new tokenizer may use roughly 1x to 1.35x as many tokens on the same input (up to about 35% more, varying by content). Anthropic recommends raising max_tokens for additional headroom and reviewing compaction triggers.

Task Budgets (Public Beta)

Task budgets are a new advisory cap across a full agentic loop, covering thinking, tool calls, tool results, and final output. The model sees a running countdown and uses it to scope work and finish the task gracefully as the budget is consumed. This is distinct from max_tokens, which is a hard per-request ceiling that the model never sees.

response = client.beta.messages.create(
  model="claude-opus-4-7",
  max_tokens=128000,
  output_config={
    "effort": "high",
    "task_budget": {"type": "tokens", "total": 128000},
  },
  messages=[
    {"role": "user", "content": "Review the codebase and propose a refactor plan."}
  ],
  betas=["task-budgets-2026-03-13"],
)

The minimum task budget is 20k tokens. Skip task budgets for open-ended work where quality matters more than speed; reserve them for workloads where scoping the run to a token allowance is part of the product. Too-restrictive budgets can cause Opus 4.7 to complete tasks less thoroughly or refuse outright.

Behavior Shifts That Affect Prompts

More literal instruction following, especially at lower effort levels. The model will not silently generalize from one example to another or infer requests you did not make.
Response length calibrates to perceived complexity rather than defaulting to a fixed verbosity, so hardcoded "keep it under 3 paragraphs" scaffolding may become redundant.
Fewer tool calls by default, with the model leaning more on reasoning. Raising effort increases tool usage when that is what you want.
More direct, opinionated tone with less validation-forward phrasing and fewer emoji than Opus 4.6's warmer style. Client-facing chat prompts tuned to Opus 4.6's tone may need refreshing.
More regular progress updates during long agentic traces. If you have added scaffolding to force interim status messages, try removing it.
Fewer subagents spawned by default; steerable through prompting when a fan-out is needed.
Real-time cybersecurity safeguards that may refuse prohibited or high-risk requests. Legitimate security work should apply to the Cyber Verification Program.

Migrating from Opus 4.6

Anthropic describes Opus 4.7 as a direct upgrade to Opus 4.6, but the migration is not strictly drop-in. Three practical areas deserve attention before flipping traffic.

1. Re-Tune Prompts for Literal Instruction Following

Opus 4.7 takes instructions literally rather than loosely. Where Opus 4.6 might have softened or skipped parts of a prompt, Opus 4.7 follows them as written. Prompts that relied on the model filling in gaps, or that contained contradictory instructions the old model silently resolved, will produce unexpected results. Audit system prompts for leftover instructions, clarify ambiguity, and drop redundant guardrails that were compensating for Opus 4.6's looseness.

2. Measure Token Spend on Real Traffic

The tokenizer change plus deeper thinking at higher effort levels shift the token math. On an internal agentic coding benchmark Anthropic shows token usage improved across all effort levels, but the right answer for any given workload is to measure. Run a representative sample of production traffic through Opus 4.7 at the effort level you plan to deploy, compare input-plus-output token totals against Opus 4.6, and adjust task budgets accordingly.

3. Start with High or xhigh Effort

Anthropic explicitly recommends starting with high or xhigh effort for coding and agentic use cases, and in Claude Code xhigh is the new default across all plans. For consumer-facing interactive workloads, test medium first to hold latency down, since Hex's observation that low-effort Opus 4.7 matches medium-effort Opus 4.6 applies here.

Further reading: Anthropic's full Opus 4.7 migration guide covers effort-level tuning, task budgets, and prompt adjustments in detail.

Safeguards and Alignment

Opus 4.7 is the first model released under the Project Glasswing framework, where Mythos-class capability improvements roll out with production safeguards on less capable models before broader Mythos release. For Opus 4.7 specifically, that means two deliberate decisions.

First, Anthropic experimented during training with reducing the model's cyber capabilities relative to Mythos. Second, the released model ships with automated safeguards that detect and block requests indicating prohibited or high-risk cybersecurity uses. On the CyberGym benchmark Opus 4.7 scores 73.1%, modestly below Opus 4.6's 73.8% and well below Mythos Preview's 83.1%.

Cyber Verification Program

Security professionals conducting legitimate vulnerability research, penetration testing, and red-teaming can apply to Anthropic's new Cyber Verification Program to access Opus 4.7 for cybersecurity use cases that would otherwise be blocked by the default safeguards.

Alignment Assessment

Anthropic's alignment evaluation concluded the model is largely well-aligned and trustworthy, though not fully ideal in its behavior. Opus 4.7 shows a similar overall safety profile to Opus 4.6, with improvements on honesty and resistance to prompt injection attacks, and modest regressions in one area: it can give overly detailed harm-reduction advice on controlled substances. Mythos Preview remains the best-aligned model Anthropic has trained by their own evaluations.

For enterprise use the practical implication is that Opus 4.7 is an acceptable drop-in for production workloads, with the standard caveats around domain-specific red-teaming. Full details are in the Claude Opus 4.7 System Card on Anthropic's site.

Conclusion

Claude Opus 4.7 is a meaningful step up for any team building production AI workflows. The coding gains are real and broad, confirmed across dozens of independent partner benchmarks rather than just Anthropic's own evaluations. The vision and memory improvements unlock use cases that were previously unreliable. And the new xhigh effort level plus task budgets give finer control over the cost-quality tradeoff than Opus 4.6 offered.

The migration requires care rather than being strictly drop-in. Stricter instruction following means prompts need auditing, the updated tokenizer shifts token spend math, and teams leaning on agentic search should compare against GPT-5.4 before fully switching. For coding, long-horizon agents, dashboard work, and document analysis, Opus 4.7 is the new default choice on the Claude family.

Ready to Put Opus 4.7 to Work?

Whether you're evaluating frontier models, migrating an existing Claude pipeline, or building new agentic workflows for clients, we can help you navigate model selection, prompt engineering, and production rollout.

Get Started Explore AI Digital Transformation

Free consultation

Expert guidance

Tailored solutions