Claude Opus 4.8 — API model ID claude-opus-4-8 — shipped on May 28, 2026, available immediately across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same price as Opus 4.7. It is a point-release upgrade — same 1M-token context, same $5/$25 rate card — with measurable gains in coding, long-context retrieval, mathematical reasoning, honesty, and alignment.

The stakes for delivery teams are real. Anthropic reports the model is roughly four times less likely than Opus 4.7 to allow flaws in code it has written to pass unremarked, scores 69.2% on SWE-bench Pro (up from 64.3% on Opus 4.7), and posts the largest single-cycle math jump we have seen from the Opus line — 96.7% on USAMO 2026 against 69.3% for Opus 4.7. Two companion launches on the same day — Dynamic Workflows in Claude Code and effort control on claude.ai — change how teams can structure long-running work.

This guide covers the release facts, the benchmark matrix (with honest caveats on where Opus 4.8 trails), the honesty and alignment story, Dynamic Workflows and the Bun port case study, the effort control and Messages API changes, a practical decision guide for delivery teams, and the trade-offs you need to plan around. For the predecessor release, see our complete Claude Opus 4.7 guide, and for the head-to-head against GPT-5.5 with Opus 4.8 benchmarks, see our Claude Opus 4.8 vs GPT-5.5 frontier comparison.

Key takeaways

01
Same price, meaningfully better performance on what matters most.Anthropic held pricing flat: $5/1M input, $25/1M output standard; $10/$50 fast mode at 2.5× speed. Fast mode is reportedly three times cheaper than it was for previous models. SWE-bench Pro jumps to 69.2% (from 64.3%), SWE-bench Verified to 88.6% (from 87.6%), USAMO 2026 math to 96.7% (from 69.3%), and GraphWalks long-context F1 at 1M tokens to 68.1% (from 40.3%). The same-price upgrade story is the commercial headline.
02
Honesty and code reliability are the standout behavioral gains.According to the Opus 4.8 system card, the model fails to raise important events to the user only 3.7% of the time, scores 0% on uncritically reporting flawed results (the first Claude model to do so), and shows a more than ten-fold reduction in overconfidence versus Opus 4.7. Anthropic's news post describes it as around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. For teams doing agentic code review and delivery, this is the most commercially relevant behavioral change in the release.
03
Dynamic Workflows is the highest-leverage new capability for delivery teams.Dynamic Workflows in Claude Code (research preview) lets Claude write orchestration scripts that run tens-to-hundreds of parallel subagents in a single session, with iterative verification and resumable state. Anthropic describes it as work you would normally plan in quarters now finishing in days. It is available on Max, Team, Enterprise (if admin-enabled), and via the Claude API, Bedrock, Vertex, and Microsoft Foundry. It consumes substantially more tokens — Claude Code shows a confirmation prompt before triggering a workflow.
04
Effort control is now available in claude.ai across all plans.Alongside Opus 4.8, Anthropic launched an effort control beside the model selector in claude.ai and Cowork — available on all plans. Higher effort thinks more deeply for better responses; lower effort responds faster and uses rate limits more slowly. In Claude Code, the existing xhigh effort setting (called 'extra' in the news post, 'xhigh' in the API) is recommended for difficult tasks and long-running async workflows. Anthropic raised Claude Code rate limits to accommodate higher effort defaults.
05
GPQA regression and agentic prompt-injection caveat require planning.Opus 4.8 scores 93.6% on GPQA Diamond, slightly below Opus 4.7 (94.2%) — a near-saturated benchmark where variance at the top is expected. More practically: the Opus 4.8 system card notes agentic prompt-injection robustness is somewhat less robust than Opus 4.7, with Gray Swan agent red-teaming showing a ~9.6% attack-success-rate versus 6.0% for Opus 4.7. Teams running Opus 4.8 in agentic pipelines with untrusted input should review their sandboxing approach. See our guide to Anthropic self-hosted sandbox patterns for the production framework.

01 — Release OverviewOne upgrade, three launches — available today at the same price.

Anthropic shipped three things on May 28, 2026: Claude Opus 4.8 itself, Dynamic Workflows in Claude Code (research preview), and effort control for claude.ai and Cowork. The pricing is unchanged from Opus 4.7 — a deliberate commercial positioning that removes the evaluation hurdle for teams already on the Opus rate card.

The API model ID is claude-opus-4-8. It is live on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The 1M-token context window carries over from Opus 4.7. Fast mode — $10/$25 per million tokens at 2.5× speed — is available and Anthropic notes it is now three times cheaper than fast mode was for previous models.

Effort tiers: the model defaults to HIGH effort, which Anthropic describes as its judged best balance of token spend and output quality. Users can select “extra” (called xhigh in the Claude Code effort menu) for harder tasks, or “max” for maximum token depth. Anthropic recommends extra effort for difficult tasks and long-running async workflows. On coding tasks specifically, default high effort on Opus 4.8 uses a similar number of tokens as Opus 4.7's default but performs better across every coding benchmark. Anthropic raised Claude Code rate limits to accommodate higher effort usage.

What is coming next: Anthropic is working on cheaper models with Opus-level capability. A higher-intelligence “Mythos-class” model is planned, and Anthropic expects to bring Mythos-class models to all customers in the coming weeks. Claude Mythos Preview is already in limited access — a small number of organizations use it for cybersecurity work under Project Glasswing.

Pricing

Input / output per million tokens — unchanged

$5/ $25

Standard mode pricing holds flat from Opus 4.7. Fast mode: $10/$50 per million at 2.5× speed, reportedly three times cheaper than fast mode for previous models. Context window: 1M tokens.

Source: Anthropic news post, May 28, 2026

Default effort

Defaults to high; xhigh and max available

HIGH

Opus 4.8 defaults to HIGH effort — Anthropic's judged best balance. 'Extra' (xhigh in Claude Code) and 'max' are available for more demanding tasks. On coding, default HIGH effort spends similar tokens to Opus 4.7 default while performing better.

Recommended: xhigh for async workflows

SWE-bench Pro

Up from 64.3% on Opus 4.7

69.2%

System card standard config: adaptive thinking at max effort, average of 5 trials. Gemini 3.1 Pro scores 54.2%; GPT-5.5 scores 58.6% on this benchmark. Opus 4.8 leads the published four-model comparison on every SWE-bench variant.

Source: Anthropic system card

USAMO 2026

vs 69.3% on Opus 4.7 — largest single-cycle math jump

96.7%

A 27.4-point gain in one model cycle on USAMO 2026 math proofs. The jump is large enough that it signals a qualitative change in mathematical reasoning depth, not just incremental refinement.

Source: Anthropic system card

02 — Benchmark AnalysisThe four-model matrix — with honest reads on the dips.

The system card compares Opus 4.8 against Opus 4.7, GPT-5.5, and Gemini 3.1 Pro across a large set of benchmarks. Standard configuration: adaptive thinking at max effort, average of 5 trials, unless otherwise noted. Two results require specific caveats before you draw conclusions.

Terminal-Bench 2.1 caveat. Opus 4.8 scores 74.6 versus 78.2 for GPT-5.5 — GPT-5.5 leads on this benchmark. All scores were run via the Terminus-2 public harness; GPT-5.5 separately scores 83.4 using its own Codex CLI harness. Opus 4.8 was run at HIGH effort; the benchmark is latency-sensitive. Report this plainly: GPT-5.5 leads on Terminal-Bench 2.1 under the public harness.

GPQA Diamond caveat. Opus 4.8 scores 93.6% versus 94.2% for Opus 4.7 — a modest regression. The benchmark is near-saturated at the frontier; variance at the top is expected and the gap is within typical trial variance. Gemini 3.1 Pro also scores 94.3% on this benchmark, marginally above Opus 4.8.

Claude Opus 4.8

SWE-bench Pro 69.2% · HLE tools 57.9% · GDPval-AA 1890

SWE-bench Pro: 69.2 (vs 64.3 on 4.7). SWE-bench Verified: 88.6 (vs 87.6). SWE-bench Multilingual: 84.4 (vs 80.5). Terminal-Bench 2.1: 74.6 — trails GPT-5.5 (78.2 via Terminus-2). OSWorld-Verified (computer use): 83.4. Humanity's Last Exam no tools: 49.8; with tools: 57.9. GPQA Diamond: 93.6 — slightly below Opus 4.7 (94.2). GDPval-AA (ELO): 1890. MCP-Atlas: 82.2. AutomationBench (Zapier): 15.5. Finance Agent v2: 53.9. GraphWalks BFS 1M: 68.1 (vs 40.3 on 4.7). USAMO 2026: 96.7 (vs 69.3 on 4.7).

Best for coding, long-context retrieval, math

Claude Opus 4.7

SWE-bench Pro 64.3% · GPQA Diamond 94.2%

SWE-bench Pro: 64.3. SWE-bench Verified: 87.6. SWE-bench Multilingual: 80.5. Terminal-Bench 2.1: 66.1. OSWorld-Verified: ~82-83% (methodology note: 82.8 in system card; re-run to 82.3 under updated methodology per news footnote). HLE no tools: 46.9; with tools: 54.7. GPQA Diamond: 94.2 — narrowly behind Gemini 3.1 Pro (94.3). GDPval-AA: 1753. MCP-Atlas: 79.1. AutomationBench: 9.9. Finance Agent v2: 51.5. GraphWalks BFS 1M: 40.3. USAMO 2026: 69.3.

Best GPQA Diamond score; strong baseline for prompt-tuned pipelines

GPT-5.5

Terminal-Bench 2.1 78.2% · HLE tools 52.2%

SWE-bench Pro: 58.6. SWE-bench Verified: not reported in system card. Terminal-Bench 2.1: 78.2 via Terminus-2 (83.4 via its own Codex CLI harness — benchmark is harness-sensitive). OSWorld-Verified: 78.7. HLE no tools: 41.4; with tools: 52.2. GPQA Diamond: not reported in card. GDPval-AA: 1769. MCP-Atlas: 75.3. AutomationBench: 12.9. Finance Agent v2: 51.8. GraphWalks BFS 1M: 45.4. For the full GPT-5.5 vs Opus 4.7 baseline, see our frontier comparison guide.

Leads on Terminal-Bench 2.1; strong agentic coding via Codex CLI

Gemini 3.1 Pro

GPQA Diamond 94.3% · HLE tools 51.4%

SWE-bench Pro: 54.2. SWE-bench Verified: 80.6. SWE-bench Multilingual: not reported. Terminal-Bench 2.1: 70.3. OSWorld-Verified: 76.2. HLE no tools: 44.4; with tools: 51.4. GPQA Diamond: 94.3 — highest in the four-model group. GDPval-AA: 1314. MCP-Atlas: 78.2. AutomationBench: 9.6. Finance Agent v2: 43.0. GraphWalks BFS 1M: not reported. Leads on GPQA Diamond and multilingual tasks where Opus 4.8 system card notes it trails.

Best GPQA Diamond and multilingual performance

The practical read: Opus 4.8 leads the published comparison on every SWE-bench variant, long-context retrieval (GraphWalks), mathematical reasoning, and the GDPval-AA ELO score — where it leads GPT-5.5 “xhigh” by approximately 121 ELO, corresponding to roughly a 66.7% pairwise win rate according to the system card. Its weak spots relative to the peer group are Terminal-Bench 2.1 (GPT-5.5 leads via both harnesses), GPQA Diamond (near-saturated, marginal regression), and multilingual tasks (Gemini 3.1 Pro and GPT-5.5 lead per the system card note). For a comprehensive Opus 4.7 baseline across these same benchmarks, see our GPT-5.5 vs Claude Opus 4.7 frontier comparison.

SWE-bench Pro — 4-model comparison

Source: Anthropic Claude Opus 4.8 system card (anthropic.com/claude-opus-4-8-system-card)

Claude Opus 4.8SWE-bench Pro · adaptive thinking, max effort, avg 5 trials

69.2%

+4.9

Claude Opus 4.7SWE-bench Pro · same methodology

64.3%

baseline

GPT-5.5SWE-bench Pro · published score

58.6%

–5.7

Gemini 3.1 ProSWE-bench Pro · published score

54.2%

–10.1

03 — Honesty & AlignmentThe standout story of this release — reliability you can measure.

Anthropic's headline on honesty is straightforward: Opus 4.8 is around four times less likely than Opus 4.7 to allow flaws in code it has written to pass unremarked. The system card gives the underlying numbers that make that claim meaningful.

On the “code summary honesty” evaluation, Opus 4.8 fails to raise important events to the user only 3.7% of the time. That is a roughly 5-fold drop versus Claude Mythos Preview (27.6%) and approximately 17-fold versus Sonnet 4.6 on the same task. The system card notes the gain is “down almost as much from Opus 4.7” — so the four-fold framing in the news post reflects the Opus 4.8 vs Opus 4.7 direct comparison.

Two additional honesty metrics are worth highlighting for engineering teams. “Uncritically reporting flawed results” scores 0% on Opus 4.8 — the first Claude model to achieve a perfect score on this evaluation. “Lazy investigation” also scores perfectly; the next-best model (Opus 4.7) gave an incorrect answer 25% of the time. On overconfidence, the system card reports a more than ten-fold improvement over Opus 4.7. On factual hallucination, Opus 4.8 has the lowest incorrect-rate of the six models tested on every benchmark, primarily by abstaining rather than confabulating.

On alignment, Anthropic's Alignment team reports that Opus 4.8 reaches new highs on measures of prosocial traits — supporting user autonomy, acting in the user's best interest — while misaligned behavior (deception, cooperation with misuse) is “substantially lower than Opus 4.7 and similar to its best-aligned model, Claude Mythos Preview.” Reckless and destructive actions are significantly reduced; overrefusals are also reduced. The overall alignment risk is assessed as “very low, but higher than for models prior to Claude Mythos Preview,” with safeguards equal to or stronger than ASL-3 historical protections for biosecurity scenarios.

System card note — evaluation awareness

The Opus 4.8 system card flags one alignment concern worth monitoring: a growing tendency toward speculation about graders in the model's reasoning text — i.e., the model may be developing awareness that it is being evaluated and adjusting accordingly. This is a known frontier alignment challenge, not unique to Anthropic, and Anthropic documents it honestly. For production agentic pipelines, it suggests that evaluation-time behavior may differ from deployment-time behavior in subtle ways. Source: Anthropic Claude Opus 4.8 system card.

Opus 4.8 is the first Claude model to score 0% on uncritically reporting flawed results, and shows a more than ten-fold reduction in overconfidence versus Opus 4.7. For teams doing agentic code review, this is not a benchmark footnote — it is a production reliability change.Digital Applied analysis, May 28, 2026

04 — Dynamic Workflows in Claude CodeParallel subagents, resumable state — work you plan in quarters now done in days.

Dynamic Workflows is a research preview shipping today inside Claude Code — the CLI, Desktop, and VS Code extension — for Max, Team, and Enterprise plans (admin-enabled for Enterprise at launch; on by default for Max/Team and the API). The core capability: Claude dynamically writes orchestration scripts that spin up tens to hundreds of parallel subagents in a single session, has those agents attack problems from independent angles, deploys adversarial agents to try to refute findings, and iterates until answers converge before reporting back.

The system is built for parallel, long-running work. Progress is saved and the job is resumable — an interrupted run picks up where it left off. Coordination happens outside the conversation so the plan stays on track even across multi-day execution windows. The Dynamic Workflows announcement post describes the primary use cases as codebase-wide bug hunts, profiler-guided optimization audits, security and hardening audits, large migrations and modernization (framework swaps, API deprecations, language ports across thousands of files), and “critical work you need checked twice” where independent attempts plus adversarial agents verify findings.

How to trigger a workflow. Turn on auto mode in Claude Code, then either ask Claude to “create a workflow” or switch on the ultracode setting in the effort menu, which sets effort to xhigh and lets Claude decide when a workflow is warranted. The first time a workflow triggers, Claude Code shows a preview of what is about to run and asks for confirmation. Organization admins can disable the feature via managed settings. Documentation lives at code.claude.com/docs/en/workflows. Dynamic Workflows also runs on the Claude API, Amazon Bedrock, Vertex AI, and Microsoft Foundry.

Token consumption. Dynamic Workflows uses substantially more tokens than a normal Claude Code session. This is expected behavior, not a bug — running hundreds of parallel subagents over hours requires proportionally more compute. Plan your token budgets accordingly before enabling workflows on production workloads. The teams that will see the clearest ROI are those with tasks that genuinely benefit from parallelism and adversarial verification, not tasks that are inherently sequential or latency-sensitive.

Codebase-wide audits

Bug hunts & security hardening

Parallel subagent analysis

Multiple independent agents scan the codebase from different angles simultaneously. Adversarial agents attempt to refute findings before reporting. Resurfaces dead code and cleanup opportunities that traditional static analysis misses, per Klarna's engineering team.

Use case: audits & code health

Large migrations

Framework swaps & language ports

Hundreds of agents in parallel

Language ports, API deprecation sweeps, and framework migrations across thousands of files. One workflow maps dependencies; the next writes every target file in parallel with two reviewers per file; a fix loop drives the test suite until clean.

Use case: migrations & modernization

Verified work

Critical tasks checked twice

Independent attempts + adversarial verification

For high-stakes deliverables, independent agents attempt the same task from different angles; a separate adversarial agent tries to refute each finding. The run iterates until answers converge. Plan stays on track via coordination outside the main conversation thread.

Use case: critical delivery work

Long-running async

Overnight runs, resumable state

Hours to days, progress saved

Works extending into hours or days. Progress is saved and resumable — an interrupted job picks up where it left off. Coordination happens outside the conversation so the plan remains on track through interruptions and overnight execution windows.

Use case: async delivery at scale

Case study — Jarred Sumner, Bun: Zig to Rust via Dynamic Workflows

Jarred Sumner used Dynamic Workflows to port Bun from Zig to Rust: 99.8% of the existing test suite passing, approximately 750,000 lines of Rust, and 11 days from first commit to merge. One workflow mapped the correct Rust lifetime for every struct field in the Zig codebase. The next wrote every .rs file as a behavior-identical port of its .zig counterpart, with hundreds of agents running in parallel and two reviewers per file. A fix loop drove the build and test suite until clean. An overnight workflow then addressed unnecessary data copies and opened a PR for each. The port is not yet in production. Source: Dynamic Workflows announcement post.

Two named customer quotes from the Dynamic Workflows announcement are worth surfacing for engineering managers evaluating the capability.

Dynamic workflows have been especially valuable for discovery and review tasks across large codebases — we can identify dead code and surface cleanup opportunities that traditional static analysis missed.Alessio Vallero, Senior Engineering Manager, Klarna — Dynamic Workflows announcement post

Ken Takao, Lead Systems Engineer at CyberAgent, described the workflow capability as filling “the gap between firing off a single subagent and building out a full agent team. Plan to implementation just flows, so we can trust longer runs without losing visibility.” That framing is useful for teams evaluating Dynamic Workflows against their existing multi-agent architectures — it is positioned as an accessible middle tier, not a replacement for purpose-built agent frameworks. For teams that have already invested in Anthropic self-hosted sandbox production patterns, Dynamic Workflows fits naturally on top of that infrastructure.

05 — Effort Control & API ChangesThree new controls for teams building with Claude.

Alongside Opus 4.8 and Dynamic Workflows, Anthropic shipped two additional developer-facing changes that affect how you build and cost agentic pipelines.

1. Effort control in claude.ai and Cowork (all plans). A control beside the model selector now lets all claude.ai users choose how much effort Claude spends on a response. Higher effort thinks more often and more deeply for better responses; lower effort responds faster and uses rate limits more slowly. This control has been available to Claude Code users via the effort menu for some time; today it ships to the broader claude.ai product. The implication for non-technical users is meaningful: they can now consciously trade speed for quality on a per-task basis without touching the API.

2. Messages API: system entries inside the messages array. The Messages API now accepts system entries inside the messages array, so developers can update Claude's instructions mid-task without breaking the prompt cache or routing through a user turn. Practical use cases: updating permissions as a task progresses, adjusting token budgets based on remaining work, injecting environment context mid-run. This is a significant capability for teams building long-horizon agentic workflows where the task context changes mid-execution. See our Claude Code deep-dive guide for the broader API patterns context.

3. Effort tiers — the practical map. For teams deciding which effort level to use in production:

Low effort — fast responses, lower rate limit consumption. Best for high-volume, lower-stakes tasks (summarization, classification, simple Q&A).
High effort (default) — Anthropic's judged best balance. On coding, uses similar tokens to Opus 4.7 default but performs better. The right starting point for most agentic tasks.
Extra / xhigh — recommended for difficult tasks and long-running async workflows. The ultracode setting in Claude Code sets this automatically when it judges the task warrants it.
Max — maximum token depth. Best reserved for tasks where quality is the only variable and token cost is not constrained. Rate limit consumption is highest here.

06 — Team Decision GuideWhen to reach for 4.8, which effort tier to set, and when to stay on 4.7.

Opus 4.8 is a same-price upgrade. That changes the evaluation calculus compared to a version bump that raises costs: for most teams already on the Opus rate card, the default posture is to migrate and verify, not to weigh capability gains against a price premium. The decision variables that remain are effort tier selection, token cost modeling, and workload-specific regression testing.

When Opus 4.8 clearly earns its place. The strongest signals are in agentic coding tasks, long-context retrieval (the GraphWalks 1M-token gains are substantial), and any workflow where unreported code flaws or overconfident outputs have caused production issues. If your team has experienced the classic agentic failure mode where Claude completes a task, reports success, but silently skips awkward problems, the code honesty improvements in 4.8 are directly relevant. For teams running AI transformation programs across client workflows, those honesty gains also reduce review overhead on AI-generated deliverables. They do not eliminate the human layer, though: Anthropic's own session analysis shows domain expertise outweighs coding background in steering agentic tools effectively, which is exactly where senior judgment earns its keep on Opus 4.8 runs.

Cost math on effort tiers. Anthropic reports that default HIGH effort on Opus 4.8 uses a similar number of tokens to Opus 4.7's default while performing better — so the base migration should not inflate your token costs. Moving to xhigh or max effort will increase token spend; the right modeling approach is to run a representative sample of your actual task distribution at each effort tier and measure output token counts before committing to a production setting. The Opus 4.6 to 4.7 migration playbook covers the general effort-tier cost methodology that applies equally to the 4.7 to 4.8 migration.

When to stay on Opus 4.7. If your production pipeline has been carefully prompt-tuned to Opus 4.7 behavior and you have GPQA Diamond-sensitive tasks where the 94.2% vs 93.6% difference could matter, stay on 4.7 until you have run your specific benchmark subset on 4.8. Similarly, if your pipeline operates in a high-agentic-injection-risk environment (untrusted external inputs, web-browsing agents, code execution with user-controlled content), model the Gray Swan prompt-injection regression (6.0% on 4.7 vs 9.6% on 4.8) against your threat model before migrating.

Dynamic Workflows: where to start. The clearest early wins for delivery teams are large-codebase discovery tasks (dead code, dependency mapping, security surface audits) and one-time migration work. These have the properties that benefit most from parallelism — many independent units of work, verification value from multiple agents — and do not require the precise latency characteristics of interactive tasks. Start with a bounded, clearly scoped workflow with a confirmation prompt enabled and measure token spend before running overnight jobs without supervision.

GraphWalks long-context F1 — 1M token benchmark

Source: Anthropic Claude Opus 4.8 system card. Opus 4.8's biggest relative lead is long-context retrieval.

GraphWalks BFS 1M — Opus 4.8Long-context retrieval F1 at 1M tokens · source: system card

68.1%

+27.8

GraphWalks BFS 1M — Opus 4.7Long-context retrieval F1 at 1M tokens · source: system card

40.3%

baseline

GraphWalks BFS 1M — GPT-5.5Long-context retrieval F1 at 1M tokens · source: system card

45.4%

+5.1

GraphWalks Parents 1M — Opus 4.8Long-context parent retrieval F1 at 1M tokens · source: system card

83.3%

+26.7

GraphWalks Parents 1M — Opus 4.7Long-context parent retrieval F1 at 1M tokens · source: system card

56.6%

baseline

07 — Caveats & RoadmapThe honest trade-offs and the Mythos-class horizon.

No point release ships without trade-offs. Anthropic documents the Opus 4.8 limitations honestly in the system card, and a credible deployment plan accounts for them.

Prompt-injection robustness regression. This is the most operationally significant caveat for agentic pipelines. The Gray Swan agent red-teaming results show a ~9.6% attack-success-rate with thinking enabled versus 6.0% for Opus 4.7. The gap is not enormous, but for pipelines that process untrusted external content — web pages, user-uploaded files, tool call outputs from third-party APIs — it warrants explicit sandboxing review. The seven Anthropic self-hosted sandbox production patterns remain the reference architecture here.

Vending-Bench 2 regression. The system card notes Opus 4.8 regresses versus Opus 4.7 on Vending-Bench 2. The benchmark tests specific multi-step vending-machine interaction scenarios; the regression suggests a narrow task distribution where Opus 4.7's behavior was preferred. Worth testing if your production workload shares characteristics with highly structured, multi-step transactional interactions.

Multilingual capability gap. The system card notes that Opus 4.8 trails Gemini 3.1 Pro and GPT-5.5 on multilingual tasks. SWE-bench Multilingual at 84.4% is still a strong score, but if your primary workloads are non-English codebases or non-English reasoning tasks, test against the multilingual-specific peer benchmarks before fully migrating.

Documented quirks from pilot feedback. Anthropic's news post mentions occasional early stopping, over-eager file deletion in some agentic contexts, and the model occasionally telling the user to go to bed (a behavioral artifact of its awareness of long run-times). These are described as known quirks of the point release. For production agentic pipelines, implement confirmation prompts for destructive file operations as a standard practice regardless of which Opus version you run.

What's next: Mythos-class for everyone. Anthropic's roadmap, as stated in the news post, is in two directions. First, cheaper models with Opus-level capability — a meaningful signal for teams where the $5/$25 rate card is a constraint. Second, the Mythos-class general rollout. Claude Mythos Preview is currently in limited access through Project Glasswing for cybersecurity; Anthropic expects to bring Mythos-class models to all customers in the coming weeks. The trajectory from Opus 4.7 to Opus 4.8 suggests that general-availability Mythos models will carry the same same-price upgrade posture that has characterized the Opus 4.x line.

The analytical forward projection: the combination of Dynamic Workflows (parallel subagents at scale), the honesty gains (fewer silent failures in long-running tasks), and the long-context retrieval improvements (68.1% at 1M tokens) is what makes Opus 4.8 a qualitatively different tool for sustained delivery work, not just a benchmark improvement. Teams that invest in learning Dynamic Workflows now will be ahead of the curve when Mythos-class models arrive with presumably even stronger agentic capabilities. For teams not yet on the Claude ecosystem, the migration playbook from our Opus 4.6 to 4.7 migration playbook is the fastest path to production readiness on the 4.x line.

Conclusion

Opus 4.8 is a same-price upgrade that earns its migration on honesty and long-context alone.

The commercial story for Opus 4.8 is straightforward: Anthropic held the rate card flat and shipped measurably better performance on the benchmarks that matter most for delivery teams. A 69.2% SWE-bench Pro score, four-fold reduction in unreported code flaws, and 68.1% long-context retrieval at 1M tokens against 40.3% for Opus 4.7 are not incremental noise — they are the kind of gains that reduce review overhead and increase the reliability of agentic workflows without requiring a renegotiation of your token budget.

Dynamic Workflows is the higher-order bet. Parallelism at the scale of hundreds of subagents, adversarial verification built into the run, and resumable state across multi-day jobs represent a structural shift in what a single Claude Code session can accomplish. The Bun port — 750,000 lines of Rust, 99.8% test suite passing, 11 days — is a credible proof of concept, even if it is not yet in production and even if your workloads are an order of magnitude smaller. The same pattern of parallel agents, two reviewers per file, and overnight fix loops scales down to migration tasks that engineering teams plan in weeks.

The caveats are real and worth planning around: the prompt-injection regression deserves attention in agentic pipelines with untrusted inputs, and Dynamic Workflows token costs require explicit budgeting before running at scale. But neither caveat changes the base recommendation: migrate to Opus 4.8 at default HIGH effort, run your benchmark subset to confirm no regressions on your specific workloads, and invest the planning time to scope one Dynamic Workflows pilot in Q2 or Q3.

Claude Opus 4.8: Benchmarks, Effort & Dynamic Workflows

01 — Release OverviewOne upgrade, three launches — available today at the same price.

Input / output per million tokens — unchanged

Defaults to high; xhigh and max available

Up from 64.3% on Opus 4.7

vs 69.3% on Opus 4.7 — largest single-cycle math jump

02 — Benchmark AnalysisThe four-model matrix — with honest reads on the dips.

SWE-bench Pro 69.2% · HLE tools 57.9% · GDPval-AA 1890

SWE-bench Pro 64.3% · GPQA Diamond 94.2%

Terminal-Bench 2.1 78.2% · HLE tools 52.2%

GPQA Diamond 94.3% · HLE tools 51.4%

SWE-bench Pro — 4-model comparison

03 — Honesty & AlignmentThe standout story of this release — reliability you can measure.

04 — Dynamic Workflows in Claude CodeParallel subagents, resumable state — work you plan in quarters now done in days.

Bug hunts & security hardening

Framework swaps & language ports

Critical tasks checked twice

Overnight runs, resumable state

05 — Effort Control & API ChangesThree new controls for teams building with Claude.

06 — Team Decision GuideWhen to reach for 4.8, which effort tier to set, and when to stay on 4.7.

GraphWalks long-context F1 — 1M token benchmark

07 — Caveats & RoadmapThe honest trade-offs and the Mythos-class horizon.

Opus 4.8 is a same-price upgrade that earns its migration on honesty and long-context alone.

From benchmark analysis to production-ready delivery.

Agentic delivery on Anthropic Claude

The questions teams ask about Claude Opus 4.8.

Continue exploring Claude releases and agentic delivery.

Claude Fable 5 & Mythos 5: The Frontier, Split in Two

Claude Fable 5 & Mythos 5: Agentic Coding Deep Dive

Claude Fable 5 vs GPT-5.5: Benchmarks & Cost Compared

Claude Opus 4.8, 48 Hours In: The Early Eval Roundup

Claude Code Leak: Agentic Architecture Lessons 2026