Claude Opus 4.8 launched on May 28, 2026, just 41 days after Opus 4.7, and the first wave of independent evaluations is now in. The 48-hour read is genuinely strong but genuinely uneven — Opus 4.8 edges to the top of one composite index, leads the headline coding benchmarks, and still trails GPT-5.5 on one prominent test that everyone is conveniently skipping past.

What makes this release worth a careful look is not the leaderboard. It is that almost every confident claim circulating two days in falls into one of three buckets: independently corroborated, third-party measured with a methodology caveat, or vendor-stated with no outside replication at all. Treating those as equivalent is how launch-week hype gets baked into roadmaps.

This roundup grades each major claim by source type and corroboration count, separates the confirmed wins from the launch-week noise, and points to the quiet API changes that matter regardless of where the benchmarks settle. For the launch-day mechanics and the new dynamic workflows feature, see our launch-day announcement and dynamic workflows deep-dive; this post is the eval-quality companion to it.

Key takeaways

01
Top of one index, not every benchmark.Opus 4.8 reaches the top of Artificial Analysis's composite Intelligence Index at 61.4, narrowly ahead of GPT-5.5 at 60.2. That is a single aggregate index, not a clean sweep — the per-benchmark picture is mixed.
02
GPT-5.5 still leads Terminal-Bench 2.1.On Terminal-Bench 2.1, GPT-5.5 scores 78.2% vs Opus 4.8's 74.6%. Vellum explicitly flags that this benchmark is sensitive to harness choice — the one prominent test where Opus 4.8 trails GPT-5.5.
03
Anthropic called it 'modest but tangible.'Anthropic described the release as a modest but tangible improvement — unusually restrained framing that practitioners read as a deliberate cultural signal rather than typical launch hyperbole.
04
Headline gains are vendor-stated at 48 hours.The roughly 4x code-honesty improvement and the USAMO math jump are Anthropic's own internal figures with no independent replication in the 48-hour window. Useful signals, not yet confirmed findings.
05
API teams win regardless of the leaderboard.A 1,024-token prompt-cache minimum (down from 4,096), mid-conversation system messages, and documented refusal stop details are concrete developer wins independent of any benchmark result.

01 — The SnapshotWhy a deliberate 48-hour read.

Frontier-model launches now arrive faster than the community can properly evaluate them. Opus 4.8 shipped 41 days after Opus 4.7 — by most accounts the fastest flagship upgrade cycle Anthropic has run. That speed is the story behind the story: the gap between "a model is announced" and "the model is independently understood" has widened, and most coverage publishes inside that gap.

So this is a snapshot, dated and bounded. As of 48 hours post-launch, some claims have multiple independent corroborations, some have a single third-party measurement, and several rest entirely on Anthropic's own evaluations. We mark which is which throughout, because the difference is the whole point. A number that one vendor measured on its own harness is not the same kind of fact as a number two independent labs reproduced.

The honest version of an early roundup states its own epistemic limits up front. Two things in particular are simply not available yet: blind-preference leaderboard data, which needs weeks of sustained voting to update, and any independent replication of the vendor-stated honesty and math figures. We treat those as pending, not as evidence either way.

What this post is — and isn't

This is an evidence-graded roundup of the first 48 hours, not a definitive benchmark verdict. Where a figure is Anthropic's own internal evaluation, we say so. Where a third-party benchmark carries a disclosed methodology change, we flag it. Where data does not exist yet, we call it pending rather than guessing. Re-verify the live numbers on each primary source before you make a production decision.

02 — The Framing"Modest but tangible" — a rare act of restraint.

The single most under-covered detail of the launch is Anthropic's own language. The company described Opus 4.8 as "a modest but tangible improvement" — phrasing that reads as deliberate calibration on a release that, by the benchmark numbers, is actually quite strong. In a market where every launch is the new state of the art, choosing "modest" is itself a signal.

Independent developer Simon Willison picked up on it directly, describing the candor as a refreshing contrast to typical AI-lab marketing. That framing matters because it is consistent with the release's headline behavioral claim — that the model is meaningfully more willing to point out problems in its own work — and because it sets a more useful expectation for teams than "this changes everything" ever does.

"I appreciated Anthropic's candor describing the release as 'a modest but tangible improvement' — contrasting refreshingly with typical AI lab marketing hyperbole."— Simon Willison, independent developer and blogger

There is a thematically linked behavioral claim worth naming carefully. Anthropic states that Opus 4.8 is approximately four times less likely than Opus 4.7 to let flaws in its own generated code pass without comment. Secondary coverage repeated that figure in identical language, but no outsider ran their own calibration test in the launch window — so it is best read as an Anthropic-stated internal metric, not an independently replicated result. As an enterprise reference point, Anthropic cited Bridgewater Associates, whose primary observed distinction was the model's tendency to proactively surface issues with the inputs and outputs of an analysis.

03 — BenchmarksWhere the numbers actually land.

Two independent eval sources carried the bulk of the early benchmarking: Vellum's side-by-side explainer and Artificial Analysis's analysis. The chart below pulls Opus 4.8 scores from both, with orange bars marking benchmarks where Opus 4.8 leads the comparison set and blue bars marking where another model leads. Read it alongside the deltas-versus-Opus-4.7 in the table that follows.

Opus 4.8 across early independent benchmarks · selected

Source: Vellum AI and Artificial Analysis, May 28, 2026

SWE-Bench ProOpus 4.8 69.2% · Opus 4.7 64.3% · GPT-5.5 58.6%

69.2%

Opus 4.8

SWE-Bench VerifiedOpus 4.8 88.6% · Opus 4.7 87.6% · Gemini 3.1 Pro 80.6%

88.6%

Opus 4.8

HLE (with tools)Opus 4.8 57.9% · Opus 4.7 54.7% · GPT-5.5 52.2%

57.9%

Opus 4.8

OSWorld-VerifiedOpus 4.8 83.4% · GPT-5.5 78.7% · harness change disclosed

83.4%

Caveat

Terminal-Bench 2.1GPT-5.5 78.2% · Opus 4.8 74.6% · harness-sensitive

78.2%

GPT-5.5

Finance Agent v2Gemini 3.5 Flash 57.9% · Opus 4.8 53.9%

57.9%

Gemini 3.5 Flash

Opus 4.8 leads the setAnother model leads

On the coding evals that matter most for agentic engineering work, Opus 4.8 has a clean, independently measured edge. SWE-Bench Pro rises to 69.2% from Opus 4.7's 64.3%, comfortably ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%). SWE-Bench Verified ticks up to 88.6% from 87.6%. Humanity's Last Exam with tools improves to 57.9% from 54.7%. These are the gains a senior team will feel in day-to-day refactors and multi-file changes, and they hold across both eval sources.

The composite picture is where care is required. Artificial Analysis's Intelligence Index puts Opus 4.8 first at 61.4, just ahead of GPT-5.5's 60.2 and well above Opus 4.7's 57.3 — but that is a blended index, not a single standardized test, and a 1.2-point composite margin is not a decisive lead. On the same source's GDPval-AA real-work measure, Opus 4.8 reaches a notably higher Elo than both GPT-5.5 and Opus 4.7, and reportedly used about 15% fewer turns than Opus 4.7 on the same task set. That turn-efficiency number is a different metric from token count; we keep the two distinct rather than collapsing them into a single "faster" claim.

Coding step-up

SWE-Bench Pro vs Opus 4.7

+4.9pts

Opus 4.8 hits 69.2% on SWE-Bench Pro versus 64.3% for Opus 4.7 — the cleanest independently measured improvement, ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%).

Vellum AI · cross-checked

Composite lead

AA Intelligence Index

61.4

First place on Artificial Analysis's composite index, narrowly ahead of GPT-5.5 at 60.2. A blended aggregate, not a single benchmark — read the margin as narrow, not decisive.

Composite, not standardized

Not #1 everywhere

Finance Agent v2

53.9%

Gemini 3.5 Flash leads Finance Agent v2 at 57.9%, ahead of Opus 4.8 (53.9%). A useful reminder that the leaderboard is task-dependent, not a single ranking.

Balance check

04 — The ExceptionThe Terminal-Bench exception, and why the caveat matters.

Almost every headline this week declared Opus 4.8 the new frontier leader. The cleanest counter-example is Terminal-Bench 2.1, where GPT-5.5 scores 78.2% against Opus 4.8's 74.6%. It is the one prominent benchmark in the early set where Opus 4.8 trails GPT-5.5 — and the reasons for it are more instructive than the gap itself.

Vellum, which published the side-by-side, explicitly notes that Terminal-Bench is sensitive to harness choice: the scaffolding, tool-call wiring, and execution environment around the model can move the score meaningfully. That cuts both ways. It means GPT-5.5's lead may be partly a harness artifact — and equally that you cannot assume Opus 4.8's wins elsewhere transfer cleanly to your own terminal-automation harness without testing.

The methodology asterisk

OSWorld-Verified deserves the same scrutiny in the other direction. Opus 4.8's 83.4% is the top computer-use score in the set, but Anthropic discloses that the OSWorld benchmark methodology changed between versions — so part of the gain over Opus 4.7's 82.8% may reflect harness improvement rather than pure model capability. Read it as a strong result with a disclosed asterisk, not a clean apples-to-apples jump.

The editorial takeaway is simple and it is the opposite of most launch coverage: Opus 4.8 won almost everywhere in the early set, except on Terminal-Bench, and the reasons it won or lost matter more than the raw position. A benchmark that swings on harness choice is a prompt to run your own evaluation on your own scaffolding, not a number to quote in a slide.

05 — Confidence MatrixWhat we know versus what we don't — yet.

The most useful thing a 48-hour roundup can produce is not a verdict but a map of confidence. The grid below sorts the major launch claims by how they are sourced — Anthropic-stated, third-party measured, or community anecdote — and what that implies for acting on them now versus waiting for more data. The pattern is the lesson: the most spectacular numbers are also the least corroborated.

Confirmed

Coding benchmark gains

SWE-Bench Pro 69.2%, SWE-Bench Verified 88.6%, HLE-with-tools 57.9%. Third-party measured by Vellum and cross-checked against Artificial Analysis. Two independent sources agree — safe to act on with your own validation.

Act now, validate

Likely, asterisked

OSWorld 83.4% computer use

Top score in the set, but Anthropic discloses a methodology change versus Opus 4.7. Single-vendor measurement with a self-disclosed caveat — directionally real, not apples-to-apples. Treat the delta as soft.

Trust the direction

Vendor-stated

4x honesty + USAMO math

The ~4x code-honesty figure and the USAMO 2026 math jump are Anthropic's internal evaluations. Widely repeated, but no independent replication in the 48-hour window. Promising signals, not confirmed findings — wait before quoting.

Hold for replication

Pending / disputed

Arena Elo and the Qwen claim

Blind-preference leaderboard data for Opus 4.8 had not landed at 48h — rankings need weeks of voting. The viral 'Qwen distillation' screenshot points to training-data contamination, not distillation. Both are non-evidence right now.

Do not cite yet

Notice the gradient. The benchmark gains a working engineer cares about most — coding throughput on SWE-Bench-style tasks — are exactly the ones with two independent measurements behind them. The most quotable numbers, the roughly fourfold honesty improvement and the large single-cycle math jump, are vendor-stated and unreplicated. That inversion is normal for launch week, and it is precisely why a source-typed confidence map beats a single composite score for making decisions.

06 — Developer SurfaceThe quiet API wins that don't need a benchmark.

Beneath the leaderboard noise sits a set of platform changes that are concrete, documented, and useful regardless of where the benchmarks settle. These are the parts of the release most coverage skipped, and they are the parts most likely to change how an API team actually builds.

Caching

1,024-token cache floor

down from 4,096 tokens

The prompt-cache minimum drops to 1,024 tokens. Prompts that were too short to cache on Opus 4.7 can now create cache entries on Opus 4.8 with no code changes — a direct cost win for short-system-prompt workloads.

No code changes required

Sessions

Mid-conversation system messages

role: system, after a user turn

Opus 4.8 accepts a system-role entry immediately after a user turn in the messages array, letting you update instructions mid-session without breaking the prompt cache. No beta header needed.

No beta header

Routing

Documented refusal stop details

stop_details object on refusals

Refusal responses now carry a documented stop_details object describing the refusal category, so applications can distinguish classes of declined requests and route users appropriately instead of guessing.

Refusal category exposed

A few facts keep the picture accurate. Standard pricing is unchanged from Opus 4.7 at $5 per million input tokens and $25 per million output tokens, and code targeting Opus 4.7 runs unchanged on 4.8 — the same constraint carries over that non-default temperature, top-p, or top-k values return an error, and extended-thinking token budgets remain unsupported, with adaptive thinking the supported mode and a default effort of high across surfaces. The context window is one million tokens on the API, Bedrock, and Vertex, but capped at 200k on Microsoft Foundry — an easy detail to get wrong.

One efficiency option comes with a caveat. A faster output tier, priced at $10 / $50 per million tokens for roughly 2.5x output speed, is available — but at 48 hours it is a research preview on the Claude API only, not a generally available feature across the claude.ai UI or every plan tier. As a practical cost anchor, Simon Willison reported a single maximum-thinking query running about 43 cents, a reminder that top-effort single queries are not free even at unchanged list pricing.

The cost reality

The new dynamic workflows preview can orchestrate large fan-outs of parallel agents, and the public proof-of-concept was striking — but the consistent operator note is that costs climb fast at scale. A 1,024-token cache floor saves cents per call; an unbounded multi-agent run can spend dollars per minute. Both are real; budget for the second before you celebrate the first. Full mechanics in our dynamic workflows deep-dive.

07 — Signal QualityReading launch-week signal without getting played.

The clearest case study from this launch is the "Qwen distillation" episode. Within hours, screenshots circulated of Opus 4.8 identifying itself as Alibaba's Qwen in Chinese, and the claim spread quickly. The mundane explanation is the likely one: self-introduction text from popular models saturates the training data scraped from the Chinese-language web, and behavior like this is inconsistent across runs and languages. A genuine distillation would fail the same way every time; this did not.

The point is not the controversy — it is the speed. A screenshot became a narrative before anyone checked whether the behavior was reproducible. That is the texture of launch-week signal: high volume, low verification. The discipline that protects a team is boring and effective — attribute every claim to a source, check whether the behavior reproduces, and ask whether a benchmark's harness or methodology could explain the result before the model does.

"The real shift isn't 'AI helps a person do a task.' It's that the task gets redesigned around AI workers from the beginning."— Avi Hacker, independent practitioner

Practitioner reports add useful texture without being dispositive. Production testing across a batch of open-source pull requests described competitive cross-file reasoning and long-horizon agentic sessions, with an actionable-finding rate that landed roughly flat versus Opus 4.7 even as the volume of minor findings rose. Separately, developers reported visible degradation past roughly 200k tokens despite the one-million-token ceiling — consistent with prior Opus-family behavior — and noted that some Opus 4.7 prompt patterns need retuning. These are small-sample, anecdotal early signals, not measured findings, and we present them as such.

08 — Practical DecisionWhat to actually do this week.

The right move at 48 hours is neither "switch everything" nor "ignore it." It is to act on the confirmed gains where they touch your highest-value workloads, and to wait on the unreplicated claims. For teams already running a multi-vendor routing strategy, the deltas here are small enough to slot in by task class rather than a wholesale migration.

Coding & refactors

Test on your own repos

Now

The SWE-Bench Pro and Verified gains are independently corroborated. Benchmark Opus 4.8 against your current default on your own repositories before changing anything — the gain is real, the magnitude is yours to measure.

Confirmed lift

Terminal automation

Run your own harness

Test

GPT-5.5 still leads Terminal-Bench 2.1, and the benchmark is harness-sensitive. Do not switch your terminal-agent default on the headline — evaluate both models inside your actual scaffolding.

Harness-dependent

API cost & UX

Adopt the platform wins

Ship

The 1,024-token cache floor, mid-conversation system messages, and refusal stop details require no benchmark belief. They are documented platform changes you can adopt immediately for cost and UX gains.

Zero-risk wins

Looking a few weeks ahead, the picture should sharpen in predictable ways. Blind-preference leaderboard data will accumulate and either confirm or temper the composite-index lead; independent math and calibration tests may or may not reproduce the vendor-stated honesty and USAMO figures; and the harness-sensitivity questions around Terminal-Bench and OSWorld will get cleaner as more teams publish runs on standardized scaffolding. The teams that benefit most are the ones that ran their own evals during this window rather than waiting for a consensus that always arrives late.

If you are deciding where Opus 4.8 fits against the rest of your stack, two companion reads go deeper than a roundup can: our full head-to-head comparison with GPT-5.5 and the agent routing use-case comparison. If you would rather hand the comparative eval to a team that does it for a living, our AI digital transformation engagements start with exactly this kind of model-selection work, and our custom development practice wires the chosen model into production with the API changes above.

09 — ConclusionA strong release, read honestly.

The 48-hour verdict, May 30, 2026

Strong where it's confirmed, pending where it's spectacular.

Forty-eight hours in, Claude Opus 4.8 looks like a genuinely strong release with an unusually honest face. It reaches the top of one composite index, leads the coding benchmarks that working engineers feel daily, and ships real API improvements that need no benchmark to justify. Anthropic's own "modest but tangible" framing turned out to be the most accurate sentence written about it all week.

The discipline is in the asterisks. The most spectacular claims — the roughly fourfold code-honesty gain and the large single-cycle math jump — are vendor-stated and unreplicated at this point; the OSWorld result carries a disclosed methodology change; blind-preference rankings simply have not landed; and GPT-5.5 still wins the one harness-sensitive terminal benchmark. None of that diminishes the release. It just means a careful team grades the claim by its source before it grades the model.

The broader lesson outlasts this launch. As upgrade cycles compress toward a month, the gap between announcement and understanding is where most decisions now get made — and the winning move is not to trust the leaderboard faster, but to run your own evals on the workloads you actually care about. Act on the confirmed gains, wait on the unreplicated ones, and treat every viral screenshot as a hypothesis, not a finding.

Claude Opus 4.8, 48 Hours In: The Early Eval Roundup

01 — The SnapshotWhy a deliberate 48-hour read.

02 — The Framing"Modest but tangible" — a rare act of restraint.

03 — BenchmarksWhere the numbers actually land.

Opus 4.8 across early independent benchmarks · selected

SWE-Bench Pro vs Opus 4.7

AA Intelligence Index

Finance Agent v2

04 — The ExceptionThe Terminal-Bench exception, and why the caveat matters.

05 — Confidence MatrixWhat we know versus what we don't — yet.

Coding benchmark gains

OSWorld 83.4% computer use

4x honesty + USAMO math

Arena Elo and the Qwen claim

06 — Developer SurfaceThe quiet API wins that don't need a benchmark.

1,024-token cache floor

Mid-conversation system messages

Documented refusal stop details

07 — Signal QualityReading launch-week signal without getting played.

08 — Practical DecisionWhat to actually do this week.

Test on your own repos

Run your own harness

Adopt the platform wins

09 — ConclusionA strong release, read honestly.

Strong where it's confirmed, pending where it's spectacular.

Choose your frontier model on evidence, not launch hype.

Frontier model engagements

The questions we get every week.

Continue exploring frontier releases.

Claude Opus 4.8: Benchmarks, Effort & Dynamic Workflows

Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7: Agentic Coding

Prompt Caching in 2026: Cut LLM Costs, Keep Quality

SWE-bench in 2026: Benchmarks vs Scaffolding Reality