SWE-bench Verified is the benchmark every coding-model launch quotes, and as of June 16, 2026 the headline is striking: Claude Fable 5 sits at the top of the llm-stats leaderboard with a reported 95.0%. The problem is what that number hides. Ninety-nine of the hundred entries on that board are self-reported, the same model family scores far lower on a standardized harness, and one of the labs that helped make the benchmark famous has quietly stopped reporting it at all.

This is not a story about which model is best. It is a story about why a single benchmark name now points to three or four genuinely different numbers — and why the agent scaffold around a model can move a score by 10 to 20 points without changing the model at all. For anyone choosing a coding model for real work, the rank order on a leaderboard is the least useful thing on the page.

Below we separate the four distinct measurements hiding behind “SWE-bench,” show where the vendor numbers and the standardized numbers diverge, explain the contamination problem that pushed OpenAI to walk away, and lay out how to read all of it when you actually have to pick a model. It builds on our broader SWE-bench live leaderboard Q2 2026 analysis with a sharper focus on the harness.

Key takeaways

01
The headline rank is mostly self-reported.The llm-stats SWE-bench Verified leaderboard listed 100 models on June 16, 2026 — but only 1 result was independently verified; the other 99 were submitted by the vendors themselves. Fable 5's 95.0% is the rare independently-confirmed one (vals.ai).
02
One benchmark name, several different numbers.SWE-bench Verified (self-reported), SWE-bench Pro on a vendor scaffold, SWE-bench Pro on Scale's standardized SEAL harness, and SWE-bench Pro on a private commercial subset can all carry the same model and produce sharply different results.
03
The harness can be worth more than the model.Three different agent systems running the same Claude Opus 4.5 produced a 50.2%–55.4% range on SWE-bench Pro — a 5.2-point spread from scaffold differences alone. Scale's own analysis attributes 10–20 point swings to harness choices.
04
Vendor Pro scores sit well above standardized ones.Anthropic reports 69.2% for Opus 4.8 on SWE-bench Pro using its own scaffold, while the best Claude score on Scale's standardized SEAL board is 51.9% (Opus 4.6 thinking) — a 17.3-point gap within a single model family.
05
Buy on the standardized and private numbers, not the headline.For purchase decisions, treat SWE-bench Verified as a pass/fail tier filter and lean on SEAL-standardized SWE-bench Pro plus the private-codebase subset — the only numbers measured the same way for every model, on tasks closer to proprietary work.

01 — The June SnapshotWhat the SWE-bench Verified board actually says today.

SWE-bench was published at ICLR 2024 by Princeton NLP, framed around a single question: can language models resolve real-world GitHub issues? The original set drew 2,294 problems from real repositories. SWE-bench Verified is the 500-task, human-validated subset that became the industry-standard headline number, drawn from popular open-source Python projects including Django, SymPy, scikit-learn, pytest, Flask, and matplotlib (the list is illustrative, not exhaustive).

On the llm-stats SWE-bench Verified leaderboard as of June 16, 2026, the top of the board is dominated by the Claude family. Claude Fable 5 leads at 95.0% — a figure vals.ai independently confirms at 95.00%. Claude Mythos Preview follows at 93.9% (a model restricted to Project Glasswing partners, not generally available), then Opus 4.8 at 88.6%, Opus 4.7 at 87.6%, GPT-5.5 at a reported 82.60% (vals.ai), and Gemini 3.1 Pro at 80.6% on third-party leaderboard aggregation.

SWE-bench Verified · top of the board · June 16, 2026

Source: llm-stats.com & vals.ai SWE-bench Verified leaderboards, retrieved June 16, 2026 (mostly self-reported)

Claude Fable 5Verified · independently confirmed by vals.ai

95.0%

rank #1

Claude Mythos PreviewVerified · Project Glasswing partners only

93.9%

Claude Opus 4.8Verified · released May 28, 2026

88.6%

Claude Opus 4.7Verified · self-reported

87.6%

GPT-5.5Verified · vals.ai, June 13, 2026

82.60%

Gemini 3.1 ProVerified · leaderboard-aggregated

80.6%

Availability caveat

The two models at the very top are not freely deployable. Fable 5 launched June 9, 2026 but was suspended on June 12, 2026 under US export-control requirements, with an expected return around July 1, 2026; the currently available Claude alternative is Opus 4.8. Mythos Preview is restricted to Project Glasswing partners. A 95% headline against a model you cannot reliably buy this week is a marketing number first and a procurement number second.

02 — The Reporting ProblemNinety-nine of a hundred results are self-reported.

The single most important fact about that leaderboard rarely appears in the coverage that cites it. Of the 100 models listed on llm-stats as of June 16, 2026, only one carries an independent verification badge — the other 99 scores were submitted by the model vendors themselves. Most reporting treats the board as an objective ranking; it is closer to a self-attested press-release aggregator with a single audited row.

That matters because each vendor runs its own evaluation harness — its own scaffold of tool definitions, retry logic, context management, and prompting around the raw model. SWE-bench gives an agent a Docker container with the target repo, the issue text, and a test runner, but crucially not the failing test itself. The agent has to discover the failing tests, understand the issue, and produce a patch that flips the failing tests to passing without breaking the ones already passing. How well the surrounding scaffold supports that discovery is a vendor choice — and a self-reported number bakes the vendor’s best scaffold into the score.

"SWE-bench scores are harness-dependent."— benchmarkingagents.com, SWE-bench Verified Explained: 2026 Methodology, Tiers, Caveats

Independent benchmarking analysis goes further, noting that any Verified score above 80% “warrants scrutiny about harness and tool access” — which is precisely the band the entire top of the June 2026 board now occupies. When the leading dozen models cluster between 80% and 95% on a self-reported benchmark, the rank order stops carrying much signal and the methodology underneath it starts carrying all of it.

03 — The Three-Score ProblemOne model, three different SWE-bench Pro numbers.

SWE-bench Pro is the harder, more contamination-resistant successor: 1,865 total tasks (731 public, 858 held-out, 276 commercial) spanning 41 repositories in Python, Go, TypeScript, and JavaScript, with tasks averaging 107.4 lines changed across 4.1 files. The public set deliberately uses copyleft-licensed repositories as a legal deterrent against quietly folding them into training data. It is a better benchmark — and it produces multiple, genuinely different numbers for the same model depending on how it is run.

There are three meaningfully distinct SWE-bench Pro measurements, and no single published table normally spans all of them next to the Verified headline. The matrix below assembles the research-sourced cells we could confirm. Where a cell shows “—,” that number is not available in a primary source as of the snapshot date — for example, Fable 5 has no Scale SEAL-standardized Pro score yet (Epoch AI’s independent evaluation was still pending as of June 10, 2026), and we deliberately do not interpolate one.

SWE-bench Pro score reality check: each model shown across SWE-bench Verified (self-reported), SWE-bench Pro on a vendor scaffold, SWE-bench Pro on Scale’s standardized SEAL public harness, and SWE-bench Pro on Scale’s private commercial subset. Dashes mark numbers not available from a primary source as of June 16, 2026.
Model	Verified (self-reported)	Pro · vendor scaffold	Pro · SEAL public	Pro · SEAL private
Claude Fable 5	95.0%	80.3% vendor	—	—
Claude Mythos Preview	93.9%	77.8% vendor	—	—
Claude Opus 4.8	88.6%	69.2% vendor	—	—
GPT-5.5	82.60%	58.6% vendor	—	—
Gemini 3.1 Pro	80.6%	54.2% vendor	46.1%	—
GPT-5.4 (xHigh)	—	—	59.1%	43.4%
Muse Spark	—	—	55.0%	—
Claude Opus 4.6 (thinking)	—	—	51.9%	47.1%
Claude Opus 4.5	80.9%	—	45.9%	—

Read the rows where the data exists and the pattern is unmistakable. Gemini 3.1 Pro falls from 80.6% on Verified to a vendor-reported 54.2% on SWE-bench Pro — a 26.4-point drop between two benchmarks wearing the same family name. Opus 4.5 falls from 80.9% on Verified to 45.9% on Scale’s standardized SEAL public set, a 35.0-point cliff. And within the Claude family on Pro alone, Anthropic’s own vendor-scaffold 69.2% for Opus 4.8 sits 17.3 points above the best standardized Claude score on the SEAL board (Opus 4.6 thinking at 51.9%).

"When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number."— morphllm.com, SWE-bench Pro Leaderboard analysis

On the 80.3% vendor figure

Fable 5’s widely-quoted 80.3% on SWE-bench Pro is vendor-reported using Anthropic’s own scaffold, not a neutral harness — and Fable 5 does not yet appear on Scale’s standardized SEAL board at all (the best Claude entry there is Opus 4.6 thinking at 51.9%). Epoch AI’s independent evaluation was still pending as of June 10, 2026. Treat the 80.3% as a vendor claim awaiting standardized confirmation, and do not assume a SEAL-equivalent number exists for it.

04 — The HarnessWhat the scaffold controls — and why it moves the number.

“The harness matters” is easy to say and hard to picture. The cleanest demonstration in the research needs no abstraction at all: three different agent systems each ran the same Claude Opus 4.5 model against SWE-bench Pro and produced scores from 50.2% to 55.4% — a 5.2-point spread coming entirely from differences in how each agent managed context and tool calls. The model was held constant; only the scaffold changed. Scale AI’s own analysis puts the swing from harness choices at 10 to 20 points.

That is why a vendor harness and a standardized harness can disagree so sharply on the same weights. The table below names the variables the scaffold actually controls. None of them are the model; all of them move the score.

Harness variables taxonomy: how the SWE-bench standard harness, a standardized SEAL agent, and a typical vendor proprietary harness differ across context management, attempt budget, tool-use integration, retry logic, file navigation, and error feedback.
Harness variable	SEAL standardized	Typical vendor scaffold	Why it moves the score
Context management	Fixed, identical for all models	Tuned per model, often generous	More retained context helps the model find the right files.
Attempt / turn budget	Capped and uniform	Higher caps, sometimes multi-attempt	More turns means more chances to converge on a passing patch.
Tool-use integration	Common mini-agent toolset	Model-native, deeply optimized	Cleaner tool calls reduce wasted turns and parse failures.
Retry logic	Minimal, consistent	Custom recovery on test failure	Recovering from a failed run rescues otherwise-lost tasks.
File navigation	Standard repo tools	Bespoke search and indexing	Faster file discovery is most of the work on real repos.
Error feedback format	Raw runner output	Parsed, summarized for the model	Readable errors help the model self-correct between attempts.

The practical lesson is that a SWE-bench number is a measurement of a system — model plus scaffold — not of a model alone. When you wire a model into your own tooling, you are building a harness of your own, and its quality will move your real-world success rate by a similar margin. This is the same dynamic we documented when measuring tool-use success rates across frontier models: the integration around the model is often the deciding variable.

05 — ContaminationWhy OpenAI quietly walked away from Verified.

There is a second reason to distrust a saturated Verified score: contamination. OpenAI’s Frontier Evals team stopped reporting SWE-bench Verified in early 2026 after an internal audit of 138 problematic tasks found that more than 60% were unsolvable as written due to flawed tests — and that frontier models could reproduce the gold-patch solutions verbatim from just the task ID, a clear fingerprint of training-data contamination.

Independent research backs the concern: one study found that 32.67% of successful SWE-bench Verified patches involved solution leakage, and that models recall the correct file paths from training data up to 76% of the time. When a model can “solve” a third of the tasks partly by remembering the answer, a 90%-plus score is measuring memory as much as capability. Flawed questions are not unique to coding either — Epoch AI’s discovery of errors in a corrected math benchmark shows how often the test itself, not the model, is the thing that needs auditing.

"The assessment no longer measures coding capability of our agents, but like the agent's ability to like correctly guess how to name a specific function."— Mia Glaese, OpenAI Frontier Evals

The editorial asymmetry is worth sitting with. One of the labs that helped make SWE-bench Verified the industry yardstick has concluded it no longer measures what it claims to — yet the benchmark still anchors the headline of nearly every coding-model launch, including the 95% Verified figure marketed for Fable 5. A benchmark does not stop being quoted just because the people closest to it stopped trusting it.

What this changes

A Verified score above ~80% is best read as a tier filter, not a ranking. It tells you a model is in the “serious candidate” bracket worth evaluating further; it does not reliably tell you that a 95% model will outperform an 88% model on your codebase, because much of the gap is scaffold and memorization rather than capability.

06 — The Private CliffOn a private codebase, every score drops.

The most decision-relevant number is also the least quoted. Scale maintains a private commercial SWE-bench Pro subset — 276 tasks drawn from proprietary codebases the models could not have trained on — and every model’s score falls when it moves there. Claude Opus 4.6 (thinking) leads it at 47.1%, down from 51.9% on the public set. GPT-5.4 xHigh drops from 59.1% to 43.4%, a 15.7-point fall. The public set, built from copyleft repos, is the lower bound of contamination risk; the private set is the closest available proxy for what a model does on code it has genuinely never seen.

For any team working on a proprietary codebase — which is most teams — the private-subset numbers are the honest expectation-setter. They say that even the best agents resolve fewer than half of realistic, never-before-seen engineering tasks unaided. That is not a reason to avoid these tools; it is a reason to deploy them with a human in the loop and to measure their real hit rate on your own backlog rather than trusting a 95% headline.

Verified headline

Fable 5 · self-reported

95.0%

The number that leads the launch coverage. Independently confirmed for Fable 5 by vals.ai, but still measured on a vendor scaffold against a contamination-prone public set.

marketing-grade

SEAL public

Best standardized score

59.1%

GPT-5.4 xHigh tops Scale's standardized SEAL public board at 59.1% — the highest score any model reaches when every model runs the same harness on the public Pro set.

apples-to-apples

SEAL private

Best on proprietary code

47.1%

Claude Opus 4.6 (thinking) leads the 276-task private commercial subset at 47.1%. The closest proxy for performance on code a model has never seen — and the number to plan around.

decision-grade

07 — Reading It For BuyingHow to actually use these numbers.

The reframe that makes benchmarks useful is to treat them as tier filters, not ordinal rankings. Any model above roughly 80% on Verified and 45% on the SEAL-standardized Pro set is in the deploy-and-evaluate bracket; the exact rank inside that bracket is mostly noise for production purposes. Use the leaderboard to build a shortlist, then decide on your own tasks.

Shortlisting

SWE-bench Verified

Use it as a coarse pass/fail filter — above ~80% means worth testing. Do not use the rank order to choose between two strong models; 99 of 100 entries are self-reported and saturated by contamination.

Filter, do not rank

Apples-to-apples

SEAL standardized Pro

Scale's standardized SEAL public board is the only place every model runs the same harness. This is the comparison to trust when you need to rank capability rather than scaffolding.

Trust for ranking

Proprietary code

SEAL private subset

The 276-task private commercial set is the closest proxy for your never-seen codebase. Best scores sit below 50% — set expectations and staffing around that, not around the Verified headline.

Plan around this

Your decision

Your own eval harness

Nothing substitutes for measuring real success on your backlog with your tooling. The harness you build moves the result as much as the model does — budget for building and measuring it.

Run your own

Cost belongs in the same frame. Fable 5 launched at $10 input / $50 output per million tokens with a 1M-token context window, which is a material premium over the available alternatives, and its production safety guardrails fall back to Opus 4.8 for sensitive request classes — meaning a team buying on the Fable 5 benchmark may be partly paying for Opus 4.8 behavior in practice. We work the cost-per-capability trade-off in detail in our guide to AI coding tool pricing and seat economics. For teams that want this comparison run on their own repositories rather than on a leaderboard, our AI digital transformation engagements start with exactly this kind of standardized, scaffold-controlled eval, and our web development team wires the winning model into a harness tuned for your codebase.

"Benchmark scores tell you which AI models are worth testing further, not which model will work for your users."— pioneer.ai, How to Choose the Best Coding Models in 2026

08 — ConclusionThe headline number is the least useful one.

Reading benchmarks in 2026

A single benchmark name now points to four different numbers — pick the one that matches your decision.

SWE-bench Verified made model evaluation legible, and then the frontier saturated it. As of June 16, 2026, the top of the board is a cluster of self-reported 80%-to-95% scores measured on vendor scaffolds against a contamination-prone public set — useful as a tier filter, misleading as a ranking. The 95% headline for Fable 5 is real and independently confirmed, but it is also not the number that predicts what the model does on your code this week, especially for a model that was suspended three days after launch.

The honest hierarchy runs the other way from the coverage. The standardized SEAL Pro board is where capability is comparable across models on the same harness; the private commercial subset, where the best scores sit below 50%, is the closest proxy for proprietary work; and your own eval on your own backlog is the only number that actually decides anything. The harness you build around a model will move your results by 10 to 20 points — the same margin that separates vendor scores from standardized ones.

The broader signal is that the benchmark era of single-number model comparison is ending. When one name carries four numbers and the scaffold can outweigh the model, the discipline that wins is not picking the highest row on a leaderboard — it is building a standardized, scaffold-controlled evaluation against the work you actually do, and treating every published score as a starting point rather than a verdict.

SWE-bench in 2026: 95% Headlines vs Scaffolding Reality

01 — The June SnapshotWhat the SWE-bench Verified board actually says today.

SWE-bench Verified · top of the board · June 16, 2026

02 — The Reporting ProblemNinety-nine of a hundred results are self-reported.

03 — The Three-Score ProblemOne model, three different SWE-bench Pro numbers.

04 — The HarnessWhat the scaffold controls — and why it moves the number.

05 — ContaminationWhy OpenAI quietly walked away from Verified.

06 — The Private CliffOn a private codebase, every score drops.

Fable 5 · self-reported

Best standardized score

Best on proprietary code

07 — Reading It For BuyingHow to actually use these numbers.

SWE-bench Verified

SEAL standardized Pro

SEAL private subset

Your own eval harness

08 — ConclusionThe headline number is the least useful one.

A single benchmark name now points to four different numbers — pick the one that matches your decision.

Choose a coding model on a standardized eval of your own code.

Coding-model evaluation engagements

The questions we get every week.

Continue exploring benchmark reality.

LLM Benchmark Methodology 2026: Reading Leaderboards

FrontierMath v2: When AI Benchmarks Get Error-Corrected

Claude Opus 4.8, 48 Hours In: The Early Eval Roundup

ARC Prize Verified Opus 5. That Is Rarer Than It Sounds.