AI DevelopmentDecision Matrix10 min readPublished June 16, 2026

One model · three different SWE-bench Pro scores · the harness moves results 10–20 points

SWE-bench in 2026: 95% Headlines vs Scaffolding Reality

As of June 16, 2026, Claude Fable 5 tops SWE-bench Verified at 95.0% — but 99 of the 100 leaderboard entries are self-reported, and the same family of models scores far lower on a standardized harness. The number that sells a model and the number that predicts your results are rarely the same number.

DA
Digital Applied Team
Senior strategists · Published June 16, 2026
PublishedJun 16, 2026
Read time10 min
Sources8 leaderboards & papers
Fable 5 · SWE-bench Verified
95.0%
leaderboard rank #1
self-reported
Leaderboard entries verified
1/100
rest are self-reported
−99 unverified
Claude vendor-vs-SEAL gap
17.3pt
Pro vendor vs standardized
Same model, 3 agents
5.2pt
harness-only spread

SWE-bench Verified is the benchmark every coding-model launch quotes, and as of June 16, 2026 the headline is striking: Claude Fable 5 sits at the top of the llm-stats leaderboard with a reported 95.0%. The problem is what that number hides. Ninety-nine of the hundred entries on that board are self-reported, the same model family scores far lower on a standardized harness, and one of the labs that helped make the benchmark famous has quietly stopped reporting it at all.

This is not a story about which model is best. It is a story about why a single benchmark name now points to three or four genuinely different numbers — and why the agent scaffold around a model can move a score by 10 to 20 points without changing the model at all. For anyone choosing a coding model for real work, the rank order on a leaderboard is the least useful thing on the page.

Below we separate the four distinct measurements hiding behind “SWE-bench,” show where the vendor numbers and the standardized numbers diverge, explain the contamination problem that pushed OpenAI to walk away, and lay out how to read all of it when you actually have to pick a model. It builds on our broader SWE-bench live leaderboard Q2 2026 analysis with a sharper focus on the harness.

Key takeaways
  1. 01
    The headline rank is mostly self-reported.The llm-stats SWE-bench Verified leaderboard listed 100 models on June 16, 2026 — but only 1 result was independently verified; the other 99 were submitted by the vendors themselves. Fable 5's 95.0% is the rare independently-confirmed one (vals.ai).
  2. 02
    One benchmark name, several different numbers.SWE-bench Verified (self-reported), SWE-bench Pro on a vendor scaffold, SWE-bench Pro on Scale's standardized SEAL harness, and SWE-bench Pro on a private commercial subset can all carry the same model and produce sharply different results.
  3. 03
    The harness can be worth more than the model.Three different agent systems running the same Claude Opus 4.5 produced a 50.2%–55.4% range on SWE-bench Pro — a 5.2-point spread from scaffold differences alone. Scale's own analysis attributes 10–20 point swings to harness choices.
  4. 04
    Vendor Pro scores sit well above standardized ones.Anthropic reports 69.2% for Opus 4.8 on SWE-bench Pro using its own scaffold, while the best Claude score on Scale's standardized SEAL board is 51.9% (Opus 4.6 thinking) — a 17.3-point gap within a single model family.
  5. 05
    Buy on the standardized and private numbers, not the headline.For purchase decisions, treat SWE-bench Verified as a pass/fail tier filter and lean on SEAL-standardized SWE-bench Pro plus the private-codebase subset — the only numbers measured the same way for every model, on tasks closer to proprietary work.

01The June SnapshotWhat the SWE-bench Verified board actually says today.

SWE-bench was published at ICLR 2024 by Princeton NLP, framed around a single question: can language models resolve real-world GitHub issues? The original set drew 2,294 problems from real repositories. SWE-bench Verified is the 500-task, human-validated subset that became the industry-standard headline number, drawn from popular open-source Python projects including Django, SymPy, scikit-learn, pytest, Flask, and matplotlib (the list is illustrative, not exhaustive).

On the llm-stats SWE-bench Verified leaderboard as of June 16, 2026, the top of the board is dominated by the Claude family. Claude Fable 5 leads at 95.0% — a figure vals.ai independently confirms at 95.00%. Claude Mythos Preview follows at 93.9% (a model restricted to Project Glasswing partners, not generally available), then Opus 4.8 at 88.6%, Opus 4.7 at 87.6%, GPT-5.5 at a reported 82.60% (vals.ai), and Gemini 3.1 Pro at 80.6% on third-party leaderboard aggregation.

SWE-bench Verified · top of the board · June 16, 2026

Source: llm-stats.com & vals.ai SWE-bench Verified leaderboards, retrieved June 16, 2026 (mostly self-reported)
Claude Fable 5Verified · independently confirmed by vals.ai
95.0%
rank #1
Claude Mythos PreviewVerified · Project Glasswing partners only
93.9%
Claude Opus 4.8Verified · released May 28, 2026
88.6%
Claude Opus 4.7Verified · self-reported
87.6%
GPT-5.5Verified · vals.ai, June 13, 2026
82.60%
Gemini 3.1 ProVerified · leaderboard-aggregated
80.6%
Availability caveat
The two models at the very top are not freely deployable. Fable 5 launched June 9, 2026 but was suspended on June 12, 2026 under US export-control requirements, with an expected return around July 1, 2026; the currently available Claude alternative is Opus 4.8. Mythos Preview is restricted to Project Glasswing partners. A 95% headline against a model you cannot reliably buy this week is a marketing number first and a procurement number second.

02The Reporting ProblemNinety-nine of a hundred results are self-reported.

The single most important fact about that leaderboard rarely appears in the coverage that cites it. Of the 100 models listed on llm-stats as of June 16, 2026, only one carries an independent verification badge — the other 99 scores were submitted by the model vendors themselves. Most reporting treats the board as an objective ranking; it is closer to a self-attested press-release aggregator with a single audited row.

That matters because each vendor runs its own evaluation harness — its own scaffold of tool definitions, retry logic, context management, and prompting around the raw model. SWE-bench gives an agent a Docker container with the target repo, the issue text, and a test runner, but crucially not the failing test itself. The agent has to discover the failing tests, understand the issue, and produce a patch that flips the failing tests to passing without breaking the ones already passing. How well the surrounding scaffold supports that discovery is a vendor choice — and a self-reported number bakes the vendor’s best scaffold into the score.

"SWE-bench scores are harness-dependent."— benchmarkingagents.com, SWE-bench Verified Explained: 2026 Methodology, Tiers, Caveats

Independent benchmarking analysis goes further, noting that any Verified score above 80% “warrants scrutiny about harness and tool access” — which is precisely the band the entire top of the June 2026 board now occupies. When the leading dozen models cluster between 80% and 95% on a self-reported benchmark, the rank order stops carrying much signal and the methodology underneath it starts carrying all of it.

03The Three-Score ProblemOne model, three different SWE-bench Pro numbers.

SWE-bench Pro is the harder, more contamination-resistant successor: 1,865 total tasks (731 public, 858 held-out, 276 commercial) spanning 41 repositories in Python, Go, TypeScript, and JavaScript, with tasks averaging 107.4 lines changed across 4.1 files. The public set deliberately uses copyleft-licensed repositories as a legal deterrent against quietly folding them into training data. It is a better benchmark — and it produces multiple, genuinely different numbers for the same model depending on how it is run.

There are three meaningfully distinct SWE-bench Pro measurements, and no single published table normally spans all of them next to the Verified headline. The matrix below assembles the research-sourced cells we could confirm. Where a cell shows “—,” that number is not available in a primary source as of the snapshot date — for example, Fable 5 has no Scale SEAL-standardized Pro score yet (Epoch AI’s independent evaluation was still pending as of June 10, 2026), and we deliberately do not interpolate one.

SWE-bench Pro score reality check: each model shown across SWE-bench Verified (self-reported), SWE-bench Pro on a vendor scaffold, SWE-bench Pro on Scale’s standardized SEAL public harness, and SWE-bench Pro on Scale’s private commercial subset. Dashes mark numbers not available from a primary source as of June 16, 2026.
ModelVerified (self-reported)Pro · vendor scaffoldPro · SEAL publicPro · SEAL private
Claude Fable 595.0%80.3% vendor
Claude Mythos Preview93.9%77.8% vendor
Claude Opus 4.888.6%69.2% vendor
GPT-5.582.60%58.6% vendor
Gemini 3.1 Pro80.6%54.2% vendor46.1%
GPT-5.4 (xHigh)59.1%43.4%
Muse Spark55.0%
Claude Opus 4.6 (thinking)51.9%47.1%
Claude Opus 4.580.9%45.9%

Read the rows where the data exists and the pattern is unmistakable. Gemini 3.1 Pro falls from 80.6% on Verified to a vendor-reported 54.2% on SWE-bench Pro — a 26.4-point drop between two benchmarks wearing the same family name. Opus 4.5 falls from 80.9% on Verified to 45.9% on Scale’s standardized SEAL public set, a 35.0-point cliff. And within the Claude family on Pro alone, Anthropic’s own vendor-scaffold 69.2% for Opus 4.8 sits 17.3 points above the best standardized Claude score on the SEAL board (Opus 4.6 thinking at 51.9%).

"When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number."— morphllm.com, SWE-bench Pro Leaderboard analysis
On the 80.3% vendor figure
Fable 5’s widely-quoted 80.3% on SWE-bench Pro is vendor-reported using Anthropic’s own scaffold, not a neutral harness — and Fable 5 does not yet appear on Scale’s standardized SEAL board at all (the best Claude entry there is Opus 4.6 thinking at 51.9%). Epoch AI’s independent evaluation was still pending as of June 10, 2026. Treat the 80.3% as a vendor claim awaiting standardized confirmation, and do not assume a SEAL-equivalent number exists for it.

04The HarnessWhat the scaffold controls — and why it moves the number.

“The harness matters” is easy to say and hard to picture. The cleanest demonstration in the research needs no abstraction at all: three different agent systems each ran the same Claude Opus 4.5 model against SWE-bench Pro and produced scores from 50.2% to 55.4% — a 5.2-point spread coming entirely from differences in how each agent managed context and tool calls. The model was held constant; only the scaffold changed. Scale AI’s own analysis puts the swing from harness choices at 10 to 20 points.

That is why a vendor harness and a standardized harness can disagree so sharply on the same weights. The table below names the variables the scaffold actually controls. None of them are the model; all of them move the score.

Harness variables taxonomy: how the SWE-bench standard harness, a standardized SEAL agent, and a typical vendor proprietary harness differ across context management, attempt budget, tool-use integration, retry logic, file navigation, and error feedback.
Harness variableSEAL standardizedTypical vendor scaffoldWhy it moves the score
Context managementFixed, identical for all modelsTuned per model, often generousMore retained context helps the model find the right files.
Attempt / turn budgetCapped and uniformHigher caps, sometimes multi-attemptMore turns means more chances to converge on a passing patch.
Tool-use integrationCommon mini-agent toolsetModel-native, deeply optimizedCleaner tool calls reduce wasted turns and parse failures.
Retry logicMinimal, consistentCustom recovery on test failureRecovering from a failed run rescues otherwise-lost tasks.
File navigationStandard repo toolsBespoke search and indexingFaster file discovery is most of the work on real repos.
Error feedback formatRaw runner outputParsed, summarized for the modelReadable errors help the model self-correct between attempts.

The practical lesson is that a SWE-bench number is a measurement of a system — model plus scaffold — not of a model alone. When you wire a model into your own tooling, you are building a harness of your own, and its quality will move your real-world success rate by a similar margin. This is the same dynamic we documented when measuring tool-use success rates across frontier models: the integration around the model is often the deciding variable.

05ContaminationWhy OpenAI quietly walked away from Verified.

There is a second reason to distrust a saturated Verified score: contamination. OpenAI’s Frontier Evals team stopped reporting SWE-bench Verified in early 2026 after an internal audit of 138 problematic tasks found that more than 60% were unsolvable as written due to flawed tests — and that frontier models could reproduce the gold-patch solutions verbatim from just the task ID, a clear fingerprint of training-data contamination.

Independent research backs the concern: one study found that 32.67% of successful SWE-bench Verified patches involved solution leakage, and that models recall the correct file paths from training data up to 76% of the time. When a model can “solve” a third of the tasks partly by remembering the answer, a 90%-plus score is measuring memory as much as capability.

"The assessment no longer measures coding capability of our agents, but like the agent's ability to like correctly guess how to name a specific function."— Mia Glaese, OpenAI Frontier Evals

The editorial asymmetry is worth sitting with. One of the labs that helped make SWE-bench Verified the industry yardstick has concluded it no longer measures what it claims to — yet the benchmark still anchors the headline of nearly every coding-model launch, including the 95% Verified figure marketed for Fable 5. A benchmark does not stop being quoted just because the people closest to it stopped trusting it.

What this changes
A Verified score above ~80% is best read as a tier filter, not a ranking. It tells you a model is in the “serious candidate” bracket worth evaluating further; it does not reliably tell you that a 95% model will outperform an 88% model on your codebase, because much of the gap is scaffold and memorization rather than capability.

06The Private CliffOn a private codebase, every score drops.

The most decision-relevant number is also the least quoted. Scale maintains a private commercial SWE-bench Pro subset — 276 tasks drawn from proprietary codebases the models could not have trained on — and every model’s score falls when it moves there. Claude Opus 4.6 (thinking) leads it at 47.1%, down from 51.9% on the public set. GPT-5.4 xHigh drops from 59.1% to 43.4%, a 15.7-point fall. The public set, built from copyleft repos, is the lower bound of contamination risk; the private set is the closest available proxy for what a model does on code it has genuinely never seen.

For any team working on a proprietary codebase — which is most teams — the private-subset numbers are the honest expectation-setter. They say that even the best agents resolve fewer than half of realistic, never-before-seen engineering tasks unaided. That is not a reason to avoid these tools; it is a reason to deploy them with a human in the loop and to measure their real hit rate on your own backlog rather than trusting a 95% headline.

Verified headline
Fable 5 · self-reported
95.0%

The number that leads the launch coverage. Independently confirmed for Fable 5 by vals.ai, but still measured on a vendor scaffold against a contamination-prone public set.

marketing-grade
SEAL public
Best standardized score
59.1%

GPT-5.4 xHigh tops Scale's standardized SEAL public board at 59.1% — the highest score any model reaches when every model runs the same harness on the public Pro set.

apples-to-apples
SEAL private
Best on proprietary code
47.1%

Claude Opus 4.6 (thinking) leads the 276-task private commercial subset at 47.1%. The closest proxy for performance on code a model has never seen — and the number to plan around.

decision-grade

07Reading It For BuyingHow to actually use these numbers.

The reframe that makes benchmarks useful is to treat them as tier filters, not ordinal rankings. Any model above roughly 80% on Verified and 45% on the SEAL-standardized Pro set is in the deploy-and-evaluate bracket; the exact rank inside that bracket is mostly noise for production purposes. Use the leaderboard to build a shortlist, then decide on your own tasks.

Shortlisting
SWE-bench Verified

Use it as a coarse pass/fail filter — above ~80% means worth testing. Do not use the rank order to choose between two strong models; 99 of 100 entries are self-reported and saturated by contamination.

Filter, do not rank
Apples-to-apples
SEAL standardized Pro

Scale's standardized SEAL public board is the only place every model runs the same harness. This is the comparison to trust when you need to rank capability rather than scaffolding.

Trust for ranking
Proprietary code
SEAL private subset

The 276-task private commercial set is the closest proxy for your never-seen codebase. Best scores sit below 50% — set expectations and staffing around that, not around the Verified headline.

Plan around this
Your decision
Your own eval harness

Nothing substitutes for measuring real success on your backlog with your tooling. The harness you build moves the result as much as the model does — budget for building and measuring it.

Run your own

Cost belongs in the same frame. Fable 5 launched at $10 input / $50 output per million tokens with a 1M-token context window, which is a material premium over the available alternatives, and its production safety guardrails fall back to Opus 4.8 for sensitive request classes — meaning a team buying on the Fable 5 benchmark may be partly paying for Opus 4.8 behavior in practice. We work the cost-per-capability trade-off in detail in our guide to AI coding tool pricing and seat economics. For teams that want this comparison run on their own repositories rather than on a leaderboard, our AI digital transformation engagements start with exactly this kind of standardized, scaffold-controlled eval, and our web development team wires the winning model into a harness tuned for your codebase.

"Benchmark scores tell you which AI models are worth testing further, not which model will work for your users."— pioneer.ai, How to Choose the Best Coding Models in 2026

08ConclusionThe headline number is the least useful one.

Reading benchmarks in 2026

A single benchmark name now points to four different numbers — pick the one that matches your decision.

SWE-bench Verified made model evaluation legible, and then the frontier saturated it. As of June 16, 2026, the top of the board is a cluster of self-reported 80%-to-95% scores measured on vendor scaffolds against a contamination-prone public set — useful as a tier filter, misleading as a ranking. The 95% headline for Fable 5 is real and independently confirmed, but it is also not the number that predicts what the model does on your code this week, especially for a model that was suspended three days after launch.

The honest hierarchy runs the other way from the coverage. The standardized SEAL Pro board is where capability is comparable across models on the same harness; the private commercial subset, where the best scores sit below 50%, is the closest proxy for proprietary work; and your own eval on your own backlog is the only number that actually decides anything. The harness you build around a model will move your results by 10 to 20 points — the same margin that separates vendor scores from standardized ones.

The broader signal is that the benchmark era of single-number model comparison is ending. When one name carries four numbers and the scaffold can outweigh the model, the discipline that wins is not picking the highest row on a leaderboard — it is building a standardized, scaffold-controlled evaluation against the work you actually do, and treating every published score as a starting point rather than a verdict.

Pick a coding model on evidence, not headlines

Choose a coding model on a standardized eval of your own code.

Our team runs standardized, scaffold-controlled coding-model evaluations on your own repositories — so your model choice is grounded in your real success rate, not a self-reported leaderboard rank.

Free consultationExpert guidanceTailored solutions
What we work on

Coding-model evaluation engagements

  • Standardized harness evals — same scaffold across every model
  • Private-codebase benchmarking on your real backlog
  • Cost-per-capability and seat-economics modeling
  • Multi-vendor routing — Claude / GPT / Gemini by task class
  • Harness engineering — wiring the winning model into your tooling
FAQ · SWE-bench in 2026

The questions we get every week.

As of June 16, 2026, Claude Fable 5 leads the llm-stats SWE-bench Verified leaderboard at 95.0%, a figure vals.ai independently confirms at 95.00%. Claude Mythos Preview follows at 93.9% (restricted to Project Glasswing partners), then Claude Opus 4.8 at 88.6%, Claude Opus 4.7 at 87.6%, GPT-5.5 at a reported 82.60%, and Gemini 3.1 Pro at 80.6%. Two important caveats: 99 of the 100 leaderboard entries are self-reported rather than independently verified, and the two top models are not freely available — Fable 5 was suspended on June 12, 2026 under US export-control requirements (expected return around July 1), and Mythos Preview is a non-public partner model. Read the rank as a tier filter, not a definitive ordering.