SWE-bench Verified is the benchmark every coding-model launch quotes, and as of June 16, 2026 the headline is striking: Claude Fable 5 sits at the top of the llm-stats leaderboard with a reported 95.0%. The problem is what that number hides. Ninety-nine of the hundred entries on that board are self-reported, the same model family scores far lower on a standardized harness, and one of the labs that helped make the benchmark famous has quietly stopped reporting it at all.
This is not a story about which model is best. It is a story about why a single benchmark name now points to three or four genuinely different numbers — and why the agent scaffold around a model can move a score by 10 to 20 points without changing the model at all. For anyone choosing a coding model for real work, the rank order on a leaderboard is the least useful thing on the page.
Below we separate the four distinct measurements hiding behind “SWE-bench,” show where the vendor numbers and the standardized numbers diverge, explain the contamination problem that pushed OpenAI to walk away, and lay out how to read all of it when you actually have to pick a model. It builds on our broader SWE-bench live leaderboard Q2 2026 analysis with a sharper focus on the harness.
- 01The headline rank is mostly self-reported.The llm-stats SWE-bench Verified leaderboard listed 100 models on June 16, 2026 — but only 1 result was independently verified; the other 99 were submitted by the vendors themselves. Fable 5's 95.0% is the rare independently-confirmed one (vals.ai).
- 02One benchmark name, several different numbers.SWE-bench Verified (self-reported), SWE-bench Pro on a vendor scaffold, SWE-bench Pro on Scale's standardized SEAL harness, and SWE-bench Pro on a private commercial subset can all carry the same model and produce sharply different results.
- 03The harness can be worth more than the model.Three different agent systems running the same Claude Opus 4.5 produced a 50.2%–55.4% range on SWE-bench Pro — a 5.2-point spread from scaffold differences alone. Scale's own analysis attributes 10–20 point swings to harness choices.
- 04Vendor Pro scores sit well above standardized ones.Anthropic reports 69.2% for Opus 4.8 on SWE-bench Pro using its own scaffold, while the best Claude score on Scale's standardized SEAL board is 51.9% (Opus 4.6 thinking) — a 17.3-point gap within a single model family.
- 05Buy on the standardized and private numbers, not the headline.For purchase decisions, treat SWE-bench Verified as a pass/fail tier filter and lean on SEAL-standardized SWE-bench Pro plus the private-codebase subset — the only numbers measured the same way for every model, on tasks closer to proprietary work.
01 — The June SnapshotWhat the SWE-bench Verified board actually says today.
SWE-bench was published at ICLR 2024 by Princeton NLP, framed around a single question: can language models resolve real-world GitHub issues? The original set drew 2,294 problems from real repositories. SWE-bench Verified is the 500-task, human-validated subset that became the industry-standard headline number, drawn from popular open-source Python projects including Django, SymPy, scikit-learn, pytest, Flask, and matplotlib (the list is illustrative, not exhaustive).
On the llm-stats SWE-bench Verified leaderboard as of June 16, 2026, the top of the board is dominated by the Claude family. Claude Fable 5 leads at 95.0% — a figure vals.ai independently confirms at 95.00%. Claude Mythos Preview follows at 93.9% (a model restricted to Project Glasswing partners, not generally available), then Opus 4.8 at 88.6%, Opus 4.7 at 87.6%, GPT-5.5 at a reported 82.60% (vals.ai), and Gemini 3.1 Pro at 80.6% on third-party leaderboard aggregation.
SWE-bench Verified · top of the board · June 16, 2026
Source: llm-stats.com & vals.ai SWE-bench Verified leaderboards, retrieved June 16, 2026 (mostly self-reported)02 — The Reporting ProblemNinety-nine of a hundred results are self-reported.
The single most important fact about that leaderboard rarely appears in the coverage that cites it. Of the 100 models listed on llm-stats as of June 16, 2026, only one carries an independent verification badge — the other 99 scores were submitted by the model vendors themselves. Most reporting treats the board as an objective ranking; it is closer to a self-attested press-release aggregator with a single audited row.
That matters because each vendor runs its own evaluation harness — its own scaffold of tool definitions, retry logic, context management, and prompting around the raw model. SWE-bench gives an agent a Docker container with the target repo, the issue text, and a test runner, but crucially not the failing test itself. The agent has to discover the failing tests, understand the issue, and produce a patch that flips the failing tests to passing without breaking the ones already passing. How well the surrounding scaffold supports that discovery is a vendor choice — and a self-reported number bakes the vendor’s best scaffold into the score.
"SWE-bench scores are harness-dependent."— benchmarkingagents.com, SWE-bench Verified Explained: 2026 Methodology, Tiers, Caveats
Independent benchmarking analysis goes further, noting that any Verified score above 80% “warrants scrutiny about harness and tool access” — which is precisely the band the entire top of the June 2026 board now occupies. When the leading dozen models cluster between 80% and 95% on a self-reported benchmark, the rank order stops carrying much signal and the methodology underneath it starts carrying all of it.
03 — The Three-Score ProblemOne model, three different SWE-bench Pro numbers.
SWE-bench Pro is the harder, more contamination-resistant successor: 1,865 total tasks (731 public, 858 held-out, 276 commercial) spanning 41 repositories in Python, Go, TypeScript, and JavaScript, with tasks averaging 107.4 lines changed across 4.1 files. The public set deliberately uses copyleft-licensed repositories as a legal deterrent against quietly folding them into training data. It is a better benchmark — and it produces multiple, genuinely different numbers for the same model depending on how it is run.
There are three meaningfully distinct SWE-bench Pro measurements, and no single published table normally spans all of them next to the Verified headline. The matrix below assembles the research-sourced cells we could confirm. Where a cell shows “—,” that number is not available in a primary source as of the snapshot date — for example, Fable 5 has no Scale SEAL-standardized Pro score yet (Epoch AI’s independent evaluation was still pending as of June 10, 2026), and we deliberately do not interpolate one.
| Model | Verified (self-reported) | Pro · vendor scaffold | Pro · SEAL public | Pro · SEAL private |
|---|---|---|---|---|
| Claude Fable 5 | 95.0% | 80.3% vendor | — | — |
| Claude Mythos Preview | 93.9% | 77.8% vendor | — | — |
| Claude Opus 4.8 | 88.6% | 69.2% vendor | — | — |
| GPT-5.5 | 82.60% | 58.6% vendor | — | — |
| Gemini 3.1 Pro | 80.6% | 54.2% vendor | 46.1% | — |
| GPT-5.4 (xHigh) | — | — | 59.1% | 43.4% |
| Muse Spark | — | — | 55.0% | — |
| Claude Opus 4.6 (thinking) | — | — | 51.9% | 47.1% |
| Claude Opus 4.5 | 80.9% | — | 45.9% | — |
Read the rows where the data exists and the pattern is unmistakable. Gemini 3.1 Pro falls from 80.6% on Verified to a vendor-reported 54.2% on SWE-bench Pro — a 26.4-point drop between two benchmarks wearing the same family name. Opus 4.5 falls from 80.9% on Verified to 45.9% on Scale’s standardized SEAL public set, a 35.0-point cliff. And within the Claude family on Pro alone, Anthropic’s own vendor-scaffold 69.2% for Opus 4.8 sits 17.3 points above the best standardized Claude score on the SEAL board (Opus 4.6 thinking at 51.9%).
"When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number."— morphllm.com, SWE-bench Pro Leaderboard analysis
04 — The HarnessWhat the scaffold controls — and why it moves the number.
“The harness matters” is easy to say and hard to picture. The cleanest demonstration in the research needs no abstraction at all: three different agent systems each ran the same Claude Opus 4.5 model against SWE-bench Pro and produced scores from 50.2% to 55.4% — a 5.2-point spread coming entirely from differences in how each agent managed context and tool calls. The model was held constant; only the scaffold changed. Scale AI’s own analysis puts the swing from harness choices at 10 to 20 points.
That is why a vendor harness and a standardized harness can disagree so sharply on the same weights. The table below names the variables the scaffold actually controls. None of them are the model; all of them move the score.
| Harness variable | SEAL standardized | Typical vendor scaffold | Why it moves the score |
|---|---|---|---|
| Context management | Fixed, identical for all models | Tuned per model, often generous | More retained context helps the model find the right files. |
| Attempt / turn budget | Capped and uniform | Higher caps, sometimes multi-attempt | More turns means more chances to converge on a passing patch. |
| Tool-use integration | Common mini-agent toolset | Model-native, deeply optimized | Cleaner tool calls reduce wasted turns and parse failures. |
| Retry logic | Minimal, consistent | Custom recovery on test failure | Recovering from a failed run rescues otherwise-lost tasks. |
| File navigation | Standard repo tools | Bespoke search and indexing | Faster file discovery is most of the work on real repos. |
| Error feedback format | Raw runner output | Parsed, summarized for the model | Readable errors help the model self-correct between attempts. |
The practical lesson is that a SWE-bench number is a measurement of a system — model plus scaffold — not of a model alone. When you wire a model into your own tooling, you are building a harness of your own, and its quality will move your real-world success rate by a similar margin. This is the same dynamic we documented when measuring tool-use success rates across frontier models: the integration around the model is often the deciding variable.
05 — ContaminationWhy OpenAI quietly walked away from Verified.
There is a second reason to distrust a saturated Verified score: contamination. OpenAI’s Frontier Evals team stopped reporting SWE-bench Verified in early 2026 after an internal audit of 138 problematic tasks found that more than 60% were unsolvable as written due to flawed tests — and that frontier models could reproduce the gold-patch solutions verbatim from just the task ID, a clear fingerprint of training-data contamination.
Independent research backs the concern: one study found that 32.67% of successful SWE-bench Verified patches involved solution leakage, and that models recall the correct file paths from training data up to 76% of the time. When a model can “solve” a third of the tasks partly by remembering the answer, a 90%-plus score is measuring memory as much as capability.
"The assessment no longer measures coding capability of our agents, but like the agent's ability to like correctly guess how to name a specific function."— Mia Glaese, OpenAI Frontier Evals
The editorial asymmetry is worth sitting with. One of the labs that helped make SWE-bench Verified the industry yardstick has concluded it no longer measures what it claims to — yet the benchmark still anchors the headline of nearly every coding-model launch, including the 95% Verified figure marketed for Fable 5. A benchmark does not stop being quoted just because the people closest to it stopped trusting it.
06 — The Private CliffOn a private codebase, every score drops.
The most decision-relevant number is also the least quoted. Scale maintains a private commercial SWE-bench Pro subset — 276 tasks drawn from proprietary codebases the models could not have trained on — and every model’s score falls when it moves there. Claude Opus 4.6 (thinking) leads it at 47.1%, down from 51.9% on the public set. GPT-5.4 xHigh drops from 59.1% to 43.4%, a 15.7-point fall. The public set, built from copyleft repos, is the lower bound of contamination risk; the private set is the closest available proxy for what a model does on code it has genuinely never seen.
For any team working on a proprietary codebase — which is most teams — the private-subset numbers are the honest expectation-setter. They say that even the best agents resolve fewer than half of realistic, never-before-seen engineering tasks unaided. That is not a reason to avoid these tools; it is a reason to deploy them with a human in the loop and to measure their real hit rate on your own backlog rather than trusting a 95% headline.
Fable 5 · self-reported
The number that leads the launch coverage. Independently confirmed for Fable 5 by vals.ai, but still measured on a vendor scaffold against a contamination-prone public set.
Best standardized score
GPT-5.4 xHigh tops Scale's standardized SEAL public board at 59.1% — the highest score any model reaches when every model runs the same harness on the public Pro set.
Best on proprietary code
Claude Opus 4.6 (thinking) leads the 276-task private commercial subset at 47.1%. The closest proxy for performance on code a model has never seen — and the number to plan around.
07 — Reading It For BuyingHow to actually use these numbers.
The reframe that makes benchmarks useful is to treat them as tier filters, not ordinal rankings. Any model above roughly 80% on Verified and 45% on the SEAL-standardized Pro set is in the deploy-and-evaluate bracket; the exact rank inside that bracket is mostly noise for production purposes. Use the leaderboard to build a shortlist, then decide on your own tasks.
SWE-bench Verified
Use it as a coarse pass/fail filter — above ~80% means worth testing. Do not use the rank order to choose between two strong models; 99 of 100 entries are self-reported and saturated by contamination.
SEAL standardized Pro
Scale's standardized SEAL public board is the only place every model runs the same harness. This is the comparison to trust when you need to rank capability rather than scaffolding.
SEAL private subset
The 276-task private commercial set is the closest proxy for your never-seen codebase. Best scores sit below 50% — set expectations and staffing around that, not around the Verified headline.
Your own eval harness
Nothing substitutes for measuring real success on your backlog with your tooling. The harness you build moves the result as much as the model does — budget for building and measuring it.
Cost belongs in the same frame. Fable 5 launched at $10 input / $50 output per million tokens with a 1M-token context window, which is a material premium over the available alternatives, and its production safety guardrails fall back to Opus 4.8 for sensitive request classes — meaning a team buying on the Fable 5 benchmark may be partly paying for Opus 4.8 behavior in practice. We work the cost-per-capability trade-off in detail in our guide to AI coding tool pricing and seat economics. For teams that want this comparison run on their own repositories rather than on a leaderboard, our AI digital transformation engagements start with exactly this kind of standardized, scaffold-controlled eval, and our web development team wires the winning model into a harness tuned for your codebase.
"Benchmark scores tell you which AI models are worth testing further, not which model will work for your users."— pioneer.ai, How to Choose the Best Coding Models in 2026
08 — ConclusionThe headline number is the least useful one.
A single benchmark name now points to four different numbers — pick the one that matches your decision.
SWE-bench Verified made model evaluation legible, and then the frontier saturated it. As of June 16, 2026, the top of the board is a cluster of self-reported 80%-to-95% scores measured on vendor scaffolds against a contamination-prone public set — useful as a tier filter, misleading as a ranking. The 95% headline for Fable 5 is real and independently confirmed, but it is also not the number that predicts what the model does on your code this week, especially for a model that was suspended three days after launch.
The honest hierarchy runs the other way from the coverage. The standardized SEAL Pro board is where capability is comparable across models on the same harness; the private commercial subset, where the best scores sit below 50%, is the closest proxy for proprietary work; and your own eval on your own backlog is the only number that actually decides anything. The harness you build around a model will move your results by 10 to 20 points — the same margin that separates vendor scores from standardized ones.
The broader signal is that the benchmark era of single-number model comparison is ending. When one name carries four numbers and the scaffold can outweigh the model, the discipline that wins is not picking the highest row on a leaderboard — it is building a standardized, scaffold-controlled evaluation against the work you actually do, and treating every published score as a starting point rather than a verdict.