AI DevelopmentMethodology14 min readPublished May 17, 2026

Who built the eval matters as much as who scored it

SWE-Bench vs Terminal-Bench: Governance First

SWE-Bench Verified, Pro, Multilingual, and Live each measure different things. Terminal-Bench is Stanford-published. CursorBench is built by Cursor. The harness is half the score. This is the governance-first guide to reading AI coding benchmark claims without getting fooled.

DA
Digital Applied Team
Senior strategists · Published May 17, 2026
PublishedMay 17, 2026
Read time14 min
Sources17 primary sources
Cross-model Verified-to-Pro gap
20–25
average points, structural
Verified ≠ Pro
Opus 4.6 gap
27pts
80.8 Verified vs 53.4 Pro
−27 cross-variant
Opus 4.7 gap
23pts
87.6 Verified vs 64.3 Pro
−23 cross-variant
Harness swing
10–20
points, same model weights
Harness = half the score

SWE-Bench has become the de-facto standard for coding-agent evaluation, but the benchmark family now has five distinct variants — original, Verified, Pro, Multilingual, and Live — and comparing scores across them is methodological malpractice. Add harness variance of 10–20 percentage points on identical model weights, a proliferation of vendor-controlled evals like CursorBench and Aider polyglot, and an emerging generation of tool-use benchmarks anchored by MCP Atlas and OSWorld, and the signal-to-noise ratio in AI coding benchmarks has rarely been worse.

The stakes are real. Engineering teams use benchmark claims to select coding agents, negotiate AI infrastructure spend, and make hiring decisions around automation. When those claims mix variants, hide harness choices, or originate from the same lab that built the model, they systematically mislead. The gap between a vendor's published 87.6% and a reproduced 64.3% on the same model is not noise — it is a structural feature of how this benchmark family is designed and a predictable consequence of letting vendors control their own evals.

This guide covers the methodology behind every major coding and agent benchmark in use as of May 2026: what each variant measures, who governs it, how the harness affects scores, and how to apply a five-question checklist to any new model release. The governance lens — not the leaderboard position — is the load-bearing insight.

Key takeaways
  1. 01
    The Verified-to-Pro gap is structural, not model-specific.Opus 4.6 drops 27 points from Verified (80.8%) to Pro (53.4%). Opus 4.7 drops 23 points. MiniMax drops ~24 points. GPT-5.2 drops ~24 points. The ~20-25 point gap is a property of the benchmark design, not of any individual model's weakness.
  2. 02
    The harness is half the score.Identical model weights in different scaffolding harnesses commonly produce 10–20 percentage-point score differences on SWE-Bench. When two vendors report different numbers for the same base model, harness is usually the explanation — not capability.
  3. 03
    CursorBench and Aider polyglot are vendor-controlled.CursorBench v3.1 is built and scored by Cursor. Aider polyglot is built and published by Aider. In both cases, the publisher and the model author are the same entity. Treat scores from these benchmarks with the same skepticism you would apply to any self-reported metric.
  4. 04
    Terminal-Bench is independently governed by Stanford.Terminal-Bench 2.0 (tbench.ai) is a Stanford + Laude Institute benchmark for terminal-mastery tasks. Anthropic, Cursor, and Codex CLI all publish Terminal-Bench 2.0 scores — it is the canonical eval for CLI coding agents precisely because it is not controlled by any of them.
  5. 05
    Pair benchmarks for real coverage.No single benchmark covers the full real-work profile. Pair SWE-Bench with Terminal-Bench for shell tasks, MCP Atlas for tool use, and OSWorld for computer-use workloads. The benchmark portfolio thesis is the practical takeaway from this guide.

01The Structural GapThe 20–25 point Verified-to-Pro gap is structural — cross-model evidence.

The most consequential fact about SWE-Bench is not any single model's score — it is the consistent ~20–25 point gap between scores on SWE-Bench Verified and SWE-Bench Pro across every model that has published both. This gap is not a quirk of one model's architecture. It is a structural property of how the two variants are designed.

The data is unambiguous. Opus 4.6 posts 80.8% on Verified and 53.4% on Pro — a 27-point spread. Opus 4.7 posts 87.6% Verified and 64.3% Pro — a 23-point spread. MiniMax M2.5 posts 80.2% Verified (M2.5) and 56.22% Pro (M2.7, same lab, similar lineage) — approximately a 24-point spread. GPT-5.2 posts approximately 80% Verified and 55.6% Pro — a 24-point spread. The convergence across architectures and providers makes the case: the gap is in the benchmark, not the model.

The practical implication is significant. Vendor releases almost invariably cite the Verified score — it is higher, it is the most widely cited, and it is the number that appears on leaderboards. Any evaluation of a model for production coding tasks should seek the Pro score as the more demanding, cross-file signal of real engineering capability.

Opus 4.6
Claude Opus 4.6

SWE-Bench Verified: 80.8% (Feb 2026). SWE-Bench Pro: 53.4% (Feb 2026). Gap: −27 points. Terminal-Bench 2.0: 65.4%. Source: Anthropic / fact pack §1.1.

−27 pts gap
Opus 4.7
Claude Opus 4.7

SWE-Bench Verified: 87.6% (April 2026). SWE-Bench Pro: 64.3% (April 2026). Gap: −23 points. Terminal-Bench 2.0: 69.4%. Source: Anthropic Opus 4.7 release notes.

−23 pts gap
MiniMax M2.x
MiniMax M2.5 / M2.7

M2.5 SWE-Bench Verified: 80.2% (Feb 12, 2026). M2.7 SWE-Bench Pro: 56.22% (Mar 18, 2026). Same lab, similar lineage — gap: approximately −24 points. Source: Digital Applied SWE-Bench Live leaderboard analysis.

~−24 pts gap
GPT-5.2
GPT-5.2

SWE-Bench Pro: 55.6% (Dec 11, 2025). Estimated Verified: approximately 80%. Gap: approximately −24 points. Source: Digital Applied GPT-5.2 complete guide.

~−24 pts gap
Verified is a 500-task subset manually validated by the SWE-Bench team, and Pro is a harder variant with more cross-file, long-context tasks. Comparing the same model across the three variants can shift its score by twenty points or more.— Digital Applied synthesis, May 17, 2026

02FoundationThe original SWE-Bench paper: 2,294 instances, Jimenez et al. 2023.

SWE-Bench was introduced in Jimenez et al. (arXiv:2310.06770, October 2023), a collaboration across Princeton NLP, Stanford, and the University of Chicago. The benchmark consists of 2,294 real GitHub issues drawn from 12 popular Python repositories — including Django, Flask, scikit-learn, numpy, pandas, and pytest. Each instance pairs a GitHub issue description with the code patch that resolved it; the task is to reproduce that patch from the issue text alone.

The evaluation harness is open-source at github.com/princeton-nlp/SWE-bench. Any team can clone the repo and re-run evals on their own hardware, which is what makes SWE-Bench the governance reference point: unlike vendor-controlled benchmarks, independent reproduction is possible and has been performed by multiple research groups. The canonical leaderboard lives at swebench.com.

The Python-only constraint is important context. SWE-Bench original measures real-world Python repository engineering — not general code generation across languages, not competitive-programming problem solving, not terminal-level shell task execution. A model that posts 70% on SWE-Bench Verified predicts how it handles the specific task of resolving realistic Python GitHub issues. It does not directly predict shell scripting performance, polyglot code generation, or computer-use agentic workflows.

Instance count
SWE-Bench original
2,294

Real GitHub issues from 12 popular Python repositories: Django, Flask, scikit-learn, numpy, pandas, pytest, and others. Each paired with its resolution patch.

arXiv:2310.06770 · Oct 2023
Repositories
Python repo coverage
12

All Python. The Python-only scope is a core design constraint — not a limitation but a deliberate choice that enables ground-truth patch verification at scale.

Open-source Python repos
Harness status
princeton-nlp/SWE-bench
Open

The evaluation harness is public on GitHub. Independent reproduction has been performed by multiple teams — the defining governance advantage over vendor-controlled benchmarks.

github.com/princeton-nlp/SWE-bench

03Verified SubsetSWE-Bench Verified: 500 human-validated instances, OpenAI partnership.

SWE-Bench Verified was announced by OpenAI on August 13, 2024, in partnership with the Princeton SWE-Bench team. It is a curated 500-instance subset of the original 2,294, produced by human reviewers who filtered out under-specified problems, broken tests, and flaky issues — tasks where the ground-truth solution was ambiguous or the automated test runner was unreliable.

The rationale was practical: the original 2,294 instances contain a non-trivial fraction of tasks where even a correct patch fails automated verification due to test environment instability or issue ambiguity. Verified removes those, making automated scoring more reliable. The trade-off is that 500 instances saturate faster than 2,294 — a model can memorize or overfit more easily, and the distribution of difficulty is different from the full set.

Verified is the most commonly cited SWE-Bench variant in model release notes and leaderboards. This is largely because it produces higher scores — and higher scores are more useful for marketing. OpenAI's funding of the validation work created a structural conflict: the organization best-positioned to benefit from a popular benchmark subset also co-designed that subset. This does not invalidate Verified, but it is a governance fact worth knowing.

Governance note — Verified
OpenAI partnered with Princeton to produce SWE-Bench Verified. The harness remains open-source and the methodology is published, but OpenAI's role as co-funder means it occupies a mixed-governance position: more independent than CursorBench, less independent than Terminal-Bench. When evaluating OpenAI models on SWE-Bench Verified, note that the benchmark subset was co-designed with OpenAI's resources.

04Harder VariantSWE-Bench Pro: cross-file changes, longer context, trickier repos.

SWE-Bench Pro is the hardest SWE-Bench variant in current use. Where Verified filters the original 2,294 for quality and reliability, Pro is a distinct variant that selects for difficulty — specifically the kind of difficulty that arises from real engineering work: multi-file changes, longer context requirements, and repository structures that resist simple local edits.

The cross-file requirement is the key differentiator. Many SWE-Bench Verified tasks can be resolved by modifying a single file. Pro tasks are weighted toward issues that require coordinated changes across multiple files — a closer analogue to how real software development works, where fixing a bug often requires touching an interface definition, its implementation, and its test suite simultaneously. This is what explains the structural 20–25 point drop from Verified to Pro: it is not that models are suddenly bad at coding, but that coordinated cross-file reasoning under long-context pressure is genuinely harder.

Exact Pro instance count varies in public coverage and was not confirmed against swebench.com at publish time — verify the current count directly on the leaderboard before citing it in production documentation. The qualitative framing (harder cross-file variant, longer context, trickier repos) is confirmed by multiple independent sources.

For practical purposes: if you are evaluating a coding agent for production tasks involving real multi-file repositories, the Pro score is a better predictor than the Verified score. The Verified score remains the number most vendors will cite in their release notes — because it is higher. Always ask for Pro when it matters.

05Refresh CadenceSWE-Bench Multilingual and Live: contamination and language coverage.

Two newer SWE-Bench variants address distinct limitations of the original: language scope and training-set contamination.

SWE-Bench Multilingual

SWE-Bench Multilingual extends the Python-only original to non-Python languages. Cursor's Composer 2.5 release notes (May 18, 2026, vendor-published) cite 79.8% on SWE-Bench Multilingual — making it the first major public citation of this variant in a commercial release announcement. Exact instance count and language coverage were not confirmed against swebench.com at publish time; verify before citing specific numbers. Because Multilingual scores are newer and less cross-reproduced than Verified scores, treat them with the same skepticism applied to any newly introduced variant.

SWE-Bench Live

SWE-Bench Live pulls fresh GitHub issues on a rolling weekly cadence, which substantially reduces training-set contamination. A frozen benchmark like Verified or Pro can be memorized: a model trained on data that includes the original GitHub issues has an unfair advantage. Live rotates the instance set continuously, making memorization economically unviable.

The trade-off is variance: because the instance set changes weekly, scores are less comparable across time than Verified or Pro. A team comparing Live scores published in March 2026 against Live scores published in May 2026 is comparing different instance distributions — the higher score may reflect a harder set, not a better model.

The SWE-Bench Live leaderboard Q2 2026 analysis tracks current scores across models and provides the companion data context for this methodology guide.

Benchmark version discipline
Always cite the benchmark version and date. SWE-Bench Verified Q1 2026 is not the same as SWE-Bench Verified Q4 2025 — the eval set updates and so do the scores. A model release comparing its own score to a competitor's score from a different evaluation date is making a methodologically unsound claim. Benchmark version + date must accompany every citation.

06Shell MasteryTerminal-Bench 2.0: Stanford-published, independent, terminal-native.

Terminal-Bench is a benchmark of terminal-mastery tasks — shell scripting, CLI tooling, file system manipulation, process management, and the kinds of infrastructure automation that SWE-Bench Python tasks do not cover. It was developed by the Stanford + Laude Institute and lives at tbench.ai. Terminal-Bench 2.0 is the version currently cited in production coding-agent releases.

The governance structure is important. Terminal-Bench is published by an academic institution (Stanford + Laude Institute) with no commercial stake in the models it evaluates. Anthropic, Cursor, and OpenAI Codex CLI all publish Terminal-Bench 2.0 scores — the benchmark's credibility depends precisely on being external to all of them. This makes it the closest analogue on the terminal-task side to what SWE-Bench original is on the Python repository side.

Terminal-Bench 2.0 vs the original

Terminal-Bench 2.0 extends the original benchmark with harder multi-step shell tasks and improved evaluation reliability. Anthropic's Opus 4.7 release notes cite 69.4% on Terminal-Bench 2.0, up from Opus 4.6's 65.4% on the same benchmark. Codex CLI also publishes Terminal-Bench 2.0 scores — the multi-agent CLI agents are its primary audience. Verify the exact version name against tbench.ai at evaluation time, as versioning conventions may differ between what vendors cite and what the canonical site tracks.

The complementarity between SWE-Bench and Terminal-Bench is straightforward: SWE-Bench covers Python repository issue resolution; Terminal-Bench covers shell and infrastructure tasks. A coding agent that scores well on both covers meaningfully more of the real engineering work profile than one that scores well on only one.

Opus 4.7
Terminal-Bench 2.0 score
69.4%

April 2026. Up from 65.4% (Opus 4.6) on the same benchmark. Source: Anthropic Opus 4.7 release notes.

↑ +4 pts vs Opus 4.6
Publisher
+ Laude Institute
Stanford

No commercial stake in any model. Anthropic, Cursor, and Codex CLI all publish Terminal-Bench 2.0 scores — the benchmark's independence is its credibility.

tbench.ai
Task type
Terminal mastery tasks
Shell

Shell scripting, CLI tooling, file system manipulation, process management, and infrastructure automation — the SWE-Bench complement for real engineering work.

SWE-Bench complement

07Vendor GovernanceCursorBench and Aider polyglot: vendor-controlled red flags.

Two widely-cited coding benchmarks share a governance problem: the organization that built and scored the benchmark is the same organization that benefits from the model scoring well on it.

CursorBench v3.1

CursorBench is built by Cursor and scored by Cursor. Composer 2.5 (launched May 18, 2026) posts 63.2% on CursorBench v3.1 — a figure sourced exclusively from the Cursor blog. No independent reproduction of this score has been confirmed at publish time. The critical governance risk is not that Cursor is dishonest — it is that the incentive structure makes self-serving choices harder to detect. Task distribution selection, iteration budget calibration, and evaluation harness design all influence scores, and a vendor controlling all three cannot offer the structural guarantee of independence.

Composer 2.5 also cites 79.8% on SWE-Bench Multilingual (vendor-published, not independently reproduced at publish time). The Multilingual score is on an external benchmark; the CursorBench score is not. The distinction matters.

Aider polyglot

Aider's polyglot benchmark evaluates LLM coding across multiple languages using a fixed exercise set. It is open-source and publicly documented — which is meaningful. But the benchmark is built by the Aider team and published by Aider. When Aider publishes leaderboard results, it is simultaneously the benchmark author and the primary beneficiary of high scores on models that work well with Aider's tooling. This is a different governance structure from SWE-Bench original (Princeton) or Terminal-Bench (Stanford).

Open-source does not automatically equal independent. The Aider polyglot harness is reproducible, which is a significant advantage over CursorBench. But the task selection and scoring remain in the hands of the same team whose commercial interest lies in demonstrating that frontier models work well with Aider. Score it accordingly.

The vendor-control checklist
When you encounter a benchmark score in a vendor release announcement, ask three questions before trusting it: Who built the eval? Who scored the run? Has an independent team reproduced the result? A "yes" to all three is rare. A "no" to the first two — as with CursorBench — means the number is a self-reported claim, not an independent measurement.

08Tool Use & Computer UseMCP Atlas and OSWorld: the agentic evaluation layer.

SWE-Bench and Terminal-Bench measure coding and shell capability for single-agent workflows. As AI agents increasingly orchestrate tool calls, web actions, and operating-system interactions, a second evaluation layer has emerged to cover agentic behavior directly.

MCP Atlas

MCP Atlas evaluates agent tool-use over the Model Context Protocol (MCP). Anthropic's Opus 4.7 release notes cite the model holding "the top spot on MCP-Atlas for scaled tool use." MCP Atlas lives in the MCP ecosystem at github.com/modelcontextprotocol. As of publish time, the canonical MCP Atlas repository URL was not independently confirmed — verify the exact repo path at the MCP organization before citing a specific URL in downstream documentation.

The governance of MCP Atlas is mixed: the MCP specification itself is maintained as an open standard, but Anthropic was the founding contributor and remains deeply involved in its development. Anthropic's claim to the top MCP-Atlas spot should be read with awareness that Anthropic has structural influence over the platform on which the benchmark runs. As MCP Atlas matures and more independent researchers publish results, the governance picture will become clearer.

OSWorld

OSWorld (Xie et al. 2024, arXiv:2404.07972) is a multimodal benchmark of 369 real computer tasks spanning web browsers, desktop applications, and operating-system interactions. It is designed for computer-use agents — the class of AI systems that observe a screen and control a cursor rather than receiving structured tool outputs. OSWorld is independently published and not controlled by any of the model vendors that score on it.

For teams evaluating AI agents for computer-use automation — RPA replacement, web research workflows, or complex multi-application workflows — OSWorld is the most credible available benchmark. It complements SWE-Bench (Python repository tasks), Terminal-Bench (shell tasks), and MCP Atlas (tool-use orchestration) in a portfolio that covers most of the real agentic work profile. See also our coverage of AI evaluation metrics reference guide for broader context on how these benchmarks fit together.

09Harness EffectThe harness is half the score: 10–20 point swings on identical model weights.

Even when benchmark variant and instance set are held constant, two vendors running the same base model can report scores that differ by 10–20 percentage points. The explanation in almost every documented case is the harness — the scaffolding layer that wraps the model, manages context windows, handles tool calls, and determines how many attempts the model gets before marking a task failed.

Harness variables that commonly explain score differences include: iteration budget (how many edit-test cycles the model is allowed before giving up), context-window management strategy (how prior context is summarized or truncated), tool-call format and error handling, and whether the harness provides the model with the test-failure output from previous attempts. A "smarter" harness that provides richer feedback on failed attempts may produce 5–10 points of apparent model improvement with no change to the underlying weights.

The six canonical MCP host harnesses — Claude Desktop, Claude Code, Cursor, Codex CLI, Windsurf, and VS Code Copilot — each implement these variables differently. When the same base model reports different SWE-Bench scores running through different hosts, harness is the expected explanation. The SWE-Bench Live leaderboard Q2 2026 analysis documents specific cases where harness choice shifts scores on otherwise identical configurations.

Harness effect — same model, same benchmark, three harness configurations

Source: Digital Applied synthesis — illustrative of commonly reported 10-20 point swings on identical model weights
Model A · Optimized harnessExtended iteration budget · Rich failure feedback · Smart context management
82%
Model A · Standard harnessDefault iteration budget · Basic error output · Standard context
68%
Model A · Minimal harnessSingle-shot · No retry · No test feedback
61%
Harness swing rangeSame model weights · Same benchmark variant · Three harness configurations
21 pts

The harness-effect figure above is illustrative — it represents the commonly reported 10–20 point swing documented across multiple independent analyses of SWE-Bench scoring. The specific numbers are a synthesis; do not treat them as sourced from a single paper. The structural claim — that harness accounts for a substantial fraction of observed score differences between vendors — is supported by the pattern of self-reported scores across the Opus 4.6, Opus 4.7, and GPT-5.x release cycles.

The implication for practitioners is direct: when comparing two vendors' SWE-Bench scores, the first question is not "which model is better?" — it is "which harness produced each score, and are those harnesses comparable?" Without that context, cross-vendor comparisons are unreliable.

10Governance MatrixVendor-controlled vs independent: every benchmark rated.

The table below rates every major coding and agent benchmark in current use on four governance dimensions: who published it, whether the model-author overlaps with the benchmark publisher, whether the harness is open-source, and whether independent reproduction has occurred. The verdict column translates those dimensions into a single governance classification.

Independent
SWE-Bench Original
Open-source harness · Reproduced

Published by Princeton NLP + collaborators. No vendor overlap. Harness open-source at github.com/princeton-nlp/SWE-bench. Reproduced by multiple independent teams. Canonical independent baseline.

Independent
Mixed
SWE-Bench Verified
Funded by vendor · Harness open

500-instance subset curated with OpenAI funding. Princeton maintains the harness; OpenAI co-funded the validation. Harness open-source. Most-cited variant in vendor release notes.

Mixed
Independent
Terminal-Bench 2.0
Open-source harness · Reproduced

Published by Stanford + Laude Institute. No commercial stake in any model. Anthropic, Cursor, Codex CLI all publish scores. tbench.ai is the canonical evaluation site.

Independent
Vendor
CursorBench v3.1
Vendor built + scored · Not reproduced

Built by Cursor, scored by Cursor. Composer 2.5 posts 63.2%. No independent reproduction confirmed. Task distribution, iteration budget, and harness design all controlled by the same entity benefiting from high scores.

Vendor-controlled
Vendor
Aider polyglot
Open harness · Aider built + scored

Aider built the benchmark, publishes the leaderboard, and benefits when models score well with Aider's tooling. Harness is open-source (advantage over CursorBench). Task selection remains with the same team.

Vendor-published
Mixed
MCP Atlas tool use
Open spec · Anthropic-influenced

MCP is an open standard; Anthropic was founding contributor. Opus 4.7 claims top spot. As MCP Atlas matures with more independent contributors, governance may shift toward independent. Verify canonical repo URL.

Mixed
Independent
OSWorld computer use
Open-source harness · Reproduced

Published by Xie et al. 2024 (arXiv:2404.07972). 369 real computer tasks across web, desktop, and OS. Not controlled by any model vendor. Most credible available benchmark for computer-use agents.

Independent
Independent
LiveCodeBench coding
Open · Contamination-resistant

Contamination-resistant coding benchmark using recently-published problems (Jain et al. 2024, arXiv:2403.07974). Rolling problem cadence reduces memorization risk. Published independently of major model vendors.

Independent

11Developer PlaybookHow to read vendor benchmark claims without getting fooled.

Every new model release announcement now leads with a benchmark table. The following five-question checklist is designed to be applied to any benchmark claim before it influences a procurement or architecture decision. These questions are ordered by how frequently the answer reveals a meaningful caveat.

Question 1
Which variant?

Verified, Pro, Multilingual, Live, or original? Each measures different things at different difficulty levels. A Verified score is not comparable to a Pro score for the same model — the cross-variant gap averages 20-25 points. Always require the variant label before accepting a score.

Require: variant + instance count
Question 2
Which harness?

Which scaffolding ran the eval? What iteration budget, context-management strategy, and tool-call format? Harness alone accounts for 10-20 point swings. If the vendor doesn't disclose the harness, the score is not reproducible.

Require: harness + iteration budget
Question 3
Vendor-published or reproduced?

Did the same organization publish the benchmark and the model score? CursorBench and Aider polyglot are yes. SWE-Bench and Terminal-Bench are no. An independent reproduction is structurally more trustworthy than a vendor self-report.

Prefer: independent reproduction
Question 4
Cross-variant spread?

Does the vendor publish both Verified and Pro scores? A 27-point Verified-to-Pro gap (as seen with Opus 4.6) is normal. A vendor that only publishes Verified is presenting the favorable half of the picture. Ask for Pro when making production decisions.

Require: Verified + Pro both
Question 5
Raw pass rate or cost-per-task?

A model that passes 70% of SWE-Bench tasks at $2 per task is a different production choice from one that passes 65% at $0.40 per task. Raw pass rate is a capability ceiling; cost-per-successful-task is the operational reality. See our analysis of the cost-per-successful-task metric for how to calculate it.

Ask for: cost-per-successful-task

Applying this checklist takes less than five minutes per benchmark claim and frequently reveals that a headline number is not comparable to the number from the previous model generation, the competitor model, or the team's own internal evaluation. The disciplines of specifying variant, harness, governance, cross-variant spread, and cost-adjusted pass rate are what separate benchmark-literate engineering decisions from marketing-driven ones.

The forward-looking implication is structural. As more coding agents publish scores across Terminal-Bench 2.0, MCP Atlas, and OSWorld alongside SWE-Bench, the benchmark landscape will become richer — and also noisier. The number of vendor-controlled evals will grow faster than the number of independent ones, because vendors have stronger incentives to create evaluations they can win. The governance lens — applied systematically — is the skill that will separate teams that evaluate well from teams that buy marketing. For teams operationalizing AI coding at scale, our AI transformation services include benchmark evaluation design as a core component of model selection work. And for teams building on the web development side, understanding which coding agents actually perform in production is directly relevant to web development automation decisions.

The deeper truth is that no single benchmark predicts real-world productivity. As our Opus 4.7 complete guide documents, a model that posts 87.6% on SWE-Bench Verified and 69.4% on Terminal-Bench 2.0 is a different production reality from one that posts the same Verified score with a terminal score 15 points lower. The benchmark portfolio — not the leaderboard position on any single eval — is what informs a real architecture decision. See also the broader context in our AI evaluation metrics reference guide and the practical comparison in our cost-per-successful-task metric analysis.

Conclusion — Benchmark Methodology

The benchmark wars are about who controls the eval — not who has the smartest model.

Every major coding benchmark covered in this guide — SWE-Bench, Terminal-Bench, CursorBench, Aider polyglot, MCP Atlas, OSWorld, LiveCodeBench — encodes a governance structure. That structure determines how much trust the score deserves and under what conditions. Independent evaluation by Princeton, Stanford, and academic collaborators produces different epistemic guarantees than self-published scores from the same organization that built the model and designed the benchmark. The 20–25 point Verified-to-Pro gap and the 10–20 point harness swing are the two most important quantitative facts in this space — and neither appears in most vendor release notes.

The five-question checklist in Section 11 is the practical operationalization of the governance lens: which variant, which harness, vendor-published or reproduced, cross-variant spread, cost-per-task or raw pass rate. Apply it to the next model release announcement you read and you will almost certainly find that the headline number answers fewer of those questions than you initially assumed. The benchmark portfolio thesis — pairing SWE-Bench with Terminal-Bench for shell tasks, MCP Atlas for tool use, and OSWorld for computer-use workloads — is the practical alternative to trusting any single leaderboard position.

The trajectory of this space points toward more vendor-controlled evaluations, not fewer. Vendors have stronger incentives to create benchmarks they can win than to support benchmarks they might not. The organizations and teams that will evaluate AI coding agents most accurately are those that build the habit of governance-first benchmark reading now, before the leaderboard landscape becomes even more difficult to navigate.

Evaluate AI coding agents with confidence

Stop reading marketing benchmarks — build an evaluation framework.

Our team designs governance-first benchmark evaluation frameworks for AI coding agent selection — helping engineering teams distinguish marketing claims from reproducible performance signals before committing to infrastructure decisions.

Free consultationExpert guidanceTailored solutions
What we work on

AI coding agent evaluation

  • Governance-first benchmark selection for your workload
  • Cross-variant SWE-Bench + Terminal-Bench comparative evals
  • Harness design and reproducible scoring methodology
  • Cost-per-successful-task analysis across model candidates
  • Benchmark portfolio design for production coding agent selection
FAQ · SWE-Bench Benchmark Guide

Benchmark questions we answer every week.

SWE-Bench original (Jimenez et al. 2023) contains 2,294 real GitHub issues from 12 popular Python repositories. SWE-Bench Verified is a 500-instance subset, curated in partnership with OpenAI in August 2024, that removes ambiguous or flaky tasks to improve scoring reliability — it is the most commonly cited variant in vendor release notes because it produces higher scores. SWE-Bench Pro is a separate harder variant that emphasizes cross-file changes, longer context requirements, and more complex repository structures that better reflect real multi-file engineering work. The same model typically scores 20–25 points lower on Pro than on Verified — Opus 4.6 posts 80.8% Verified and 53.4% Pro; Opus 4.7 posts 87.6% Verified and 64.3% Pro. Comparing a Verified score to a Pro score is not a valid apples-to-apples comparison.