When Cursor announced Composer 2.5 earlier today, the headline benchmark was CursorBench v3.1 — built, run, and scored by Cursor itself. That governance structure matters more than the 63.2% score: the organization publishing the benchmark is the same organization that published the model, and the actual #1 on Cursor's own leaderboard belongs to Anthropic's Claude Opus 4.7 (Adaptive) at 64.8%, a fact that received far less attention in the launch coverage.
The methodology gap CursorBench is filling is real. SWE-Bench Verified is contaminated: OpenAI's own audit found that 59.4% of 138 problems its o3 model failed contain flawed test cases, and OpenAI has since stopped reporting Verified scores entirely. Cursor's response — build a fresher benchmark sourced from real production sessions via a proprietary tool called “Cursor Blame” — is methodologically defensible. The execution problem is governance: the publisher is also the model author, the instance count is not disclosed, the test set is not downloadable, and the harness is not open-source.
This analysis covers what CursorBench v3.1 actually measures, why independent rating agencies like BenchLM exclude it from their scoring formulas, and how to apply a 10-point vendor-benchmark literacy framework to CursorBench — and to every other vendor-published score you read. The framework is the artifact; CursorBench is the worked example.
- 01Composer 2.5 is not the absolute leader on Cursor's own benchmark.Claude Opus 4.7 (Adaptive) scores 64.8% on CursorBench v3.1; Composer 2.5 scores 63.2%. Cursor's framing is quality-per-dollar leadership, not absolute capability leadership. These are different claims, and the distinction matters when reading the launch narrative.
- 02CursorBench v3.1 fails three of the four core transparency tests.Instance count is not disclosed. The test set is not publicly downloadable. The evaluation harness is not open-source. Independent reproduction is structurally impossible. BenchLM tracks CursorBench v3.1 as display-only because it is a first-party benchmark and excludes it from scoring formulas.
- 03The methodology gap CursorBench exploits is real.OpenAI stopped reporting SWE-Bench Verified after finding that frontier models can reproduce gold patches from memory. The 59.4%-flawed-test-case audit confirms that Verified has structural integrity problems. Cursor's contamination critique is accurate. The problem is that the response introduces a different governance failure.
- 04Vendor benchmarks are optimistic, not fraudulent.CursorBench is not secretive — Cursor publicly documents the methodology. The risk is structural: task distribution, iteration budget, and harness design all influence scores, and a vendor controlling all three cannot offer the independence guarantee that makes SWE-Bench Verified and Terminal-Bench 2.0 trustworthy. Adjust scores accordingly.
- 05The 10-point literacy checklist is the reusable artifact.Apply the framework in section 09 to every benchmark you read: CursorBench, Aider polyglot, vendor-published SWE-Bench Multilingual, internal Sourcegraph evals. The same governance questions reveal the same structural patterns across all vendor-controlled benchmarks.
01 — Launch-Day ClaimComposer 2.5 shipped today — and CursorBench v3.1 is the leadership claim.
Cursor's Composer 2.5 launch post, published today (May 18, 2026), anchors its quality claim on CursorBench v3.1 scores. The model is built on a Kimi K2.5 base, priced at $0.50 input / $2.50 output per million tokens, and positioned as a quality-per-dollar leader in the agentic coding space. Composer 2.5 also reports 79.8% on SWE-Bench Multilingual — a different, more externally constructed benchmark — but the CursorBench v3.1 number anchors the headline narrative.
The timing is structurally significant. CursorBench was first announced by Naman Jain of Cursor research on March 11, 2026, three months before the v3.1 score was used to anchor the Composer 2.5 leadership claim. Cursor built the evaluation methodology, ran the v3.1 evaluation, published the scores, and simultaneously launched the model being scored. All four decisions rested with the same organization on the same day. That is what “vendor-controlled” means in practice.
This does not make the benchmark fraudulent. Cursor's technical documentation of CursorBench is substantive: the four-dimension evaluation framework, the “Cursor Blame” sourcing methodology, and the correctness-vs-efficiency scatter plot are all described in the announcement post. The problem is governance, not transparency. Understanding the distinction is the starting point for reading any vendor-published score.
For the launch deep-dive context on Composer 2.5 itself — pricing, model architecture, and the Kimi K2.5 base — see our Composer 2.5 launch analysis. For the head-to-head routing comparison between Composer 2.5 and Claude Code, see the Composer 2.5 vs Claude Code routing guide.
02 — The #1 SpotOpus 4.7 (Adaptive) leads at 64.8% — Composer 2.5 is second on its own maker's benchmark.
The most rhetorically inconvenient fact about CursorBench v3.1 is that Cursor's model does not lead it. Claude Opus 4.7 (Adaptive) scores 64.8% — 1.6 points above Composer 2.5's 63.2%. Cursor's launch framing is careful: the positioning is quality-per-dollar leadership, not absolute capability leadership. Composer 2.5 at $0.50 input / $2.50 output per million tokens costs roughly one-tenth of Opus 4.7 standard ($5 / $25), which makes a cost-adjusted leadership argument viable. But much of the press coverage collapsed this nuance into a headline that Composer 2.5 “tops” CursorBench. It does not.
The full seven-model leaderboard, drawn from BenchLM's CursorBench v3.1 tracking page (retrieved May 22, 2026), is shown below. Note that BenchLM displays these scores for reference while explicitly excluding them from its independent scoring formula. Every score on this leaderboard is vendor-controlled — published by Cursor, run by Cursor, on a test set that only Cursor has seen.
CursorBench v3.1 leaderboard — 7 models, vendor-controlled benchmark
Source: BenchLM CursorBench v3.1 leaderboard, retrieved May 22, 2026 — all scores are vendor-published; independent reproduction not confirmed. BenchLM tracks display-only.One structural observation the leaderboard makes visible: Kimi K2.5 (the base model Cursor fine-tuned to produce Composer 2.5) scores 31.9%. Composer 2.5 scores 63.2% — a 31.3-point lift attributed to Cursor's fine-tuning and agent scaffolding. This is a meaningful capability delta. It also means Cursor was evaluating its own fine-tuning on a benchmark it designed, run by its own harness, on a test set it sourced from its own production sessions. Each of those decisions influences the magnitude of the delta.
For cross-benchmark context: Opus 4.7 reports 87.6% on SWE-Bench Verified (vendor-published) and 64.3% on SWE-Bench Pro (also vendor-published, but on an externally constructed test set from Scale AI). See our Claude Opus 4.7 complete guide for the full benchmark spread. CursorBench scores are not numerically comparable to SWE-Bench scores — they measure different things on different test sets with different scoring methodology.
03 — What It MeasuresFour dimensions: correctness, code quality, efficiency, interaction behavior.
Cursor documents CursorBench's evaluation framework in the March 11, 2026 announcement post. The benchmark's stated purpose, per Naman Jain: “We built CursorBench to measure multiple dimensions of agent performance including solution correctness, code quality, efficiency, and interaction behavior.” This four-dimension framing is the methodological advance over SWE-Bench Verified, which scores only whether the final patch passes the test suite — a binary correctness measure.
The four dimensions address meaningfully different aspects of coding-agent quality. Correctness — does the solution work? — is the SWE-Bench baseline. Code quality captures whether the solution is readable, maintainable, and consistent with the repository's conventions; this is the dimension most relevant to teams that care about what lands in production, not just what passes tests. Efficiency measures completion token usage — how verbose or concise the agent's interaction is — visualized on a correctness-vs-median-completion-tokens scatter plot. Interaction behavior captures the quality of the agent's conversational pattern: clarification questions, tool use sequencing, and recovery from errors.
CursorBench tasks scale in complexity. Cursor notes that tasks “roughly doubled in scope from v1 to v3” in both lines-of-code touched and number of files involved. This is a deliberate design choice to make the benchmark more resistant to models that can solve simple single-file edits but struggle with cross-file coordination — the same gap that separates SWE-Bench Verified from SWE-Bench Pro. The SWE-Bench vs Terminal-Bench benchmark guide covers the cross-file complexity dimension in detail.
The four-dimension framing is the strongest part of CursorBench's methodology case. The problem is that these dimensions are scored by Cursor's own harness, on tasks sourced from Cursor's own production sessions, using criteria that Cursor defines but does not publish. A four-dimension eval where the publisher controls every dimension is not more rigorous than a one-dimension eval with an open harness — it is more elaborate.
Solution correctness
The baseline SWE-Bench dimension: does the final patch pass the test suite? CursorBench extends this to a production-session context where 'correctness' includes whether the committed code actually runs in the user's environment.
Code quality score
Readability, maintainability, and consistency with the repository's coding conventions. This dimension is the most practically relevant for engineering teams that care about what lands in production — and the hardest to score objectively without a reproducible rubric.
Completion token efficiency
Measured as median completion tokens per task. Visualized on a correctness-vs-token-usage scatter plot. A model that solves the same task in fewer tokens is preferable at scale — particularly at Composer 2.5's $2.50/Mtok output price.
Interaction behavior
Covers clarification question quality, tool use sequencing, and error recovery patterns. The dimension closest to real-world agent UX — but also the dimension most dependent on Cursor's own scoring criteria, which are not publicly documented.
04 — Cursor BlameTasks sourced via “Cursor Blame” — production sessions, not public repos.
The core methodological claim Cursor makes for CursorBench is that its tasks are drawn from real production coding sessions, not from public repositories that end up in model training data. Direct quote from the announcement post: “We source tasks for CursorBench using Cursor Blame, which traces committed code back to the agent request that produced it.”
“Cursor Blame” is Cursor's proprietary tooling — not an industry-standard instrument. The mechanism is described conceptually: when a Cursor user commits code that was produced by an AI agent request, Cursor Blame creates a link between the committed code and the original request. This linkage enables Cursor to construct evaluation tasks from real developer-agent interactions, which reduces the contamination risk that plagues public-repository benchmarks.
The contamination argument is substantive. Cursor notes in the announcement that “SWE-bench Verified, Pro, and Multilingual all draw tasks from public repositories that end up in model training data, inflating scores.” This is consistent with what OpenAI found in its own audit and what independent researchers like Han Yang have documented: public benchmark tasks appear in training corpora, and models that memorize the expected solutions post inflated scores without demonstrating genuine capability.
The Cursor Blame sourcing approach sidesteps that specific contamination risk by drawing tasks from internal codebase and controlled production sessions. The structural cost is auditability: because the tasks come from Cursor's proprietary production data, no external party can verify the distribution, check for selection bias toward tasks where Cursor's own agents excel, or confirm that the task set is representative of real-world software engineering workloads beyond Cursor's user base. The contamination fix introduces a different opacity.
05 — Transparency GapsInstance count undisclosed, test set not downloadable, harness not open-source.
The four transparency failures below are distinct from methodology choices — they are structural absences that make independent verification impossible regardless of whether Cursor's methodology is sound. Each represents a gap between what open benchmarks like SWE-Bench and Terminal-Bench provide and what CursorBench v3.1 provides.
How many tasks?
Cursor states that 'the suite is refreshed every few months' but does not publish the v3.1 task count. The instance count is not on the CursorBench announcement page, the Composer 2.5 launch post, or the BenchLM tracking page. Do not manufacture a number.
Can you inspect the tasks?
The CursorBench v3.1 task set is not available on github.com/getcursor or Hugging Face as of May 24, 2026. You cannot inspect the distribution of task types, difficulty levels, or repository sources. Independent reproduction is structurally impossible.
How are scores computed?
The evaluation harness — the scaffolding that wraps the model, manages context, sets iteration budgets, and scores responses — is not publicly documented or open-sourced. Harness choices routinely account for 10–20 percentage-point score differences on identical model weights.
Has anyone else run it?
No independent team has confirmed a reproduction of CursorBench v3.1 scores as of publish time. Without a public test set and open harness, reproduction requires Cursor's cooperation. This makes every CursorBench v3.1 number a self-reported claim, not an independently verified measurement.
These four gaps collectively produce BenchLM's institutional response: “CursorBench v3.1 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.” A credit-rating agency that refused to factor in a corporate bond's self-issued credit rating is applying the same principle. The scores are not hidden — they are visible on the BenchLM page — but they do not carry the same epistemic weight as independently verified scores.
For practitioners, the practical implication is simple: when someone cites a CursorBench v3.1 number in a procurement discussion or architecture decision, the correct question is not “what is the score?” — it is “has this been independently reproduced?” The answer, today, is no. See our AI evaluation metrics reference guide for a broader taxonomy of what different benchmark governance structures imply for practical trust levels.
06 — Why It ExistsThe real methodology gap CursorBench is exploiting: Verified saturation and contamination.
The case for CursorBench is not simply marketing. The benchmark fills a genuine void created by SWE-Bench Verified's saturation and contamination problems. Understanding what Cursor is responding to is necessary to evaluate whether the response is proportionate.
SWE-Bench Verified is a 500-instance subset of 2,294 GitHub issues curated in August 2024. At the time of its release, top models scored in the 40–50% range. By early 2026, frontier models post 80%+ on Verified. The saturation problem is structural: a 500-instance benchmark where top models score 87.6% has less discriminatory power than a larger, harder benchmark where top models score 23%. SWE-Bench Pro (1,865 tasks across 41 professional repositories, with reference solutions averaging 107.4 lines of code across 4.1 files) addresses the difficulty gap, but is frozen at 2025 tasks and also subject to training-set contamination as it ages.
The contamination problem is documented and severe. Benchmarks sourced from public repositories are at continuous risk of appearing in training data. When a model is trained on data that includes the original GitHub issues, it has an unfair advantage on those specific tasks — it is recalling a memorized answer, not demonstrating reasoning capability. Independent analysis suggests contamination can inflate scores by 5–15 or more points on popular benchmarks, according to BenchLM's benchmark reliability analysis.
Cursor's response — source tasks from real production sessions that are not in any model's training data, score multiple dimensions of agent quality rather than just patch correctness, and refresh the benchmark periodically rather than freezing it — addresses all three of the main Verified criticisms. The problem is that solving the contamination and saturation problems by moving to a vendor-controlled benchmark trades one set of epistemic risks for another. The SWE-Bench Live leaderboard analysis covers how the broader benchmark ecosystem is responding to these contamination pressures.
Cursor also notes that it supplements offline CursorBench evaluations with live, controlled online experiments — A/B test infrastructure that most vendors lack. This is a meaningful methodological addition: live experiments on real user sessions can confirm or challenge what offline benchmark numbers suggest. But live experiment results are also internal and not publishable in a way that independent researchers can verify.
07 — OpenAI's AuditOpenAI's 59.4% audit: the evidence that Verified no longer measures frontier capability.
OpenAI's decision to stop reporting SWE-Bench Verified scores is the strongest institutional validation of the contamination critique. The evidence is documented in the OpenAI GPT-5.5 deployment safety hub. OpenAI found that all tested frontier models — including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash — can reproduce exact gold patches and verbatim problem details from SWE-Bench Verified. Frontier models are not solving these problems; they are recalling memorized answers.
The audit finding that receives the most attention is the 59.4% figure. Precision matters here: this is not 59.4% of the full 500-problem Verified set. It is 59.4% of the 138 problems that OpenAI o3 failed. Of those 138 failed cases, more than half contain flawed test cases — either too narrow (a correct solution fails because the test is under-specified) or too wide (an incorrect solution passes because the test is over-permissive). OpenAI now recommends the industry move to SWE-Bench Pro.
This is the evidentiary foundation for CursorBench's existence. Cursor cites the OpenAI finding directly in the March 11, 2026 announcement: “OpenAI recently stopped reporting SWE-bench Verified results entirely after finding that frontier models could reproduce gold patches from memory.” The citation is accurate. The argument it supports — that the industry needs a fresher, contamination-resistant benchmark — is correct. The question is whether Cursor building and controlling that replacement is the right governance response.
The independent alternative recommended by OpenAI is SWE-Bench Pro, published by Scale AI (arXiv:2509.16941). SWE-Bench Pro contains 1,865 total tasks across 41 professional repositories: 731 public, 276 commercial, and 858 held-out. Top models score around 23% on the Pro public set — compared to 70%+ on Verified — exposing the ~47-percentage-point gap that illustrates how saturated Verified has become. The public-facing leaderboard at labs.scale.com uses the 731-task public subset; the 1,865 total includes commercial and held-out subsets not in public scoring.
OpenAI's GPT-5.5 deployment safety hub documents that 59.4% of the 138 problems OpenAI o3 failed on SWE-Bench Verified contain flawed test cases — either too narrow or too wide. OpenAI found that all tested frontier models can reproduce gold patches from memory and has stopped reporting SWE-Bench Verified scores entirely. Its institutional recommendation is to adopt SWE-Bench Pro (Scale AI, arXiv:2509.16941), where top models score approximately 23% vs 70%+ on Verified. The public leaderboard is at labs.scale.com/leaderboard/swe_bench_pro_public.
08 — BenchLM ResponseThird-party rating agency: “display-only” because it is a first-party benchmark.
BenchLM is an independent benchmark tracking platform that aggregates and rates AI benchmark scores across models. Its treatment of CursorBench v3.1, documented on the CursorBench v3.1 tracking page (retrieved May 22, 2026), is explicit and carries a direct quote that deserves verbatim preservation: “BenchLM tracks it as display-only because it is a first-party benchmark.”
The implication is institutionally significant. BenchLM publishes CursorBench v3.1 scores for reference — practitioners can see them — but excludes them from the scoring formula that determines overall model rankings. This is the equivalent of a financial ratings agency showing a company's self-issued credit assessment in the footnotes while declining to incorporate it into the official rating. The information is present; its epistemic weight is explicitly discounted.
BenchLM's response is the right institutional pattern. An independent rating agency cannot rank a benchmark whose construction, harness, and scoring are controlled by the model author without undermining the independence that makes its ratings valuable. The same logic applies to how engineering teams should read CursorBench scores: visible as context, but not determinative in a production evaluation.
The broader BenchLM methodology for benchmark reliability is documented in their benchmark reliability post, which covers contamination inflation, vendor-control patterns, and the governance criteria they use to classify benchmarks. It is worth reading alongside this analysis as a complement to the 10-point checklist in the next section.
A benchmark whose construction, harness, and scoring are entirely in the hands of the model author is not a measurement — it is a self-description. Treat it as informative context, not independent evidence. Adjust the confidence interval you place on any vendor-published score by asking who designed the test before asking what it measured.Digital Applied synthesis, May 18, 2026
09 — Literacy ChecklistThe 10-point vendor-benchmark literacy framework applied to CursorBench v3.1.
The following framework is designed to be applied to any vendor-published benchmark before a score influences a procurement decision or architecture choice. CursorBench v3.1 is the worked example. The “Compare: SWE-Bench Pro” column shows how the independent alternative answers the same question. This checklist is adapted from governance-transparency frameworks used in independent benchmark evaluation practice — see the cost-per-successful-task metric analysis for the complementary cost-adjusted evaluation dimension.
Is the test set publicly downloadable?
CursorBench v3.1: No — not available on github.com/getcursor or Hugging Face. You cannot inspect task distribution, difficulty calibration, or repository coverage. | Compare: SWE-Bench Pro: Yes — 731-task public subset downloadable from the Scale Labs leaderboard. | Why this matters: Without access to the test set, you cannot verify that it is representative of your workload or that task selection is unbiased.
Is the evaluation harness open-source?
CursorBench v3.1: No — the scoring harness is not publicly documented or available. Harness choices commonly account for 10–20-point score swings on identical model weights. | Compare: SWE-Bench Pro: Yes — harness inherits from the open-source SWE-Bench Princeton harness. | Why this matters: The harness is half the score. Without it, the number is not reproducible.
Can third parties reproduce the score?
CursorBench v3.1: No — no independent reproduction confirmed as of May 2026. Without a public test set and open harness, reproduction requires Cursor's cooperation. | Compare: SWE-Bench Pro: Yes — independent teams can run the public subset. | Why this matters: A non-reproducible score is a self-reported claim. Reproducibility is the line between a measurement and marketing.
Is contamination prevention disclosed and verifiable?
CursorBench v3.1: Partial — Cursor Blame sourcing from production sessions is described conceptually, which addresses contamination. But without seeing the tasks, you cannot verify that the sourcing method actually produces uncontaminated instances. | Compare: SWE-Bench Pro: Better — held-out subset not released publicly, reducing contamination risk. | Why this matters: Contamination can inflate scores 5–15+ points. Disclosure without verifiability is still a transparency gap.
Are tasks representative of real-world workloads?
CursorBench v3.1: Unknown — tasks come from Cursor's user base, which may not represent your codebase, language mix, or task type distribution. The production-session sourcing is a strength; the opaque distribution is a limitation. | Compare: SWE-Bench Pro: Partial — 41 professional repos across multiple domains, but still Python-centric. | Why this matters: A benchmark optimized for Cursor's user base may not predict performance on your team's workloads.
Is the comparison harness identical across all tested models?
CursorBench v3.1: Unknown — Cursor does not publish whether the same harness configuration (iteration budget, context management, tool-call format) is applied identically to all 7 models on the leaderboard. If Composer 2.5 runs with a better-tuned harness than Opus 4.7, the comparison is not valid. | Compare: SWE-Bench Pro: Partial — teams run their own harnesses on the shared test set. | Why this matters: Cross-model comparisons are invalid if the harness is not held constant.
Is score consistency (variance across runs) disclosed?
CursorBench v3.1: No — Cursor does not publish variance across runs, random seeds, or prompt variations. A 63.2% point estimate without a confidence interval carries less information than one with error bars. | Compare: SWE-Bench Pro: Partial — most published Pro scores are single-run point estimates without variance disclosure. | Why this matters: If the score varies ±3 points across runs, a 1.6-point gap between Opus 4.7 and Composer 2.5 may not be statistically meaningful.
Is there a peer-reviewed companion paper?
CursorBench v3.1: No — the benchmark is documented in a company blog post (Naman Jain, March 11, 2026). No peer-reviewed paper has been submitted or published as of May 2026. | Compare: SWE-Bench Pro: Yes — arXiv:2509.16941, Scale AI, November 2025. | Why this matters: Peer review provides independent methodological scrutiny that a blog post does not.
Is the leaderboard updated to reflect competitor improvements?
CursorBench v3.1: Unknown — the leaderboard at BenchLM shows 7 models. It is unclear whether Cursor proactively re-evaluates competitors when they release new models or whether competitor scores are updated only when Cursor runs a new benchmark version. | Compare: SWE-Bench Pro: Partial — teams self-submit results; Scale Labs does not run competitor evaluations. | Why this matters: A leaderboard that lags competitor updates systematically overstates the benchmark publisher's model performance.
Does the vendor publish both wins AND losses?
CursorBench v3.1: Partially — to Cursor's credit, Opus 4.7 (Adaptive) at 64.8% does appear above Composer 2.5 at 63.2% on the public leaderboard. Cursor does not hide this. But the instance count, harness details, and task distribution are not disclosed, so you cannot assess where Composer 2.5 specifically underperforms. | Compare: SWE-Bench Pro: Better — held-out instances not published, but public subset is fully open. | Why this matters: A vendor that selectively reports wins introduces survivorship bias into the leaderboard narrative.
Across the 10 questions, CursorBench v3.1 scores: 5 “No”, 3 “Unknown”, 1 “Partial”, and 1 “Mixed”. The only clear positive is question 10 — Cursor does publish competitor scores above its own model. That is a meaningful credibility signal. It is also why this analysis treats CursorBench as optimistic rather than fraudulent. The structural problem is not intent; it is governance architecture.
The same checklist applied to Aider polyglot produces a different profile: “No” on peer review and independent reproduction, but “Yes” on open harness — a better governance position than CursorBench. Aider is also vendor-published, but its open-source harness means the task set and scoring methodology can be independently verified even if Aider's team controls which tasks are included. That distinction matters. Apply the same checklist to any benchmark you read — see our Composer 2 vs Opus 4.6 benchmark coverage for a prior-generation example of how CursorBench scores were used in an earlier launch narrative.
10 — VerdictHow to read CursorBench v3.1 as one signal among many.
CursorBench v3.1 is not a benchmark to ignore. The methodology gap it is filling is real, the four-dimension framework is more representative of production coding agent quality than binary patch-pass scoring, and Cursor's decision to publish a leaderboard where a competitor model (Opus 4.7) leads its own model (Composer 2.5) is a meaningful credibility signal. A vendor willing to show that it ranks second on its own benchmark is applying more honesty than a vendor that publishes a benchmark only when its model wins.
The appropriate calibration is to treat CursorBench v3.1 as an informative signal with structural limitations, not as a definitive independent measurement. Concretely: use it to understand how Cursor characterizes the quality of its own and competitor models in the context of Cursor-native workflows. Weight it alongside SWE-Bench Pro (independent, harder, 1,865 tasks), Terminal-Bench 2.0 (independently governed by Stanford), and the SWE-Bench Live leaderboard for a benchmark portfolio that covers multiple dimensions without over-relying on any single vendor-controlled signal.
For teams running an AI transformation engagement that includes coding-agent selection, the operational implication is direct: do not select or deselect a coding agent based on CursorBench v3.1 scores alone. Treat the scores as a starting hypothesis, run internal evaluations on your own codebase, and use cost-per-successful-task on your actual task distribution as the decision metric. The cost-per-successful-task metric framework provides the operational scaffolding for that calculation.
The forward-looking trajectory is worth tracking. As CursorBench matures, three changes would meaningfully improve its epistemic status: (1) publishing the instance count and task distribution summary without releasing the individual tasks; (2) open-sourcing the harness configuration used across all tested models; and (3) engaging Scale AI, Stanford, or another independent party to run a subset of tasks as a spot-check reproduction. None of these would require releasing the proprietary production-session data. They would move CursorBench from “display-only” toward “informative with independent validation” — a meaningfully better governance position.
Vendor benchmarks are optimistic, not fraudulent. Adjust accordingly.
The methodology gap CursorBench exploits is real. SWE-Bench Verified is contaminated — OpenAI's own audit found 59.4% of the 138 problems o3 failed have flawed test cases, and frontier models can reproduce gold patches from memory. OpenAI stopped reporting Verified entirely. Cursor's response — build a fresher benchmark sourced from real production sessions via “Cursor Blame,” score multiple dimensions of agent quality, and refresh the suite regularly — is a defensible methodological answer to a genuine problem. The execution failure is governance: the publisher is also the model author, the instance count is undisclosed, the test set is not downloadable, and the harness is not open-source.
BenchLM's “display-only” treatment is the right institutional response. An independent rating agency cannot assign scoring weight to a benchmark whose construction, harness, and scoring are fully vendor-controlled without undermining the independence that gives its ratings value. Mirror this discipline in your own evaluations: show CursorBench v3.1 numbers as context, but do not let them be determinative in production model selection decisions. The same principle applies to Aider polyglot, any vendor-published SWE-Bench Multilingual score, and internal Sourcegraph evaluations — wherever the benchmark publisher and the model author are the same entity, structural self-interest is present whether or not the intent is honest.
The 10-point literacy checklist is the takeaway artifact. Apply it to every benchmark score you read in a model release announcement. CursorBench v3.1 scores 5 “No,” 3 “Unknown,” 1 “Partial,” and 1 “Mixed” — which means it is a self-reported claim with four specific structural limitations, not an independent measurement. Vendor benchmarks are not fraudulent by default. They are optimistic by structure. Reading them with the right calibration — one signal among many, weighted by governance quality — is the practical skill.