AI Development11 min read

SWE-Bench Live Leaderboard Q2 2026: Deep Analysis

SWE-Bench Live Q2 2026 leaderboard analysis — what the scores actually predict, delivery velocity vs test pass rate, and why some top models underperform.

Digital Applied Team
April 13, 2026
11 min read
3

Benchmark Variants

Weekly

Live Refresh

500

Verified Tasks

80%+

Top Tier

Key Takeaways

Live Means Live: SWE-Bench Live pulls fresh GitHub issues on a rolling cadence, which reduces training-set contamination but introduces higher variance from week to week than the frozen Verified and Pro splits.
Verified Is Quality-Curated, Pro Is Harder: Verified is a 500-task subset manually validated by the SWE-Bench team, and Pro is a harder variant with more cross-file, long-context tasks. Comparing the same model across the three variants can shift its score by twenty points or more.
The Harness Is Half the Score: Agent scaffolding around a model such as iteration budget, tool availability, reflection loops, and repository navigation heuristics can swing results by ten to twenty percentage points on identical underlying weights.
Scores Predict Scores: A model that posts 70 percent on SWE-Bench Verified predicts another benchmark score, not how quickly your agency ships billable features. The two correlate loosely at best on real delivery work.
Use a Benchmark Portfolio: Pair SWE-Bench with Terminal-Bench for shell and infra tasks, MCP-Atlas for tool use, BrowseComp for web research, and OSWorld for computer-use workloads to cover more of the real work profile.
Q2 2026 Leaders Are Plural: No single model sweeps the top of every variant. Closed frontier models hold Verified and Pro leads while open-source entrants such as Nemotron 3 Super 120B and MiniMax M2.5 sit within striking range on specific splits.

SWE-Bench Live scores predict SWE-Bench Live scores — they don't predict how your agency's delivery velocity will change. That gap is where most tool-selection goes wrong. A model can top the leaderboard on a Tuesday, land in your Claude Code or Cursor setup on Wednesday, and still take longer to close a real client ticket than the model you replaced. The benchmark is not the deliverable.

That is not an argument against benchmarks. SWE-Bench and its relatives are the best public signal we have for coarse model capability on software engineering tasks. But reading the leaderboard well takes more than looking at the top row. This guide walks through what SWE-Bench Live actually measures, how the Live, Verified, and Pro variants differ, what the Q2 2026 numbers show, and where benchmark position genuinely predicts real-world behaviour versus where it breaks down.

What SWE-Bench Live Actually Measures

SWE-Bench is an academic benchmark from Princeton-NLP that tests whether a language model can resolve real GitHub issues by producing a patch that passes the repository's test suite. The task format is simple on the surface: the model receives an issue description, the repository at the pre-fix commit, and a reference test. A run is scored as a pass if the model's patch makes the test green without breaking other tests.

SWE-Bench Live extends that harness with a rolling task selection policy. Rather than evaluating against a fixed split frozen at publication time, it draws fresh issues and pull requests from participating repositories on a regular cadence. The motivation is contamination control. Frontier models are trained on enormous web corpora that almost certainly include public GitHub. If a model has seen a specific issue and its fix during pretraining, its benchmark score reflects memorisation rather than capability. Live's rolling window narrows that risk.

What a Single SWE-Bench Task Contains
  • A human-written issue describing a bug or missing feature.
  • The repository snapshot at the pre-fix commit, with full git history.
  • One or more reference tests that should pass once the issue is resolved.
  • An evaluation harness that runs the model's proposed patch against the test suite in isolation.
  • Binary pass or fail scoring, aggregated across all tasks in the split.

Why Python-Heavy Tasks Dominate

The original SWE-Bench split was drawn from twelve large Python repositories including Django, scikit-learn, and sympy. Live adds newer sources, but Python still dominates. That has implications for how well leaderboard position generalises to TypeScript-first, Go-heavy, or Rust-native engineering teams. A model tuned for Django refactors may land differently on a Next.js monorepo, and the benchmark will not tell you so directly.

Live vs Verified vs Pro: Benchmark Family Explained

The SWE-Bench family has grown into a small set of related splits, each optimised for a different evaluation goal. Confusing them is the most common cause of apples-to-oranges comparisons in model marketing.

VariantWhat It OptimisesTask CountTypical Use
SWE-Bench (full)Breadth across 12 Python repos~2,294Academic baseline
SWE-Bench VerifiedCurated, manually validated subset500Default number in model releases
SWE-Bench ProHarder tasks, long-context, cross-fileSmaller, curatedSeparation at the frontier
SWE-Bench LiveRolling fresh tasks, contamination controlRotatingSignal on recent capability

Verified is the number most model announcements quote. When a release post says "87 percent on SWE-Bench," it almost always means Verified. Pro and Live scores tend to be meaningfully lower on the same model. On the April 2026 cycle, top-tier models cluster between sixty and seventy percent on Pro while clearing eighty percent on Verified. MiniMax M2.5, for example, has been measured at 80.2 percent on Verified, and MiniMax M2.7 at 56.22 percent on Pro — same provider, different variant, very different number.

Leaderboard Q2 2026 Snapshot

Because Live refreshes continuously and harness reporting varies, we do not attempt a single ranked table here. Instead, the data points below are drawn from vendor release notes and provider documentation published through April 2026, and are grouped by what the score tells us rather than by raw position.

Documented Q2 2026 Data Points

  • MiniMax M2.5 — 80.2 percent on SWE-Bench Verified at its February 12, 2026 release. Notable for being an open-weight Chinese model competitive with closed frontier entrants on Verified. Detailed in our MiniMax M2.7 release guide.
  • MiniMax M2.7 — 56.22 percent on SWE-Bench Pro at its March 18, 2026 release. Positioned as a self-evolving successor to M2.5 with substantially lower cost.
  • NVIDIA Nemotron 3 Super 120B — 60.47 percent on SWE-Bench Verified, released March 10, 2026 as an open-source 120B-parameter MoE with 12B active weights.
  • Frontier closed models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, and MiMo V2 Pro have all been cited in the upper tier across Verified, Pro, and Live through Q2 2026, with exact scores varying by harness and refresh.

The pattern worth noting: no single model sweeps every variant. A provider that dominates Verified may sit mid-pack on Pro, and the Live ranking shifts as new tasks enter the pool. This plurality is healthy for the ecosystem, but it complicates the "just pick the top model" procurement strategy that a lot of teams default to.

Reading a Live Leaderboard Row

A typical row contains: model name, harness identifier, pass rate, refresh date, median tokens per task, and sometimes cost per successful task. The last three are where most of the actionable signal lives.

  • Harness identifier tells you which scaffold ran the model — critical for apples-to-apples comparison.
  • Median tokens per task reveals reasoning verbosity that directly affects production cost.
  • Cost per successful task is the closest public proxy to delivery economics and should beat pass rate for procurement decisions.

The Score-to-Delivery Gap

Here is the awkward truth. The correlation between SWE-Bench Verified rank and observed delivery velocity on a real agency project is loose. We have seen teams migrate from a model at seventy-five percent on Verified to one at eighty-five percent and ship slower the following quarter. We have also seen the reverse. The benchmark is measuring something real, but the something is not cleanly the same as "how quickly your engineers close tickets."

Why the Gap Exists

Benchmarks evaluate models on clean, well-specified tasks with reliable tests. Real agency work does not look like that. Client tickets are ambiguous. Context lives in Slack threads and Figma comments. Tests are missing, flaky, or misleading. Deploy pipelines require approvals. Design and code review add latency that is not in any model's latency figure. A model that is world-class on Verified can still be bottlenecked by everything around it.

What Benchmarks Measure
Controlled, isolated, binary
  • Single-task completion against a known test.
  • Clean repository state at a specific commit.
  • Well-specified issue descriptions.
  • Isolated sandbox with fresh dependencies.
What Delivery Requires
Messy, iterative, collaborative
  • Ambiguous tickets with missing context.
  • Long-lived branches with live conflict risk.
  • Human review, QA, design feedback.
  • Deploy approvals and rollback planning.

The practical implication for agencies is that a ten-point benchmark gap is usually smaller than the difference a better prompt library, a tighter harness, or a well-organised repository makes on the same model. Before switching models, check that the wrapper around the model is not the real bottleneck. Our AI coding tool adoption survey covers how teams actually distribute time between model and wrapper.

Systematic Biases in the Benchmark

Every benchmark encodes choices about what to measure, and those choices create blind spots. SWE-Bench and its variants are no exception. Understanding the biases is the difference between reading the leaderboard literally and reading it well.

Task Selection Bias

SWE-Bench selects issues that have clear reference tests. That self-selects for bugs with deterministic reproduction and feature requests with obvious success criteria. The entire category of "make this UI feel right" work, which is a huge share of agency delivery, is essentially invisible. A model that excels on visually-subjective interface polish will not score differently than a model that cannot render a sensible layout, because neither gets tested on it.

Repository Type Bias

The repositories chosen for SWE-Bench skew toward mature open-source libraries with strong test coverage, linear history, and coherent coding conventions. Most agency work involves client codebases that are none of those things. Inherited monoliths, partially-tested green-field builds, and multi-service monorepos all have structural characteristics the benchmark does not capture.

Language Bias

Python dominates. That matters because models trained on Python-heavy code sometimes underperform when the target language is TypeScript, Go, Rust, or Elixir. Live has widened the source pool somewhat, but the Python skew persists in Verified and Pro. Agencies that live in the JavaScript-TypeScript ecosystem should weight benchmark position accordingly and supplement with JS-native evaluations such as Terminal-Bench test runs or in-house harnesses.

Harness Effects

The single biggest source of between-score variance on SWE-Bench is not the model — it is the harness wrapped around it. A harness is the agent scaffold that turns a raw completion model into something that can navigate a repository, run tests, and iterate. Different harnesses produce meaningfully different scores on the same underlying weights.

Harness Knobs That Move the Score

  • Iteration budget — how many round-trips the agent can take before giving up. Raising this from five to twenty commonly adds several points.
  • Tool availability — whether the agent can execute shell commands, read the file system, search dependencies, or run tests incrementally.
  • Reflection loops — whether the agent reviews its own output before submitting. A single well-timed reflection step can add five points on hard tasks.
  • Repository navigation heuristics — whether the harness uses code-aware search, language-server signals, or naive grep. Better navigation narrows the effective context.
  • Test-driven iteration — whether the agent runs the reference test, reads the failure, and iterates, or produces a one-shot patch.
  • Error-recovery strategy — whether the harness can recognise when it has wandered off-task and restart.

The practical consequence is that a fifteen-point score difference between two public numbers may reflect nothing about the underlying models and everything about the harnesses that ran them. The Claude Code versus Codex versus Jules matrix goes deeper into how individual agent harnesses differ in production.

Apples-to-Apples Comparison Checklist
  • Same split (Live, Verified, or Pro) on both models.
  • Same refresh date, not a score from three refreshes ago.
  • Same harness, or at least the same iteration budget.
  • Same sampling strategy and effort level.
  • Documented tool availability.

When Benchmark Scores DO Predict Real-World Performance

The previous sections argued that leaderboard position is a noisy signal for delivery velocity. That is not the same as saying it is useless. There are conditions under which SWE-Bench Verified, Pro, and Live scores become genuinely predictive.

Scores Predict When the Work Profile Matches

If your team primarily ships bug fixes and small feature additions to mature Python or JavaScript libraries, SWE-Bench scores map reasonably well. The further your work drifts from that profile — into greenfield product work, design-heavy interface builds, data pipeline integration, or complex multi-service orchestration — the weaker the correlation becomes.

Scores Predict at Tier Boundaries

The difference between a model scoring forty percent and a model scoring seventy percent on Verified is usually real and consequential. The difference between a model at seventy-eight and a model at eighty-three is often within the noise floor of how the harnesses were configured. Treat the leaderboard as a tiering tool, not a ranking tool. Cut-offs at ten-point boundaries work better than chasing the absolute top row.

Scores Predict Cost-Per-Success Better Than Velocity

When leaderboards publish median tokens per task alongside pass rate, the derived "cost per successful task" is a far cleaner signal for budgeting than raw pass rate. A model that passes seventy-five percent of tasks at a third of the tokens of a seventy-eight percent competitor will often win the procurement decision, even though it sits lower on the leaderboard. For broader context on this tradeoff, see our performance-versus-price efficient frontier analysis.

How Agencies Should Read the Leaderboard

For agency decision-makers, the leaderboard is a filter, not a decision. A practical reading process looks like this.

Step 1: Shortlist by Tier

Draw a line at your capability tier (for most agencies, the seventy-percent-plus tier on Verified) and shortlist every model above it. Exact ordering inside the tier is rarely decisive.

Step 2: Filter by Profile

Eliminate models whose strengths do not match your work profile. A top-Pro scorer that trails on Terminal-Bench is the wrong choice for an infra-heavy agency. A top-Verified scorer with weak MCP-Atlas results is the wrong choice for a tool-use-heavy consultancy.

Step 3: Local Evaluation

Run a short, representative evaluation against the shortlist. Three to five real client tasks measured on completion, token spend, and human edit distance will tell you more than any public benchmark.

Step 4: Factor the Harness

Before switching models, confirm the harness you are running inside is not the real bottleneck. A tighter prompt library, a faster repository indexer, or a better reflection step frequently beats a model upgrade.

Step 5: Re-Run Quarterly

The Q2 2026 leaderboard will not be the Q3 2026 leaderboard. Treat model selection as a quarterly review rather than a one-off decision.

Alternative Benchmarks Worth Tracking

SWE-Bench alone is an incomplete view. Each of the benchmarks below covers a different slice of real engineering work and, taken together, they give a much more honest picture of what a model will actually feel like in a production harness.

Terminal-Bench
Shell and infrastructure work

Evaluates command-line competence, infrastructure scripts, and environment configuration — the kind of work that happens outside an IDE. Complements SWE-Bench for agencies running DevOps-heavy engagements.

MCP-Atlas
Tool use at scale

Tests whether a model can coordinate many tool calls across a Model Context Protocol server fleet without losing coherence. Predictive for agents embedded in CRMs, analytics stacks, and multi-service orchestration.

BrowseComp
Agentic web research

Measures how well a model navigates the open web, extracts and synthesises information, and terminates research loops. Relevant for any workload involving competitive research, content briefs, or live data gathering.

OSWorld
Computer-use agents

Tests a model's ability to operate a real desktop via screenshots and input events. Increasingly important as agencies deploy agents that drive CRMs, dashboards, and SaaS applications without API coverage.

The broader lesson is that a benchmark portfolio beats a single benchmark. Our 20-platform agentic coding tools matrix and the Nemotron 3 Super 120B release analysis both lean on multi-benchmark readings rather than SWE-Bench alone.

Conclusion

SWE-Bench Live, Verified, and Pro are the most useful public leaderboards we have for frontier coding models. They are also partial views. Reading them well means understanding what they measure, what they miss, how the harness distorts the numbers, and where your agency's work profile diverges from the benchmark's implicit model of "engineering."

For Q2 2026 procurement decisions, the honest answer is that several models occupy the upper tier and no single one dominates every split. Use the leaderboard to shortlist, use your own evaluation to decide, and revisit the decision every quarter because the ordering will change.

Pick the Right Model for Your Actual Work

Leaderboards filter the noise; local evaluation makes the call. We help agencies and platform teams translate benchmark data into model-selection decisions that match real client delivery.

Free consultation
Expert guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring AI coding benchmarks and model selection