FrontierMath v2 landed on June 12, 2026, and the headline most outlets ran with was the leaderboard: Claude Fable 5 reaching roughly 88% on the benchmark’s hardest tier. The more important story is buried in the release note. An Epoch AI audit found small but critical errors in 42% of the original FrontierMath problems — meaning every “state of the art” claim made on the prior version was scored against a test that was wrong about two in five of its own questions.

That matters far beyond the math-benchmark community. Engineering teams pick models, justify budgets, and ship architectures partly on benchmark numbers. If the benchmark itself carried a 42% error rate, the procurement signal it sent was noisier than anyone treating those scores as ground truth assumed. The correction did not change who was winning — it changed how much confidence the numbers deserved.

This guide walks through what v2 actually changed, why a flagship benchmark had errors in nearly half its problems, what the post-correction score jump reveals about the nature of those errors, and a five-question checklist your team can apply to any vendor benchmark claim — not just FrontierMath.

Key takeaways

01
An audit found errors in 42% of the original problems.Epoch AI described them as small but critical errors, comparable to error rates Epoch notes for other major machine-learning benchmarks. This is a quality-assurance story, not a fraud one — but it invalidated prior absolute scores.
02
v2 has 338 problems, down from 350.It corrected 123 problems in Tiers 1-3 and 12 in Tier 4, and removed 5 from Tiers 1-3 and 7 from Tier 4 — leaving a base set of 295 plus 43 Tier 4 expansion problems.
03
Scores rose across the board; rankings broadly held.Epoch AI reports model rankings on v2 are similar but scores are higher across the board. That pattern strongly suggests the errors were penalizing correct answers, not making problems harder.
04
Fable 5 leads Tier 4; the top of Tiers 1-3 is a tie.Claude Fable 5 clearly leads Tier 4 at roughly 88%. On Tiers 1-3 it sits effectively level with GPT-5.5 Pro — about 87% each on a secondary leaderboard, well within the stated error bars.
05
Treat any benchmark score as a claim to verify.Ask who ran the eval, which version, what effort setting, how the test was verified, and whether rankings stay stable when the test is corrected. FrontierMath v2 is the case study for why each question matters.

01 — What Changedv2 corrected 135 problems and removed twelve.

FrontierMath was established in November 2024 by Epoch AI, a non-profit research organization that tracks and forecasts AI progress. It was built with more than 60 mathematicians, including contributors holding 14 International Mathematical Olympiad gold medals and a Fields Medalist among the advisors. Every problem is designed to be “guessproof” — large numerical answers or complex mathematical objects that leave less than a 1% chance of a correct guess without genuine work. It was, by construction, one of the most rigorous evaluations in the field.

That rigor is exactly why the v2 correction is striking. Following an audit, Epoch AI corrected 123 problems in Tiers 1-3 and 12 in Tier 4, and removed a further 5 from Tiers 1-3 and 7 from Tier 4. The net result: the full dataset dropped from 350 problems (300 in Tiers 1-3 plus 50 in Tier 4) to 338 — a base set of 295 plus an expansion set of 43 Tier 4 problems. For a primer on how evaluations like this are constructed and scored, our guide to standard AI evaluation metrics sets the baseline.

Tiers 1-3 (base)

295 problems

123 corrected · 5 removed

The original base set of expert-crafted problems with known answers. Most of the v2 corrections landed here, where the bulk of the dataset lives. Down from 300 in v1.

down from 300

Tier 4 (expansion)

43 problems

12 corrected · 7 removed

The hardest tier, completed in June 2025 — each problem the output of a several-week research project by a mathematics professor or postdoc. Down from 50 in v1.

down from 50

Release snapshot

Epoch AI released FrontierMath v2 on June 12, 2026, announced in The Epoch Brief. The update corrected 135 problems in total and removed 12, leaving 338. Epoch AI characterizes the original error rate as “comparable to error rates in other major ML benchmarks like ImageNet” — context that frames this as routine evaluation maintenance rather than a benchmark failure.

02 — The Error RateWhy a flagship benchmark had errors in 42% of problems.

The instinct on reading “42% of problems had errors” is to assume something went badly wrong. The more accurate reading is that verifying expert mathematics at scale is genuinely hard, and that AI itself made the audit possible. Early human quality reviews had flagged something closer to 1 in 20 problems — roughly 5% — needing corrections. It was a later AI-assisted audit that surfaced the much higher rate of small but critical errors across the set.

These are not the kind of errors a casual reader would spot. The problems were authored to be brutally hard. Terence Tao, the Fields Medalist, described the difficulty of solving them at all in stark terms, and Timothy Gowers, another Fields Medalist, said the problems he reviewed looked to be at a different level of difficulty from Olympiad problems. When a problem sits that far past the edge of human expertise, a subtle slip in the stated answer or a boundary condition can survive review for a long time.

"The first thing to understand about FrontierMath is that it's genuinely extremely hard. Almost everyone on Earth would score approximately 0%, even if they're given a full day to solve each problem."— Matthew Barnett, AI researcher

That difficulty is the point — and it is also why error rates this high are not unique to FrontierMath. Epoch AI itself draws the comparison to ImageNet, the benchmark that arguably launched the modern deep-learning era and which carried its own well-documented label errors for years. The lesson is not that FrontierMath was uniquely flawed; it is that even the most carefully constructed evaluations carry measurable error, and that error sets a floor on how precisely any single score can be trusted.

03 — What The Jump RevealsScores went up, rankings held — that is the tell.

Epoch AI’s own framing of the v2 results is the most analytically useful line in the release: model rankings on v2 are similar, but scores are higher across the board. Read carefully, that sentence tells you what kind of errors the audit found. If the corrections had simply removed easy problems or added harder ones, scores would have fallen. Instead they rose — which points to errors that were marking correct answers as wrong.

The clearest illustration is the same model measured before and after. GPT-5.5 in its high-effort mode scored roughly 35% on Tier 4 against the uncorrected set; on v2 the same model scores about 72.5%. A near-doubling on an unchanged model is not a capability leap — it is the benchmark no longer penalizing the model for answers it had right all along. The trajectory table below recomputes that delta alongside the other models with a before-and-after figure.

FrontierMath Tier 4 score trajectory by model, before and after the v2 error correction. Delta columns recomputed from the two figures stated in each row. Sources: The Epoch Brief, LM Council, The Decoder, OfficeChai, retrieved June 14, 2026. Secondary-leaderboard figures carry error margins.
Model	Window	Tier 4 (v1)	Tier 4 (v2)	Delta	Note
Claude Fable 5 (max)	Jun 2026	n/a	≈88%	—	Predecessor Opus 4.5 sat below 10% on Tier 4 in early 2026.
GPT-5.5 Pro (xhigh)	Jun 2026	n/a	≈78%	—	Trails Fable 5 on Tier 4 by roughly ten points (LM Council).
Google AI co-mathematician	May → Jun 2026	48%	≈75.6%	+27.6 pts	v1 48% ran without the compute caps applied to other models.
GPT-5.5 (xhigh)	Apr → Jun 2026	≈35.4%	72.5%	+37.1 pts	Same model, near-doubling — mostly the error-corrected set.
Claude Opus 4.8	Jun 2026	n/a	56.1%	—	LM Council Tier 4 figure, ±7.8 error margin.

Two things are worth holding lightly here. The v1 and v2 figures come from different reporting moments and, in the secondary cases, carry wide error margins — so the deltas are directional, not surgical. And the AI co-mathematician’s earlier 48% Tier 4 result was achieved without the computational caps applied to the other models, which makes it not strictly comparable. But the shape of the pattern holds: where a model has both a before and an after, the after is markedly higher, and the ordering survives.

That is the genuinely reassuring finding. A benchmark can carry a high error rate and still be useful for ranking models, provided the errors hit all models proportionally. What it cannot do is support confident claims about absolute capability — “the model solves 35% of the hardest math problems” was simply the wrong number, and no amount of hedging would have rescued it.

04 — The LeaderboardWhere the frontier models actually land on v2.

Epoch AI’s newsletter reports that Claude Fable 5 holds the top spot on v2, at approximately 87% on Tiers 1-3 and 88% on Tier 4. A secondary leaderboard, LM Council, fills in the field below it. Read those two sources together and a careful picture emerges — one that is more interesting than a single “winner.” On Tier 4, Fable 5 leads clearly. On Tiers 1-3, the top is effectively a tie.

FrontierMath v2 · Tier 4 standings (secondary leaderboard)

Source: The Epoch Brief (Fable 5 top spot) + LM Council leaderboard (field figures, with error bars), June 2026

Claude Fable 5 (max)Tier 4 · LM Council 87.8% ±5.2

≈88%

GPT-5.5 Pro (xhigh)Tier 4 · LM Council 78.0% ±6.5

78%

AI co-mathematicianTier 4 · LM Council 75.6% ±6.7 · uncapped on v1

≈75.6%

GPT-5.5 (xhigh)Tier 4 · LM Council 72.5% ±7.1

72.5%

Claude Opus 4.8Tier 4 · LM Council 56.1% ±7.8

56.1%

Read the error bars

On Tiers 1-3, the LM Council leaderboard shows GPT-5.5 Pro at 87.7% ±1.9 and Claude Fable 5 at 87.0% ±2.0 — a 0.7-point gap inside overlapping error margins. Epoch AI’s own newsletter puts Fable 5 at the top. The honest read: the two are effectively tied at the top of Tiers 1-3, and Fable 5 leads Tier 4 by roughly ten points. LM Council is a secondary aggregator, not Epoch AI’s official ranking — attribute it as such.

The trajectory matters as much as the standings. As recently as early 2026, Claude Opus 4.5 scored below 10% on Tier 4; at FrontierMath’s launch in November 2024, all six models evaluated solved fewer than 2% of problems. Today’s top scores sit near 88%. That is more than a fortyfold improvement in well under two years — a pace that is precisely why Epoch AI expects the benchmark to saturate soon. If you are weighing Fable 5 against GPT-5.5 specifically, our Claude Fable 5 vs GPT-5.5 comparison goes beyond a single benchmark to cost and coding fit.

"There are easier math benchmarks that are already obsolete, several generations of them. FrontierMath will probably saturate within the next two years — could be faster."— Greg Burnham, Senior Researcher at Epoch AI

05 — Origin StoryThe controversy that arguably forced the scrutiny.

FrontierMath did not arrive uncontested. OpenAI funded the benchmark’s development and had access to most of the dataset — all but a holdout set — under a verbal agreement not to use it for training. That arrangement became a flashpoint in December 2024, when OpenAI reported roughly 25% on the benchmark with its o3 model. Several mathematicians who had contributed problems said they were unaware at the time of OpenAI’s funding or its level of access.

Epoch AI later built in structural safeguards: Tier 4 holds 20 problems back from OpenAI for evaluation, with access to the remaining problems, so that some portion of the test always stays uncontaminated. It is reasonable to read a line from the contested launch through to the v2 audit — public scrutiny of a high-profile benchmark tends to invite exactly the kind of close inspection that surfaces a 42% error rate. Benchmark contamination and provenance are their own discipline; our guide to reading LLM leaderboards covers how training-data leakage distorts scores.

06 — The Bigger PictureFrontierMath is not an outlier — benchmark error is systemic.

The most uncomfortable implication of v2 is that a 42% error rate is not extraordinary. Independent analyses of AI benchmarks have reported invalid-question rates ranging from a couple of percent on some math sets to roughly 40% on others, and annotation-error rates on certain text-to-SQL benchmarks have been reported above half the sampled questions. The specific figures vary by source and method, but the direction is consistent: the tests we rank models with carry meaningful, often unmeasured error.

For teams, the practical risk is not the benchmark error itself — it is the gap between a lab score and a production result. Real deployments introduce messy inputs, tool failures, latency budgets, and cost ceilings that no static benchmark captures. A model that tops a leaderboard can still underperform on your actual workload, which is why the better evaluation question is cost per successful task, not raw accuracy. Our piece on measuring AI performance beyond benchmarks makes that case, and our study of model accuracy and benchmark reliability shows how far reported numbers can drift from observed behavior.

Capability reversal

Launch-day scores

<2%

In November 2024, all six evaluated models solved under 2% of FrontierMath problems. Top models now sit near 88% on the hardest tier — a complete reversal in well under two years.

Nov 2024 → Jun 2026

The human bar

MIT team average

19%

An MIT competition of roughly 40 exceptional undergraduates and experts averaged 19% on 23 Tier 1-3 problems, solving 35% collectively. Today's top models well exceed that level.

2025 human baseline

Guessproof by design

Odds of a lucky guess

<1%

Every problem uses large numerical answers or complex objects, leaving under a 1% chance of guessing correctly without genuine mathematical work. Scores reflect reasoning, not luck.

benchmark design

07 — A Practical ChecklistFive questions to stress-test any benchmark claim.

The durable takeaway from v2 is a habit, not a number. Before a benchmark score becomes a procurement input, run it through five questions. The matrix below applies them to FrontierMath v2 itself — and the same grid works on any vendor leaderboard you are handed.

A five-question decision matrix for stress-testing any AI benchmark claim, with red, amber, and green flag answers and how FrontierMath v2 scores against each. Derived from benchmark-integrity principles; FrontierMath v2 status drawn from Epoch AI and LM Council, retrieved June 14, 2026.
Question to ask	Red flag	Amber flag	Green flag	FrontierMath v2
Who ran the eval?	The vendor selling the model, with no third party.	Vendor numbers echoed by press, no independent run.	An independent lab or maintainer ran it.	Run by Epoch AI; cross-listed on LM Council.
Which version of the benchmark?	Unstated, or a version since corrected.	A named version, but you have to dig for it.	Version stated up front, with a changelog.	v2, released June 12, 2026, after the v1 audit.
What effort or mode setting?	No setting disclosed.	A setting named but not held constant across models.	Setting disclosed and matched across the field.	Reported per effort tier (max, xhigh) on LM Council.
How was the test itself verified?	No QA process described.	QA mentioned but no error rate published.	Published error rate and a correction process.	Audit found errors in 42% of problems; v2 corrects them.
Are rankings stable across versions?	Order changes when the test is fixed.	Order shifts within the error bars.	Order holds after corrections.	Rankings held; absolute scores rose across the board.

FrontierMath v2 scores green on most of this grid — which is the point. A benchmark that publishes its error rate, names the correction, and survives the version change with stable rankings is more trustworthy after the audit than before it. The benchmarks to distrust are the ones that never disclose a version, never publish a QA process, and quietly reshuffle when the test is fixed.

08 — What Teams Should DoTurning a benchmark story into a buying discipline.

The v2 episode does not mean benchmarks are useless — it means they are evidence, not verdicts. Here is how that translates into how a team should actually choose and operate models.

Frontier reasoning

Hardest math and proof-style work

On v2 Tier 4, Claude Fable 5 leads at roughly 88%, ahead of GPT-5.5 Pro by about ten points. If your workload genuinely needs frontier mathematical reasoning, that gap is meaningful — but validate on your own problems, not the leaderboard.

Lead with Fable 5

General capability

Broad knowledge and Tier 1-3 reasoning

Fable 5 and GPT-5.5 Pro are effectively tied at the top of Tiers 1-3, within the error bars. At parity, decide on cost, latency, and ecosystem fit — not the 0.7-point benchmark difference.

Decide on cost & fit

Procurement signal

Treating a score as a decision input

Never act on an absolute score without its version, effort setting, and who ran the eval. The pre-v2 35% Tier 4 figure for GPT-5.5 was simply wrong; the correction was 37 points. Demand the same provenance from every vendor claim.

Verify before you buy

Production reality

What actually predicts deployment results

Benchmark accuracy is a weak proxy for production outcomes once tool use, messy inputs, and cost ceilings enter. Run your own eval on representative tasks and measure cost per successful task before committing a pipeline.

Run your own eval

The practical move for most teams is to stop treating leaderboards as answers and start treating them as starting hypotheses. Pull the two or three models a benchmark suggests for your task class, then run a small, representative evaluation on your own prompts — the work your product actually does — and measure not just accuracy but cost, latency, and failure modes. That is the exact shape of the comparative evaluations in our AI transformation engagements, where model choice is grounded in your workload rather than someone else’s headline. If the benchmark you are relying on is a coding one, our guide to AI coding benchmarks applies the same scrutiny to SWE-bench-style claims.

09 — ConclusionA more honest number is worth more than a bigger one.

The shape of AI evaluation, June 2026

The benchmark that corrected itself is more trustworthy than the one that never did.

FrontierMath v2 is, on its surface, a leaderboard update — Claude Fable 5 at the top, scores up across the board. The more durable story is what the update revealed about evaluation itself: a flagship benchmark, built by dozens of elite mathematicians, carried small but critical errors in 42% of its problems, and the AI community was making model-selection calls on those numbers without knowing it.

The reassuring part is that the corrections did not move the rankings. Errors that hit all models proportionally still let a benchmark do its core job — distinguishing capability — even as they make any single absolute score untrustworthy. The pre-correction “35% on the hardest tier” was the wrong number; the right one was nearly double. No hedge would have rescued the original.

For teams building with AI, the takeaway is a habit rather than a model preference. Treat every benchmark score as a claim with a provenance — which version, which effort setting, who ran it, how it was verified — and reserve real confidence for the evaluation you run on your own workload. The benchmark that publishes its own error rate and survives the correction has earned more trust than the one that never looked.

FrontierMath v2: When AI Benchmarks Get Error-Corrected

01 — What Changedv2 corrected 135 problems and removed twelve.

295 problems

43 problems

02 — The Error RateWhy a flagship benchmark had errors in 42% of problems.

03 — What The Jump RevealsScores went up, rankings held — that is the tell.

04 — The LeaderboardWhere the frontier models actually land on v2.

FrontierMath v2 · Tier 4 standings (secondary leaderboard)

05 — Origin StoryThe controversy that arguably forced the scrutiny.

06 — The Bigger PictureFrontierMath is not an outlier — benchmark error is systemic.

Launch-day scores

MIT team average

Odds of a lucky guess

07 — A Practical ChecklistFive questions to stress-test any benchmark claim.

08 — What Teams Should DoTurning a benchmark story into a buying discipline.

Hardest math and proof-style work

Broad knowledge and Tier 1-3 reasoning

Treating a score as a decision input

What actually predicts deployment results

09 — ConclusionA more honest number is worth more than a bigger one.

The benchmark that corrected itself is more trustworthy than the one that never did.

Pick your model on the workload you run, not the leaderboard.

Model evaluation engagements

The questions we get every week.

Continue exploring frontier releases.

ARC Prize Verified Opus 5. That Is Rarer Than It Sounds.

Fable 5 Plan Access Extended to July 12: What Changes

SWE-bench in 2026: Benchmarks vs Scaffolding Reality

Claude Fable 5 & Mythos 5: Agentic Coding Deep Dive