Frontier models in 2026 ship a reasoning_effort dial. The dial works — quality lifts 8 to 22 points across the curve. The dial also costs — fees inflate 4-17×, latency 5-60×. The economic question is no longer which model; it is which tier, picked per workload.
We ran 900 tasks across five frontier models and three effort tiers on math (AIME 2026 problems), code (Expert-SWE refactor), and analytic reasoning (GPQA Diamond). The crossover point — where higher effort starts costing more per correct answer than the quality lift earns — is task-specific and lives at different tiers for each workload. This piece publishes the data and the decision matrix.
Cost-per-correct-answer is the right unit. A 22-point pass-rate lift at 17× cost is a great deal on a hard math contest where the answer is binary; the same lift on a PR-scale review where humans edit anyway is a waste. The matrix in §07 maps nine common workflows to the right tier.
- 01High reasoning_effort lifts AIME pass-rate by 18-22 points across the frontier; medium lifts Expert-SWE by 11-14.Math reasoning shows the steepest curve — high effort earns out cleanly because the answer is verifiable and binary. Code reasoning peaks at medium for refactor tasks; high adds little. Analytic reasoning peaks at medium-high band.
- 02Cost-per-correct-answer is the right metric. Per-token rate misleads in both directions.DeepSeek V4 at high reasoning is cheaper per correct answer on AIME than GPT-5.5 Pro at medium — until you slice by topic. Cost-per-correct-answer changes the apparent ranking on every workload we tested. Per-token rate is the input, not the output.
- 03Latency tax is the underrated cost — TTFT inflates 5-60× at high effort.On Claude Opus 4.7 with extended thinking, P50 TTFT rises from 0.8s (low) to 28s (high). For chat UX latency budgets, the high tier is unusable; for batch and async, irrelevant. Pick by workflow latency budget, not capability ceiling.
- 04Open-weight at high reasoning is cost-competitive with frontier at medium.DeepSeek V4 at high reasoning lands within 4-7 quality points of GPT-5.5 Pro at medium across our test suite, at 1/12 the cost. For workloads where the 4-7 point gap is acceptable, open-weight high-effort is the procurement floor.
- 05Don't pick the tier ceiling — pick the workload's quality bar and reverse out.The most common mistake is defaulting every reasoning workload to high effort because it sounds safer. Quality-bar reasoning (what pass-rate is genuinely required?) plus latency-budget reasoning will land most workflows at low or medium and 4-12× cheaper than the default.
01 — MethodologyThe test harness.
Five frontier models (GPT-5.5 Pro, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5 Reasoning, DeepSeek V4) tested at three reasoning_effort tiers: low, medium, high. Each provider exposes the dial differently — OpenAI uses the explicit reasoning_effort parameter; Anthropic uses extended thinking budget; Google Deep Think uses thinking_budget; xAI Grok uses reasoning_mode; DeepSeek uses an internal CoT toggle. We normalised by approximate token-spend tier rather than vendor parameter name.
Three task families, 60 problems each, run three times per model+effort cell — 900 task runs total. Pass-rate computed as majority vote across the three runs.
Math · AIME 2026
60 problems · binary answer · 3-run majorityAmerican Invitational Math Exam 2026 (post-cutoff). Verified by exact-match. Picks up reasoning depth and self-correction; weak signal for shallow models.
Hardest reasoning floorCode · Expert-SWE refactor
60 multi-file refactor tasks · pytest + integration testsReal-world refactors drawn from open-source PRs not in any model's training cutoff. Pass = full test suite green after the model's edit. Our internal benchmark, methodology open-sourced.
Production-style codeAnalysis · GPQA Diamond
60 graduate-level science · multiple-choice · 3-run majorityGraduate-level physics, chemistry, biology. Diamond subset. Tests deep reasoning on novel scientific scenarios with negative incentives for shortcuts.
Scientific reasoning02 — AIME 2026Math reasoning · steep quality curve.
Math is where reasoning_effort earns its keep. Across all five models, the low-to-high tier delta on AIME 2026 is 18-22 points. The chart below shows the per-tier pass-rate for each model.
AIME 2026 pass-rate · 5 models × 3 effort tiers
Source: Internal benchmark · 60 AIME 2026 problems · 3-run majority · April 2026Two reads matter. First: the low-to-high curve is steeper on math than on any other family — 22 points on GPT-5.5 Pro, 18-22 across the board. The compute pays for itself in verifiable correctness. Second: DeepSeek V4 at high reasoning (84.2%) beats GPT-5.5 Pro at low (69.3%) and is competitive with all four frontier closed-source models at medium. The cost gap (15-30×) is substantial.
"Math reasoning is where the dial pays its rent. Code reasoning is where the dial is misused."— Internal eval retro, May 2026
03 — Expert-SWECode reasoning · medium is the sweet spot.
Code reasoning behaves differently than math. The marginal lift from medium to high is small (3-5 points across the frontier) and sometimes negative — extra reasoning time spent on Expert-SWE refactor often introduces over-engineered solutions that fail integration tests. Medium is the right default for production code workflows.
Expert-SWE refactor pass-rate · 5 models × 3 effort tiers
Source: Internal benchmark · 60 Expert-SWE refactor tasks · pytest + integration · April 202604 — GPQA DiamondAnalytic reasoning · medium-high band wins.
Graduate-level scientific reasoning sits between math and code on the curve shape. Quality lifts cleanly from low to medium (12-15 points) and continues to lift modestly from medium to high (3-7 points). The medium-to-high band is where most analytic-reasoning workflows should sit, picking the tier by latency budget.
GPQA Diamond · high
+15.2 vs low. Steady curve through medium (74.1%) to high (78.4%). Strongest performer overall on analytic reasoning. Cost premium is rational on novel scientific tasks.
Best analytic frontierGPQA Diamond · high
Strong on biology and chemistry; slightly behind on physics. Extended thinking adds 11.8 points over default. Solid second choice for scientific analysis.
Biology · chemistry leaderGPQA Diamond · high
Multimodal advantage on questions with figures (12% of GPQA Diamond). High Deep Think tier adds 13.4 points over default. Right for vision-adjacent scientific tasks.
Multimodal advantageGPQA Diamond · high
Strongest open-weight result; 11-15 points behind frontier closed-source at high tier. CoT-enabled mode delivers most of the lift. Cost-per-correct-answer winner at scale.
Open-weight ceiling05 — The Real MetricCost-per-correct-answer changes the ranking.
Quality and cost in isolation tell you nothing. The chart that matters is cost-per-correct-answer — total spend on a task family, divided by the number of correct answers. Below: cost-per-correct for AIME 2026 across the model+effort grid.
Cost-per-correct-answer · AIME 2026
Source: Internal benchmark · cost = total tokens × rate / correct-answer count · April 2026The ranking inverts. GPT-5.5 Pro at high effort wins on raw pass-rate (91.7%) but lands at $0.78/answer — 19× the DeepSeek V4 high-effort cost ($0.04). For workloads where the 7.5 percentage points of extra correctness do not justify the cost (most internal workflows), DeepSeek V4 at high reasoning is the procurement floor.
06 — Latency TaxThe latency tax is the third axis.
Cost and quality are two axes; latency is the third. Reasoning modes inflate TTFT 5-60× depending on model and tier. For chat UX workflows with sub-2-second latency budgets, high reasoning is unusable regardless of capability ceiling.
Minimal · low effort
TTFT P50 0.4-1.5s across frontier. Right for chat UX, autocompletions, codemod, fast extraction. Pick this tier for anything user-facing under 2-second budget.
Chat UX · 0.4-1.5sMedium effort
TTFT P50 4-12s across frontier. Right for code refactor, content brief, document analysis where users are waiting actively but tolerant. Streaming output helps perceived latency.
Refactor · 4-12sHigh effort
TTFT P50 18-90s across frontier. Right for batch jobs, async workflows, research analysis where the user submits and returns later. Unusable for sync chat.
Batch · 18-90s07 — Decision MatrixWorkload to tier — nine common cases.
The matrix below maps nine workloads to the right effort tier based on the empirical pass-rate curves and cost-per-correct numbers. Use this as the starting policy, then measure against your specific quality bar.
Math contest / verifiable answers
High effort wins. Quality curve is steep, answer is binary, latency budget is generous. Default to GPT-5.5 Pro high or Claude Opus 4.7 high. DeepSeek V4 high if cost-bound.
High · GPT-5.5 Pro · $0.78Multi-file code refactor
Medium wins. High effort regresses 3-5 points by over-engineering. Default to GPT-5.5 Pro medium or Claude Opus 4.7 medium. Latency budget tolerable in IDE.
Medium · Pro $0.42PR-scale code review
Low effort wins. Humans edit the output anyway; reasoning quality marginal. Default to standard tier without extended thinking. Sonnet 4.6 or GPT-5.5 standard.
Low · Sonnet $0.12Scientific / analytic research
Medium-high. Quality curve continues lifting through high but latency unbearable. Pick high for batch research, medium for interactive analysis sessions.
Medium-high · Opus $0.27Long-document Q&A (cached)
Low-medium. Cache neutralizes input cost; output budget governs. Use medium for synthesis questions; low for direct extraction. Pick model by cache discount.
Low-medium · Gemini 3 cachedCustomer-facing chat / live UX
Low effort, latency-bound. High and medium TTFT exceed UX budget. Default to standard tier with minimal reasoning. Stream output for perceived responsiveness.
Low only · TTFT-boundAgentic outreach personalization
Low effort, volume-bound. 50K+ emails/month tips to DeepSeek V4 minimal reasoning at $0.002/email. Quality bar is human-acceptance, not factuality.
Low · V4 $0.002Eval / benchmarking harness
Match production tier. The point of an eval is to mirror production conditions, not maximize capability. If production runs medium, eval runs medium.
Match prod tierNovel research / hard analysis
High effort. The genuine novel-reasoning use case where the dial earns its rent. Batch tolerant. GPT-5.5 Pro high or Opus 4.7 high; DeepSeek V4 high cost-bound.
High · Pro $0.78"Most teams default every workflow to high reasoning out of caution and pay 4-12× over the right tier. The cost is real; the quality lift is illusory."— Internal procurement memo, May 2026
08 — ConclusionThe dial is workload-specific — not a default.
Pick the tier per workflow. Measure cost-per-correct-answer. Don't default to high.
The reasoning_effort dial is a real tool with a real cost. The mistake we see most often is teams setting the dial to high once and forgetting it — paying 4-17× the right tier on workflows where the quality curve is flat. The corrective is a workload-by-workload policy, not a model-wide default.
The decision matrix above is the starting point. The actual policy for your stack is the result of measuring cost-per-correct-answer on your specific tasks against your specific quality bar — not the published benchmark. Build that telemetry into your AI ops stack as a first-class metric.
We re-run this benchmark every quarter as new model tiers ship. Bookmark this page if you want the canonical reference; subscribe to the newsletter for the change log.