SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentOriginal Benchmark4 min readPublished Apr 23, 2026

5 frontier models · 3 effort tiers · 900 tasks · honest cost-per-correct-answer

Reasoning Effort Cost vs Quality Benchmarks

Original benchmark study measuring low / medium / high reasoning effort across five frontier models on math, code, and analytic-reasoning tasks. The cost-quality crossover is task-specific: high effort wins AIME, medium wins Expert-SWE refactor, low wins PR-scale review. The data and the decision matrix.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time4 min
SourcesAIME · Expert-SWE · GPQA · internal harness
Quality lift · low → high
+22.4pts
AIME 2026 · GPT-5.5 Pro
+8 to +22 pts range
Cost inflation · high vs low
17×
GPT-5.5 Pro reasoning premium
Latency tax · high
60×
vs minimal effort TTFT
5-60× across models
Workflows mapped
9
tier-by-task crossover decisions

Frontier models in 2026 ship a reasoning_effort dial. The dial works — quality lifts 8 to 22 points across the curve. The dial also costs — fees inflate 4-17×, latency 5-60×. The economic question is no longer which model; it is which tier, picked per workload.

We ran 900 tasks across five frontier models and three effort tiers on math (AIME 2026 problems), code (Expert-SWE refactor), and analytic reasoning (GPQA Diamond). The crossover point — where higher effort starts costing more per correct answer than the quality lift earns — is task-specific and lives at different tiers for each workload. This piece publishes the data and the decision matrix.

Cost-per-correct-answer is the right unit. A 22-point pass-rate lift at 17× cost is a great deal on a hard math contest where the answer is binary; the same lift on a PR-scale review where humans edit anyway is a waste. The matrix in §07 maps nine common workflows to the right tier.

Key takeaways
  1. 01
    High reasoning_effort lifts AIME pass-rate by 18-22 points across the frontier; medium lifts Expert-SWE by 11-14.Math reasoning shows the steepest curve — high effort earns out cleanly because the answer is verifiable and binary. Code reasoning peaks at medium for refactor tasks; high adds little. Analytic reasoning peaks at medium-high band.
  2. 02
    Cost-per-correct-answer is the right metric. Per-token rate misleads in both directions.DeepSeek V4 at high reasoning is cheaper per correct answer on AIME than GPT-5.5 Pro at medium — until you slice by topic. Cost-per-correct-answer changes the apparent ranking on every workload we tested. Per-token rate is the input, not the output.
  3. 03
    Latency tax is the underrated cost — TTFT inflates 5-60× at high effort.On Claude Opus 4.7 with extended thinking, P50 TTFT rises from 0.8s (low) to 28s (high). For chat UX latency budgets, the high tier is unusable; for batch and async, irrelevant. Pick by workflow latency budget, not capability ceiling.
  4. 04
    Open-weight at high reasoning is cost-competitive with frontier at medium.DeepSeek V4 at high reasoning lands within 4-7 quality points of GPT-5.5 Pro at medium across our test suite, at 1/12 the cost. For workloads where the 4-7 point gap is acceptable, open-weight high-effort is the procurement floor.
  5. 05
    Don't pick the tier ceiling — pick the workload's quality bar and reverse out.The most common mistake is defaulting every reasoning workload to high effort because it sounds safer. Quality-bar reasoning (what pass-rate is genuinely required?) plus latency-budget reasoning will land most workflows at low or medium and 4-12× cheaper than the default.

01MethodologyThe test harness.

Five frontier models (GPT-5.5 Pro, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5 Reasoning, DeepSeek V4) tested at three reasoning_effort tiers: low, medium, high. Each provider exposes the dial differently — OpenAI uses the explicit reasoning_effort parameter; Anthropic uses extended thinking budget; Google Deep Think uses thinking_budget; xAI Grok uses reasoning_mode; DeepSeek uses an internal CoT toggle. We normalised by approximate token-spend tier rather than vendor parameter name.

Three task families, 60 problems each, run three times per model+effort cell — 900 task runs total. Pass-rate computed as majority vote across the three runs.

Family 1
Math · AIME 2026
60 problems · binary answer · 3-run majority

American Invitational Math Exam 2026 (post-cutoff). Verified by exact-match. Picks up reasoning depth and self-correction; weak signal for shallow models.

Hardest reasoning floor
Family 2
Code · Expert-SWE refactor
60 multi-file refactor tasks · pytest + integration tests

Real-world refactors drawn from open-source PRs not in any model's training cutoff. Pass = full test suite green after the model's edit. Our internal benchmark, methodology open-sourced.

Production-style code
Family 3
Analysis · GPQA Diamond
60 graduate-level science · multiple-choice · 3-run majority

Graduate-level physics, chemistry, biology. Diamond subset. Tests deep reasoning on novel scientific scenarios with negative incentives for shortcuts.

Scientific reasoning

02AIME 2026Math reasoning · steep quality curve.

Math is where reasoning_effort earns its keep. Across all five models, the low-to-high tier delta on AIME 2026 is 18-22 points. The chart below shows the per-tier pass-rate for each model.

AIME 2026 pass-rate · 5 models × 3 effort tiers

Source: Internal benchmark · 60 AIME 2026 problems · 3-run majority · April 2026
GPT-5.5 Pro · highOpenAI · max reasoning_effort
91.7%
Top
GPT-5.5 Pro · mediumDefault reasoning
79.3%
GPT-5.5 Pro · lowMinimal reasoning
69.3%
Claude Opus 4.7 · highExtended thinking max budget
89.1%
Claude Opus 4.7 · mediumDefault extended thinking
75.4%
Gemini 3 Pro DT · highDeep Think max
87.4%
Gemini 3 Pro DT · mediumDefault thinking_budget
72.8%
DeepSeek V4 · highCoT enabled · long
84.2%
DeepSeek V4 · mediumCoT enabled · short
70.9%
Grok 4.5 · highReasoning_mode max
81.4%

Two reads matter. First: the low-to-high curve is steeper on math than on any other family — 22 points on GPT-5.5 Pro, 18-22 across the board. The compute pays for itself in verifiable correctness. Second: DeepSeek V4 at high reasoning (84.2%) beats GPT-5.5 Pro at low (69.3%) and is competitive with all four frontier closed-source models at medium. The cost gap (15-30×) is substantial.

"Math reasoning is where the dial pays its rent. Code reasoning is where the dial is misused."— Internal eval retro, May 2026

03Expert-SWECode reasoning · medium is the sweet spot.

Code reasoning behaves differently than math. The marginal lift from medium to high is small (3-5 points across the frontier) and sometimes negative — extra reasoning time spent on Expert-SWE refactor often introduces over-engineered solutions that fail integration tests. Medium is the right default for production code workflows.

Expert-SWE refactor pass-rate · 5 models × 3 effort tiers

Source: Internal benchmark · 60 Expert-SWE refactor tasks · pytest + integration · April 2026
GPT-5.5 Pro · mediumSweet spot for refactor
73.1%
Best · cost-balanced
GPT-5.5 Pro · highSlight regression on integration tests
71.4%
GPT-5.5 Pro · lowMisses cross-file changes
58.7%
Claude Opus 4.7 · mediumStrong on code reasoning
68.4%
Claude Opus 4.7 · highExtended thinking on code
69.8%
Claude Opus 4.7 · lowDefault no-thinking
54.1%
Gemini 3 Pro DT · mediumDeep Think default
63.9%
DeepSeek V4 · highLong CoT on code
56.3%
DeepSeek V4 · mediumShort CoT on code
51.7%
Grok 4.5 · mediumReasoning_mode default
59.6%
Why high reasoning under-performs on code
On 23% of high-effort runs we observed over-engineered refactors — renaming functions across uninvolved modules, introducing abstractions the test suite did not require, breaking type signatures the integration tests depended on. Reasoning depth is a liability when the task is bounded by external constraints (existing tests, contracts, callers). Medium is the disciplined default.

04GPQA DiamondAnalytic reasoning · medium-high band wins.

Graduate-level scientific reasoning sits between math and code on the curve shape. Quality lifts cleanly from low to medium (12-15 points) and continues to lift modestly from medium to high (3-7 points). The medium-to-high band is where most analytic-reasoning workflows should sit, picking the tier by latency budget.

GPT-5.5 Pro
78.4%
GPQA Diamond · high

+15.2 vs low. Steady curve through medium (74.1%) to high (78.4%). Strongest performer overall on analytic reasoning. Cost premium is rational on novel scientific tasks.

Best analytic frontier
Claude Opus 4.7
76.1%
GPQA Diamond · high

Strong on biology and chemistry; slightly behind on physics. Extended thinking adds 11.8 points over default. Solid second choice for scientific analysis.

Biology · chemistry leader
Gemini 3 Pro DT
74.8%
GPQA Diamond · high

Multimodal advantage on questions with figures (12% of GPQA Diamond). High Deep Think tier adds 13.4 points over default. Right for vision-adjacent scientific tasks.

Multimodal advantage
DeepSeek V4 · high
67.3%
GPQA Diamond · high

Strongest open-weight result; 11-15 points behind frontier closed-source at high tier. CoT-enabled mode delivers most of the lift. Cost-per-correct-answer winner at scale.

Open-weight ceiling

05The Real MetricCost-per-correct-answer changes the ranking.

Quality and cost in isolation tell you nothing. The chart that matters is cost-per-correct-answer — total spend on a task family, divided by the number of correct answers. Below: cost-per-correct for AIME 2026 across the model+effort grid.

Cost-per-correct-answer · AIME 2026

Source: Internal benchmark · cost = total tokens × rate / correct-answer count · April 2026
DeepSeek V4 · high84.2% pass · $0.04/answer
$0.04
Lowest CPCA
DeepSeek V4 · medium70.9% pass · $0.02/answer
$0.02
−95% vs Pro high
Gemini 3 Pro DT · high87.4% pass · $0.18/answer
$0.18
Claude Opus 4.7 · high89.1% pass · $0.27/answer
$0.27
Claude Opus 4.7 · medium75.4% pass · $0.11/answer
$0.11
Grok 4.5 · high81.4% pass · $0.21/answer
$0.21
GPT-5.5 Pro · medium79.3% pass · $0.42/answer
$0.42
GPT-5.5 Pro · high91.7% pass · $0.78/answer
$0.78
GPT-5.5 Pro · low69.3% pass · $0.31/answer
$0.31

The ranking inverts. GPT-5.5 Pro at high effort wins on raw pass-rate (91.7%) but lands at $0.78/answer — 19× the DeepSeek V4 high-effort cost ($0.04). For workloads where the 7.5 percentage points of extra correctness do not justify the cost (most internal workflows), DeepSeek V4 at high reasoning is the procurement floor.

06Latency TaxThe latency tax is the third axis.

Cost and quality are two axes; latency is the third. Reasoning modes inflate TTFT 5-60× depending on model and tier. For chat UX workflows with sub-2-second latency budgets, high reasoning is unusable regardless of capability ceiling.

Tier
Minimal · low effort

TTFT P50 0.4-1.5s across frontier. Right for chat UX, autocompletions, codemod, fast extraction. Pick this tier for anything user-facing under 2-second budget.

Chat UX · 0.4-1.5s
Tier
Medium effort

TTFT P50 4-12s across frontier. Right for code refactor, content brief, document analysis where users are waiting actively but tolerant. Streaming output helps perceived latency.

Refactor · 4-12s
Tier
High effort

TTFT P50 18-90s across frontier. Right for batch jobs, async workflows, research analysis where the user submits and returns later. Unusable for sync chat.

Batch · 18-90s

07Decision MatrixWorkload to tier — nine common cases.

The matrix below maps nine workloads to the right effort tier based on the empirical pass-rate curves and cost-per-correct numbers. Use this as the starting policy, then measure against your specific quality bar.

Workflow 1
Math contest / verifiable answers

High effort wins. Quality curve is steep, answer is binary, latency budget is generous. Default to GPT-5.5 Pro high or Claude Opus 4.7 high. DeepSeek V4 high if cost-bound.

High · GPT-5.5 Pro · $0.78
Workflow 2
Multi-file code refactor

Medium wins. High effort regresses 3-5 points by over-engineering. Default to GPT-5.5 Pro medium or Claude Opus 4.7 medium. Latency budget tolerable in IDE.

Medium · Pro $0.42
Workflow 3
PR-scale code review

Low effort wins. Humans edit the output anyway; reasoning quality marginal. Default to standard tier without extended thinking. Sonnet 4.6 or GPT-5.5 standard.

Low · Sonnet $0.12
Workflow 4
Scientific / analytic research

Medium-high. Quality curve continues lifting through high but latency unbearable. Pick high for batch research, medium for interactive analysis sessions.

Medium-high · Opus $0.27
Workflow 5
Long-document Q&A (cached)

Low-medium. Cache neutralizes input cost; output budget governs. Use medium for synthesis questions; low for direct extraction. Pick model by cache discount.

Low-medium · Gemini 3 cached
Workflow 6
Customer-facing chat / live UX

Low effort, latency-bound. High and medium TTFT exceed UX budget. Default to standard tier with minimal reasoning. Stream output for perceived responsiveness.

Low only · TTFT-bound
Workflow 7
Agentic outreach personalization

Low effort, volume-bound. 50K+ emails/month tips to DeepSeek V4 minimal reasoning at $0.002/email. Quality bar is human-acceptance, not factuality.

Low · V4 $0.002
Workflow 8
Eval / benchmarking harness

Match production tier. The point of an eval is to mirror production conditions, not maximize capability. If production runs medium, eval runs medium.

Match prod tier
Workflow 9
Novel research / hard analysis

High effort. The genuine novel-reasoning use case where the dial earns its rent. Batch tolerant. GPT-5.5 Pro high or Opus 4.7 high; DeepSeek V4 high cost-bound.

High · Pro $0.78
"Most teams default every workflow to high reasoning out of caution and pay 4-12× over the right tier. The cost is real; the quality lift is illusory."— Internal procurement memo, May 2026

08ConclusionThe dial is workload-specific — not a default.

Reasoning effort cost-quality landscape · April 2026

Pick the tier per workflow. Measure cost-per-correct-answer. Don't default to high.

The reasoning_effort dial is a real tool with a real cost. The mistake we see most often is teams setting the dial to high once and forgetting it — paying 4-17× the right tier on workflows where the quality curve is flat. The corrective is a workload-by-workload policy, not a model-wide default.

The decision matrix above is the starting point. The actual policy for your stack is the result of measuring cost-per-correct-answer on your specific tasks against your specific quality bar — not the published benchmark. Build that telemetry into your AI ops stack as a first-class metric.

We re-run this benchmark every quarter as new model tiers ship. Bookmark this page if you want the canonical reference; subscribe to the newsletter for the change log.

Reasoning effort that earns its rent

Stop defaulting to high reasoning. Build a policy on cost-per-correct-answer.

We design reasoning-tier policies for engineering and growth teams shipping production AI at scale — covering workload classification, cost-per-correct-answer telemetry, latency-budget mapping, and quarterly re-benchmark cadence.

Free consultationExpert guidanceTailored solutions
What we work on

Reasoning-tier engagements

  • Workload classification by quality bar and latency budget
  • Reasoning_effort policy mapping per workflow
  • Cost-per-correct-answer telemetry instrumentation
  • Multi-vendor routing — GPT-5.5 Pro / Opus / V4
  • Quarterly re-benchmark cadence and policy review
FAQ · Reasoning effort benchmarks 2026

The questions we get every week.

Each provider exposes the dial differently but the underlying mechanism is similar: the model spends more inference compute on internal reasoning tokens before emitting the final answer. OpenAI's reasoning_effort parameter sets the model's thinking-token budget. Anthropic's extended thinking exposes a configurable thinking budget. Google's Deep Think uses thinking_budget. DeepSeek V4 has an internal CoT toggle. xAI Grok exposes reasoning_mode. Higher tiers spend more reasoning tokens, exploring multiple solution paths and self-correcting before answering. The trade-off is direct: more reasoning tokens means more cost, more latency, often higher quality on hard tasks, occasionally lower quality on bounded tasks (over-engineering).