AI Development13 min readOriginal researchQ2 2026 data

AI Model Efficient Frontier Q2 2026: Performance vs Price

Q2 2026 efficient-frontier analysis — Pareto scatter plots mapping speed, quality, and cost across 20 frontier models. Identifies the dominant strategies.

Digital Applied Team
April 12, 2026
13 min read
20

Models analyzed

6

Pareto-dominant

1000x

Price spread

Q2 2026

Timeframe

Key Takeaways

Dominance is rare, not common:: Of 20 frontier models tracked across OpenRouter, only six sit on the Q2 2026 Pareto frontier across cost, quality, and speed. The other fourteen are beaten on every dimension by a cheaper or faster sibling — yet most of them still ship real production traffic.
A 50x price spread ends in a tie:: Input token pricing spans from $0.03 per 1M (LFM2 24B) to $30 per 1M (GPT-5.4 Pro) — a thousand-fold gap. Quality scores span roughly 49 to 57 on the Artificial Analysis index. The price range is not matched by a matching quality range.
Free tiers compress the entire curve:: Qwen 3.6 Plus (free preview) and Step 3.5 Flash (free tier) are economically free. Any paid model that does not dominate them on quality is hard to justify at the low end. The floor resets the scatter.
Speed is a separate frontier:: Cerebras, Groq, and SambaNova hardware reshape which models are attractive at each price point. A mediocre model on fast silicon can beat a strong model on slow silicon for interactive UX.
Most buyers shop the wrong axis:: Teams default to name recognition or the cheapest available model. Both are dominated strategies. The Pareto frontier says pick the model that is not beaten on every axis at once by a sibling — which produces a short-list of six.
Provider strategy is visible in the scatter:: Anthropic bets on reasoning premium, Xiaomi bets on volume-at-quality, Alibaba floods the low end, OpenAI sits above everyone on price, and NVIDIA prices Nemotron to compete against open weights. Each bet maps to a frontier quadrant.
Routing beats model choice:: The cheapest correct answer is usually a two-model routing rule rather than a single model choice. Sonnet 4.6 for reasoning, MiMo V2 Pro for volume code, Qwen 3.6 Plus (free) for bulk classification, Opus 4.6 for judgment calls — that stack dominates any single-model selection on unit cost.

The efficient frontier says every dominated model should be discarded. Half the top-10 on OpenRouter is dominated on every dimension by a cheaper sibling. Pareto optimality is a business decision — and most agencies get it wrong.

This post is a Q2 2026 snapshot of the AI model landscape viewed through a single lens: which models are not beaten on every axis at once by a competitor, and which are. It uses real pricing from the Q2 2026 LLM pricing index, real throughput numbers from Artificial Analysis and OpenRouter, and real quality scores from the Intelligence Index. No invented benchmarks. No vendor-curated charts.

What an efficient frontier actually is

An efficient frontier, in the Pareto sense, is the set of options where you cannot improve one dimension without getting worse on another. In portfolio theory the axes are expected return and variance. In AI model selection the axes are quality, cost per token, and throughput. A model is on the frontier if no competitor strictly beats it on every axis simultaneously. A model is dominated if some alternative is at least as good on every axis and strictly better on at least one.

The interesting property of the frontier is its sparsity. With 20 models across three axes, a naive expectation is that most of them sit somewhere on the curve. Reality is the opposite — the frontier collapses to a handful of dominant points and a long tail of models beaten on every dimension. The frontier is not a line, it is a cluster.

What "dominated" actually means

Claude Opus 4.5 (Reasoning) scores 49.7 on the Intelligence Index. It costs roughly $15/$75 per 1M tokens on Anthropic's direct API. Claude Sonnet 4.6 (Max Effort) scores 51.7 on the same index at $3/$15 per 1M tokens. Sonnet 4.6 is better on quality and cheaper on price — so Opus 4.5 is strictly dominated. There is no axis on which Opus 4.5 wins.

That does not mean nobody should use Opus 4.5 — migration cost is real, existing pipelines are tuned to it, and the newer Opus 4.6 already replaced it at the top of the Anthropic stack. But in a greenfield deployment as of Q2 2026, choosing Opus 4.5 over Sonnet 4.6 is not a defensible choice on any axis.

The three axes: speed, quality, cost

Every Pareto analysis depends on its axes. For this one we use three measurable, comparable quantities. Each has caveats.

Quality
Intelligence Index

Artificial Analysis composite score (0-100) blending MMLU-Pro, GPQA Diamond, AIME, LiveCodeBench, SciCode, HumanEval, MATH-500, and IFEval. Reflects general reasoning ability. Domain-specific rankings differ.

Cost
Blended $/1M tokens

Blended input+output price per 1M tokens at a 3:1 input-to-output ratio. OpenRouter pricing for open-weight models, direct API pricing for frontier closed models. Excludes cached-input discounts.

Speed
Output tokens/second

Median output throughput measured on the fastest generally available inference provider per model. Cerebras, Groq, and SambaNova reshape the numbers for open-weight models; closed frontier models are capped by their vendor.

These three axes are correlated but not collinear. Cheap models often score lower on quality, but not always. Fast hardware improves throughput without changing quality. Expensive reasoning models are slow by design because they generate more tokens per answer. The correlations are loose enough that the three-axis Pareto analysis produces a genuinely short list of dominant models.

Q2 2026 scatter: cost vs quality

Twenty frontier and near-frontier models plotted against blended cost and the Artificial Analysis Intelligence Index. Prices are quoted on a $/1M-token basis at a 3:1 input-to-output blend. The final column marks whether the model is on the cost-vs-quality Pareto frontier for Q2 2026.

ModelProviderBlended $/1MQuality scoreFrontier?
GPT-5.4 (xhigh)OpenAI$5.6357.2Yes
Gemini 3.1 Pro PreviewGoogle$4.5057.2Yes
Claude Opus 4.6 (Max Effort)Anthropic$10.0053.0Yes
Claude Sonnet 4.6 (Max Effort)Anthropic$6.0051.7Yes
GPT-5.2 (xhigh)OpenAI$6.5051.3No — dominated by Sonnet 4.6
GLM-5 (Reasoning)Z.ai$1.2449.8No — dominated by MiniMax M2.7
Claude Opus 4.5 (Reasoning)Anthropic$30.0049.7No — dominated by Sonnet 4.6
MiniMax M2.7MiniMax$0.5349.6Yes
MiMo V2 ProXiaomi$1.5049.2Yes
GPT-5.4 MiniOpenAI$1.69~46No — dominated by MiMo V2 Pro
Nemotron 3 Super 120BNVIDIA$0.10~45Yes
Kimi K2.5Moonshot$0.72~44No — dominated by Nemotron 3 Super
Qwen 3.6 Plus PreviewAlibabaFree~44Yes
Qwen 3 Max ThinkingAlibaba$1.56~43No — dominated by MiMo V2 Pro
MiniMax M2.5MiniMax$0.34~42No — dominated by Nemotron 3 Super
DeepSeek V3.2DeepSeek$0.42~41No — dominated by Nemotron 3 Super
Gemini 3 Flash PreviewGoogle$0.56~40No — dominated by Qwen 3.6 Plus (free)
Grok 4.20xAI$3.00~47No — dominated by Sonnet 4.6 on quality
Step 3.5 FlashStepFunFree tier~38No — dominated by Qwen 3.6 Plus on quality
GPT-5.4 ProOpenAI$67.50~58Yes (top)

Reading the table: six models sit on the cost-vs-quality frontier — GPT-5.4 Pro at the absolute top, GPT-5.4 and Gemini 3.1 Pro tied below it, Opus 4.6 and Sonnet 4.6 in the premium reasoning band, MiniMax M2.7 and MiMo V2 Pro in the mid-quality mid-cost band, Nemotron 3 Super in the open-weight band, and Qwen 3.6 Plus as the free-tier anchor. Every other model in the top 20 is beaten on both axes at once by one of those six.

Q2 2026 scatter: cost vs speed

The second scatter plots cost against throughput, measured as output tokens per second on the fastest generally available provider for each model. Cerebras, Groq, SambaNova, and DeepInfra matter here — they reshape the frontier for open-weight models by offering hardware-differentiated speed at low prices.

ModelFastest onThroughputBlended $/1MFrontier?
gpt-oss-120bCerebras920 tok/s$0.35Yes (fastest)
gpt-oss-20bGroq714 tok/s$0.07Yes
Qwen 3 32BGroq423 tok/s$0.29Yes
Llama 3.1 8BGroq274 tok/s$0.05Yes (cheapest fast)
Kimi K2.5ModelRun198 tok/s$0.55No — dominated by Qwen 3 32B
MiniMax M2.5SambaNova175 tok/s$0.30No — dominated by Qwen 3 32B
Nemotron 3 SuperDeepInfra174 tok/s$0.10Yes
Trinity MiniClarifai170 tok/s$0.04Yes
R1 0528Nebius156 tok/s$2.00No — dominated by Nemotron 3 Super
Claude Sonnet 4.6Anthropic direct~75 tok/s$6.00No — dominated on cost and speed
Claude Opus 4.6Anthropic direct~60 tok/s$10.00No on speed axis — Yes on quality
GPT-5.4OpenAI direct~110 tok/s$5.63No — dominated by gpt-oss-120b on cost and speed

The cost-vs-speed frontier is owned by hardware rather than model architecture. Cerebras wins the top of the curve with gpt-oss-120b at 920 tok/s. Groq dominates the mid-curve with gpt-oss-20b at 714 tok/s and $0.07/1M. Claude and OpenAI's frontier models are not on this frontier — they are on the quality frontier, not the speed one.

Dominant models on the frontier

Six models survive both scatters as Q2 2026 Pareto-dominant choices. Each owns a distinct position.

Qwen 3.6 Plus Preview (free)

The free-tier anchor. Alibaba's preview release is priced at zero on OpenRouter with a 1M-token context window and always-on chain-of-thought. Quality sits in the ~44 range on the Intelligence Index. Because it is free, nothing with a quality score below 44 can justify a per-token price. The entire bottom of the scatter collapses against this point.

Best for: bulk classification, synthetic data generation, prompt exploration, and any workload where quality above 44 is not required.

Nemotron 3 Super 120B

The open-weight frontier. NVIDIA's 120B/12B-active Mamba- Transformer MoE scores 60.47% on SWE-Bench Verified and runs at 174 tok/s on DeepInfra for $0.10 blended. On the cost-vs-quality scatter it sits at the inflection where open-weight economics meet reasonable quality. No open-weight model beats it on every axis at once.

Best for: self-hosted deployments, regulated environments requiring on-prem inference, and workloads where data residency rules prohibit closed-weight APIs.

MiniMax M2.7

The self-evolving mid-curve. Ten billion active parameters, 56.22% on SWE-Bench Pro, 57.0% on Terminal Bench 2, and $0.30/$1.20 per 1M. Sits at the sweet spot where quality of 49.6 meets $0.53 blended cost — a position Opus charges 50x more for. The Opus-class-quality-at-Xiaomi-prices story.

Best for: agent workloads, code generation at scale, and any budget-conscious deployment that needs near-premium reasoning.

MiMo V2 Pro

The volume champion. Xiaomi's 1T+ parameter model at $1/$3 per 1M carries 4.79T weekly tokens on OpenRouter — a 3x lead over second place and 25.5% of all coding tokens. Quality sits at 49.2 on the Intelligence Index. Not the best at anything; on the frontier because of sheer cost-per-quality efficiency at its tier.

Best for: high-volume coding assistants, content generation pipelines, and any workload where stability at scale matters more than peak quality.

Claude Sonnet 4.6

The premium workhorse. $3/$15 per 1M, 79.6% on SWE-Bench Verified, Intelligence Index 51.7 at Max Effort, 200K context with 1M beta. At roughly one-fifth of Opus pricing with near-Opus quality, Sonnet 4.6 dominates Opus 4.5 outright and competes with GPT-5.4 on every axis except raw speed. The default reasoning model for most production workloads.

Best for: production RAG pipelines, agentic workflows, and any workload where consistency matters more than cost-per-token.

Claude Opus 4.6

The quality ceiling for reasoning-heavy tasks. $5/$25 per 1M on OpenRouter, Intelligence Index 53.0 at Max Effort, 200K context with 1M beta. Sits on the frontier because no cheaper model beats it on quality and no higher-quality model beats it on price except GPT-5.4 Pro (at 7x the cost) and tied GPT-5.4 and Gemini 3.1 Pro.

Best for: irreducibly hard reasoning, long-horizon agents, legal analysis, and any workload where the cost of a wrong answer exceeds the cost of the token spend.

Dominated models off the frontier

The honest list. These models still ship real production traffic, but in a greenfield Q2 2026 deployment they are beaten on every axis at once by one of the six frontier models above. Domination does not mean "bad" — it means "not the right default."

ModelDominated byWhy
Claude Opus 4.5Sonnet 4.6Lower quality, higher price, slower.
GPT-5.2 xhighSonnet 4.6Similar quality, higher price, smaller ecosystem gains.
Qwen 3 Max ThinkingMiMo V2 ProSimilar quality tier at higher blended cost.
Kimi K2.5MiniMax M2.7 / Nemotron 3 SuperHigher cost at comparable quality, slower than Nemotron.
MiniMax M2.5MiniMax M2.7Older sibling, superseded on both price and quality.
DeepSeek V3.2Nemotron 3 SuperHigher cost at lower SWE-Bench, slower inference.
Gemini 3 Flash PreviewQwen 3.6 Plus (free)Paid tier at quality the free model matches.
Step 3.5 FlashQwen 3.6 Plus (free)Both free; Qwen 3.6 scores higher on quality.
GPT-5.4 MiniMiMo V2 ProSimilar quality tier at higher blended cost.
Grok 4.20Sonnet 4.6Lower benchmark scores, comparable price, smaller tool ecosystem.

The dominated list matters because procurement inertia is strongest for exactly these models. Grok 4.20 ships because xAI has distribution. GPT-5.4 Mini ships because OpenAI has name recognition. DeepSeek V3.2 ships because it was the right answer six months ago. None of them are the right Pareto answer now.

The free-tier floor

Q2 2026 is the first period where genuinely capable free models exist in meaningful numbers. Qwen 3.6 Plus Preview is free during its preview window. Step 3.5 Flash has a free tier. Nemotron 3 Super has a free tier on NVIDIA's own inference endpoints. OpenAI gpt-oss-120b is Apache 2.0. The floor of the scatter is compressing against zero.

The business implication is counterintuitive. A paid model priced at $0.20 per 1M tokens does not compete with other $0.20 models — it competes with free. If it does not beat the best free model on quality, it has no niche. That condition is strict; it eliminates most of the sub-$0.50 paid tier from serious consideration on greenfield deployments.

What free tiers actually charge for
  • Throughput guarantees: paid tiers usually unlock higher RPS limits and priority queues.
  • Data handling: paid tiers often promise zero-retention, no-training, or region-specific inference.
  • SLAs: paid tiers carry uptime commitments that free tiers explicitly disclaim.
  • Larger context windows: some free tiers cap context at 32K while the paid version hits 1M.
  • Tool-use and file uploads: advanced features (function calling, vision, file attachments) are often gated to paid.

Strategic positioning per provider

The scatter reveals each provider's Q2 2026 bet. Every major lab has made a choice about which frontier quadrant they intend to own. Those choices are legible.

Anthropic — reasoning premium

Opus at $5/$25, Sonnet at $3/$15. Premium pricing, premium quality, no attempt to compete at the low end. The bet is that enterprise reasoning workloads are price-inelastic above a quality threshold. Provider share on OpenRouter is 10.9% — below Xiaomi, Alibaba, Google — but concentrated in high-margin workloads.

Xiaomi — volume at quality

MiMo V2 Pro at $1/$3 per 1M and 4.79T weekly tokens. The bet is that a mid-quality model at scale produces more revenue than a premium model at enterprise scale. 21.1% OpenRouter share — the largest of any provider. Detailed in Anthropic cost vs Xiaomi volume.

Alibaba — free floor

Qwen 3.6 Plus free during preview, Qwen 3.5 Flash at $0.065/$0.26. The bet is that the free tier pulls developers onto Alibaba Cloud infrastructure where the paid upsell happens. 13.9% share and climbing.

OpenAI — top of curve, premium everywhere

GPT-5.4 Pro at $30/$180. GPT-5.4 at $2.50/$15. Even Nano at $0.20/$1.25. The bet is that brand pricing holds even as Chinese models match on quality. 7.5% OpenRouter share signals that the bet is under pressure but profitable.

NVIDIA — open weights as distribution

Nemotron 3 Super at $0.10 blended, open weights, fast on DeepInfra. The bet is that open-weight models anchor developer mindshare and drive H100/B200 hardware demand. The model is not the product; the inference hardware is.

MiniMax — self-evolving mid-curve

M2.7 at $0.30/$1.20 with a self-evolving training loop. 56% SWE-Pro and 57% Terminal-Bench 2. The bet is that a continuously-improving model at one-tenth of Opus pricing wins agentic workloads where quality-per-dollar dominates. 8.1% OpenRouter share.

Digital Applied model-routing rules

The right application of a Pareto frontier is not model selection — it is model routing. Each workload type maps to a specific model on the frontier. Our production rules as of Q2 2026:

WorkloadRoute toWhy
Irreducible judgmentClaude Opus 4.6Top of quality axis; error cost exceeds token cost.
Production RAG + agentsClaude Sonnet 4.6Near-Opus quality at 1/5 cost; default reasoning model.
High-volume code generationMiMo V2 ProBest cost-per-quality at scale; 1M context.
Budget agent workloadsMiniMax M2.7Opus-class reasoning at $0.53 blended.
Bulk classificationQwen 3.6 Plus (free)Free floor dominates below the quality-44 band.
Interactive UX / voicegpt-oss-120b on Cerebras920 tok/s owns latency-sensitive surface.
On-prem / regulatedNemotron 3 Super 120BBest open-weight quality; self-hostable.

A production stack running those seven rules costs less than a single-model stack pinned to any frontier model and produces better unit quality because each workload is served by the model that dominates its axis. The cheapest correct answer is structurally a routing decision, not a shopping decision. For the measurement infrastructure behind this analysis, our analytics and insights engagements track cost-per-correct-answer per route against cost-per-token baselines.

Conclusion

Twenty frontier and near-frontier models reduce to six Pareto-dominant choices in Q2 2026. The other fourteen are dominated on at least one axis by a sibling, usually by a cheaper one. That is a strong claim, and it changes fast — the frontier will look different in Q3. The methodology is what matters: define the axes for your workload, place every candidate on the scatter, discard the dominated points, then route workloads against the surviving frontier.

The cheapest correct answer for a production AI stack is almost never a single model. It is a routing rule that sends each workload to the point on the frontier that dominates its axis — Opus for judgment, Sonnet for reasoning, MiMo for volume, MiniMax for budget agents, Qwen for free-tier classification, Cerebras-hosted gpt-oss for interactive UX, Nemotron for on-prem. That is the Pareto-rational default.

Build a model stack that actually dominates

Most production AI stacks pay a premium for a dominated model and call it a day. We build Pareto-rational routing layers that send each workload to the frontier model that fits its axis — and measure cost-per-correct-answer, not cost-per-token.

Free consultation
Expert guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring the Q2 2026 AI landscape.