AI model pricing in Q2 2026 spans nearly two orders of magnitude per token, and three more once you account for cached, batch, and reasoning-mode multipliers. The result is a procurement landscape where the wrong model choice on a high-volume workload can be a ten-times cost decision, made without anybody noticing for a quarter.
We track 30 models from 12 providers across input, output, cached-read, cache-write, and batch tier rates. The full data lives below, organised so that engineering and finance teams can run a real cost model rather than re-add per-token rates from memory. The headline numbers as of April 23, 2026: Claude Opus 4.7 at $5/$25, GPT-5.5 at $5/$30, GPT-5.5 Pro at $30/$180, Gemini 3 Pro Deep Think at $4/$20, DeepSeek V4 at $0.40/$1.60, and Llama 4 405B hosted at $0.70/$2.80.
Per-token rates are the headline; they are not the budget. The last section translates rack-rate pricing into real cost-per-successful -task numbers for six common workflows, because that is the unit most procurement RFPs are quietly converging on.
- 01Frontier output rates span 113× — DeepSeek V4 ($1.60) to GPT-5.5 Pro ($180).The cheapest credible frontier output ($1.60/1M, DeepSeek V4) is 113× cheaper than the most expensive ($180/1M, GPT-5.5 Pro). Per-token rate alone is rarely the right comparison, but at this spread it cannot be ignored.
- 02Cached-read discounts now reach 95% on the leading providers.Anthropic, OpenAI, and Google all ship cached-read tiers at 0.05-0.10× of input rate. On long-context workloads with stable prefixes, the cached rate is the only number that matters; uncached input is for cold-start and prime calls.
- 03Batch APIs cut input cost by 50% across OpenAI, Anthropic, and Google.Every major provider ships a batch tier at 50% input discount with a 24-hour SLA. Most teams forget to use it. For non-interactive workflows — embedding refresh, content rewrites, classification jobs — batch is the default, not an optimisation.
- 04Open-weight pricing is converging toward $0.20-0.80/1M output via Together, Fireworks, Anyscale, and Groq.Llama 4 405B, Qwen 3 235B, and DeepSeek V4 hosted offerings now sit in a tight $0.20-0.80/1M output band across major inference providers. The competitive dynamic is no longer model selection at this tier — it is provider selection on latency, throughput, and uptime.
- 05$/token is the wrong unit; cost-per-successful-task is the right one.Pass-rate, retry-rate, and output amplification reshape the apparent ranking. A model that is 6× cheaper per token but needs 3× as many retries to hit the same correctness bar is more expensive in production. The §06 worked examples translate the rate table into the metric procurement actually cares about.
01 — SnapshotThe Q2 2026 top-line pricing chart.
Below is the chart we update every quarter. It plots output rate per 1M tokens against the rough quality tier of each model so the cost-quality frontier is visible at a glance. The bars are normalised to GPT-5.5 Pro at output ($180) for visual comparison — the actual rate label sits to the right of each bar.
Output rate per 1M tokens · 10 reference models
Source: Provider pricing pages · April 23, 2026Read this chart as a positioning map, not a buying guide. A 113× output spread between DeepSeek V4 and GPT-5.5 Pro does not mean DeepSeek is the right choice 113× as often — it means the cost of picking the wrong tier on a misaligned workload is enormous. The sections below decompose the bands.
02 — Frontier TierThe frontier band — eight headline models.
Eight models share what we are calling the frontier band — released in the last 12 months, capable across all major capability families (reasoning, code, agentic tool use, long context), and priced in the $3-$30 input / $15-$180 output range. The granular layout below shows each model's rate in every billing dimension.
$30 / $180 per 1M
Cached read $1.50 · Cache write $37.50 · Batch $15Top reasoning model. Premium tier for hardest tasks; cost only justified at high reasoning_effort on complex multi-file work or research-grade analysis.
Premium · Pro$5 / $30 per 1M
Cached read $0.50 · Cache write $6.25 · Batch $2.50Default frontier choice for most production workloads. Strong general capability, broad provider support across Azure, AWS Bedrock. Workhorse tier.
Standard · default$5 / $25 per 1M
Cached read $0.50 · Cache write $6.25 · Batch $2.50Top long-context and coding model. 1M token context, strongest cache pricing, leads on SWE-Bench Pro and MCP tool use. Cheaper than GPT-5.5 on output.
Coding · 1M context$3 / $15 per 1M
Cached read $0.30 · Cache write $3.75 · Batch $1.50Workhorse model — most enterprise routes default here. Good capability per dollar, especially with cache. Sweet spot for multi-step agentic workflows.
Workhorse$4 / $20 per 1M
Cached read $0.20 · Cache write $5.00 · Batch $2.00Multimodal flagship. Best-in-class context cache discount (95%). Strong on vision, audio, video understanding, code. Good value when long context dominates.
Multimodal · cache leader$0.30 / $1.20 per 1M
Cached read $0.075 · Batch $0.15Fast frontier-derived tier. Right for high-volume tagging, classification, summary generation. Quality holds up better than headline rate suggests.
High volume$3 / $15 per 1M
Cached read $0.75 · Real-time mode +20%Reasoning + real-time data integration. Premium for live web context. Good fit for time-sensitive analysis where Google search-grounding doesn't suffice.
Real-time$2 / $9 per 1M
Cached read $0.20 · Batch $1.00European data residency tier; strong multilingual. Fits regulated enterprise workloads where EU sovereignty is required. Capability lags GPT-5.5 by ~5-8 points.
EU sovereignty"The pricing chart for frontier models in 2026 is not a ladder — it is a fan. Pick the model that matches the workload class, then optimize for cache."— Internal pricing-policy doc, May 2026
03 — Open-Weight TierThe open-weight band — convergence to a tight floor.
Open-weight pricing has compressed dramatically since Q1 2026. The major MoE models — Llama 4, Qwen 3, DeepSeek V4 — now sit in a tight $0.20-$0.80 per 1M output band across the leading inference providers. The shift in competitive dynamics is real: open-weight buyers no longer pick a model, they pick a provider on latency, throughput, and reliability.
Open-weight pricing across 5 inference providers
Source: Provider pricing pages · April 23, 2026 · Output cost per 1M tokens04 — Tier MechanicsCached, batch, and tier-of-service discounts.
The headline rates above are the rack-rate input and output. In production, three discount tiers reshape the effective cost: cached read, cache write, and batch. Each major provider ships at least two; OpenAI, Anthropic, and Google ship all three.
Anthropic & OpenAI · stable prefix tax
Both ship cached-read at 10% of input rate. Google Gemini 3 ships at 5%. On any workload with a stable system prefix and repeat traffic, this is the only number that matters in the steady state.
5-min, 1-h, 24-h tiersPrime cost · Anthropic 5-min tier
First call that writes the cache pays a 25% premium over base input. Longer TTL tiers (1-h, 24-h) add additional write premium. Pays back after 2-5 cached reads in 5-min tier; 5+ in 1-h; 20+ in 24-h.
Break-even depends on TTLNon-interactive workflows · 24h SLA
OpenAI, Anthropic, Google batch APIs ship 50% off input rate with a 24-hour completion SLA. Right for embedding refresh, content rewrites, classification, evaluation harnesses. Not for live UX.
24-h SLA · async onlyThe mistake we see most often is procurement teams modelling cost against rack rate and forgetting batch and cache entirely. On workloads where 70%+ of input tokens hit a cached prefix and 30%+ of output is non-interactive, the effective rate sits 60-75% below rack — and that gap is the difference between a viable AI feature and a budget overrun.
05 — Provider SpreadSame model, different provider — what changes.
For frontier closed-source models, the model and the provider are the same — you pay OpenAI for GPT-5.5, Anthropic for Claude. For open-weight models, you pick. The grid below shows where each provider lands on price, throughput, and reliability for the same Llama 4 405B model. The throughput numbers are what justify the premium tiers.
Generalist · best price
Lowest hosted rate at $0.70/$2.80. Strong model coverage (Llama 4, Qwen 3, DeepSeek, Mistral, Mixtral). Solid throughput at 80-160 TPS. Sweet spot for batch and bulk workflows.
$0.70 / $2.80 · 120 TPSGeneralist · enterprise leaning
Slightly cheaper than Together on Llama 4 ($0.65/$2.50). Strong fine-tuning hosting; SOC-2 + HIPAA tiers. Right for regulated workloads needing serverless inference plus customer-data residency.
$0.65 / $2.50 · 110 TPSThroughput leader
Llama 4 405B at 480 TPS — 4× generalist providers. Premium output rate ($3.00/1M) pays for the speed. Right for chat UX where TTFT is the metric. Limited model coverage.
$0.65 / $3.00 · 480 TPSThroughput leader · alt arch
Wafer-scale silicon hits 525 TPS on Llama 4 70B. Pricing matches Groq. Limited frontier-model coverage; right for specific use cases where 70B-class capability suffices.
$0.30 / $0.60 · 525 TPS · 70BEnterprise default
Hosts Anthropic, Meta, Mistral, AI21, Cohere. 10-20% premium over native pricing buys VPC privacy, IAM, billing-bundle. Right for AWS-aligned workloads; wrong otherwise.
+10-20% over nativeOpen-weight hosted
Strong open-weight coverage with serverless and dedicated tiers. Pricing competitive with Together on most models. Solid SLO, less popular in 2026 than 2024.
$0.18 / $0.65 · 90 TPS06 — From Rate to CostTranslating per-token rate to real cost.
Per-token rate is the input to a cost model, not the output. The output is cost-per-successful-task — what it actually costs to complete a real workflow end-to-end including retries, output amplification, and tool-loop overhead. The six worked examples below translate the rate table into the unit procurement RFPs increasingly cite.
Multi-file refactor (Expert-SWE-style)
GPT-5.5 Pro at high effort: $0.42/successful task at 72.6% pass-rate. Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high reasoning: $0.06 at 51.7%. Pick by quality bar.
Pro $0.42 · Opus $0.31 · V4 $0.06PR-scale code review
GPT-5.5 standard: $0.18/review at 84.1% find-rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B: $0.04 at 73.4%. The cost-spread justifies open-weight on internal-tool reviews.
Sonnet $0.12 · Llama $0.04RAG knowledge-base Q&A (cached prefix)
Claude Opus 4.7 with 90% cache: $0.05/answer. Gemini 3 Pro with 95% cache: $0.04. GPT-5.5: $0.07. Cache mechanics flatten the cost gap; pick by quality.
Gemini $0.04 · Opus $0.05Long-document summary (200K input)
Cached at 90%: GPT-5.5 $0.18, Opus $0.16, Gemini 3 Pro $0.10. Uncached: GPT-5.5 $1.00, Opus $1.00, Gemini $0.80. Cache discipline is everything at long context.
Cached Gemini $0.10Agentic outreach personalization (email-class)
DeepSeek V4: $0.002/email at 78% engagement-quality vs human baseline. Sonnet 4.6: $0.008 at 84%. GPT-5.5: $0.016 at 86%. Volume tips the scale to V4 above 50K/mo.
V4 $0.002 at scaleDaily content brief (4K output)
Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Pick by acceptance bar; Mini wins on internal drafts.
Sonnet $0.06 · Mini $0.02"By Q3 2026 every serious AI procurement RFP will quote cost-per-successful-task, not $/1M. The token rate becomes a sub-input."— Internal procurement memo, May 2026
07 — Quarterly DeltaWhat changed since Q1 2026.
Three changes since the Q1 2026 tracker matter for cost modelling.
- Open-weight floor dropped roughly 30%. Llama 4 405B hosted output fell from $4.00/1M (Q1) to $2.80/1M (Q2); DeepSeek V4 native landed at $1.60/1M output, undercutting Q1 V3.5 by 18%. The compression is provider-led, not vendor-led — Together, Fireworks, Groq, Cerebras competing on margin.
- Cached-read tier became universal. Q1 2026 had Anthropic and OpenAI shipping cache; Google Gemini 3 Pro joined in late Q1 at the most aggressive 5% rate. Mistral added 10% cached read in February. Every major provider now ships prefix-cache pricing.
- Reasoning-mode cost is now load-bearing. GPT-5.5 Pro at high reasoning_effort is 6× the cost of GPT-5.5 standard; Claude Opus 4.7 with extended thinking adds 2-4×. Gemini 3 Deep Think adds 3×. The reasoning premium has become the most important sub-tier choice on any cost model — bigger than picking between the headline models in some workloads.
08 — ConclusionThe pricing chart is the input — not the answer.
Track rates quarterly. Translate to cost-per-successful-task. Re-bid annually.
The 200+ data points above are a snapshot. The frontier is moving fast enough that any rate locked into a procurement contract for 12 months is paying somewhere between 15% and 60% over the steady state. Quarterly re-cost cadence is the right operating tempo for 2026.
The deeper move is to stop measuring per-token rate at all. The workloads that win in production are the ones whose owners measure cost-per-successful-task and accept that the headline rate is just one of six inputs to that number. Cache topology, batch usage, retry rate, output amplification, and tool-loop overhead are the others.
We re-publish this tracker every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.