AI model pricing in Q2 2026 spans nearly two orders of magnitude per token, and three more once you account for cached, batch, and reasoning-mode multipliers. The result is a procurement landscape where the wrong model choice on a high-volume workload can be a ten-times cost decision, made without anybody noticing for a quarter.
We track 30 models from 12 providers across input, output, cached-read, cache-write, and batch tier rates. The full data lives below, organised so that engineering and finance teams can run a real cost model rather than re-add per-token rates from memory. The headline numbers as of April 24, 2026: Claude Opus 4.7 at $5/$25, GPT-5.5 at $5/$30, GPT-5.5 Pro at $30/$180, Gemini 3 Pro Deep Think at $4/$20, DeepSeek V4-Pro at $0.435/$0.87 on its launch promotion (list $1.74/$3.48), V4-Flash at $0.14/$0.28, and Llama 4 405B hosted at $0.70/$2.80.
Per-token rates are the headline; they are not the budget. The last section translates rack-rate pricing into real cost-per-successful -task numbers for six common workflows, because that is the unit most procurement RFPs are quietly converging on.
- 01Frontier output rates span 207× — DeepSeek V4-Pro promo ($0.87) to GPT-5.5 Pro ($180).The cheapest official frontier-output promotion ($0.87/1M, DeepSeek V4-Pro) is 207× cheaper than the most expensive ($180/1M, GPT-5.5 Pro). DeepSeek's V4-Pro list output is $3.48/1M, still roughly 52× cheaper than Pro.
- 02Cached-read discounts now reach 95% on the leading providers.Anthropic, OpenAI, and Google all ship cached-read tiers at 0.05-0.10× of input rate. On long-context workloads with stable prefixes, the cached rate is the only number that matters; uncached input is for cold-start and prime calls.
- 03Batch APIs cut input cost by 50% across OpenAI, Anthropic, and Google.Every major provider ships a batch tier at 50% input discount with a 24-hour SLA. Most teams forget to use it. For non-interactive workflows — embedding refresh, content rewrites, classification jobs — batch is the default, not an optimisation.
- 04Open-weight pricing is converging toward $0.20-0.80/1M output via Together, Fireworks, Anyscale, and Groq.Llama 4 405B, Qwen 3 235B, and DeepSeek V4-Flash hosted offerings now sit in a tight $0.20-0.80/1M output band across major inference providers. The competitive dynamic is no longer model selection at this tier — it is provider selection on latency, throughput, and uptime.
- 05$/token is the wrong unit; cost-per-successful-task is the right one.Pass-rate, retry-rate, and output amplification reshape the apparent ranking. A model that is 6× cheaper per token but needs 3× as many retries to hit the same correctness bar is more expensive in production. The §06 worked examples translate the rate table into the metric procurement actually cares about.
01 — SnapshotThe Q2 2026 top-line pricing chart.
Below is the chart we update every quarter. It plots output rate per 1M tokens against the rough quality tier of each model so the cost-quality frontier is visible at a glance. The bars are normalised to GPT-5.5 Pro at output ($180) for visual comparison — the actual rate label sits to the right of each bar.
Output rate per 1M tokens · 10 reference models
Source: Provider pricing pages · April 24, 2026Read this chart as a positioning map, not a buying guide. A 207× output spread between DeepSeek V4-Pro's promotional rate and GPT-5.5 Pro does not mean DeepSeek is the right choice 207× as often — it means the cost of picking the wrong tier on a misaligned workload is enormous. The sections below decompose the bands.
02 — Frontier TierThe frontier band — eight headline models.
Eight models share what we are calling the frontier band — released in the last 12 months, capable across all major capability families (reasoning, code, agentic tool use, long context), and priced in the $3-$30 input / $15-$180 output range. The granular layout below shows each model's rate in every billing dimension.
$30 / $180 per 1M
Top reasoning model. Premium tier for hardest tasks; cost only justified at high reasoning_effort on complex multi-file work or research-grade analysis.
$5 / $30 per 1M
Default frontier choice for most production workloads. Strong general capability, broad provider support across Azure, AWS Bedrock. Workhorse tier.
$5 / $25 per 1M
Top long-context and coding model. 1M token context, strongest cache pricing, leads on SWE-Bench Pro and MCP tool use. Cheaper than GPT-5.5 on output.
$3 / $15 per 1M
Workhorse model — most enterprise routes default here. Good capability per dollar, especially with cache. Sweet spot for multi-step agentic workflows.
$4 / $20 per 1M
Multimodal flagship. Best-in-class context cache discount (95%). Strong on vision, audio, video understanding, code. Good value when long context dominates.
$0.30 / $1.20 per 1M
Fast frontier-derived tier. Right for high-volume tagging, classification, summary generation. Quality holds up better than headline rate suggests.
$3 / $15 per 1M
Reasoning + real-time data integration. Premium for live web context. Good fit for time-sensitive analysis where Google search-grounding doesn't suffice.
$2 / $9 per 1M
European data residency tier; strong multilingual. Fits regulated enterprise workloads where EU sovereignty is required. Capability lags GPT-5.5 by ~5-8 points.
"The pricing chart for frontier models in 2026 is not a ladder — it is a fan. Pick the model that matches the workload class, then optimize for cache."— Internal pricing-policy doc, May 2026
03 — Open-Weight TierThe open-weight band — convergence to a tight floor.
Open-weight pricing has compressed dramatically since Q1 2026. The major MoE models — Llama 4, Qwen 3, DeepSeek V4-Flash — now sit in a tight $0.20-$0.80 per 1M output band across the leading inference providers. The shift in competitive dynamics is real: open-weight buyers no longer pick a model, they pick a provider on latency, throughput, and reliability.
Open-weight pricing across 5 inference providers
Source: Provider pricing pages · April 24, 2026 · Output cost per 1M tokens04 — Tier MechanicsCached, batch, and tier-of-service discounts.
The headline rates above are the rack-rate input and output. In production, three discount tiers reshape the effective cost: cached read, cache write, and batch. Each major provider ships at least two; OpenAI, Anthropic, and Google ship all three.
Anthropic & OpenAI · stable prefix tax
Both ship cached-read at 10% of input rate. Google Gemini 3 ships at 5%. On any workload with a stable system prefix and repeat traffic, this is the only number that matters in the steady state.
Prime cost · Anthropic 5-min tier
First call that writes the cache pays a 25% premium over base input. Longer TTL tiers (1-h, 24-h) add additional write premium. Pays back after 2-5 cached reads in 5-min tier; 5+ in 1-h; 20+ in 24-h.
Non-interactive workflows · 24h SLA
OpenAI, Anthropic, Google batch APIs ship 50% off input rate with a 24-hour completion SLA. Right for embedding refresh, content rewrites, classification, evaluation harnesses. Not for live UX.
The mistake we see most often is procurement teams modelling cost against rack rate and forgetting batch and cache entirely. On workloads where 70%+ of input tokens hit a cached prefix and 30%+ of output is non-interactive, the effective rate sits 60-75% below rack — and that gap is the difference between a viable AI feature and a budget overrun.
05 — Provider SpreadSame model, different provider — what changes.
For frontier closed-source models, the model and the provider are the same — you pay OpenAI for GPT-5.5, Anthropic for Claude. For open-weight models, you pick. The grid below shows where each provider lands on price, throughput, and reliability for the same Llama 4 405B model. The throughput numbers are what justify the premium tiers.
Generalist · best price
Lowest hosted rate at $0.70/$2.80. Strong model coverage (Llama 4, Qwen 3, DeepSeek, Mistral, Mixtral). Solid throughput at 80-160 TPS. Sweet spot for batch and bulk workflows.
Generalist · enterprise leaning
Slightly cheaper than Together on Llama 4 ($0.65/$2.50). Strong fine-tuning hosting; SOC-2 + HIPAA tiers. Right for regulated workloads needing serverless inference plus customer-data residency.
Throughput leader
Llama 4 405B at 480 TPS — 4× generalist providers. Premium output rate ($3.00/1M) pays for the speed. Right for chat UX where TTFT is the metric. Limited model coverage.
Throughput leader · alt arch
Wafer-scale silicon hits 525 TPS on Llama 4 70B. Pricing matches Groq. Limited frontier-model coverage; right for specific use cases where 70B-class capability suffices.
Enterprise default
Hosts Anthropic, Meta, Mistral, AI21, Cohere. 10-20% premium over native pricing buys VPC privacy, IAM, billing-bundle. Right for AWS-aligned workloads; wrong otherwise.
Open-weight hosted
Strong open-weight coverage with serverless and dedicated tiers. Pricing competitive with Together on most models. Solid SLO, less popular in 2026 than 2024.
06 — From Rate to CostTranslating per-token rate to real cost.
Per-token rate is the input to a cost model, not the output. The output is cost-per-successful-task — what it actually costs to complete a real workflow end-to-end including retries, output amplification, and tool-loop overhead. The six worked examples below translate the rate table into the unit procurement RFPs increasingly cite.
Multi-file refactor (Expert-SWE-style)
GPT-5.5 Pro at high effort: $0.42/successful task at 72.6% pass-rate. Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high reasoning: $0.06 at 51.7%. Pick by quality bar.
PR-scale code review
GPT-5.5 standard: $0.18/review at 84.1% find-rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B: $0.04 at 73.4%. The cost-spread justifies open-weight on internal-tool reviews.
RAG knowledge-base Q&A (cached prefix)
Claude Opus 4.7 with 90% cache: $0.05/answer. Gemini 3 Pro with 95% cache: $0.04. GPT-5.5: $0.07. Cache mechanics flatten the cost gap; pick by quality.
Long-document summary (200K input)
Cached at 90%: GPT-5.5 $0.18, Opus $0.16, Gemini 3 Pro $0.10. Uncached: GPT-5.5 $1.00, Opus $1.00, Gemini $0.80. Cache discipline is everything at long context.
Agentic outreach personalization (email-class)
DeepSeek V4: $0.002/email at 78% engagement-quality vs human baseline. Sonnet 4.6: $0.008 at 84%. GPT-5.5: $0.016 at 86%. Volume tips the scale to V4 above 50K/mo.
Daily content brief (4K output)
Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Pick by acceptance bar; Mini wins on internal drafts.
"By Q3 2026 every serious AI procurement RFP will quote cost-per-successful-task, not $/1M. The token rate becomes a sub-input."— Internal procurement memo, May 2026
07 — Quarterly DeltaWhat changed since Q1 2026.
Three changes since the Q1 2026 tracker matter for cost modelling.
- Open-weight floor dropped roughly 30%. Llama 4 405B hosted output fell from $4.00/1M (Q1) to $2.80/1M (Q2); DeepSeek V4-Flash native landed at $0.28/1M output, while V4-Pro launched at $0.87/1M promotional output ($3.48/1M list). The compression is vendor-led at the floor and provider-led in the hosted open-weight band — Together, Fireworks, Groq, Cerebras competing on margin.
- Cached-read tier became universal. Q1 2026 had Anthropic and OpenAI shipping cache; Google Gemini 3 Pro joined in late Q1 at the most aggressive 5% rate. Mistral added 10% cached read in February. Every major provider now ships prefix-cache pricing.
- Reasoning-mode cost is now load-bearing. GPT-5.5 Pro at high reasoning_effort is 6× the cost of GPT-5.5 standard; Claude Opus 4.7 with extended thinking adds 2-4×. Gemini 3 Deep Think adds 3×. The reasoning premium has become the most important sub-tier choice on any cost model — bigger than picking between the headline models in some workloads.
08 — ConclusionThe pricing chart is the input — not the answer.
Track rates quarterly. Translate to cost-per-successful-task. Re-bid annually.
The 200+ data points above are a snapshot. The frontier is moving fast enough that any rate locked into a procurement contract for 12 months is paying somewhere between 15% and 60% over the steady state. Quarterly re-cost cadence is the right operating tempo for 2026.
The deeper move is to stop measuring per-token rate at all. The workloads that win in production are the ones whose owners measure cost-per-successful-task and accept that the headline rate is just one of six inputs to that number. Cache topology, batch usage, retry rate, output amplification, and tool-loop overhead are the others.
We re-publish this tracker every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.