AI model pricing in Q2 2026 spans nearly two orders of magnitude per token, and three more once you account for cached, batch, and reasoning-mode multipliers. The result is a procurement landscape where the wrong model choice on a high-volume workload can be a ten-times cost decision, made without anybody noticing for a quarter.

We track 30 models from 12 providers across input, output, cached-read, cache-write, and batch tier rates. The full data lives below, organised so that engineering and finance teams can run a real cost model rather than re-add per-token rates from memory. The headline numbers as of April 23, 2026: Claude Opus 4.7 at $5/$25, GPT-5.5 at $5/$30, GPT-5.5 Pro at $30/$180, Gemini 3 Pro Deep Think at $4/$20, DeepSeek V4 at $0.40/$1.60, and Llama 4 405B hosted at $0.70/$2.80.

Per-token rates are the headline; they are not the budget. The last section translates rack-rate pricing into real cost-per-successful -task numbers for six common workflows, because that is the unit most procurement RFPs are quietly converging on.

Key takeaways

01
Frontier output rates span 113× — DeepSeek V4 ($1.60) to GPT-5.5 Pro ($180).The cheapest credible frontier output ($1.60/1M, DeepSeek V4) is 113× cheaper than the most expensive ($180/1M, GPT-5.5 Pro). Per-token rate alone is rarely the right comparison, but at this spread it cannot be ignored.
02
Cached-read discounts now reach 95% on the leading providers.Anthropic, OpenAI, and Google all ship cached-read tiers at 0.05-0.10× of input rate. On long-context workloads with stable prefixes, the cached rate is the only number that matters; uncached input is for cold-start and prime calls.
03
Batch APIs cut input cost by 50% across OpenAI, Anthropic, and Google.Every major provider ships a batch tier at 50% input discount with a 24-hour SLA. Most teams forget to use it. For non-interactive workflows — embedding refresh, content rewrites, classification jobs — batch is the default, not an optimisation.
04
Open-weight pricing is converging toward $0.20-0.80/1M output via Together, Fireworks, Anyscale, and Groq.Llama 4 405B, Qwen 3 235B, and DeepSeek V4 hosted offerings now sit in a tight $0.20-0.80/1M output band across major inference providers. The competitive dynamic is no longer model selection at this tier — it is provider selection on latency, throughput, and uptime.
05
$/token is the wrong unit; cost-per-successful-task is the right one.Pass-rate, retry-rate, and output amplification reshape the apparent ranking. A model that is 6× cheaper per token but needs 3× as many retries to hit the same correctness bar is more expensive in production. The §06 worked examples translate the rate table into the metric procurement actually cares about.

01 — SnapshotThe Q2 2026 top-line pricing chart.

Below is the chart we update every quarter. It plots output rate per 1M tokens against the rough quality tier of each model so the cost-quality frontier is visible at a glance. The bars are normalised to GPT-5.5 Pro at output ($180) for visual comparison — the actual rate label sits to the right of each bar.

Output rate per 1M tokens · 10 reference models

Source: Provider pricing pages · April 23, 2026

GPT-5.5 ProOpenAI · top-tier reasoning

$30 / $180

Claude Opus 4.7Anthropic · top-tier coding & long context

$5 / $25

−86% vs Pro

GPT-5.5 (standard)OpenAI · default frontier

$5 / $30

Gemini 3 Pro Deep ThinkGoogle · multimodal flagship

$4 / $20

−89% vs Pro

Grok 4.5xAI · reasoning + real-time

$3 / $15

DeepSeek V4DeepSeek · MoE efficiency

$0.40 / $1.60

−99% vs Pro

Qwen 3 235B (hosted)Alibaba · open-weight via Together

$0.18 / $0.72

Llama 4 405B (hosted)Meta · open-weight via Together

$0.70 / $2.80

Mistral Large 3Mistral · enterprise default

$2 / $9

Claude Sonnet 4.6Anthropic · workhorse tier

$3 / $15

Read this chart as a positioning map, not a buying guide. A 113× output spread between DeepSeek V4 and GPT-5.5 Pro does not mean DeepSeek is the right choice 113× as often — it means the cost of picking the wrong tier on a misaligned workload is enormous. The sections below decompose the bands.

02 — Frontier TierThe frontier band — eight headline models.

Eight models share what we are calling the frontier band — released in the last 12 months, capable across all major capability families (reasoning, code, agentic tool use, long context), and priced in the $3-$30 input / $15-$180 output range. The granular layout below shows each model's rate in every billing dimension.

OpenAI · GPT-5.5 Pro

$30 / $180 per 1M

Cached read $1.50 · Cache write $37.50 · Batch $15

Top reasoning model. Premium tier for hardest tasks; cost only justified at high reasoning_effort on complex multi-file work or research-grade analysis.

Premium · Pro

OpenAI · GPT-5.5

$5 / $30 per 1M

Cached read $0.50 · Cache write $6.25 · Batch $2.50

Default frontier choice for most production workloads. Strong general capability, broad provider support across Azure, AWS Bedrock. Workhorse tier.

Standard · default

Anthropic · Claude Opus 4.7

$5 / $25 per 1M

Cached read $0.50 · Cache write $6.25 · Batch $2.50

Top long-context and coding model. 1M token context, strongest cache pricing, leads on SWE-Bench Pro and MCP tool use. Cheaper than GPT-5.5 on output.

Coding · 1M context

Anthropic · Sonnet 4.6

$3 / $15 per 1M

Cached read $0.30 · Cache write $3.75 · Batch $1.50

Workhorse model — most enterprise routes default here. Good capability per dollar, especially with cache. Sweet spot for multi-step agentic workflows.

Workhorse

Google · Gemini 3 Pro Deep Think

$4 / $20 per 1M

Cached read $0.20 · Cache write $5.00 · Batch $2.00

Multimodal flagship. Best-in-class context cache discount (95%). Strong on vision, audio, video understanding, code. Good value when long context dominates.

Multimodal · cache leader

Google · Gemini 3 Flash

$0.30 / $1.20 per 1M

Cached read $0.075 · Batch $0.15

Fast frontier-derived tier. Right for high-volume tagging, classification, summary generation. Quality holds up better than headline rate suggests.

High volume

xAI · Grok 4.5

$3 / $15 per 1M

Cached read $0.75 · Real-time mode +20%

Reasoning + real-time data integration. Premium for live web context. Good fit for time-sensitive analysis where Google search-grounding doesn't suffice.

Real-time

Mistral · Mistral Large 3

$2 / $9 per 1M

Cached read $0.20 · Batch $1.00

European data residency tier; strong multilingual. Fits regulated enterprise workloads where EU sovereignty is required. Capability lags GPT-5.5 by ~5-8 points.

EU sovereignty

"The pricing chart for frontier models in 2026 is not a ladder — it is a fan. Pick the model that matches the workload class, then optimize for cache."— Internal pricing-policy doc, May 2026

03 — Open-Weight TierThe open-weight band — convergence to a tight floor.

Open-weight pricing has compressed dramatically since Q1 2026. The major MoE models — Llama 4, Qwen 3, DeepSeek V4 — now sit in a tight $0.20-$0.80 per 1M output band across the leading inference providers. The shift in competitive dynamics is real: open-weight buyers no longer pick a model, they pick a provider on latency, throughput, and reliability.

Open-weight pricing across 5 inference providers

Source: Provider pricing pages · April 23, 2026 · Output cost per 1M tokens

DeepSeek V4 (671B MoE, 37B active)DeepSeek native API · MoE flagship

$0.40 / $1.60

DeepSeek V4 · Together AIHosted · same model

$0.30 / $1.30

−19% vs native

Llama 4 405B · Together AIMeta · open-weight

$0.70 / $2.80

Llama 4 405B · FireworksSame model · alt provider

$0.65 / $2.50

−11% vs Together

Llama 4 70B · Together AIMid-tier dense

$0.18 / $0.40

Qwen 3 235B (Alibaba native)Alibaba Model Studio

$0.14 / $0.56

Qwen 3 235B · Together AIHosted · same model

$0.18 / $0.72

Mixtral 8x22B · FireworksMistral open-weight

$0.45 / $0.90

Llama 4 405B · GroqSame model · 480 TPS throughput

$0.65 / $3.00

Llama 4 70B · CerebrasMid-tier · 525 TPS throughput

$0.30 / $0.60

The provider-spread tax

Pricing on the same open-weight model varies by 2-3× across hosted providers. The premium on Groq and Cerebras pays for throughput (480-525 TPS) and TTFT (under 200ms), not capability — both are serving the same model weights as Together. For interactive UX, the throughput premium is rational; for batch jobs it is waste.

04 — Tier MechanicsCached, batch, and tier-of-service discounts.

The headline rates above are the rack-rate input and output. In production, three discount tiers reshape the effective cost: cached read, cache write, and batch. Each major provider ships at least two; OpenAI, Anthropic, and Google ship all three.

Cached read

0.05× input

Anthropic & OpenAI · stable prefix tax

Both ship cached-read at 10% of input rate. Google Gemini 3 ships at 5%. On any workload with a stable system prefix and repeat traffic, this is the only number that matters in the steady state.

5-min, 1-h, 24-h tiers

Cache write

1.25× input

Prime cost · Anthropic 5-min tier

First call that writes the cache pays a 25% premium over base input. Longer TTL tiers (1-h, 24-h) add additional write premium. Pays back after 2-5 cached reads in 5-min tier; 5+ in 1-h; 20+ in 24-h.

Break-even depends on TTL

Batch tier

0.50× input

Non-interactive workflows · 24h SLA

OpenAI, Anthropic, Google batch APIs ship 50% off input rate with a 24-hour completion SLA. Right for embedding refresh, content rewrites, classification, evaluation harnesses. Not for live UX.

24-h SLA · async only

The mistake we see most often is procurement teams modelling cost against rack rate and forgetting batch and cache entirely. On workloads where 70%+ of input tokens hit a cached prefix and 30%+ of output is non-interactive, the effective rate sits 60-75% below rack — and that gap is the difference between a viable AI feature and a budget overrun.

05 — Provider SpreadSame model, different provider — what changes.

For frontier closed-source models, the model and the provider are the same — you pay OpenAI for GPT-5.5, Anthropic for Claude. For open-weight models, you pick. The grid below shows where each provider lands on price, throughput, and reliability for the same Llama 4 405B model. The throughput numbers are what justify the premium tiers.

Together AI

Generalist · best price

Lowest hosted rate at $0.70/$2.80. Strong model coverage (Llama 4, Qwen 3, DeepSeek, Mistral, Mixtral). Solid throughput at 80-160 TPS. Sweet spot for batch and bulk workflows.

$0.70 / $2.80 · 120 TPS

Fireworks AI

Generalist · enterprise leaning

Slightly cheaper than Together on Llama 4 ($0.65/$2.50). Strong fine-tuning hosting; SOC-2 + HIPAA tiers. Right for regulated workloads needing serverless inference plus customer-data residency.

$0.65 / $2.50 · 110 TPS

Groq

Throughput leader

Llama 4 405B at 480 TPS — 4× generalist providers. Premium output rate ($3.00/1M) pays for the speed. Right for chat UX where TTFT is the metric. Limited model coverage.

$0.65 / $3.00 · 480 TPS

Cerebras

Throughput leader · alt arch

Wafer-scale silicon hits 525 TPS on Llama 4 70B. Pricing matches Groq. Limited frontier-model coverage; right for specific use cases where 70B-class capability suffices.

$0.30 / $0.60 · 525 TPS · 70B

AWS Bedrock

Enterprise default

Hosts Anthropic, Meta, Mistral, AI21, Cohere. 10-20% premium over native pricing buys VPC privacy, IAM, billing-bundle. Right for AWS-aligned workloads; wrong otherwise.

+10-20% over native

Anyscale Endpoints

Open-weight hosted

Strong open-weight coverage with serverless and dedicated tiers. Pricing competitive with Together on most models. Solid SLO, less popular in 2026 than 2024.

$0.18 / $0.65 · 90 TPS

06 — From Rate to CostTranslating per-token rate to real cost.

Per-token rate is the input to a cost model, not the output. The output is cost-per-successful-task — what it actually costs to complete a real workflow end-to-end including retries, output amplification, and tool-loop overhead. The six worked examples below translate the rate table into the unit procurement RFPs increasingly cite.

Workflow 1

Multi-file refactor (Expert-SWE-style)

GPT-5.5 Pro at high effort: $0.42/successful task at 72.6% pass-rate. Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high reasoning: $0.06 at 51.7%. Pick by quality bar.

Pro $0.42 · Opus $0.31 · V4 $0.06

Workflow 2

PR-scale code review

GPT-5.5 standard: $0.18/review at 84.1% find-rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B: $0.04 at 73.4%. The cost-spread justifies open-weight on internal-tool reviews.

Sonnet $0.12 · Llama $0.04

Workflow 3

RAG knowledge-base Q&A (cached prefix)

Claude Opus 4.7 with 90% cache: $0.05/answer. Gemini 3 Pro with 95% cache: $0.04. GPT-5.5: $0.07. Cache mechanics flatten the cost gap; pick by quality.

Gemini $0.04 · Opus $0.05

Workflow 4

Long-document summary (200K input)

Cached at 90%: GPT-5.5 $0.18, Opus $0.16, Gemini 3 Pro $0.10. Uncached: GPT-5.5 $1.00, Opus $1.00, Gemini $0.80. Cache discipline is everything at long context.

Cached Gemini $0.10

Workflow 5

Agentic outreach personalization (email-class)

DeepSeek V4: $0.002/email at 78% engagement-quality vs human baseline. Sonnet 4.6: $0.008 at 84%. GPT-5.5: $0.016 at 86%. Volume tips the scale to V4 above 50K/mo.

V4 $0.002 at scale

Workflow 6

Daily content brief (4K output)

Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Pick by acceptance bar; Mini wins on internal drafts.

Sonnet $0.06 · Mini $0.02

"By Q3 2026 every serious AI procurement RFP will quote cost-per-successful-task, not $/1M. The token rate becomes a sub-input."— Internal procurement memo, May 2026

07 — Quarterly DeltaWhat changed since Q1 2026.

Three changes since the Q1 2026 tracker matter for cost modelling.

Open-weight floor dropped roughly 30%. Llama 4 405B hosted output fell from $4.00/1M (Q1) to $2.80/1M (Q2); DeepSeek V4 native landed at $1.60/1M output, undercutting Q1 V3.5 by 18%. The compression is provider-led, not vendor-led — Together, Fireworks, Groq, Cerebras competing on margin.
Cached-read tier became universal. Q1 2026 had Anthropic and OpenAI shipping cache; Google Gemini 3 Pro joined in late Q1 at the most aggressive 5% rate. Mistral added 10% cached read in February. Every major provider now ships prefix-cache pricing.
Reasoning-mode cost is now load-bearing. GPT-5.5 Pro at high reasoning_effort is 6× the cost of GPT-5.5 standard; Claude Opus 4.7 with extended thinking adds 2-4×. Gemini 3 Deep Think adds 3×. The reasoning premium has become the most important sub-tier choice on any cost model — bigger than picking between the headline models in some workloads.

Pricing change cadence

Provider rate cuts are frequent. We saw four announced rate adjustments in the last 30 days alone (Together −15% Llama 4 70B, Fireworks −10% Mixtral, Anthropic +1-h cache tier, Google Gemini Flash batch tier launch). Build a quarterly re-cost cadence into procurement rather than locking annual rates.

08 — ConclusionThe pricing chart is the input — not the answer.

Q2 2026 pricing landscape · April 2026

Track rates quarterly. Translate to cost-per-successful-task. Re-bid annually.

The 200+ data points above are a snapshot. The frontier is moving fast enough that any rate locked into a procurement contract for 12 months is paying somewhere between 15% and 60% over the steady state. Quarterly re-cost cadence is the right operating tempo for 2026.

The deeper move is to stop measuring per-token rate at all. The workloads that win in production are the ones whose owners measure cost-per-successful-task and accept that the headline rate is just one of six inputs to that number. Cache topology, batch usage, retry rate, output amplification, and tool-loop overhead are the others.

We re-publish this tracker every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.

AI Model API Pricing Tracker · Q2 2026

01 — SnapshotThe Q2 2026 top-line pricing chart.

Output rate per 1M tokens · 10 reference models

02 — Frontier TierThe frontier band — eight headline models.

$30 / $180 per 1M

$5 / $30 per 1M

$5 / $25 per 1M

$3 / $15 per 1M

$4 / $20 per 1M

$0.30 / $1.20 per 1M

$3 / $15 per 1M

$2 / $9 per 1M

03 — Open-Weight TierThe open-weight band — convergence to a tight floor.

Open-weight pricing across 5 inference providers

04 — Tier MechanicsCached, batch, and tier-of-service discounts.

Anthropic & OpenAI · stable prefix tax

Prime cost · Anthropic 5-min tier

Non-interactive workflows · 24h SLA

05 — Provider SpreadSame model, different provider — what changes.

Generalist · best price

Generalist · enterprise leaning

Throughput leader

Throughput leader · alt arch

Enterprise default

Open-weight hosted

06 — From Rate to CostTranslating per-token rate to real cost.

Multi-file refactor (Expert-SWE-style)

PR-scale code review

RAG knowledge-base Q&A (cached prefix)

Long-document summary (200K input)

Agentic outreach personalization (email-class)

Daily content brief (4K output)

07 — Quarterly DeltaWhat changed since Q1 2026.

08 — ConclusionThe pricing chart is the input — not the answer.

Track rates quarterly. Translate to cost-per-successful-task. Re-bid annually.

Move past per-token pricing. Build a procurement model on cost-per-successful-task.

AI cost engineering engagements

The questions we get every week.

Continue exploring frontier model economics.

Self-Hosting Frontier AI Models: 2026 TCO Analysis

Claude Opus 4.7 1M Context: The Cost-Strategy Guide

Cost-Per-Successful-Task: A New AI Evaluation Metric