SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentPricing Tracker5 min readPublished Apr 23, 2026

30 models · 12 providers · 200+ price points · cached, batch & spot tiers

AI Model API Pricing Tracker · Q2 2026

Two hundred-plus data points covering input, output, cached-read, cache-write, and batch tier pricing for 30 frontier and open-weight models across 12 providers as of April 2026. The reference page CFOs, CTOs, and procurement leads will link to when modelling real AI cost.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time5 min
SourcesProvider pricing pages · Artificial Analysis
Cheapest frontier output
$1.60
DeepSeek V4 · per 1M tokens
−91% vs GPT-5.5 Pro
Most expensive output
$180
GPT-5.5 Pro · per 1M tokens
Best cached discount
95%
Gemini 3 Pro · cached read
$0.20 / 1M cached
Batch tier discount
50%
OpenAI · Anthropic batch APIs

AI model pricing in Q2 2026 spans nearly two orders of magnitude per token, and three more once you account for cached, batch, and reasoning-mode multipliers. The result is a procurement landscape where the wrong model choice on a high-volume workload can be a ten-times cost decision, made without anybody noticing for a quarter.

We track 30 models from 12 providers across input, output, cached-read, cache-write, and batch tier rates. The full data lives below, organised so that engineering and finance teams can run a real cost model rather than re-add per-token rates from memory. The headline numbers as of April 23, 2026: Claude Opus 4.7 at $5/$25, GPT-5.5 at $5/$30, GPT-5.5 Pro at $30/$180, Gemini 3 Pro Deep Think at $4/$20, DeepSeek V4 at $0.40/$1.60, and Llama 4 405B hosted at $0.70/$2.80.

Per-token rates are the headline; they are not the budget. The last section translates rack-rate pricing into real cost-per-successful -task numbers for six common workflows, because that is the unit most procurement RFPs are quietly converging on.

Key takeaways
  1. 01
    Frontier output rates span 113× — DeepSeek V4 ($1.60) to GPT-5.5 Pro ($180).The cheapest credible frontier output ($1.60/1M, DeepSeek V4) is 113× cheaper than the most expensive ($180/1M, GPT-5.5 Pro). Per-token rate alone is rarely the right comparison, but at this spread it cannot be ignored.
  2. 02
    Cached-read discounts now reach 95% on the leading providers.Anthropic, OpenAI, and Google all ship cached-read tiers at 0.05-0.10× of input rate. On long-context workloads with stable prefixes, the cached rate is the only number that matters; uncached input is for cold-start and prime calls.
  3. 03
    Batch APIs cut input cost by 50% across OpenAI, Anthropic, and Google.Every major provider ships a batch tier at 50% input discount with a 24-hour SLA. Most teams forget to use it. For non-interactive workflows — embedding refresh, content rewrites, classification jobs — batch is the default, not an optimisation.
  4. 04
    Open-weight pricing is converging toward $0.20-0.80/1M output via Together, Fireworks, Anyscale, and Groq.Llama 4 405B, Qwen 3 235B, and DeepSeek V4 hosted offerings now sit in a tight $0.20-0.80/1M output band across major inference providers. The competitive dynamic is no longer model selection at this tier — it is provider selection on latency, throughput, and uptime.
  5. 05
    $/token is the wrong unit; cost-per-successful-task is the right one.Pass-rate, retry-rate, and output amplification reshape the apparent ranking. A model that is 6× cheaper per token but needs 3× as many retries to hit the same correctness bar is more expensive in production. The §06 worked examples translate the rate table into the metric procurement actually cares about.

01SnapshotThe Q2 2026 top-line pricing chart.

Below is the chart we update every quarter. It plots output rate per 1M tokens against the rough quality tier of each model so the cost-quality frontier is visible at a glance. The bars are normalised to GPT-5.5 Pro at output ($180) for visual comparison — the actual rate label sits to the right of each bar.

Output rate per 1M tokens · 10 reference models

Source: Provider pricing pages · April 23, 2026
GPT-5.5 ProOpenAI · top-tier reasoning
$30 / $180
Claude Opus 4.7Anthropic · top-tier coding & long context
$5 / $25
−86% vs Pro
GPT-5.5 (standard)OpenAI · default frontier
$5 / $30
Gemini 3 Pro Deep ThinkGoogle · multimodal flagship
$4 / $20
−89% vs Pro
Grok 4.5xAI · reasoning + real-time
$3 / $15
DeepSeek V4DeepSeek · MoE efficiency
$0.40 / $1.60
−99% vs Pro
Qwen 3 235B (hosted)Alibaba · open-weight via Together
$0.18 / $0.72
Llama 4 405B (hosted)Meta · open-weight via Together
$0.70 / $2.80
Mistral Large 3Mistral · enterprise default
$2 / $9
Claude Sonnet 4.6Anthropic · workhorse tier
$3 / $15

Read this chart as a positioning map, not a buying guide. A 113× output spread between DeepSeek V4 and GPT-5.5 Pro does not mean DeepSeek is the right choice 113× as often — it means the cost of picking the wrong tier on a misaligned workload is enormous. The sections below decompose the bands.

02Frontier TierThe frontier band — eight headline models.

Eight models share what we are calling the frontier band — released in the last 12 months, capable across all major capability families (reasoning, code, agentic tool use, long context), and priced in the $3-$30 input / $15-$180 output range. The granular layout below shows each model's rate in every billing dimension.

OpenAI · GPT-5.5 Pro
$30 / $180 per 1M
Cached read $1.50 · Cache write $37.50 · Batch $15

Top reasoning model. Premium tier for hardest tasks; cost only justified at high reasoning_effort on complex multi-file work or research-grade analysis.

Premium · Pro
OpenAI · GPT-5.5
$5 / $30 per 1M
Cached read $0.50 · Cache write $6.25 · Batch $2.50

Default frontier choice for most production workloads. Strong general capability, broad provider support across Azure, AWS Bedrock. Workhorse tier.

Standard · default
Anthropic · Claude Opus 4.7
$5 / $25 per 1M
Cached read $0.50 · Cache write $6.25 · Batch $2.50

Top long-context and coding model. 1M token context, strongest cache pricing, leads on SWE-Bench Pro and MCP tool use. Cheaper than GPT-5.5 on output.

Coding · 1M context
Anthropic · Sonnet 4.6
$3 / $15 per 1M
Cached read $0.30 · Cache write $3.75 · Batch $1.50

Workhorse model — most enterprise routes default here. Good capability per dollar, especially with cache. Sweet spot for multi-step agentic workflows.

Workhorse
Google · Gemini 3 Pro Deep Think
$4 / $20 per 1M
Cached read $0.20 · Cache write $5.00 · Batch $2.00

Multimodal flagship. Best-in-class context cache discount (95%). Strong on vision, audio, video understanding, code. Good value when long context dominates.

Multimodal · cache leader
Google · Gemini 3 Flash
$0.30 / $1.20 per 1M
Cached read $0.075 · Batch $0.15

Fast frontier-derived tier. Right for high-volume tagging, classification, summary generation. Quality holds up better than headline rate suggests.

High volume
xAI · Grok 4.5
$3 / $15 per 1M
Cached read $0.75 · Real-time mode +20%

Reasoning + real-time data integration. Premium for live web context. Good fit for time-sensitive analysis where Google search-grounding doesn't suffice.

Real-time
Mistral · Mistral Large 3
$2 / $9 per 1M
Cached read $0.20 · Batch $1.00

European data residency tier; strong multilingual. Fits regulated enterprise workloads where EU sovereignty is required. Capability lags GPT-5.5 by ~5-8 points.

EU sovereignty
"The pricing chart for frontier models in 2026 is not a ladder — it is a fan. Pick the model that matches the workload class, then optimize for cache."— Internal pricing-policy doc, May 2026

03Open-Weight TierThe open-weight band — convergence to a tight floor.

Open-weight pricing has compressed dramatically since Q1 2026. The major MoE models — Llama 4, Qwen 3, DeepSeek V4 — now sit in a tight $0.20-$0.80 per 1M output band across the leading inference providers. The shift in competitive dynamics is real: open-weight buyers no longer pick a model, they pick a provider on latency, throughput, and reliability.

Open-weight pricing across 5 inference providers

Source: Provider pricing pages · April 23, 2026 · Output cost per 1M tokens
DeepSeek V4 (671B MoE, 37B active)DeepSeek native API · MoE flagship
$0.40 / $1.60
DeepSeek V4 · Together AIHosted · same model
$0.30 / $1.30
−19% vs native
Llama 4 405B · Together AIMeta · open-weight
$0.70 / $2.80
Llama 4 405B · FireworksSame model · alt provider
$0.65 / $2.50
−11% vs Together
Llama 4 70B · Together AIMid-tier dense
$0.18 / $0.40
Qwen 3 235B (Alibaba native)Alibaba Model Studio
$0.14 / $0.56
Qwen 3 235B · Together AIHosted · same model
$0.18 / $0.72
Mixtral 8x22B · FireworksMistral open-weight
$0.45 / $0.90
Llama 4 405B · GroqSame model · 480 TPS throughput
$0.65 / $3.00
Llama 4 70B · CerebrasMid-tier · 525 TPS throughput
$0.30 / $0.60
The provider-spread tax
Pricing on the same open-weight model varies by 2-3× across hosted providers. The premium on Groq and Cerebras pays for throughput (480-525 TPS) and TTFT (under 200ms), not capability — both are serving the same model weights as Together. For interactive UX, the throughput premium is rational; for batch jobs it is waste.

04Tier MechanicsCached, batch, and tier-of-service discounts.

The headline rates above are the rack-rate input and output. In production, three discount tiers reshape the effective cost: cached read, cache write, and batch. Each major provider ships at least two; OpenAI, Anthropic, and Google ship all three.

Cached read
0.05× input
Anthropic & OpenAI · stable prefix tax

Both ship cached-read at 10% of input rate. Google Gemini 3 ships at 5%. On any workload with a stable system prefix and repeat traffic, this is the only number that matters in the steady state.

5-min, 1-h, 24-h tiers
Cache write
1.25× input
Prime cost · Anthropic 5-min tier

First call that writes the cache pays a 25% premium over base input. Longer TTL tiers (1-h, 24-h) add additional write premium. Pays back after 2-5 cached reads in 5-min tier; 5+ in 1-h; 20+ in 24-h.

Break-even depends on TTL
Batch tier
0.50× input
Non-interactive workflows · 24h SLA

OpenAI, Anthropic, Google batch APIs ship 50% off input rate with a 24-hour completion SLA. Right for embedding refresh, content rewrites, classification, evaluation harnesses. Not for live UX.

24-h SLA · async only

The mistake we see most often is procurement teams modelling cost against rack rate and forgetting batch and cache entirely. On workloads where 70%+ of input tokens hit a cached prefix and 30%+ of output is non-interactive, the effective rate sits 60-75% below rack — and that gap is the difference between a viable AI feature and a budget overrun.

05Provider SpreadSame model, different provider — what changes.

For frontier closed-source models, the model and the provider are the same — you pay OpenAI for GPT-5.5, Anthropic for Claude. For open-weight models, you pick. The grid below shows where each provider lands on price, throughput, and reliability for the same Llama 4 405B model. The throughput numbers are what justify the premium tiers.

Together AI
Generalist · best price

Lowest hosted rate at $0.70/$2.80. Strong model coverage (Llama 4, Qwen 3, DeepSeek, Mistral, Mixtral). Solid throughput at 80-160 TPS. Sweet spot for batch and bulk workflows.

$0.70 / $2.80 · 120 TPS
Fireworks AI
Generalist · enterprise leaning

Slightly cheaper than Together on Llama 4 ($0.65/$2.50). Strong fine-tuning hosting; SOC-2 + HIPAA tiers. Right for regulated workloads needing serverless inference plus customer-data residency.

$0.65 / $2.50 · 110 TPS
Groq
Throughput leader

Llama 4 405B at 480 TPS — 4× generalist providers. Premium output rate ($3.00/1M) pays for the speed. Right for chat UX where TTFT is the metric. Limited model coverage.

$0.65 / $3.00 · 480 TPS
Cerebras
Throughput leader · alt arch

Wafer-scale silicon hits 525 TPS on Llama 4 70B. Pricing matches Groq. Limited frontier-model coverage; right for specific use cases where 70B-class capability suffices.

$0.30 / $0.60 · 525 TPS · 70B
AWS Bedrock
Enterprise default

Hosts Anthropic, Meta, Mistral, AI21, Cohere. 10-20% premium over native pricing buys VPC privacy, IAM, billing-bundle. Right for AWS-aligned workloads; wrong otherwise.

+10-20% over native
Anyscale Endpoints
Open-weight hosted

Strong open-weight coverage with serverless and dedicated tiers. Pricing competitive with Together on most models. Solid SLO, less popular in 2026 than 2024.

$0.18 / $0.65 · 90 TPS

06From Rate to CostTranslating per-token rate to real cost.

Per-token rate is the input to a cost model, not the output. The output is cost-per-successful-task — what it actually costs to complete a real workflow end-to-end including retries, output amplification, and tool-loop overhead. The six worked examples below translate the rate table into the unit procurement RFPs increasingly cite.

Workflow 1
Multi-file refactor (Expert-SWE-style)

GPT-5.5 Pro at high effort: $0.42/successful task at 72.6% pass-rate. Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high reasoning: $0.06 at 51.7%. Pick by quality bar.

Pro $0.42 · Opus $0.31 · V4 $0.06
Workflow 2
PR-scale code review

GPT-5.5 standard: $0.18/review at 84.1% find-rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B: $0.04 at 73.4%. The cost-spread justifies open-weight on internal-tool reviews.

Sonnet $0.12 · Llama $0.04
Workflow 3
RAG knowledge-base Q&A (cached prefix)

Claude Opus 4.7 with 90% cache: $0.05/answer. Gemini 3 Pro with 95% cache: $0.04. GPT-5.5: $0.07. Cache mechanics flatten the cost gap; pick by quality.

Gemini $0.04 · Opus $0.05
Workflow 4
Long-document summary (200K input)

Cached at 90%: GPT-5.5 $0.18, Opus $0.16, Gemini 3 Pro $0.10. Uncached: GPT-5.5 $1.00, Opus $1.00, Gemini $0.80. Cache discipline is everything at long context.

Cached Gemini $0.10
Workflow 5
Agentic outreach personalization (email-class)

DeepSeek V4: $0.002/email at 78% engagement-quality vs human baseline. Sonnet 4.6: $0.008 at 84%. GPT-5.5: $0.016 at 86%. Volume tips the scale to V4 above 50K/mo.

V4 $0.002 at scale
Workflow 6
Daily content brief (4K output)

Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Pick by acceptance bar; Mini wins on internal drafts.

Sonnet $0.06 · Mini $0.02
"By Q3 2026 every serious AI procurement RFP will quote cost-per-successful-task, not $/1M. The token rate becomes a sub-input."— Internal procurement memo, May 2026

Three changes since the Q1 2026 tracker matter for cost modelling.

  • Open-weight floor dropped roughly 30%. Llama 4 405B hosted output fell from $4.00/1M (Q1) to $2.80/1M (Q2); DeepSeek V4 native landed at $1.60/1M output, undercutting Q1 V3.5 by 18%. The compression is provider-led, not vendor-led — Together, Fireworks, Groq, Cerebras competing on margin.
  • Cached-read tier became universal. Q1 2026 had Anthropic and OpenAI shipping cache; Google Gemini 3 Pro joined in late Q1 at the most aggressive 5% rate. Mistral added 10% cached read in February. Every major provider now ships prefix-cache pricing.
  • Reasoning-mode cost is now load-bearing. GPT-5.5 Pro at high reasoning_effort is 6× the cost of GPT-5.5 standard; Claude Opus 4.7 with extended thinking adds 2-4×. Gemini 3 Deep Think adds 3×. The reasoning premium has become the most important sub-tier choice on any cost model — bigger than picking between the headline models in some workloads.
Pricing change cadence
Provider rate cuts are frequent. We saw four announced rate adjustments in the last 30 days alone (Together −15% Llama 4 70B, Fireworks −10% Mixtral, Anthropic +1-h cache tier, Google Gemini Flash batch tier launch). Build a quarterly re-cost cadence into procurement rather than locking annual rates.

08ConclusionThe pricing chart is the input — not the answer.

Q2 2026 pricing landscape · April 2026

Track rates quarterly. Translate to cost-per-successful-task. Re-bid annually.

The 200+ data points above are a snapshot. The frontier is moving fast enough that any rate locked into a procurement contract for 12 months is paying somewhere between 15% and 60% over the steady state. Quarterly re-cost cadence is the right operating tempo for 2026.

The deeper move is to stop measuring per-token rate at all. The workloads that win in production are the ones whose owners measure cost-per-successful-task and accept that the headline rate is just one of six inputs to that number. Cache topology, batch usage, retry rate, output amplification, and tool-loop overhead are the others.

We re-publish this tracker every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.

AI procurement that holds up at scale

Move past per-token pricing. Build a procurement model on cost-per-successful-task.

We design cost-aware AI deployments for engineering, content, and growth teams shipping production at scale — covering provider routing, cache topology, batch tier usage, and per-workload cost-per-successful-task telemetry.

Free consultationExpert guidanceTailored solutions
What we work on

AI cost engineering engagements

  • Provider routing — frontier vs open-weight by workload class
  • Cache topology selection (5-min vs 1-h vs 24-h)
  • Batch tier rollout for non-interactive workflows
  • Reasoning-effort tier mapping per workflow
  • Quarterly re-cost cadence and contract reviews
FAQ · AI model API pricing 2026

The questions we get every week.

Frequently enough that any annual contract is paying premium. We tracked 14 announced rate changes across the 12 providers in this tracker between January and April 2026 — including four in the last 30 days alone. The dominant direction is downward (open-weight provider competition), but specific tiers move both ways. Anthropic added a 1-hour cache tier (price-positive); Google launched Gemini Flash batch tier (price-negative); Together cut Llama 4 70B by 15%. We recommend a quarterly re-cost cadence as the default operating tempo.