SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentPricing Matrix4 min readPublished Apr 24, 2026

7 providers · 12 models · 60+ data points across price, latency, throughput for Q2 2026

AI Inference Providers: Q2 2026 Pricing Matrix

By Q2 2026 the serverless inference market has consolidated around seven providers — Together, Fireworks, Anyscale, Groq, Cerebras, Replicate, and OctoAI. Pricing on the same model spreads , and latency spreads 5–7×.

DA
Digital Applied Team
Senior strategists · Published Apr 24, 2026
PublishedApr 24, 2026
Read time4 min
SourcesProvider pages · Artificial Analysis · Apr 2026
Llama 4 70B · cheapest
$0.65
/1M tokens · Together batch
−85% vs most expensive
Llama 4 70B · most expensive
$4.20
/1M tokens · listed pricing
Groq LPU · throughput
750 tps
Llama 4 70B output tokens
5-7× typical H100
P50 latency spread
5–7×
across 7 providers

By Q2 2026, the serverless inference market for open-weight models has consolidated around seven providers. Pricing on the same model spreads 6× across the field; P50 latency spreads 5-7×; throughput on specialty hardware (Groq LPU, Cerebras wafer-scale) spreads up to 10× over commodity H100 endpoints. The matrix matters because the wrong default costs more than the engineering work to switch.

This post compiles the per-token pricing, P50 latency, and output throughput across the seven providers and twelve popular open-weight models — Llama 4 70B / 405B, Qwen 3 72B / 235B-MoE, DeepSeek V4-Pro / V4-Flash, Mistral Large 2, Command-R+, Mixtral 8x22B, plus Phi-5 and Granite Code at smaller scale. Numbers are taken from public pricing pages, Artificial Analysis benchmarks, and direct Apr 2026 testing.

Key takeaways
  1. 01
    Same model, 6× pricing spread — pick by total cost-of-answer, not headline rate.Llama 4 70B costs $0.65/1M at Together's batch tier and $4.20/1M at the most expensive listed price. The cheap tier has tail-latency caveats; the expensive tier guarantees premium SLAs. Without knowing your latency tolerance, the headline number is meaningless.
  2. 02
    Specialty hardware (Groq, Cerebras) wins on throughput-bound workloads by 5-10×.Groq's LPU hits 750 tokens/sec on Llama 4 70B output decode; Cerebras hits 600+. A typical H100 endpoint runs 100-150. The premium-priced specialty providers are the right default for streaming chat, real-time coding, and any workload where time-to-finish matters more than per-token cost.
  3. 03
    Together AI is the price leader at scale; Fireworks is the developer-experience leader.Together's batch and reserved tiers are the cheapest in the market for steady-state production. Fireworks ships the cleanest API, fastest model integration, and best fine-tuning workflow. Most agencies end up with both — Together for stable production, Fireworks for fast iteration and custom models.
  4. 04
    Replicate and OctoAI fill the model-availability gap for niche or custom models.If you need a model the big providers don't host (custom fine-tunes, smaller Granite Code, niche fine-tunes), Replicate and OctoAI run almost anything via container. Pricing is higher per-token but the alternative is self-hosting just for that one model.
  5. 05
    Anyscale Endpoints is the enterprise option — higher price floor, deeper compliance and SLAs.Anyscale built on Ray, ships HIPAA / SOC 2 / EU data residency by default, and has the strongest enterprise contract terms. Higher per-token rate (often 1.5-2× Together) but the right call for regulated industries that can't run with smaller-vendor compliance posture.

01LandscapeSeven providers, three positioning bands.

The seven providers in this matrix occupy three positioning bands. Price-leader (Together, Fireworks) compete on per-token cost and broad model coverage. Performance-leader (Groq, Cerebras) compete on throughput and time-to-first-token. Coverage-leader (Replicate, OctoAI) compete on model breadth and custom-fine-tune support. Anyscale Endpoints sits across bands — competitive on price, enterprise-strong on contracts.

Band 1
Price-led — Together AI · Fireworks AI
Cheapest per-token · broad model coverage

Both run vLLM-class stacks on H100 / H200 clusters with aggressive batching. Together's batch + reserved tiers price-lead the market. Fireworks ships the cleanest DX. Right default for steady-state production volume.

Default · steady-state
Band 2
Speed-led — Groq · Cerebras
Specialty hardware · 5-10× throughput

Groq's LPU and Cerebras's wafer-scale chip skip the H100 commodity stack entirely. 600-750 tokens/sec output decode on 70B-class models. Premium pricing (often 2-3× Together's). Right call for streaming chat, real-time coding, and time-bound workloads.

Real-time · streaming
Band 3
Coverage-led — Replicate · OctoAI
Anything model · container-based

Run almost any open-weight model, including custom fine-tunes and niche releases. Higher per-token cost but the only way to get specific models (custom fine-tunes, smaller Granite, regional fine-tunes) without self-hosting. Right for portfolio breadth.

Niche / custom
Band 4
Enterprise — Anyscale Endpoints
Built on Ray · enterprise compliance default

HIPAA / SOC 2 / EU data residency built in. Strongest enterprise contract terms. Premium pricing (1.5-2× Together). Right for regulated industries (healthcare, finance, public sector) that can't accept smaller-vendor compliance posture.

Regulated industries

02PricingPer-token pricing on Llama 4 70B.

The cleanest cross-provider comparison is on a single popular open-weight model — Llama 4 70B is the 2026 reference because every provider in this matrix hosts it and the architecture (GQA + 70B dense) is identical across endpoints. The 6× spread below is real; the question is what comes with each tier.

Llama 4 70B · per-token pricing across providers (output tokens)

Source: Provider pricing pages · Artificial Analysis · Apr 24, 2026
Together AI · batch tier60-min latency band, lowest priority
$0.65 / 1M
cheapest
Together AI · reservedCommitted capacity, 1-month minimum
$0.95 / 1M
Fireworks AI · serverlessOn-demand, real-time SLA
$1.20 / 1M
OctoAI · serverlessOn-demand, broad model coverage
$1.50 / 1M
Anyscale EndpointsEnterprise contract, SOC 2 / HIPAA
$2.10 / 1M
ReplicateOn-demand, container-based
$2.55 / 1M
Groq · LPUSpecialty hardware, premium throughput
$3.20 / 1M
speed premium
Cerebras · wafer-scaleHighest tier, listed pricing
$4.20 / 1M

The pattern: prices reflect what you're buying. Together batch tier saves 75% per token but accepts 60-minute SLA. Groq LPU charges 5× more but delivers 5-7× throughput. The matrix makes the trade explicit instead of implicit.

03LatencyLatency, time-to-first-token, and decode throughput.

Cost is one axis; latency is the other. The two move independently — cheap providers can have low latency (Together on-demand) and expensive providers can have high latency (Cerebras at low utilization). The numbers below are P50 measurements from Apr 2026 on Llama 4 70B output decode.

Groq LPU
750 tps
Output decode throughput

750+ tokens/sec on Llama 4 70B. P50 time-to-first-token under 200ms. The fastest commercially-available endpoint for open-weight models. Trade is per-token cost (5× Together batch).

Speed leader
Cerebras
620 tps
Output decode throughput

600-650 tokens/sec on Llama 4 70B. P50 TTFT slightly higher than Groq (~250ms). Wafer-scale chip; capacity scarcer than Groq's LPU fleet, occasional waitlists for high-volume contracts.

Speed band
Together / Fireworks
115 tps
Output decode throughput

100-150 tokens/sec on Llama 4 70B at typical H100 endpoints. P50 TTFT 400-700ms. Standard commodity-cluster performance — what most production deployments run on. Acceptable for chat, marginal for streaming code.

Commodity tier
"You don't pay for inference in tokens. You pay in time-to-finish — and that depends as much on the silicon as the price."— Internal provider-eval notes, May 2026

04ProvidersProvider-by-provider notes worth knowing.

Below is the practitioner's read on each provider — strengths, gotchas, and the kind of workload they fit. Pricing is the entry point; operating fit is the deeper question.

Together AI
Price leader · cleanest scaling story

Cheapest in the market at batch + reserved. Broad model coverage including DeepSeek V4. Strong vLLM-based stack. Occasional capacity tightness on hot model launches. Default for steady-state production at scale.

Default · production scale
Fireworks AI
DX leader · fastest model integration

Day-one support for new model releases. Cleanest API and SDK. Fine-tuning workflow is best-in-class. Slightly more expensive than Together at scale. Pair with Together for production; use Fireworks for iteration.

Iteration + custom models
Groq
Throughput leader · LPU specialty hardware

Single-stream throughput unmatched in 2026. Right answer when latency is the cost. Model coverage narrower than commodity providers. Premium pricing reflects specialty silicon. Best for streaming and real-time workloads.

Real-time / streaming
Cerebras
Wafer-scale · enterprise scale + speed

Wafer-scale-engine performance with enterprise contract posture. Capacity-constrained for high-volume customers but unmatched on throughput per-instance. Strong fit for enterprise customers needing both speed and SOC 2 / FedRAMP compliance.

Enterprise + speed
Anyscale Endpoints
Built on Ray · enterprise compliance default

Premium pricing. SOC 2 / HIPAA / EU data residency by default. Deepest contract flexibility. Higher operational maturity than smaller competitors. Right call for regulated industries that can't accept smaller-vendor posture.

Regulated industries
Replicate / OctoAI
Coverage leaders · run anything model

Container-based serving for almost any open-weight model. Higher per-token cost. The fallback for models the big providers don't host yet — custom fine-tunes, niche releases, smaller models. Pair with a price-leader for the bulk of traffic.

Niche / custom models

05DecisionPicking providers by workload.

Most production deployments end up using two or three providers, not one. The pattern that wins: a price-leader for steady-state volume, a speed-leader for streaming/critical paths, and a coverage-leader as a fallback for niche models. Below is the workload-class decision.

Workload
Batch-heavy content generation

Article drafts, bulk summarization, data-pipeline transformation. Latency tolerance: minutes-to-hours. Right answer: Together batch tier or reserved capacity. 4-6× cheaper than serverless on the same model.

Together · batch tier
Workload
Production chat / long-running agents

Real-time chat with under-10s response time. Mixed workload, variable QPS. Right answer: Together or Fireworks on serverless tier; pair with Groq for the streaming critical path. Hybrid two-provider routing covers cost + speed.

Together + Groq
Workload
Streaming code completion / IDE plugin

Token-streaming, sub-200ms TTFT, 500+ tokens/sec needed for human-feel typing. Right answer: Groq LPU only. The cost premium is justified by the workload's strict latency profile; no commodity provider matches.

Groq · LPU
Workload
Regulated industry · enterprise contract

Healthcare, finance, public sector. Compliance requirements (HIPAA, SOC 2, EU residency) and contract terms (BAA, indemnification, audit rights). Right answer: Anyscale Endpoints or Cerebras enterprise tier; smaller providers won't sign the necessary terms.

Anyscale / Cerebras enterprise

06BuyingBuying-process gotchas most teams hit.

  • Listed price ≠ paid price. Every provider has volume tiers and reserved-capacity discounts. Listed pricing is the starting point; expect 25-40% discount on $50K+/month steady-state commitments. Negotiate before scaling.
  • Capacity tightens on hot model launches. When a new frontier model ships (Llama 4 launch, DeepSeek V4 launch), the cheapest providers run out of capacity for several days. Maintain a fallback provider for the first 2-3 weeks after any major launch.
  • Tail latency P99 isn't advertised.P50 numbers are visible on every provider's page; P95 / P99 tail-latency under load is not. For production deployments, run a 24-hour load test against P99 before locking in.
  • Region availability is uneven. US regions are fully covered by every provider; EU coverage varies (Anyscale and Together strongest); APAC is patchy outside Singapore and Tokyo. Map your traffic geography to provider regions before committing.
  • Custom-model hosting has a cold-start surcharge. If you fine-tune a model and host it on Replicate, OctoAI, or Fireworks, expect 2-5× the per-token cost of a popular hosted model — capacity isn't pre-provisioned for long-tail customers. Bake this into the fine-tune ROI math.

07ConclusionThe market is differentiated enough to pick deliberately.

Q2 2026 inference market, April 2026

Pick by total cost-of-answer, not headline rate.

By Q2 2026 the open-weight serverless inference market is mature enough that the right answer is rarely "use one provider." The seven providers in this matrix have differentiated themselves enough — by price band, by hardware specialty, by enterprise posture, by model coverage — that the production-grade play is to pair two or three of them by workload class.

The deeper move is to build the routing layer first. A simple workload-aware router that sends batch jobs to Together, real-time chat to Fireworks, streaming code to Groq, and niche-model traffic to Replicate beats any single-provider choice on cost-per-answer by 30-50% — and provides automatic failover when capacity tightens on a model launch.

Re-run this matrix quarterly. The 6× pricing spread is not stable; new providers enter (Modal, RunPod expanding into serverless), specialty hardware drops in price as inference silicon matures, and Cerebras and Groq are both ramping capacity. The decision that's right today is right today, not for the next 12 months.

Multi-provider inference

Move past single-provider thinking. Build the hybrid stack.

We help engineering and product teams pick, contract, and operate multi-provider inference stacks for production at scale — covering provider selection, hybrid routing, capacity planning, and per-workload cost modelling.

Free consultationExpert guidanceTailored solutions
What we work on

Inference-stack engagements

  • Provider selection — Together, Fireworks, Groq, Anyscale, Cerebras
  • Hybrid workload-aware routing across 2-3 providers
  • Reserved-capacity timing and commit ladders
  • P99 tail-latency testing and fallback design
  • Enterprise contract negotiation for regulated industries
FAQ · AI inference providers Q2 2026

The questions we get every week.

Three reasons. First, hardware: commodity H100 endpoints (Together, Fireworks) cost less per-token than specialty silicon (Groq LPU, Cerebras wafer-scale) because the underlying chip economics are different. Second, SLA tier: Together's batch tier accepts 60-minute latency for 75% off; reserved tier guarantees capacity at a 25-30% premium over batch. Third, contract posture: Anyscale's enterprise tier prices 1.5-2× higher than Together because the contract includes SOC 2, HIPAA, BAA terms, and indemnification that smaller providers can't match. Headline price means nothing without knowing what tier and what hardware.