By Q2 2026, the serverless inference market for open-weight models has consolidated around seven providers. Pricing on the same model spreads 6× across the field; P50 latency spreads 5-7×; throughput on specialty hardware (Groq LPU, Cerebras wafer-scale) spreads up to 10× over commodity H100 endpoints. The matrix matters because the wrong default costs more than the engineering work to switch.

This post compiles the per-token pricing, P50 latency, and output throughput across the seven providers and twelve popular open-weight models — Llama 4 70B / 405B, Qwen 3 72B / 235B-MoE, DeepSeek V4-Pro / V4-Flash, Mistral Large 2, Command-R+, Mixtral 8x22B, plus Phi-5 and Granite Code at smaller scale. Numbers are taken from public pricing pages, Artificial Analysis benchmarks, and direct Apr 2026 testing.

Key takeaways

01
Same model, 6× pricing spread — pick by total cost-of-answer, not headline rate.Llama 4 70B costs $0.65/1M at Together's batch tier and $4.20/1M at the most expensive listed price. The cheap tier has tail-latency caveats; the expensive tier guarantees premium SLAs. Without knowing your latency tolerance, the headline number is meaningless.
02
Specialty hardware (Groq, Cerebras) wins on throughput-bound workloads by 5-10×.Groq's LPU hits 750 tokens/sec on Llama 4 70B output decode; Cerebras hits 600+. A typical H100 endpoint runs 100-150. The premium-priced specialty providers are the right default for streaming chat, real-time coding, and any workload where time-to-finish matters more than per-token cost.
03
Together AI is the price leader at scale; Fireworks is the developer-experience leader.Together's batch and reserved tiers are the cheapest in the market for steady-state production. Fireworks ships the cleanest API, fastest model integration, and best fine-tuning workflow. Most agencies end up with both — Together for stable production, Fireworks for fast iteration and custom models.
04
Replicate and OctoAI fill the model-availability gap for niche or custom models.If you need a model the big providers don't host (custom fine-tunes, smaller Granite Code, niche fine-tunes), Replicate and OctoAI run almost anything via container. Pricing is higher per-token but the alternative is self-hosting just for that one model.
05
Anyscale Endpoints is the enterprise option — higher price floor, deeper compliance and SLAs.Anyscale built on Ray, ships HIPAA / SOC 2 / EU data residency by default, and has the strongest enterprise contract terms. Higher per-token rate (often 1.5-2× Together) but the right call for regulated industries that can't run with smaller-vendor compliance posture.

01 — LandscapeSeven providers, three positioning bands.

The seven providers in this matrix occupy three positioning bands. Price-leader (Together, Fireworks) compete on per-token cost and broad model coverage. Performance-leader (Groq, Cerebras) compete on throughput and time-to-first-token. Coverage-leader (Replicate, OctoAI) compete on model breadth and custom-fine-tune support. Anyscale Endpoints sits across bands — competitive on price, enterprise-strong on contracts.

Band 1

Price-led — Together AI · Fireworks AI

Cheapest per-token · broad model coverage

Both run vLLM-class stacks on H100 / H200 clusters with aggressive batching. Together's batch + reserved tiers price-lead the market. Fireworks ships the cleanest DX. Right default for steady-state production volume.

Default · steady-state

Band 2

Speed-led — Groq · Cerebras

Specialty hardware · 5-10× throughput

Groq's LPU and Cerebras's wafer-scale chip skip the H100 commodity stack entirely. 600-750 tokens/sec output decode on 70B-class models. Premium pricing (often 2-3× Together's). Right call for streaming chat, real-time coding, and time-bound workloads.

Real-time · streaming

Band 3

Coverage-led — Replicate · OctoAI

Anything model · container-based

Run almost any open-weight model, including custom fine-tunes and niche releases. Higher per-token cost but the only way to get specific models (custom fine-tunes, smaller Granite, regional fine-tunes) without self-hosting. Right for portfolio breadth.

Niche / custom

Band 4

Enterprise — Anyscale Endpoints

Built on Ray · enterprise compliance default

HIPAA / SOC 2 / EU data residency built in. Strongest enterprise contract terms. Premium pricing (1.5-2× Together). Right for regulated industries (healthcare, finance, public sector) that can't accept smaller-vendor compliance posture.

Regulated industries

02 — PricingPer-token pricing on Llama 4 70B.

The cleanest cross-provider comparison is on a single popular open-weight model — Llama 4 70B is the 2026 reference because every provider in this matrix hosts it and the architecture (GQA + 70B dense) is identical across endpoints. The 6× spread below is real; the question is what comes with each tier.

Llama 4 70B · per-token pricing across providers (output tokens)

Source: Provider pricing pages · Artificial Analysis · Apr 24, 2026

Together AI · batch tier60-min latency band, lowest priority

$0.65 / 1M

cheapest

Together AI · reservedCommitted capacity, 1-month minimum

$0.95 / 1M

Fireworks AI · serverlessOn-demand, real-time SLA

$1.20 / 1M

OctoAI · serverlessOn-demand, broad model coverage

$1.50 / 1M

Anyscale EndpointsEnterprise contract, SOC 2 / HIPAA

$2.10 / 1M

ReplicateOn-demand, container-based

$2.55 / 1M

Groq · LPUSpecialty hardware, premium throughput

$3.20 / 1M

speed premium

Cerebras · wafer-scaleHighest tier, listed pricing

$4.20 / 1M

The pattern: prices reflect what you're buying. Together batch tier saves 75% per token but accepts 60-minute SLA. Groq LPU charges 5× more but delivers 5-7× throughput. The matrix makes the trade explicit instead of implicit.

03 — LatencyLatency, time-to-first-token, and decode throughput.

Cost is one axis; latency is the other. The two move independently — cheap providers can have low latency (Together on-demand) and expensive providers can have high latency (Cerebras at low utilization). The numbers below are P50 measurements from Apr 2026 on Llama 4 70B output decode.

Groq LPU

750 tps

Output decode throughput

750+ tokens/sec on Llama 4 70B. P50 time-to-first-token under 200ms. The fastest commercially-available endpoint for open-weight models. Trade is per-token cost (5× Together batch).

Speed leader

Cerebras

620 tps

Output decode throughput

600-650 tokens/sec on Llama 4 70B. P50 TTFT slightly higher than Groq (~250ms). Wafer-scale chip; capacity scarcer than Groq's LPU fleet, occasional waitlists for high-volume contracts.

Speed band

Together / Fireworks

115 tps

Output decode throughput

100-150 tokens/sec on Llama 4 70B at typical H100 endpoints. P50 TTFT 400-700ms. Standard commodity-cluster performance — what most production deployments run on. Acceptable for chat, marginal for streaming code.

Commodity tier

"You don't pay for inference in tokens. You pay in time-to-finish — and that depends as much on the silicon as the price."— Internal provider-eval notes, May 2026

04 — ProvidersProvider-by-provider notes worth knowing.

Below is the practitioner's read on each provider — strengths, gotchas, and the kind of workload they fit. Pricing is the entry point; operating fit is the deeper question.

Together AI

Price leader · cleanest scaling story

Cheapest in the market at batch + reserved. Broad model coverage including DeepSeek V4. Strong vLLM-based stack. Occasional capacity tightness on hot model launches. Default for steady-state production at scale.

Default · production scale

Fireworks AI

DX leader · fastest model integration

Day-one support for new model releases. Cleanest API and SDK. Fine-tuning workflow is best-in-class. Slightly more expensive than Together at scale. Pair with Together for production; use Fireworks for iteration.

Iteration + custom models

Groq

Throughput leader · LPU specialty hardware

Single-stream throughput unmatched in 2026. Right answer when latency is the cost. Model coverage narrower than commodity providers. Premium pricing reflects specialty silicon. Best for streaming and real-time workloads.

Real-time / streaming

Cerebras

Wafer-scale · enterprise scale + speed

Wafer-scale-engine performance with enterprise contract posture. Capacity-constrained for high-volume customers but unmatched on throughput per-instance. Strong fit for enterprise customers needing both speed and SOC 2 / FedRAMP compliance.

Enterprise + speed

Anyscale Endpoints

Built on Ray · enterprise compliance default

Premium pricing. SOC 2 / HIPAA / EU data residency by default. Deepest contract flexibility. Higher operational maturity than smaller competitors. Right call for regulated industries that can't accept smaller-vendor posture.

Regulated industries

Replicate / OctoAI

Coverage leaders · run anything model

Container-based serving for almost any open-weight model. Higher per-token cost. The fallback for models the big providers don't host yet — custom fine-tunes, niche releases, smaller models. Pair with a price-leader for the bulk of traffic.

Niche / custom models

05 — DecisionPicking providers by workload.

Most production deployments end up using two or three providers, not one. The pattern that wins: a price-leader for steady-state volume, a speed-leader for streaming/critical paths, and a coverage-leader as a fallback for niche models. Below is the workload-class decision.

Workload

Batch-heavy content generation

Article drafts, bulk summarization, data-pipeline transformation. Latency tolerance: minutes-to-hours. Right answer: Together batch tier or reserved capacity. 4-6× cheaper than serverless on the same model.

Together · batch tier

Workload

Production chat / long-running agents

Real-time chat with under-10s response time. Mixed workload, variable QPS. Right answer: Together or Fireworks on serverless tier; pair with Groq for the streaming critical path. Hybrid two-provider routing covers cost + speed.

Together + Groq

Workload

Streaming code completion / IDE plugin

Token-streaming, sub-200ms TTFT, 500+ tokens/sec needed for human-feel typing. Right answer: Groq LPU only. The cost premium is justified by the workload's strict latency profile; no commodity provider matches.

Groq · LPU

Workload

Regulated industry · enterprise contract

Healthcare, finance, public sector. Compliance requirements (HIPAA, SOC 2, EU residency) and contract terms (BAA, indemnification, audit rights). Right answer: Anyscale Endpoints or Cerebras enterprise tier; smaller providers won't sign the necessary terms.

Anyscale / Cerebras enterprise

06 — BuyingBuying-process gotchas most teams hit.

Listed price ≠ paid price. Every provider has volume tiers and reserved-capacity discounts. Listed pricing is the starting point; expect 25-40% discount on $50K+/month steady-state commitments. Negotiate before scaling.
Capacity tightens on hot model launches. When a new frontier model ships (Llama 4 launch, DeepSeek V4 launch), the cheapest providers run out of capacity for several days. Maintain a fallback provider for the first 2-3 weeks after any major launch.
Tail latency P99 isn't advertised.P50 numbers are visible on every provider's page; P95 / P99 tail-latency under load is not. For production deployments, run a 24-hour load test against P99 before locking in.
Region availability is uneven. US regions are fully covered by every provider; EU coverage varies (Anyscale and Together strongest); APAC is patchy outside Singapore and Tokyo. Map your traffic geography to provider regions before committing.
Custom-model hosting has a cold-start surcharge. If you fine-tune a model and host it on Replicate, OctoAI, or Fireworks, expect 2-5× the per-token cost of a popular hosted model — capacity isn't pre-provisioned for long-tail customers. Bake this into the fine-tune ROI math.

07 — ConclusionThe market is differentiated enough to pick deliberately.

Q2 2026 inference market, April 2026

Pick by total cost-of-answer, not headline rate.

By Q2 2026 the open-weight serverless inference market is mature enough that the right answer is rarely "use one provider." The seven providers in this matrix have differentiated themselves enough — by price band, by hardware specialty, by enterprise posture, by model coverage — that the production-grade play is to pair two or three of them by workload class.

The deeper move is to build the routing layer first. A simple workload-aware router that sends batch jobs to Together, real-time chat to Fireworks, streaming code to Groq, and niche-model traffic to Replicate beats any single-provider choice on cost-per-answer by 30-50% — and provides automatic failover when capacity tightens on a model launch.

Re-run this matrix quarterly. The 6× pricing spread is not stable; new providers enter (Modal, RunPod expanding into serverless), specialty hardware drops in price as inference silicon matures, and Cerebras and Groq are both ramping capacity. The decision that's right today is right today, not for the next 12 months.

AI Inference Providers: Q2 2026 Pricing Matrix

01 — LandscapeSeven providers, three positioning bands.

Price-led — Together AI · Fireworks AI

Speed-led — Groq · Cerebras

Coverage-led — Replicate · OctoAI

Enterprise — Anyscale Endpoints

02 — PricingPer-token pricing on Llama 4 70B.

Llama 4 70B · per-token pricing across providers (output tokens)

03 — LatencyLatency, time-to-first-token, and decode throughput.

Output decode throughput

Output decode throughput

Output decode throughput

04 — ProvidersProvider-by-provider notes worth knowing.

Price leader · cleanest scaling story

DX leader · fastest model integration

Throughput leader · LPU specialty hardware

Wafer-scale · enterprise scale + speed

Built on Ray · enterprise compliance default

Coverage leaders · run anything model

05 — DecisionPicking providers by workload.

Batch-heavy content generation

Production chat / long-running agents

Streaming code completion / IDE plugin

Regulated industry · enterprise contract

06 — BuyingBuying-process gotchas most teams hit.

07 — ConclusionThe market is differentiated enough to pick deliberately.

Pick by total cost-of-answer, not headline rate.

Move past single-provider thinking. Build the hybrid stack.

Inference-stack engagements

The questions we get every week.

Continue exploring AI infrastructure pricing.

Self-Hosting Frontier AI Models: 2026 TCO Analysis

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Data

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing