By Q2 2026, the serverless inference market for open-weight models has consolidated around seven providers. Pricing on the same model spreads 6× across the field; P50 latency spreads 5-7×; throughput on specialty hardware (Groq LPU, Cerebras wafer-scale) spreads up to 10× over commodity H100 endpoints. The matrix matters because the wrong default costs more than the engineering work to switch.
This post compiles the per-token pricing, P50 latency, and output throughput across the seven providers and twelve popular open-weight models — Llama 4 70B / 405B, Qwen 3 72B / 235B-MoE, DeepSeek V4-Pro / V4-Flash, Mistral Large 2, Command-R+, Mixtral 8x22B, plus Phi-5 and Granite Code at smaller scale. Numbers are taken from public pricing pages, Artificial Analysis benchmarks, and direct Apr 2026 testing.
- 01Same model, 6× pricing spread — pick by total cost-of-answer, not headline rate.Llama 4 70B costs $0.65/1M at Together's batch tier and $4.20/1M at the most expensive listed price. The cheap tier has tail-latency caveats; the expensive tier guarantees premium SLAs. Without knowing your latency tolerance, the headline number is meaningless.
- 02Specialty hardware (Groq, Cerebras) wins on throughput-bound workloads by 5-10×.Groq's LPU hits 750 tokens/sec on Llama 4 70B output decode; Cerebras hits 600+. A typical H100 endpoint runs 100-150. The premium-priced specialty providers are the right default for streaming chat, real-time coding, and any workload where time-to-finish matters more than per-token cost.
- 03Together AI is the price leader at scale; Fireworks is the developer-experience leader.Together's batch and reserved tiers are the cheapest in the market for steady-state production. Fireworks ships the cleanest API, fastest model integration, and best fine-tuning workflow. Most agencies end up with both — Together for stable production, Fireworks for fast iteration and custom models.
- 04Replicate and OctoAI fill the model-availability gap for niche or custom models.If you need a model the big providers don't host (custom fine-tunes, smaller Granite Code, niche fine-tunes), Replicate and OctoAI run almost anything via container. Pricing is higher per-token but the alternative is self-hosting just for that one model.
- 05Anyscale Endpoints is the enterprise option — higher price floor, deeper compliance and SLAs.Anyscale built on Ray, ships HIPAA / SOC 2 / EU data residency by default, and has the strongest enterprise contract terms. Higher per-token rate (often 1.5-2× Together) but the right call for regulated industries that can't run with smaller-vendor compliance posture.
01 — LandscapeSeven providers, three positioning bands.
The seven providers in this matrix occupy three positioning bands. Price-leader (Together, Fireworks) compete on per-token cost and broad model coverage. Performance-leader (Groq, Cerebras) compete on throughput and time-to-first-token. Coverage-leader (Replicate, OctoAI) compete on model breadth and custom-fine-tune support. Anyscale Endpoints sits across bands — competitive on price, enterprise-strong on contracts.
Price-led — Together AI · Fireworks AI
Cheapest per-token · broad model coverageBoth run vLLM-class stacks on H100 / H200 clusters with aggressive batching. Together's batch + reserved tiers price-lead the market. Fireworks ships the cleanest DX. Right default for steady-state production volume.
Default · steady-stateSpeed-led — Groq · Cerebras
Specialty hardware · 5-10× throughputGroq's LPU and Cerebras's wafer-scale chip skip the H100 commodity stack entirely. 600-750 tokens/sec output decode on 70B-class models. Premium pricing (often 2-3× Together's). Right call for streaming chat, real-time coding, and time-bound workloads.
Real-time · streamingCoverage-led — Replicate · OctoAI
Anything model · container-basedRun almost any open-weight model, including custom fine-tunes and niche releases. Higher per-token cost but the only way to get specific models (custom fine-tunes, smaller Granite, regional fine-tunes) without self-hosting. Right for portfolio breadth.
Niche / customEnterprise — Anyscale Endpoints
Built on Ray · enterprise compliance defaultHIPAA / SOC 2 / EU data residency built in. Strongest enterprise contract terms. Premium pricing (1.5-2× Together). Right for regulated industries (healthcare, finance, public sector) that can't accept smaller-vendor compliance posture.
Regulated industries02 — PricingPer-token pricing on Llama 4 70B.
The cleanest cross-provider comparison is on a single popular open-weight model — Llama 4 70B is the 2026 reference because every provider in this matrix hosts it and the architecture (GQA + 70B dense) is identical across endpoints. The 6× spread below is real; the question is what comes with each tier.
Llama 4 70B · per-token pricing across providers (output tokens)
Source: Provider pricing pages · Artificial Analysis · Apr 24, 2026The pattern: prices reflect what you're buying. Together batch tier saves 75% per token but accepts 60-minute SLA. Groq LPU charges 5× more but delivers 5-7× throughput. The matrix makes the trade explicit instead of implicit.
03 — LatencyLatency, time-to-first-token, and decode throughput.
Cost is one axis; latency is the other. The two move independently — cheap providers can have low latency (Together on-demand) and expensive providers can have high latency (Cerebras at low utilization). The numbers below are P50 measurements from Apr 2026 on Llama 4 70B output decode.
Output decode throughput
750+ tokens/sec on Llama 4 70B. P50 time-to-first-token under 200ms. The fastest commercially-available endpoint for open-weight models. Trade is per-token cost (5× Together batch).
Speed leaderOutput decode throughput
600-650 tokens/sec on Llama 4 70B. P50 TTFT slightly higher than Groq (~250ms). Wafer-scale chip; capacity scarcer than Groq's LPU fleet, occasional waitlists for high-volume contracts.
Speed bandOutput decode throughput
100-150 tokens/sec on Llama 4 70B at typical H100 endpoints. P50 TTFT 400-700ms. Standard commodity-cluster performance — what most production deployments run on. Acceptable for chat, marginal for streaming code.
Commodity tier"You don't pay for inference in tokens. You pay in time-to-finish — and that depends as much on the silicon as the price."— Internal provider-eval notes, May 2026
04 — ProvidersProvider-by-provider notes worth knowing.
Below is the practitioner's read on each provider — strengths, gotchas, and the kind of workload they fit. Pricing is the entry point; operating fit is the deeper question.
Price leader · cleanest scaling story
Cheapest in the market at batch + reserved. Broad model coverage including DeepSeek V4. Strong vLLM-based stack. Occasional capacity tightness on hot model launches. Default for steady-state production at scale.
Default · production scaleDX leader · fastest model integration
Day-one support for new model releases. Cleanest API and SDK. Fine-tuning workflow is best-in-class. Slightly more expensive than Together at scale. Pair with Together for production; use Fireworks for iteration.
Iteration + custom modelsThroughput leader · LPU specialty hardware
Single-stream throughput unmatched in 2026. Right answer when latency is the cost. Model coverage narrower than commodity providers. Premium pricing reflects specialty silicon. Best for streaming and real-time workloads.
Real-time / streamingWafer-scale · enterprise scale + speed
Wafer-scale-engine performance with enterprise contract posture. Capacity-constrained for high-volume customers but unmatched on throughput per-instance. Strong fit for enterprise customers needing both speed and SOC 2 / FedRAMP compliance.
Enterprise + speedBuilt on Ray · enterprise compliance default
Premium pricing. SOC 2 / HIPAA / EU data residency by default. Deepest contract flexibility. Higher operational maturity than smaller competitors. Right call for regulated industries that can't accept smaller-vendor posture.
Regulated industriesCoverage leaders · run anything model
Container-based serving for almost any open-weight model. Higher per-token cost. The fallback for models the big providers don't host yet — custom fine-tunes, niche releases, smaller models. Pair with a price-leader for the bulk of traffic.
Niche / custom models05 — DecisionPicking providers by workload.
Most production deployments end up using two or three providers, not one. The pattern that wins: a price-leader for steady-state volume, a speed-leader for streaming/critical paths, and a coverage-leader as a fallback for niche models. Below is the workload-class decision.
Batch-heavy content generation
Article drafts, bulk summarization, data-pipeline transformation. Latency tolerance: minutes-to-hours. Right answer: Together batch tier or reserved capacity. 4-6× cheaper than serverless on the same model.
Together · batch tierProduction chat / long-running agents
Real-time chat with under-10s response time. Mixed workload, variable QPS. Right answer: Together or Fireworks on serverless tier; pair with Groq for the streaming critical path. Hybrid two-provider routing covers cost + speed.
Together + GroqStreaming code completion / IDE plugin
Token-streaming, sub-200ms TTFT, 500+ tokens/sec needed for human-feel typing. Right answer: Groq LPU only. The cost premium is justified by the workload's strict latency profile; no commodity provider matches.
Groq · LPURegulated industry · enterprise contract
Healthcare, finance, public sector. Compliance requirements (HIPAA, SOC 2, EU residency) and contract terms (BAA, indemnification, audit rights). Right answer: Anyscale Endpoints or Cerebras enterprise tier; smaller providers won't sign the necessary terms.
Anyscale / Cerebras enterprise06 — BuyingBuying-process gotchas most teams hit.
- Listed price ≠ paid price. Every provider has volume tiers and reserved-capacity discounts. Listed pricing is the starting point; expect 25-40% discount on $50K+/month steady-state commitments. Negotiate before scaling.
- Capacity tightens on hot model launches. When a new frontier model ships (Llama 4 launch, DeepSeek V4 launch), the cheapest providers run out of capacity for several days. Maintain a fallback provider for the first 2-3 weeks after any major launch.
- Tail latency P99 isn't advertised.P50 numbers are visible on every provider's page; P95 / P99 tail-latency under load is not. For production deployments, run a 24-hour load test against P99 before locking in.
- Region availability is uneven. US regions are fully covered by every provider; EU coverage varies (Anyscale and Together strongest); APAC is patchy outside Singapore and Tokyo. Map your traffic geography to provider regions before committing.
- Custom-model hosting has a cold-start surcharge. If you fine-tune a model and host it on Replicate, OctoAI, or Fireworks, expect 2-5× the per-token cost of a popular hosted model — capacity isn't pre-provisioned for long-tail customers. Bake this into the fine-tune ROI math.
07 — ConclusionThe market is differentiated enough to pick deliberately.
Pick by total cost-of-answer, not headline rate.
By Q2 2026 the open-weight serverless inference market is mature enough that the right answer is rarely "use one provider." The seven providers in this matrix have differentiated themselves enough — by price band, by hardware specialty, by enterprise posture, by model coverage — that the production-grade play is to pair two or three of them by workload class.
The deeper move is to build the routing layer first. A simple workload-aware router that sends batch jobs to Together, real-time chat to Fireworks, streaming code to Groq, and niche-model traffic to Replicate beats any single-provider choice on cost-per-answer by 30-50% — and provides automatic failover when capacity tightens on a model launch.
Re-run this matrix quarterly. The 6× pricing spread is not stable; new providers enter (Modal, RunPod expanding into serverless), specialty hardware drops in price as inference silicon matures, and Cerebras and Groq are both ramping capacity. The decision that's right today is right today, not for the next 12 months.