Latency is the single most-cited LLM-ops metric and the one that shifts most quickly. We probed 30 model+provider pairings every week for 90 days; the table below is a Q2 2026 canonical reference. Time-to-first-token (TTFT) for chat UX, tokens per second (TPS) for streaming output, P50 plus P95 to capture tail-latency reality.
Headlines: Groq runs Llama 4 405B at 480 tokens/sec with 0.18-second TTFT P50. Cerebras runs Qwen 3 235B at 525 tokens/sec. Claude Opus 4.7 standard sits at 78 TPS / 0.85s TTFT P50. GPT-5.5 standard at 92 TPS / 1.1s TTFT. Reasoning mode is the latency landmine — adding extended thinking to GPT-5.5 Pro pushes TTFT P50 to 67 seconds. Pick by UX class, not by capability ceiling.
The decision matrix in §07 maps six UX classes (chat, autocomplete, agentic background, batch, real-time voice, IDE) to the right model+provider pairing. Use it as starting policy and measure against your specific traffic shape.
- 01Throughput leaders are alt-architecture providers — Groq and Cerebras hit 480-525 TPS, 4-6× generalist providers.Groq's LPU and Cerebras's wafer-scale silicon are not faster models; they are faster providers serving the same Llama 4 / Qwen 3 weights. The premium pays for throughput, not capability. For chat UX where TTFT and TPS dominate perceived performance, alt-architecture providers are the right default at the open-weight tier.
- 02Frontier closed-source models cluster at 70-110 TPS with 0.7-1.4s TTFT.GPT-5.5 standard, Claude Opus 4.7, Gemini 3 Pro all sit in this band. None of them have invested in raw throughput at the cost of capability. For sub-2-second UX, this is the steady state; for sub-1-second, only standard reasoning works (extended thinking adds 5-30s).
- 03Reasoning-mode latency tax is 5-30× TTFT — chat UX cannot afford it.Extended thinking inflates TTFT 5-30× across the frontier. GPT-5.5 Pro at high reasoning_effort: 67s P50. Claude Opus 4.7 with extended thinking: 28s. Gemini 3 Pro Deep Think high: 52s. For interactive UX with sub-2-second budgets, reasoning mode is unusable; for batch and async, it's irrelevant.
- 04Regional spread is 30-200ms TTFT across the four major regions; pick by user concentration.Provider region matters. US-East to APAC TTFT P50 adds 180-220ms across all major providers; EU to US-East adds 80-110ms. For latency-bound UX, deploy to the region nearest your largest user base or use regional routing through a gateway. A single global endpoint costs you 100-200ms on average across distributed users.
- 05P95 inflates 1.6-3.2× over P50 in 2026 — most production SLOs need P95-anchored design.P50 is the marketing number. P95 is the reality of streaming UX where outliers ruin perceived performance. Across our test, P95/P50 ratio averaged 2.1×; worst pairings hit 3.2×. SLO design and budget planning should anchor on P95 — and provider quality differs more on P95 than P50 (the cheaper providers tend to have noisier tails).
01 — MethodologyThe measurement harness.
10,000 probes per model+provider pairing per region, distributed across 90 days. 1,024-token input, 256-token output. Times measured client-side from request send to first stream chunk (TTFT) and over full output for TPS. Probes run from infrastructure in each region, not synthetic — actual round-trip times.
Time-to-first-token (TTFT)
Client-send → first stream chunk · msThe headline UX metric. Determines how soon the user sees output appear. P50 is the typical case; P95 captures the tail. For chat UX, sub-2-second P95 is the defensible bar; sub-1-second is the premium bar.
UX latencyTokens per second (TPS)
Output tokens / output duration · tok/sThroughput once streaming starts. Determines perceived completion time. 50 TPS feels slow; 100 TPS feels normal; 200+ TPS feels instant. Above 300 TPS the bottleneck shifts to the renderer, not the model.
Streaming throughputRegional spread
Δ TTFT P50 across 4 regions · msHow much latency the user pays for being far from the provider. US-East to EU adds ~80-110ms; to APAC adds 180-220ms. Single-region deployments penalize distant users; multi-region routing reduces but does not eliminate.
Geo penalty02 — TTFTTime-to-first-token P50 / P95 across 30 pairings.
TTFT is the metric chat UX lives or dies on. Below: the P50 and P95 measurements for a representative subset of model+provider pairings. Lower is better.
TTFT P50 and P95 · 12 representative model-provider pairings
Source: Internal probes · 10,000 samples per pairing · April 2026 · 1024-tok input"Reasoning mode is not a latency increment — it is a different latency category. Sub-2-second UX simply can't use it."— Internal latency report, May 2026
03 — ThroughputTokens-per-second leaderboard.
Once streaming starts, throughput governs perceived completion time. The leaderboard below shows the headline TPS for each model+ provider pairing under standard load.
Tokens-per-second · 12 model-provider pairings
Source: Internal probes · 10,000 samples per pairing · April 202604 — Regional SpreadRegional spread — geo penalty.
All providers route to specific regions. A user in Tokyo hitting an OpenAI US-East endpoint pays an extra 180-220ms TTFT versus a US-East user. Below: TTFT P50 deltas for GPT-5.5 standard across the four major regions, expressed as the additional latency over the home region.
Home region · baseline
OpenAI's primary region. TTFT P50 for GPT-5.5 standard: 1.12s. All other regions are measured as an additional latency over this baseline. Most US enterprise users land here.
ReferenceCross-coast penalty
TTFT P50 1.16s (+38ms over baseline). Cross-continental US adds modest latency. Some providers operate primary regions in US-West (Anthropic), reversing this penalty.
+3% over baselineAtlantic crossing
TTFT P50 1.22s (+98ms over baseline). Atlantic latency is irreducible. EU users gain meaningfully from EU-resident providers (Mistral La Plateforme, AWS Bedrock EU, Azure OpenAI Europe).
+9% over baselinePacific penalty
TTFT P50 1.33s (+207ms over baseline). Pacific latency is the worst case for US-anchored providers. APAC users gain from Together/Fireworks regional inference if open-weight is acceptable.
+18% over baseline05 — Reasoning TaxReasoning-mode latency tax.
Reasoning modes (extended thinking, Deep Think, reasoning_effort) inflate TTFT 5-30×. The compute is real and visible in the client-side latency. Below: TTFT P50 across reasoning tiers for three frontier models.
Reasoning-mode latency tax · TTFT P50 by tier
Source: Internal probes · April 2026 · TTFT P50 across reasoning tiers06 — UX ClassesLatency by UX class.
Latency budgets are a function of UX class. Six common classes, their budgets, and the model+provider pairings that meet them.
Real-time voice (sub-300ms TTFT)
Only Groq/Cerebras meet the budget. Llama 4 70B on Cerebras at 0.16s P50 with 520 TPS is the canonical voice stack. Frontier closed-source not viable for real-time voice without latency-shaping infra.
Cerebras Llama 4 70BAutocomplete / inline IDE (sub-500ms TTFT)
Groq, Cerebras, Fireworks open-weight pairings; GPT-5.5 Mini at 0.61s is borderline. Latency-tier Gemini 3 Flash at 0.42s also viable. For chat-class quality on autocomplete budgets, Mini-class is the floor.
Mini · Flash · open-weightChat / interactive UX (sub-2s TTFT)
Most frontier closed-source models fit at standard reasoning. GPT-5.5, Claude Opus 4.7, Gemini 3 Pro all sit between 0.85-1.4s P50. P95 is the constraint at 1.6-2.4s; some providers fail here.
Frontier standard reasoningBackground agentic (10-30s budget)
Reasoning modes become viable. Claude Opus 4.7 with light extended thinking (2-8s) or GPT-5.5 Pro at low-medium reasoning. Right for agents users submit and wait briefly for result.
Reasoning · medium tierAsync / batch (no latency budget)
Highest reasoning tier viable. GPT-5.5 Pro high (67s), Opus 4.7 max thinking (28s), Gemini 3 DT high (52s). Pair with batch tier for 50% input discount when async.
High reasoning + batchIDE inline-suggest (sub-200ms ideal)
Groq/Cerebras open-weight is the only category that meets the bar consistently. For deeper completions where 500ms is acceptable, GPT-5.5 Mini or Codex Mini can fit.
Cerebras · Groq07 — Provider SelectionThe decision tree.
The provider-selection logic for latency-bound deployments, distilled to a starting policy. Use as default; measure against your specific traffic to refine.
Pin the UX class
Real-time voice → IDE → Chat → Agentic → AsyncDetermine the budget. Sub-300ms only Groq/Cerebras open-weight. Sub-2-second can use frontier standard. Above 10s, anything works including extended thinking.
Budget firstPick model tier
Frontier closed-source · Frontier open-weight · Latency-tierWithin the budget, pick model by capability needs. Don't try to fit Claude Opus 4.7 extended thinking into a chat budget — pick GPT-5.5 standard or Sonnet 4.6 instead.
Model fitProvider for region
US-East · EU-Central · APAC-TokyoPick the provider region nearest your largest user concentration. For globally distributed users, regional routing through a gateway is worth 80-200ms on average.
Region fitValidate P95
P50 → P95 multiplier · 1.6-3.2×P50 alone hides outliers. Anchor SLO design on P95. The cheapest providers tend to have noisier tails; the throughput leaders (Groq/Cerebras) have remarkably tight P95.
Tail latency"Most teams design SLOs against P50, ship to production, and discover the P95 outliers ruin perceived UX. Anchor on P95 from day one."— Internal SLO review, May 2026
08 — ConclusionLatency is UX-class-specific.
Pin the UX class. Pick the model+provider. Anchor on P95. Re-measure quarterly.
AI latency is a moving target. Provider-side improvements (Groq/Cerebras throughput climbs, Anthropic latency tier additions, Google Flash variants) ship every few weeks; the data in this tracker will be partly stale within a quarter. Build quarterly re-measurement into your SLO process.
The framing that lasts: latency budget is a UX-class function, not a model property. Pin the class first (real-time voice, chat, background agent, batch); pick the model+provider that fits; then measure P95 against your traffic shape. Premature optimization for raw TPS without UX-class fit is wasted spend.
We re-publish this tracker every quarter. Bookmark this page; subscribe to the newsletter for the change log.