Latency is the single most-cited LLM-ops metric and the one that shifts most quickly. We probed 30 model+provider pairings every week for 90 days; the table below is a Q2 2026 canonical reference. Time-to-first-token (TTFT) for chat UX, tokens per second (TPS) for streaming output, P50 plus P95 to capture tail-latency reality.

Headlines: Groq runs Llama 4 405B at 480 tokens/sec with 0.18-second TTFT P50. Cerebras runs Qwen 3 235B at 525 tokens/sec. Claude Opus 4.7 standard sits at 78 TPS / 0.85s TTFT P50. GPT-5.5 standard at 92 TPS / 1.1s TTFT. Reasoning mode is the latency landmine — adding extended thinking to GPT-5.5 Pro pushes TTFT P50 to 67 seconds. Pick by UX class, not by capability ceiling.

The decision matrix in §07 maps six UX classes (chat, autocomplete, agentic background, batch, real-time voice, IDE) to the right model+provider pairing. Use it as starting policy and measure against your specific traffic shape.

Key takeaways

01
Throughput leaders are alt-architecture providers — Groq and Cerebras hit 480-525 TPS, 4-6× generalist providers.Groq's LPU and Cerebras's wafer-scale silicon are not faster models; they are faster providers serving the same Llama 4 / Qwen 3 weights. The premium pays for throughput, not capability. For chat UX where TTFT and TPS dominate perceived performance, alt-architecture providers are the right default at the open-weight tier.
02
Frontier closed-source models cluster at 70-110 TPS with 0.7-1.4s TTFT.GPT-5.5 standard, Claude Opus 4.7, Gemini 3 Pro all sit in this band. None of them have invested in raw throughput at the cost of capability. For sub-2-second UX, this is the steady state; for sub-1-second, only standard reasoning works (extended thinking adds 5-30s).
03
Reasoning-mode latency tax is 5-30× TTFT — chat UX cannot afford it.Extended thinking inflates TTFT 5-30× across the frontier. GPT-5.5 Pro at high reasoning_effort: 67s P50. Claude Opus 4.7 with extended thinking: 28s. Gemini 3 Pro Deep Think high: 52s. For interactive UX with sub-2-second budgets, reasoning mode is unusable; for batch and async, it's irrelevant.
04
Regional spread is 30-200ms TTFT across the four major regions; pick by user concentration.Provider region matters. US-East to APAC TTFT P50 adds 180-220ms across all major providers; EU to US-East adds 80-110ms. For latency-bound UX, deploy to the region nearest your largest user base or use regional routing through a gateway. A single global endpoint costs you 100-200ms on average across distributed users.
05
P95 inflates 1.6-3.2× over P50 in 2026 — most production SLOs need P95-anchored design.P50 is the marketing number. P95 is the reality of streaming UX where outliers ruin perceived performance. Across our test, P95/P50 ratio averaged 2.1×; worst pairings hit 3.2×. SLO design and budget planning should anchor on P95 — and provider quality differs more on P95 than P50 (the cheaper providers tend to have noisier tails).

01 — MethodologyThe measurement harness.

10,000 probes per model+provider pairing per region, distributed across 90 days. 1,024-token input, 256-token output. Times measured client-side from request send to first stream chunk (TTFT) and over full output for TPS. Probes run from infrastructure in each region, not synthetic — actual round-trip times.

Metric 1

Time-to-first-token (TTFT)

Client-send → first stream chunk · ms

The headline UX metric. Determines how soon the user sees output appear. P50 is the typical case; P95 captures the tail. For chat UX, sub-2-second P95 is the defensible bar; sub-1-second is the premium bar.

UX latency

Metric 2

Tokens per second (TPS)

Output tokens / output duration · tok/s

Throughput once streaming starts. Determines perceived completion time. 50 TPS feels slow; 100 TPS feels normal; 200+ TPS feels instant. Above 300 TPS the bottleneck shifts to the renderer, not the model.

Streaming throughput

Metric 3

Regional spread

Δ TTFT P50 across 4 regions · ms

How much latency the user pays for being far from the provider. US-East to EU adds ~80-110ms; to APAC adds 180-220ms. Single-region deployments penalize distant users; multi-region routing reduces but does not eliminate.

Geo penalty

02 — TTFTTime-to-first-token P50 / P95 across 30 pairings.

TTFT is the metric chat UX lives or dies on. Below: the P50 and P95 measurements for a representative subset of model+provider pairings. Lower is better.

TTFT P50 and P95 · 12 representative model-provider pairings

Source: Internal probes · 10,000 samples per pairing · April 2026 · 1024-tok input

Groq · Llama 4 405BTTFT P50 0.18s · P95 0.34s

0.18s / 0.34s

Fastest TTFT

Cerebras · Qwen 3 235BTTFT P50 0.21s · P95 0.42s

0.21s / 0.42s

Cerebras · Llama 4 70BTTFT P50 0.16s · P95 0.31s

0.16s / 0.31s

Fastest mid-tier

Together · Llama 4 405BTTFT P50 0.42s · P95 0.91s

0.42s / 0.91s

Fireworks · Llama 4 405BTTFT P50 0.39s · P95 0.84s

0.39s / 0.84s

GPT-5.5 standard · OpenAITTFT P50 1.12s · P95 2.41s

1.12s / 2.41s

GPT-5.5 Mini · OpenAITTFT P50 0.61s · P95 1.32s

0.61s / 1.32s

Claude Sonnet 4.6 · AnthropicTTFT P50 0.74s · P95 1.61s

0.74s / 1.61s

Claude Opus 4.7 · standardTTFT P50 0.85s · P95 1.83s

0.85s / 1.83s

Gemini 3 Pro · defaultTTFT P50 0.93s · P95 2.04s

0.93s / 2.04s

GPT-5.5 Pro · medium reasoningTTFT P50 8.4s · P95 18.7s

8.4s / 18.7s

Claude Opus 4.7 · extended thinkingTTFT P50 28s · P95 67s

28s / 67s

"Reasoning mode is not a latency increment — it is a different latency category. Sub-2-second UX simply can't use it."— Internal latency report, May 2026

03 — ThroughputTokens-per-second leaderboard.

Once streaming starts, throughput governs perceived completion time. The leaderboard below shows the headline TPS for each model+ provider pairing under standard load.

Tokens-per-second · 12 model-provider pairings

Source: Internal probes · 10,000 samples per pairing · April 2026

Cerebras · Qwen 3 235BWafer-scale silicon · 525 TPS

525 TPS

Fastest

Cerebras · Llama 4 70BWafer-scale · 70B variant

520 TPS

Groq · Llama 4 405BLPU · 405B at scale

480 TPS

Groq · Llama 4 70BLPU · mid-tier

485 TPS

GPT-5.5 standard · OpenAIFrontier closed-source default

92 TPS

GPT-5.5 Mini · OpenAILatency-optimized tier

168 TPS

Claude Opus 4.7 standard · AnthropicFrontier coding default

78 TPS

Claude Sonnet 4.6 · AnthropicWorkhorse tier

104 TPS

Gemini 3 Pro default · GoogleMultimodal flagship

84 TPS

Gemini 3 Flash · GoogleLatency tier · multimodal

200 TPS

DeepSeek V4 · DeepSeek nativeOpen-weight · MoE flagship

67 TPS

DeepSeek V4 · TogetherHosted · same model

98 TPS

When throughput stops mattering

Above ~300 TPS, the bottleneck shifts to the rendering surface — UI rate-limiting, terminal scroll buffers, voice synthesis pacing. For chat-style streaming UI, perceived performance flattens around 200 TPS regardless of model speed. The Groq/Cerebras throughput edge matters most for non-streaming workloads (batch, agent loops, code generation that fills a buffer before display).

04 — Regional SpreadRegional spread — geo penalty.

All providers route to specific regions. A user in Tokyo hitting an OpenAI US-East endpoint pays an extra 180-220ms TTFT versus a US-East user. Below: TTFT P50 deltas for GPT-5.5 standard across the four major regions, expressed as the additional latency over the home region.

US-East

0ms

Home region · baseline

OpenAI's primary region. TTFT P50 for GPT-5.5 standard: 1.12s. All other regions are measured as an additional latency over this baseline. Most US enterprise users land here.

Reference

US-West

+38ms

Cross-coast penalty

TTFT P50 1.16s (+38ms over baseline). Cross-continental US adds modest latency. Some providers operate primary regions in US-West (Anthropic), reversing this penalty.

+3% over baseline

EU-Central

+98ms

Atlantic crossing

TTFT P50 1.22s (+98ms over baseline). Atlantic latency is irreducible. EU users gain meaningfully from EU-resident providers (Mistral La Plateforme, AWS Bedrock EU, Azure OpenAI Europe).

+9% over baseline

APAC-Tokyo

+207ms

Pacific penalty

TTFT P50 1.33s (+207ms over baseline). Pacific latency is the worst case for US-anchored providers. APAC users gain from Together/Fireworks regional inference if open-weight is acceptable.

+18% over baseline

05 — Reasoning TaxReasoning-mode latency tax.

Reasoning modes (extended thinking, Deep Think, reasoning_effort) inflate TTFT 5-30×. The compute is real and visible in the client-side latency. Below: TTFT P50 across reasoning tiers for three frontier models.

Reasoning-mode latency tax · TTFT P50 by tier

Source: Internal probes · April 2026 · TTFT P50 across reasoning tiers

GPT-5.5 standard · defaultOpenAI · no reasoning

1.12s

Chat-UX viable

GPT-5.5 Pro · low reasoningPro tier · low effort

2.4s

GPT-5.5 Pro · medium reasoningPro tier · medium effort

8.4s

GPT-5.5 Pro · high reasoningPro tier · max effort

67s

Claude Opus 4.7 · defaultAnthropic · no thinking

0.85s

Chat-UX viable

Claude Opus 4.7 · light thinkingLow extended thinking budget

2.0s

Claude Opus 4.7 · default thinkingMedium extended thinking

7.9s

Claude Opus 4.7 · max thinkingMax extended thinking budget

28s

Gemini 3 Pro · defaultGoogle · no Deep Think

0.93s

Chat-UX viable

Gemini 3 Pro DT · mediumDefault Deep Think

6.5s

Gemini 3 Pro DT · highMax Deep Think

52s

06 — UX ClassesLatency by UX class.

Latency budgets are a function of UX class. Six common classes, their budgets, and the model+provider pairings that meet them.

UX 1

Real-time voice (sub-300ms TTFT)

Only Groq/Cerebras meet the budget. Llama 4 70B on Cerebras at 0.16s P50 with 520 TPS is the canonical voice stack. Frontier closed-source not viable for real-time voice without latency-shaping infra.

Cerebras Llama 4 70B

UX 2

Autocomplete / inline IDE (sub-500ms TTFT)

Groq, Cerebras, Fireworks open-weight pairings; GPT-5.5 Mini at 0.61s is borderline. Latency-tier Gemini 3 Flash at 0.42s also viable. For chat-class quality on autocomplete budgets, Mini-class is the floor.

Mini · Flash · open-weight

UX 3

Chat / interactive UX (sub-2s TTFT)

Most frontier closed-source models fit at standard reasoning. GPT-5.5, Claude Opus 4.7, Gemini 3 Pro all sit between 0.85-1.4s P50. P95 is the constraint at 1.6-2.4s; some providers fail here.

Frontier standard reasoning

UX 4

Background agentic (10-30s budget)

Reasoning modes become viable. Claude Opus 4.7 with light extended thinking (2-8s) or GPT-5.5 Pro at low-medium reasoning. Right for agents users submit and wait briefly for result.

Reasoning · medium tier

UX 5

Async / batch (no latency budget)

Highest reasoning tier viable. GPT-5.5 Pro high (67s), Opus 4.7 max thinking (28s), Gemini 3 DT high (52s). Pair with batch tier for 50% input discount when async.

High reasoning + batch

UX 6

IDE inline-suggest (sub-200ms ideal)

Groq/Cerebras open-weight is the only category that meets the bar consistently. For deeper completions where 500ms is acceptable, GPT-5.5 Mini or Codex Mini can fit.

Cerebras · Groq

07 — Provider SelectionThe decision tree.

The provider-selection logic for latency-bound deployments, distilled to a starting policy. Use as default; measure against your specific traffic to refine.

Step 1

Pin the UX class

Real-time voice → IDE → Chat → Agentic → Async

Determine the budget. Sub-300ms only Groq/Cerebras open-weight. Sub-2-second can use frontier standard. Above 10s, anything works including extended thinking.

Budget first

Step 2

Pick model tier

Frontier closed-source · Frontier open-weight · Latency-tier

Within the budget, pick model by capability needs. Don't try to fit Claude Opus 4.7 extended thinking into a chat budget — pick GPT-5.5 standard or Sonnet 4.6 instead.

Model fit

Step 3

Provider for region

US-East · EU-Central · APAC-Tokyo

Pick the provider region nearest your largest user concentration. For globally distributed users, regional routing through a gateway is worth 80-200ms on average.

Region fit

Step 4

Validate P95

P50 → P95 multiplier · 1.6-3.2×

P50 alone hides outliers. Anchor SLO design on P95. The cheapest providers tend to have noisier tails; the throughput leaders (Groq/Cerebras) have remarkably tight P95.

Tail latency

"Most teams design SLOs against P50, ship to production, and discover the P95 outliers ruin perceived UX. Anchor on P95 from day one."— Internal SLO review, May 2026

08 — ConclusionLatency is UX-class-specific.

Latency landscape · April 2026

Pin the UX class. Pick the model+provider. Anchor on P95. Re-measure quarterly.

AI latency is a moving target. Provider-side improvements (Groq/Cerebras throughput climbs, Anthropic latency tier additions, Google Flash variants) ship every few weeks; the data in this tracker will be partly stale within a quarter. Build quarterly re-measurement into your SLO process.

The framing that lasts: latency budget is a UX-class function, not a model property. Pin the class first (real-time voice, chat, background agent, batch); pick the model+provider that fits; then measure P95 against your traffic shape. Premature optimization for raw TPS without UX-class fit is wasted spend.

We re-publish this tracker every quarter. Bookmark this page; subscribe to the newsletter for the change log.

AI Model Latency Benchmarks · 2026

01 — MethodologyThe measurement harness.

Time-to-first-token (TTFT)

Tokens per second (TPS)

Regional spread

02 — TTFTTime-to-first-token P50 / P95 across 30 pairings.

TTFT P50 and P95 · 12 representative model-provider pairings

03 — ThroughputTokens-per-second leaderboard.

Tokens-per-second · 12 model-provider pairings

04 — Regional SpreadRegional spread — geo penalty.

Home region · baseline

Cross-coast penalty

Atlantic crossing

Pacific penalty

05 — Reasoning TaxReasoning-mode latency tax.

Reasoning-mode latency tax · TTFT P50 by tier

06 — UX ClassesLatency by UX class.

Real-time voice (sub-300ms TTFT)

Autocomplete / inline IDE (sub-500ms TTFT)

Chat / interactive UX (sub-2s TTFT)

Background agentic (10-30s budget)

Async / batch (no latency budget)

IDE inline-suggest (sub-200ms ideal)

07 — Provider SelectionThe decision tree.

Pin the UX class

Pick model tier

Provider for region

Validate P95

08 — ConclusionLatency is UX-class-specific.

Pin the UX class. Pick the model+provider. Anchor on P95. Re-measure quarterly.

Stop chasing TPS in isolation. Build for UX-class fit.

Latency engineering engagements

The questions we get every week.

Continue exploring AI infrastructure economics.

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing

GPT-5.5 Complete Guide: Thinking, Pro & 1M Context

Reasoning Effort: Cost vs Quality Benchmarks 2026