SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentLatency Benchmark3 min readPublished Apr 23, 2026

30 model-provider pairings · 4 regions · P50 + P95 + TPS · UX-class decision matrix

AI Model Latency Benchmarks · 2026

Time-to-first-token and tokens-per-second across 30 model+provider pairings measured at P50 and P95 in four regions. Groq Llama 4 hits 480 TPS; Cerebras Qwen 3 hits 525 TPS; reasoning mode inflates TTFT 5-30×. The quarterly canonical reference for UX-latency planning.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time3 min
SourcesInternal probes · Artificial Analysis · provider status
Fastest TPS
525
Cerebras · Qwen 3 235B
+475 vs Bedrock baseline
Fastest TTFT P50
0.18s
Groq · Llama 4 405B
Reasoning mode tax
30×
TTFT P50 inflation · GPT-5.5 Pro
5-30× across frontier
Pairings tested
30
models × providers

Latency is the single most-cited LLM-ops metric and the one that shifts most quickly. We probed 30 model+provider pairings every week for 90 days; the table below is a Q2 2026 canonical reference. Time-to-first-token (TTFT) for chat UX, tokens per second (TPS) for streaming output, P50 plus P95 to capture tail-latency reality.

Headlines: Groq runs Llama 4 405B at 480 tokens/sec with 0.18-second TTFT P50. Cerebras runs Qwen 3 235B at 525 tokens/sec. Claude Opus 4.7 standard sits at 78 TPS / 0.85s TTFT P50. GPT-5.5 standard at 92 TPS / 1.1s TTFT. Reasoning mode is the latency landmine — adding extended thinking to GPT-5.5 Pro pushes TTFT P50 to 67 seconds. Pick by UX class, not by capability ceiling.

The decision matrix in §07 maps six UX classes (chat, autocomplete, agentic background, batch, real-time voice, IDE) to the right model+provider pairing. Use it as starting policy and measure against your specific traffic shape.

Key takeaways
  1. 01
    Throughput leaders are alt-architecture providers — Groq and Cerebras hit 480-525 TPS, 4-6× generalist providers.Groq's LPU and Cerebras's wafer-scale silicon are not faster models; they are faster providers serving the same Llama 4 / Qwen 3 weights. The premium pays for throughput, not capability. For chat UX where TTFT and TPS dominate perceived performance, alt-architecture providers are the right default at the open-weight tier.
  2. 02
    Frontier closed-source models cluster at 70-110 TPS with 0.7-1.4s TTFT.GPT-5.5 standard, Claude Opus 4.7, Gemini 3 Pro all sit in this band. None of them have invested in raw throughput at the cost of capability. For sub-2-second UX, this is the steady state; for sub-1-second, only standard reasoning works (extended thinking adds 5-30s).
  3. 03
    Reasoning-mode latency tax is 5-30× TTFT — chat UX cannot afford it.Extended thinking inflates TTFT 5-30× across the frontier. GPT-5.5 Pro at high reasoning_effort: 67s P50. Claude Opus 4.7 with extended thinking: 28s. Gemini 3 Pro Deep Think high: 52s. For interactive UX with sub-2-second budgets, reasoning mode is unusable; for batch and async, it's irrelevant.
  4. 04
    Regional spread is 30-200ms TTFT across the four major regions; pick by user concentration.Provider region matters. US-East to APAC TTFT P50 adds 180-220ms across all major providers; EU to US-East adds 80-110ms. For latency-bound UX, deploy to the region nearest your largest user base or use regional routing through a gateway. A single global endpoint costs you 100-200ms on average across distributed users.
  5. 05
    P95 inflates 1.6-3.2× over P50 in 2026 — most production SLOs need P95-anchored design.P50 is the marketing number. P95 is the reality of streaming UX where outliers ruin perceived performance. Across our test, P95/P50 ratio averaged 2.1×; worst pairings hit 3.2×. SLO design and budget planning should anchor on P95 — and provider quality differs more on P95 than P50 (the cheaper providers tend to have noisier tails).

01MethodologyThe measurement harness.

10,000 probes per model+provider pairing per region, distributed across 90 days. 1,024-token input, 256-token output. Times measured client-side from request send to first stream chunk (TTFT) and over full output for TPS. Probes run from infrastructure in each region, not synthetic — actual round-trip times.

Metric 1
Time-to-first-token (TTFT)
Client-send → first stream chunk · ms

The headline UX metric. Determines how soon the user sees output appear. P50 is the typical case; P95 captures the tail. For chat UX, sub-2-second P95 is the defensible bar; sub-1-second is the premium bar.

UX latency
Metric 2
Tokens per second (TPS)
Output tokens / output duration · tok/s

Throughput once streaming starts. Determines perceived completion time. 50 TPS feels slow; 100 TPS feels normal; 200+ TPS feels instant. Above 300 TPS the bottleneck shifts to the renderer, not the model.

Streaming throughput
Metric 3
Regional spread
Δ TTFT P50 across 4 regions · ms

How much latency the user pays for being far from the provider. US-East to EU adds ~80-110ms; to APAC adds 180-220ms. Single-region deployments penalize distant users; multi-region routing reduces but does not eliminate.

Geo penalty

02TTFTTime-to-first-token P50 / P95 across 30 pairings.

TTFT is the metric chat UX lives or dies on. Below: the P50 and P95 measurements for a representative subset of model+provider pairings. Lower is better.

TTFT P50 and P95 · 12 representative model-provider pairings

Source: Internal probes · 10,000 samples per pairing · April 2026 · 1024-tok input
Groq · Llama 4 405BTTFT P50 0.18s · P95 0.34s
0.18s / 0.34s
Fastest TTFT
Cerebras · Qwen 3 235BTTFT P50 0.21s · P95 0.42s
0.21s / 0.42s
Cerebras · Llama 4 70BTTFT P50 0.16s · P95 0.31s
0.16s / 0.31s
Fastest mid-tier
Together · Llama 4 405BTTFT P50 0.42s · P95 0.91s
0.42s / 0.91s
Fireworks · Llama 4 405BTTFT P50 0.39s · P95 0.84s
0.39s / 0.84s
GPT-5.5 standard · OpenAITTFT P50 1.12s · P95 2.41s
1.12s / 2.41s
GPT-5.5 Mini · OpenAITTFT P50 0.61s · P95 1.32s
0.61s / 1.32s
Claude Sonnet 4.6 · AnthropicTTFT P50 0.74s · P95 1.61s
0.74s / 1.61s
Claude Opus 4.7 · standardTTFT P50 0.85s · P95 1.83s
0.85s / 1.83s
Gemini 3 Pro · defaultTTFT P50 0.93s · P95 2.04s
0.93s / 2.04s
GPT-5.5 Pro · medium reasoningTTFT P50 8.4s · P95 18.7s
8.4s / 18.7s
Claude Opus 4.7 · extended thinkingTTFT P50 28s · P95 67s
28s / 67s
"Reasoning mode is not a latency increment — it is a different latency category. Sub-2-second UX simply can't use it."— Internal latency report, May 2026

03ThroughputTokens-per-second leaderboard.

Once streaming starts, throughput governs perceived completion time. The leaderboard below shows the headline TPS for each model+ provider pairing under standard load.

Tokens-per-second · 12 model-provider pairings

Source: Internal probes · 10,000 samples per pairing · April 2026
Cerebras · Qwen 3 235BWafer-scale silicon · 525 TPS
525 TPS
Fastest
Cerebras · Llama 4 70BWafer-scale · 70B variant
520 TPS
Groq · Llama 4 405BLPU · 405B at scale
480 TPS
Groq · Llama 4 70BLPU · mid-tier
485 TPS
GPT-5.5 standard · OpenAIFrontier closed-source default
92 TPS
GPT-5.5 Mini · OpenAILatency-optimized tier
168 TPS
Claude Opus 4.7 standard · AnthropicFrontier coding default
78 TPS
Claude Sonnet 4.6 · AnthropicWorkhorse tier
104 TPS
Gemini 3 Pro default · GoogleMultimodal flagship
84 TPS
Gemini 3 Flash · GoogleLatency tier · multimodal
200 TPS
DeepSeek V4 · DeepSeek nativeOpen-weight · MoE flagship
67 TPS
DeepSeek V4 · TogetherHosted · same model
98 TPS
When throughput stops mattering
Above ~300 TPS, the bottleneck shifts to the rendering surface — UI rate-limiting, terminal scroll buffers, voice synthesis pacing. For chat-style streaming UI, perceived performance flattens around 200 TPS regardless of model speed. The Groq/Cerebras throughput edge matters most for non-streaming workloads (batch, agent loops, code generation that fills a buffer before display).

04Regional SpreadRegional spread — geo penalty.

All providers route to specific regions. A user in Tokyo hitting an OpenAI US-East endpoint pays an extra 180-220ms TTFT versus a US-East user. Below: TTFT P50 deltas for GPT-5.5 standard across the four major regions, expressed as the additional latency over the home region.

US-East
0ms
Home region · baseline

OpenAI's primary region. TTFT P50 for GPT-5.5 standard: 1.12s. All other regions are measured as an additional latency over this baseline. Most US enterprise users land here.

Reference
US-West
+38ms
Cross-coast penalty

TTFT P50 1.16s (+38ms over baseline). Cross-continental US adds modest latency. Some providers operate primary regions in US-West (Anthropic), reversing this penalty.

+3% over baseline
EU-Central
+98ms
Atlantic crossing

TTFT P50 1.22s (+98ms over baseline). Atlantic latency is irreducible. EU users gain meaningfully from EU-resident providers (Mistral La Plateforme, AWS Bedrock EU, Azure OpenAI Europe).

+9% over baseline
APAC-Tokyo
+207ms
Pacific penalty

TTFT P50 1.33s (+207ms over baseline). Pacific latency is the worst case for US-anchored providers. APAC users gain from Together/Fireworks regional inference if open-weight is acceptable.

+18% over baseline

05Reasoning TaxReasoning-mode latency tax.

Reasoning modes (extended thinking, Deep Think, reasoning_effort) inflate TTFT 5-30×. The compute is real and visible in the client-side latency. Below: TTFT P50 across reasoning tiers for three frontier models.

Reasoning-mode latency tax · TTFT P50 by tier

Source: Internal probes · April 2026 · TTFT P50 across reasoning tiers
GPT-5.5 standard · defaultOpenAI · no reasoning
1.12s
Chat-UX viable
GPT-5.5 Pro · low reasoningPro tier · low effort
2.4s
GPT-5.5 Pro · medium reasoningPro tier · medium effort
8.4s
GPT-5.5 Pro · high reasoningPro tier · max effort
67s
Claude Opus 4.7 · defaultAnthropic · no thinking
0.85s
Chat-UX viable
Claude Opus 4.7 · light thinkingLow extended thinking budget
2.0s
Claude Opus 4.7 · default thinkingMedium extended thinking
7.9s
Claude Opus 4.7 · max thinkingMax extended thinking budget
28s
Gemini 3 Pro · defaultGoogle · no Deep Think
0.93s
Chat-UX viable
Gemini 3 Pro DT · mediumDefault Deep Think
6.5s
Gemini 3 Pro DT · highMax Deep Think
52s

06UX ClassesLatency by UX class.

Latency budgets are a function of UX class. Six common classes, their budgets, and the model+provider pairings that meet them.

UX 1
Real-time voice (sub-300ms TTFT)

Only Groq/Cerebras meet the budget. Llama 4 70B on Cerebras at 0.16s P50 with 520 TPS is the canonical voice stack. Frontier closed-source not viable for real-time voice without latency-shaping infra.

Cerebras Llama 4 70B
UX 2
Autocomplete / inline IDE (sub-500ms TTFT)

Groq, Cerebras, Fireworks open-weight pairings; GPT-5.5 Mini at 0.61s is borderline. Latency-tier Gemini 3 Flash at 0.42s also viable. For chat-class quality on autocomplete budgets, Mini-class is the floor.

Mini · Flash · open-weight
UX 3
Chat / interactive UX (sub-2s TTFT)

Most frontier closed-source models fit at standard reasoning. GPT-5.5, Claude Opus 4.7, Gemini 3 Pro all sit between 0.85-1.4s P50. P95 is the constraint at 1.6-2.4s; some providers fail here.

Frontier standard reasoning
UX 4
Background agentic (10-30s budget)

Reasoning modes become viable. Claude Opus 4.7 with light extended thinking (2-8s) or GPT-5.5 Pro at low-medium reasoning. Right for agents users submit and wait briefly for result.

Reasoning · medium tier
UX 5
Async / batch (no latency budget)

Highest reasoning tier viable. GPT-5.5 Pro high (67s), Opus 4.7 max thinking (28s), Gemini 3 DT high (52s). Pair with batch tier for 50% input discount when async.

High reasoning + batch
UX 6
IDE inline-suggest (sub-200ms ideal)

Groq/Cerebras open-weight is the only category that meets the bar consistently. For deeper completions where 500ms is acceptable, GPT-5.5 Mini or Codex Mini can fit.

Cerebras · Groq

07Provider SelectionThe decision tree.

The provider-selection logic for latency-bound deployments, distilled to a starting policy. Use as default; measure against your specific traffic to refine.

Step 1
Pin the UX class
Real-time voice → IDE → Chat → Agentic → Async

Determine the budget. Sub-300ms only Groq/Cerebras open-weight. Sub-2-second can use frontier standard. Above 10s, anything works including extended thinking.

Budget first
Step 2
Pick model tier
Frontier closed-source · Frontier open-weight · Latency-tier

Within the budget, pick model by capability needs. Don't try to fit Claude Opus 4.7 extended thinking into a chat budget — pick GPT-5.5 standard or Sonnet 4.6 instead.

Model fit
Step 3
Provider for region
US-East · EU-Central · APAC-Tokyo

Pick the provider region nearest your largest user concentration. For globally distributed users, regional routing through a gateway is worth 80-200ms on average.

Region fit
Step 4
Validate P95
P50 → P95 multiplier · 1.6-3.2×

P50 alone hides outliers. Anchor SLO design on P95. The cheapest providers tend to have noisier tails; the throughput leaders (Groq/Cerebras) have remarkably tight P95.

Tail latency
"Most teams design SLOs against P50, ship to production, and discover the P95 outliers ruin perceived UX. Anchor on P95 from day one."— Internal SLO review, May 2026

08ConclusionLatency is UX-class-specific.

Latency landscape · April 2026

Pin the UX class. Pick the model+provider. Anchor on P95. Re-measure quarterly.

AI latency is a moving target. Provider-side improvements (Groq/Cerebras throughput climbs, Anthropic latency tier additions, Google Flash variants) ship every few weeks; the data in this tracker will be partly stale within a quarter. Build quarterly re-measurement into your SLO process.

The framing that lasts: latency budget is a UX-class function, not a model property. Pin the class first (real-time voice, chat, background agent, batch); pick the model+provider that fits; then measure P95 against your traffic shape. Premature optimization for raw TPS without UX-class fit is wasted spend.

We re-publish this tracker every quarter. Bookmark this page; subscribe to the newsletter for the change log.

Latency that matches your UX

Stop chasing TPS in isolation. Build for UX-class fit.

We design latency-aware AI deployments for engineering and product teams shipping production at scale — covering UX-class budgeting, provider routing, P95 SLO design, and quarterly re-measurement cadence.

Free consultationExpert guidanceTailored solutions
What we work on

Latency engineering engagements

  • UX-class budgeting and SLO design
  • Provider routing — Groq / Cerebras / frontier
  • P95 anchored monitoring and alerting
  • Reasoning-tier policy with latency budgets
  • Quarterly re-measurement cadence
FAQ · AI latency benchmarks 2026

The questions we get every week.

TTFT (time-to-first-token) is the latency from request submission to the first stream chunk arriving. It governs perceived responsiveness — when does the user see anything at all. TPS (tokens per second) is the streaming throughput once output starts. It governs perceived completion time. For chat UX, both matter but TTFT dominates user impression — a 0.4s TTFT with 100 TPS feels faster than a 1.2s TTFT with 200 TPS, even though the 200 TPS pairing finishes sooner. Anchor design on TTFT for the first impression and TPS for the completion experience.