Marketing claims of 1M-token context windows hide a 30-60 point retrieval drop between 200K and 1M for every frontier model except Gemini 3 Deep Think. The phrase "1M context" on a model card is a capacity statement; it is not a quality statement. Effective context — the window over which retrieval and reasoning hold up — is dramatically shorter than the advertised window for three of the four leading frontier models.
On NIAH-2 single-needle at 1M: GPT-5.5 hits 96%, Gemini 3 Deep Think hits 99%, Claude Opus 4.7 hits 89%, DeepSeek V4-Pro hits 78%. That looks acceptable; the multi-needle and reasoning-over- context numbers tell a different story. Multi-needle (8 needles): GPT-5.5 74%, Gemini 3 89%, Opus 4.7 56%, V4-Pro 41%. RULER reasoning-over-context at 256K is harsher still — only Gemini 3 stays above 80%.
This post publishes the full benchmark grid, the failure modes (positional bias, attention-sink collapse, MLA distortion at long context), and the production-side implications for teams designing long-context workflows.
- 01Claimed context window ≠ effective context. The gap is 30-60 points on multi-needle retrieval.Every frontier model advertises 1M tokens. None of them perform at 1M as well as they do at 200K — except Gemini 3 Deep Think, which uniquely holds near-perfect retrieval through the full window. For the other three, design for effective context (typically 200-400K) rather than claimed.
- 02Gemini 3 Deep Think is the long-context leader by a wide and persistent margin.NIAH-2 1M single-needle 99%, multi-needle 89%, RULER 256K above 80%. The architectural and training-data advantages compound — Google's long-context pipeline is roughly 12 months ahead of competitors and the gap is not closing on existing benchmarks.
- 03Multi-needle retrieval is where models silently fail — single-needle scores are misleading.Single-needle NIAH measures whether the model can find one piece of information in a haystack. Multi-needle measures whether it can integrate multiple pieces. Production workloads are multi-needle; single-needle scores overstate production capability by 15-40 points across the field.
- 04RULER reasoning-over-context is harsher than NIAH and more production-realistic.RULER tests reasoning over retrieved long-context content rather than pure retrieval. Scores typically run 10-25 points below NIAH-2 single-needle for the same model. At 256K context, only Gemini 3 stays above 80% on RULER. For workloads requiring reasoning over long context (legal analysis, research synthesis), use RULER as the headline benchmark.
- 05Production implication: design for effective context, not claimed. Use RAG above 200-400K for non-Gemini stacks.If your workload sits comfortably under 200K tokens, claim-vs-effective gap doesn't matter. Above 200K with non-Gemini frontier models, supplement with retrieval — RAG over a focused chunk-set typically outperforms naive long-context for the same total budget. Above 400K on non-Gemini, RAG almost always wins.
01 — Claim vs RealityClaimed context ≠ effective context.
The 1M-token claim is a capacity statement. It says: the model architecturally accepts a 1M-token input. It does not say: the model performs at 1M as well as it performs at 200K. The two statements are independent and the gap between them is what governs whether long-context production works.
Three benchmark families measure effective context: NIAH-2 (Greg Kamradt's updated needle-in-haystack tests), RULER (Nvidia's reasoning-over-context suite), and MRCR v2 (Multi-Round Context Retrieval, Anthropic-aligned). All three agree on the qualitative picture: effective context is much shorter than claimed for every frontier model except Gemini 3.
NIAH-2 — single + multi needle retrieval
Find specific info in long-context haystackThe classic. Place a 'needle' (specific fact) in a long-context distractor text; ask the model to retrieve. Single-needle is the easy version; multi-needle (typically 8 needles) is closer to production.
Retrieval ceilingRULER — reasoning over context
Multi-step reasoning over long-context inputReasoning over retrieved content rather than pure recall. More production-realistic. Scores run 10-25 points below NIAH-2 single-needle for the same model and context length.
Production-realisticMRCR v2 — multi-round context retrieval
Multi-turn questions referencing scattered contextAnthropic-aligned benchmark where multiple turns reference different parts of long context. Tests both retrieval breadth and conversational stability. GPT-5.5 leads at 1M (74.0% on 8-needle 512K-1M); Opus 4.7 lags despite long-context positioning (32.2%).
Conversation realismLongBench-v2 — open eval suite
Code, document, dialogue at long contextMulti-task long-context eval. Useful as a sanity-check across more diverse tasks than NIAH or RULER. Less crisp differentiator at the top end but useful for picking up workload-specific quirks.
Cross-task sanity check02 — NIAH-2 Single-NeedleNIAH-2 single-needle across the field.
Single-needle NIAH-2 is the headline retrieval benchmark — and also the one most likely to overstate production capability. All four frontier models look acceptable here at long context. The deeper differentiation lives in the multi-needle and reasoning-over-context tests.
NIAH-2 single-needle · 200K vs 1M context
Source: NIAH-2 (Greg Kamradt) · public model cards · Apr 2026The pattern: at 200K tokens every frontier model is essentially saturated. At 1M, the spread opens to 21 points (78% to 99%). The single-needle drop from 200K to 1M is small for Gemini 3 (-0.5 pts) and GPT-5.5 (-3 pts), large for Opus 4.7 (-10 pts) and dramatic for DeepSeek V4-Pro (-18 pts). The MLA-class attention compression that gives V4-Pro its KV-cache advantage also distorts long-range attention slightly more than GQA-class compression — a real, measured trade.
03 — Multi-NeedleMulti-needle is where models silently fail.
The single-to-multi-needle drop is the most important number on the page. Single-needle measures whether the model can find one specific item; multi-needle (8 needles) measures whether it can integrate eight pieces of information across long context. For production workloads — research synthesis, legal analysis, multi-document Q&A — multi-needle is the realistic capability.
NIAH-2 multi-needle (8) · 200K vs 1M
Source: NIAH-2 8-needle · model cards · Apr 2026"The 1M-token claim sells the demo. The multi-needle score sells the production deployment. They are not the same number."— Internal long-context eval notes, May 2026
The single-to-multi gap at 1M context is dramatic across the field: GPT-5.5 drops 22 points, Opus 4.7 drops 33 points, DeepSeek V4-Pro drops 37 points. Only Gemini 3 stays close (10-point drop). For any production workload that involves integrating multiple pieces of information across long context, weight multi-needle scores heavily over single-needle.
04 — RULERRULER — reasoning over long context.
RULER asks the model to reason over long-context content rather than retrieve from it. The benchmark covers 13 task categories (variable tracking, multi-hop reasoning, frequent words, query-aware aggregation) at context lengths from 4K to 128K and beyond. It is the most production-realistic long-context benchmark available — and the harshest.
Gemini 3 Deep Think
Only frontier model that holds above 80% at 256K on RULER. Reasoning over retrieved long-context content stays accurate. Default for any workload that requires reasoning + long context together.
Only model >80% at 256KGPT-5.5
Drops below 80% at 256K — the threshold most teams target for production reliability. Acceptable for shorter context (under 128K) where it scores 86-89%; weakens at the long end.
Strong at 128KClaude Opus 4.7
Strong on document retrieval (DocVQA-class tasks) but weaker on RULER's reasoning-over-context. The pattern: Opus is excellent at finding and quoting; weaker at reasoning across the full long-context space.
Retrieval > reasoningThe qualitative pattern: every model's RULER score is 10-25 points below its NIAH-2 single-needle score at the same context length. Reasoning over long context is harder than retrieving from it. Production teams should treat RULER as the realistic ceiling for any long-context workflow that requires more than "quote the relevant passage" behavior.
05 — Failure ModesThe mechanical failure modes.
Long-context degradation is not random. Three mechanical failure modes explain most of the score drop between 200K and 1M, and each suggests a different mitigation. Knowing which mode is in play on your workload helps choose between architectural fixes (model swap), application fixes (RAG + chunking), and operational fixes (retry / re-anchor).
Positional bias — middle gets dropped
The classic 'lost in the middle' pattern. Models attend more to the start and end of context than the middle, so needles placed at 30-70% positional depth show 5-15 point retrieval drop. Mitigation: re-pack the prompt to put critical content at the start or end; or use RAG to deliver the relevant chunks tightly.
Re-pack prompt by importanceAttention-sink collapse
At very long context, attention concentrates on a small set of 'sink' tokens (often the BOS token and the most recent few tokens). Mid-context tokens become effectively invisible. The DeepSeek V4 MLA + CSA partially mitigates this; GQA-class models without explicit sink handling are more affected. Visible as catastrophic mid-context drops.
Architecture-bound · model swapMLA distortion at long range
DeepSeek's MLA compresses keys to a low-rank latent space. Excellent for memory; the projection introduces small distortion that compounds at very long range. V4-Pro shows the largest single-to-multi-needle gap of any frontier model, traceable to this. The trade is real and should be understood when picking V4 for long-context-heavy workloads.
Architecture-bound · accept or swapMulti-fact integration failure
Even with all needles successfully retrieved individually, models can fail to integrate multiple facts when the answer requires combining them. This is a reasoning failure, not retrieval — it shows up worst on RULER and on multi-hop benchmarks. Mitigation: chain-of-thought prompting, agentic decomposition, or RAG with focused chunks.
Use CoT or RAG decomposition06 — ProductionWhat to do with this in production.
The data says one thing clearly: stop treating "1M-token context window" as a feature checkbox. Treat it as a capacity ceiling and design for effective context. Below are the four production patterns that work today.
Stay under 200K · single-pass long-context
Below 200K, every frontier model performs cleanly. Use full long-context. The gap between models in this band is small; pick on cost (Opus cached at $0.50/1M is hard to beat) and capability (which model wins your specific task).
Full long-context · any model200K-1M with Gemini 3 Deep Think only
Gemini 3 is the only frontier model that holds retrieval and reasoning quality from 200K to 1M. For workloads that require >200K and don't fit RAG (whole-corpus reasoning, multi-document analysis), Gemini 3 is the default.
Gemini 3 Deep Think only200K+ with non-Gemini · supplement with RAG
If you can't or don't want to use Gemini 3, layer RAG on top. Use the long-context model to read focused chunks (50-100K) selected by retrieval rather than running 500K+ raw. The hybrid gives back most of the long-context benefit while restoring multi-needle reliability.
RAG + focused long-contextReasoning-heavy long-context · always RAG
If the workload requires multi-hop reasoning over long context (legal analysis, research synthesis, technical documentation), RAG with explicit chunk selection beats naive long-context for every model except Gemini 3 Deep Think — and even Gemini 3 benefits from RAG above 400K.
RAG required07 — ConclusionEffective context is shorter than claimed.
The 1M claim is a ceiling. The multi-needle score is the production reality.
By April 2026 the long-context picture is well-characterized but widely-misunderstood. Every frontier model claims 1M tokens; only Gemini 3 Deep Think actually holds retrieval and reasoning quality through the full window. For the other three (GPT-5.5, Claude Opus 4.7, DeepSeek V4-Pro), effective context for multi-needle production workloads sits in the 200-400K band. Above that, performance degrades meaningfully and production deployments should supplement with retrieval.
The mechanical failure modes — positional bias, attention-sink collapse, MLA distortion, multi-fact integration failure — are different in cause but similar in consequence. The right response depends on which mode dominates your workload. Single- needle benchmarks miss this; multi-needle and RULER expose it; production teams should weight RULER and 8-needle scores most heavily when choosing models for long-context applications.
The next 12 months will see GPT-5.5 and Opus 4.7 close some of the gap on Gemini 3 (both are working aggressively on this); we do not expect the gap to vanish. Plan production architectures around the current benchmark grid, and re-evaluate quarterly as the numbers shift.