Frontier model releases in H1 2026 painted a clearer picture than any half-year prior. Four labs shipped more than twenty production models between January and May, and the pattern across them was consistent enough to call a trend rather than a coincidence — capabilities converged, context windows standardised at one million tokens, and pricing per intelligence-unit fell faster than any previous half.
What changed isn't just the headline benchmarks. It's the shape of how teams now consume these models. Reasoning-effort knobs, once an Anthropic curiosity, became the default control surface across every major lab. Structured outputs moved from a coin-flip feature to a reliability primitive teams can build on. Agent loops stopped being scaffolding teams wrote themselves and started being native API behaviour. Million-token context stopped being an aspirational demo and turned into a price-competitive default.
This retrospective tracks the data behind those shifts — release cadence by lab, benchmark deltas across the half, pricing-tier comparisons, the context-window timeline, and the four operating trends teams should be calibrating against. We close with a forecast for H2 2026 grounded in the trajectory, not in vibes.
- 01Reasoning-effort routing became the default control.Every major lab now exposes a low/medium/high or non-think/think-high/think-max knob. The interesting product surface moved from model selection to effort selection within a single model family.
- 02One-million-token context is now economically viable.DeepSeek V4 Preview, Gemini 3.1 Pro, and Claude Opus 4.7 all priced 1M context within range of what was 200k pricing six months ago. The context-as-luxury era ended in H1.
- 03Structured outputs hit production-grade reliability.Schema adherence rates above 99.5% across labs made JSON-mode and tool-calling the assumed substrate rather than a fragile feature. Validators around model output became optional, not load-bearing.
- 04Agent loops are now a native API primitive.Multi-turn tool-calling with parallel execution, interleaved reasoning, and crash-safe pause-resume moved from custom scaffolding into the SDK surface. The framework conversation shifted accordingly.
- 05Frontier labs converged on capability profiles.Best-of-class on most benchmarks now sits within a single-digit gap across labs. The differentiation moved from raw capability to latency, price, deployment posture, and the ergonomics of the surrounding tooling.
01 — Why RetrospectiveSix months is the shortest window that smooths the noise.
Tracking frontier model releases week-by-week produces noise that looks like signal. A new benchmark high, a price cut, a deprecation — each one feels load-bearing in the moment, and most don't survive the next release cycle. Six months is the shortest window that smooths individual-release noise into a trajectory teams can actually plan against.
H1 2026 was a particularly productive window for that smoothing. The four labs we track — Anthropic, OpenAI, Google DeepMind, and DeepSeek — collectively shipped more than twenty production-grade models in the period. That density means the half includes enough data points to separate trend from outlier on every dimension that matters: cadence, capability, price, context, and agent ergonomics.
Our methodology for this report: only models that shipped publicly usable inference are counted (Hugging Face open weights, public API availability, or chat-product exposure). Research previews without inference paths are excluded. Pricing is taken from each lab's official documentation as of May 15, 2026. Benchmark numbers are sourced from each lab's technical report or model card, with cross-reference against independent evaluations where the spread between labs warranted it.
One framing matters before the data: this is a capabilities and cadenceretrospective, not a vendor ranking. The question we're answering isn't "which lab is winning" — it's "what changed in the substrate that teams build on, and what does that imply for how to plan the next six months." Vendor selection is a per-workload decision driven by latency, deployment posture, and the specific benchmark profile your application cares about. The data below should inform that decision, not replace it.
02 — Release CadenceFour labs, twenty-plus shipped models, distinct rhythms.
Across the half, the four labs settled into recognisable rhythms. Anthropic released two Opus generations and a refreshed Sonnet family. OpenAI shipped GPT-5.4 in January, GPT-5.5 in April, with interim point releases. Google DeepMind moved Gemini 3.0 to 3.1 Pro and added two specialised siblings. DeepSeek shipped V3.2 in February and V4 Preview in April — its first 1.6T-parameter open release.
The pattern beneath the count is more interesting than the count itself. Anthropic shifted to roughly six-week cadence on point releases, with backwards-compatible API contracts and explicit migration playbooks. OpenAI continued its tradition of fewer, larger releases with more dramatic capability jumps. Google ran a two-track strategy — a flagship Gemini line on a slower cadence, and rapid iteration on the smaller Flash and Nano siblings. DeepSeek leaned into a paper-plus-release model where every shipment is accompanied by a complete technical report.
Opus + Sonnet · six-week cadence
Opus 4.6 · Opus 4.7 · Sonnet 4.7Two Opus generations and a refreshed Sonnet family. Reasoning-effort knob (low/medium/high) became the default product surface; backwards-compatible API contracts kept migration friction low.
anthropic.comGPT-5 family · quarterly major
GPT-5.4 · GPT-5.5 · interim pointsFewer, larger releases with more dramatic capability jumps. xLow / Low / Med / High / xHigh effort tiers covering the price-capability surface; structured outputs reached 99.7% schema-adherence.
platform.openai.comGemini 3.0 → 3.1 Pro
3.0 → 3.1 Pro · Flash · Nano siblingsTwo-track strategy: flagship line on slower cadence, rapid iteration on Flash and Nano. Native 1M context across the family; multimodal parity remained the principal differentiator.
ai.google.devV3.2 → V4 Preview
V3.2 · V4-Pro 1.6T · V4-Flash 284BPaper-plus-release model — every shipment came with a complete technical report. V4 Preview's hybrid CSA+HCA attention made 1M context economically viable on open weights.
huggingface.co/deepseek-aiWhat changed in cadence terms across the half is the compression of the gap between point releases. What used to be six-to-nine-month flagship cycles tightened toward quarterly for OpenAI and six-weekly for Anthropic. The implication for teams is that "current model" is a moving target — integration patterns that depend on hard-coded model IDs need a review cadence that matches the upstream cadence, not the older twice-a-year deployment rhythm.
"What used to be six-to-nine-month flagship cycles tightened toward quarterly for OpenAI and six-weekly for Anthropic — and the integration discipline that pattern requires has not caught up in most teams."— Our reading of H1 2026 release telemetry
03 — Benchmark GainsWhere the frontier actually moved in six months.
The headline benchmarks shifted meaningfully in H1, though not evenly across categories. Coding and competitive programming saw the largest gains — open and closed models both crossed thresholds that looked aspirational in December 2025. Formal reasoning and mathematical proof improved enough that proof-graded benchmarks (where every solution must be a valid graded proof, not a numeric answer) stopped being out-of-reach for the strongest models.
General knowledge and retrieval moved more modestly, and long- context retrieval remained the area with the widest spread between labs. The chart below tracks selected benchmark gains from December 2025 baselines to the strongest H1 2026 frontier mode in each category. The deltas are in absolute percentage points.
H1 2026 benchmark deltas · selected categories
Source: Lab-reported benchmarks · Dec 2025 vs May 2026 highTwo observations from the data above warrant calling out. First, ceiling effects are starting to show on a handful of long-standing benchmarks — MMLU-Pro and GPQA Diamond moved single-digit percentage points across the half because the strongest models are already in the high 80s and low 90s. New benchmarks designed to be harder (frontier math, multi-step agent tasks, very-long-context retrieval) are where the meaningful differentiation now lives.
Second, the inter-lab spread on long-context retrieval is the data point teams should care most about. MRCR 1M scores ranged from the high 70s to the low 90s across labs, with the gap stable through the half. That spread directly affects which lab a team should pick for long-document RAG and agent workloads — and the answer is currently lab-dependent, not convergent. Routing across labs by workload class remained non-optional for any team running long-context RAG in production.
04 — Pricing ShiftsPer-token pricing fell faster than capability rose.
H1 2026 was the period where per-intelligence-unit pricing fell faster than capability rose for the first time in two years. Two mechanics drove that: open-weight efficiency releases (DeepSeek V4 being the canonical example) put pressure on closed-frontier list prices, and the labs themselves competed on cached-input and structured-output discounts that quietly cut effective production costs by 30 to 60% for typical agent workloads.
The matrix below compares strategic positioning across the four labs at the close of H1. It's not a pricing chart — we've covered the actual cost-per-million-tokens deltas in our DeepSeek V4 Preview launch coverage — but a strategic-fit guide for how to think about each lab's current shape.
Premium reasoning workhorse
Best-in-class agentic coding and long-context retrieval, three-tier reasoning effort, the strongest MRCR 1M scores across labs. Highest list price among the four, partially offset by prompt-caching and batch-API discounts.
Pick for agentic + long-contextGeneralist with five-tier effort
Strongest broad-knowledge profile, native structured outputs at 99.7% adherence, five effort tiers (xLow through xHigh) spanning a wide price-capability surface. Default first choice when reasoning effort is the variable, not the model.
Pick for generalist routingMultimodal-first frontier
Native 1M context across the whole family, the strongest multimodal parity (image, video, audio, code), aggressive pricing on Flash and Nano siblings. The right pick when video and image dominate the workload.
Pick for multimodalOpen-weight efficiency
Open weights on Hugging Face, hybrid CSA+HCA attention making 1M context economically viable, three reasoning modes. Strongest open-model code and formal-reasoning scores; trails closed frontier on general knowledge by 3-to-6 months.
Pick for sovereignty + on-premThe strategic implication for production routing: the "use this one model for everything" era ended in H1 2026. A well-architected production stack now routes by workload class — agentic-coding to Opus 4.7, multimodal to Gemini 3.1 Pro, generalist Q&A to GPT-5.5 mid-effort, sovereignty-bound long-document RAG to V4-Pro on-prem. Routing tax (the latency and engineering cost of operating multiple labs) is real, but for any workload above modest scale, the price-capability gains from routing dominate the tax.
One pricing-mechanic worth flagging for H2 planning: cached-input pricing. Every lab now offers a meaningful discount (50 to 90%) for repeated prompt prefixes served from a managed cache. For agent workloads where system prompts and tool definitions repeat across thousands of calls, the cached-input price — not the headline list rate — is the number that determines unit economics. Most teams are still calibrating their architecture to take advantage of it.
05 — Context GrowthOne million tokens stopped being a luxury.
The context-window story of H1 2026 isn't that limits got bigger — they were already at 1M for Gemini in 2025, and 200k for Anthropic and OpenAI. The story is that one-million-token context became economically reasonable across every major lab. Three things happened in parallel: DeepSeek shipped V4's hybrid CSA+HCA attention that runs 1M context at 27% of V3.2's FLOPs, OpenAI expanded GPT-5.5 to 1M context at production pricing, and Anthropic added 1M context to Opus 4.7 with differentiated pricing tiers above 200k. By May, all four labs shipped a 1M-context model at price points within range of what was 200k pricing six months earlier.
Median frontier context
January frontier median was 200k tokens for Anthropic and OpenAI, 1M for Gemini, 128k for DeepSeek V3.2. Wide spread, with 1M still feeling like a Google specialty rather than a baseline expectation.
Wide inter-lab spreadUniversal frontier context
May frontier ceiling is 1M tokens across all four labs at competitive pricing. Anthropic, OpenAI, Google, and DeepSeek all shipped 1M-capable production models — the era of 1M-as-luxury ended in H1.
Universal baselineWhere retrieval still wins
Above 400k tokens, retrieval-augmented generation still outperforms naive long-context loading for most workloads on most labs. 1M context is a power tool for specific workload classes, not a universal architecture default.
RAG still wins above 400kWhat didn't change as much as the context floor: effective retrieval quality at very long context. MRCR 1M scores ranged from the high 70s to low 90s across labs, and the "needle in a haystack" intuition that worked at 128k breaks down at 1M for most labs. Above roughly 400k tokens, retrieval-augmented generation paired with a smaller context window still outperforms naive long-context loading for most workloads. 1M context is a power tool for specific workload classes — full-codebase analysis, multi-document legal review, very-long agent sessions — not a universal architecture default.
The architecture decision tree H2 should reach is: under 200k, load the context directly into the model; 200k to 400k, evaluate per-workload because the difference is workload-specific; above 400k, RAG-plus-medium-context still wins for most teams, with specific exceptions (codebase navigation, very-long agent sessions) where the model's coherent attention across the full window matters more than retrieval precision.
06 — Four TrendsThe four operating trends teams should calibrate against.
Stepping back from individual releases, four operating trends defined how H1 2026 changed the substrate teams build on. None of them are particular to any single lab, and all four will shape H2 architectural decisions more than any single benchmark gain.
1. Reasoning-effort routing replaced model selection
In H1 2025, the interesting question was "which model do I pick for this task". In H1 2026, the more interesting question is "within this model family, which effort tier do I route this task to". Every major lab now ships a low/medium/high or non-think/think-high/think-max knob, and the cost-capability surface within a single model family often spans an order of magnitude. Teams that have rebuilt their routing layer around effort tiers (rather than model IDs) are reporting 30 to 50% cost reductions on like-for-like quality.
2. One-million-token context became economically viable
DeepSeek V4 Preview's hybrid CSA+HCA attention, Gemini 3.1 Pro's native-1M pricing, and Claude Opus 4.7's differentiated 1M tier collectively ended the 1M-as-luxury era. For specific workload classes, the architecture decision shifted from "build RAG" to "load the whole thing and let the model figure it out." For most workloads, RAG still wins above roughly 400k, but the threshold moved up meaningfully across the half.
3. Structured outputs hit production-grade reliability
Schema adherence rates above 99.5% across labs made JSON-mode and tool-calling the assumed substrate rather than a fragile feature. Validators around model output became optional rather than load- bearing. The implication for application architecture: building on top of structured-output guarantees no longer requires the defensive-coding overhead that defined 2024-era AI integration.
4. Agent loops became a native API primitive
Multi-turn tool-calling with parallel execution, interleaved reasoning across user turns, and crash-safe pause-resume moved from custom scaffolding into the SDK surface. The framework conversation shifted accordingly — agent frameworks that re-implemented loop semantics outside the model API lost ground to thin SDK wrappers that delegated loop state to the lab's own infrastructure.
07 — H2 ProjectionWhat we expect H2 2026 to commoditize.
Forecasts in this space age badly, so we'll keep the H2 projection short, grounded, and explicit about confidence. Three things we expect with reasonable confidence, two we're less sure about, and one thing we'd be surprised to see.
Expected with reasonable confidence
Per-token pricing falls another 30 to 50% across the half. The mechanic that drove H1's price compression — open- weight pressure plus cached-input and structured-output discounts — accelerates rather than slows. DeepSeek V4 going to full release (not Preview), Anthropic and OpenAI both shipping further efficiency generations, and a probable Gemini 3.2 release together push per- intelligence-unit cost down meaningfully.
Agent loops become the dominant programming model. Teams building "chat with model" integrations in H2 will increasingly find themselves out of step with the surrounding ecosystem — the SDK surfaces, framework patterns, and observability tooling are all converging on multi-turn tool-using agents as the default. One-shot LLM calls remain valid for narrow tasks but stop being the default mental model.
Multi-lab routing becomes table stakes. The inter-lab spread on long-context retrieval, multimodal handling, and price-per-effort-tier persists through H2. Production stacks with serious scale will route across at least two labs by year-end, and the routing layer itself becomes a category of tooling (rather than a thing teams build from scratch).
Less certain
A new benchmark generation displaces MMLU-Pro and GPQA. Ceiling effects on existing benchmarks are starting to bite, and a successor generation (frontier math, ARC-AGI-2, multi-step agent harnesses) is gaining attention. Whether they consolidate into a new canon during H2 is the open question.
Open-weight closes the frontier gap further. DeepSeek V4 narrowed the open-weight gap to roughly 3 to 6 months behind closed frontier. Whether that gap stays stable, narrows further (open catches up), or widens (closed pulls away) depends on whether other open labs release at V4's ambition during H2. Our central case is gap-stable; the surprise direction would be open closing within 2 to 3 months by year-end.
Would be surprised to see
A meaningful new lab entering the frontier race during H2. The capital, compute, and talent concentration required to ship a frontier model has consolidated to a handful of labs, and the path from announcement to production frontier is multi- quarter even with full funding. The frontier race during H2 stays among the labs already on the field, with the action being how they differentiate within a converged capability ceiling rather than who else joins.
The headline framing for H2 planning: H1 2026 was the year frontier capabilities converged on a common substrate. H2 2026 will be the year those capabilities commoditize — pricing compresses further, routing across labs becomes standard, and the differentiation moves from "what can the model do" to "how cheaply can the architecture do it at scale." That commoditization is good for application builders and harder for labs trying to maintain pricing power on raw capability.
"H1 2026 was the year frontier capabilities converged on a common substrate; H2 will be the year those capabilities commoditize."— Our H2 2026 forecast, low-confidence projection
For teams planning their H2 model strategy, the right move is to calibrate against the trend lines rather than the current snapshot. The architecture that performs best in May 2026 may be roughly 30% more expensive than the architecture that performs best in November — that's the trajectory implied by the H1 data. Building for commoditization (effort-tier routing, multi- lab fallback, structured-output-first contracts, agent-loop-native patterns) is what positions a stack to take advantage of H2 pricing rather than be caught flat-footed by it. If you're sizing that work, our team's AI digital transformation engagements start with exactly this calibration. And if you want the deepest cut on the migration mechanics for one specific upgrade, our Claude Opus 4.6 to 4.7 migration playbook covers breaking changes in detail.
H1 2026 was the year frontier capabilities converged — H2 will be the year they commoditize.
Six months, four labs, twenty-plus releases. The data behind the half tells a more coherent story than any individual release does: reasoning-effort routing became default, one-million-token context became economically viable, structured outputs hit production- grade reliability, and agent loops became a native API primitive. Together those four trends rewired the substrate that production AI architectures build on.
The honest framing for teams calibrating their H2 strategy: capability converged during H1, and commoditization is the most likely H2 storyline. Per-token pricing keeps falling, multi-lab routing becomes table stakes, and the differentiation between labs moves from raw benchmark wins to ergonomics, latency, and the cost-per-effort-tier surface. Application architectures that internalise that shift early position themselves to take advantage of H2's pricing trajectory; those that don't pay the routing tax without capturing the efficiency upside.
The data above is a snapshot. Frontier releases keep coming, and the trend lines could surprise. The discipline this retrospective tries to model is the right one for the moment: track release cadence by lab, track benchmark deltas in absolute terms, calibrate pricing against effective production rates rather than list prices, and update the operating model when the substrate changes — not when the headline changes. H2 2026 will reward teams that hold that discipline through the noise.