SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentCompetitive Forecast14 min readPublished May 15, 2026

DeepSeek V5, Llama 5, Qwen 3.5, Mistral Large 3 — the Q3 2026 open-weight competitive forecast and the gap to closed frontier.

Open-Weight Model Q3 2026 Projection: Competitive Forecast

The open-weight pack — DeepSeek, Llama, Qwen, Mistral — closed the gap to closed frontier faster than the consensus 2025 read expected. This forecast lays out the Q3 2026 trajectory for each lab, the capability deltas that will and won't close by September, the enterprise deployment split, and the hardware enablement curve that gates self-host economics.

DA
Digital Applied Team
AI industry analysts · Published May 15, 2026
PublishedMay 15, 2026
Read time14 min
SourcesLab roadmaps + benchmarks
Models tracked
6
open-weight pack
Forecast scenarios
10
probability-weighted
Forecast horizon
Sep 30
Q3 2026 close
Watch-list signals
12
leading indicators

Open-weight model trajectory through Q3 2026 is the most consequential competitive question facing engineering leaders this half. The pack — DeepSeek, Meta's Llama, Alibaba's Qwen, Mistral — closed the capability gap to closed frontier faster than the consensus 2025 read expected, and the Q3 cadence will determine whether the gap narrows further, holds, or widens again before year end.

The stakes are practical, not academic. Enterprises planning production deployments in H2 2026 face a binary choice that increasingly maps to a per-workload decision: self-host an open-weight model for cost and sovereignty, or pay frontier API rates for the marginal capability that closed models still hold. The forecast below shapes that decision — which workloads will cross over to open-weight by September, which will stay on closed API, and which still require holding both options open.

This forecast covers where the open-weight pack stands at Q2 end, the DeepSeek V5 release trajectory across three scenarios, the Llama 5 and Qwen 3.5 release windows, the capability gap to closed frontier broken out by workload class, the enterprise deployment split projection, the hardware enablement curve that gates self-host economics, and ten forecast scenarios with the watch-list signals that matter more than benchmarks for tracking which scenario is actually playing out.

Key takeaways
  1. 01
    DeepSeek V5 likely lands September — the headline release of the quarter.Following the V4 Preview cadence, V5 is the most probable Q3 2026 frontier-open release. Expect another step on long-context efficiency and a meaningful narrowing of the closed-frontier gap on coding and math; agentic eval is the open question.
  2. 02
    Open-weight closes the gap on coding and math — but agentic eval persists.Competitive-programming and formal-reasoning benchmarks are likely to see open-weight reach parity or lead by end-Q3. Multi-turn agentic evaluation — tool use, planning, long-horizon coherence — remains the durable capability moat for closed frontier through the half.
  3. 03
    Enterprise deployment split projected ~40% open-weight self-host, ~60% closed-API by Q3 end.Driven by data-sovereignty workloads, cost-sensitive bulk inference, and code-automation pipelines moving on-prem. Closed-API retains generalist knowledge work, agentic orchestration, and any workload where capability headroom matters more than unit economics.
  4. 04
    Hardware enablement gates self-host economics — B200 and MI400 are the inflection.Open-weight 1M-context models become economically tractable on B200-class hardware in a way they aren't on H100. MI400 availability through H2 is the swing factor for cost-sensitive deployments. Self-host TCO is a hardware-curve question, not a software-curve question.
  5. 05
    Watch-list signals matter more than benchmarks for tracking which scenario plays out.Release-date slippage, training-token disclosure, API-pricing moves from the closed labs, hyperscaler tenancy announcements — these leading indicators tell you which forecast scenario is actually unfolding. Benchmark scores are lagging indicators by the time they publish.

01Q2 End BaselineWhere the open-weight pack stands at the half.

The open-weight competitive landscape at Q2 end looks materially different from twelve months ago. The pack reorganised in early 2026 around four labs producing frontier-credible releases on quarterly or faster cadence — DeepSeek with V4 Preview in April, Meta with Llama 4.5 in March, Alibaba with Qwen 3 series rolling through the half, and Mistral with the Large 2.5 refresh. The long tail of smaller open releases continues to thicken, but the competitive frontier sits with that four-lab cohort.

DeepSeek V4 Preview, covered in detail in our V4 launch analysis, is the strongest open-weight signal of the half — frontier-class on coding and formal reasoning, three-to-six months behind on generalist knowledge, and a hybrid attention architecture that makes 1M context economically tractable rather than aspirational. That release reset the open-weight benchmark for Q3 entrants.

Llama 4.5 prioritised tool-use coherence and multi-turn agent reliability over raw benchmark headlines. Qwen 3 doubled down on multilingual capability and the long-context retrieval profile that matters for enterprise RAG. Mistral Large 2.5 held its Europe-centric positioning with strong instruction following and a permissive license posture that still beats most of the pack on commercial-friendliness. The four labs increasingly differentiate on capability axis rather than competing head-to-head on the same benchmark suite.

The Q2 end picture, in one paragraph
The open-weight pack ends Q2 2026 with four labs producing frontier-credible releases on quarterly cadence, differentiated by capability axis rather than head-to-head benchmark competition. DeepSeek leads on long-context efficiency and competitive programming; Meta on tool-use and agent reliability; Alibaba on multilingual and long-context retrieval; Mistral on instruction-following and license posture. The closed-frontier gap has narrowed across every workload class, but persists unevenly — small on coding and math, durable on agentic evaluation and generalist knowledge.

The competitive question for Q3 isn't whether any single open release will leapfrog closed frontier — that's a misframing of the trajectory. The question is which capability axes will see the open pack reach parity by September, which will narrow but not close, and which will stay durably behind. The forecast below addresses each axis explicitly rather than collapsing the question into a single "open vs closed" framing.

02DeepSeek V5DeepSeek V5 trajectory — three scenarios for the headline release.

DeepSeek V5 is the single most probable Q3 2026 frontier-open release. The lab's cadence — V3 in late 2025, V3.1 / V3.2 refreshes through the first quarter of 2026, V4 Preview in April — suggests a V5 Preview window in the August-to-September range with non-trivial probability of slipping into October. The three scenarios below cover the range of plausible Q3 outcomes.

Scenario weighting reflects our reading of the V4 paper's own framing — the lab signals where the next architectural and efficiency moves are heading and how aggressively they're being pursued. Each scenario carries forward implications for enterprise deployment decisions and the closed-frontier gap.

Scenario A · 40%
Aggressive V5 in September
September Preview · frontier-class agentic

V5 Preview lands by mid-September with material agentic-evaluation gains layered on the V4 hybrid attention stack. Codeforces and LiveCodeBench leadership extends; agentic eval narrows to within striking distance of closed frontier. Enterprise self-host economics improve another step.

Most probable single outcome
Scenario B · 35%
Conservative V4 refresh
V4.1 / V4.2 only · no V5 in Q3

DeepSeek ships V4.1 and V4.2 refreshes through Q3 — incremental efficiency and post-training improvements without architectural step-change. V5 deferred to Q4 / early 2027. Coding-and-math lead narrows further; agentic-eval gap holds at ~3-6 months behind closed frontier.

Slippage-driven outcome
Scenario C · 25%
V5 + multimodal
September V5 + native vision

V5 ships with native multimodal capability — the open-weight pack's standing weakness against closed frontier. Adds vision and document-grounded reasoning to the V4 efficiency story. Highest-impact but least probable outcome; would reset the enterprise deployment-split projection sharply.

Upside surprise scenario

The probability weights are inherently soft — they reflect our reading of release cadence, architectural signaling in the V4 paper, and the lab's public roadmap commentary. The honest interpretation is that Scenario A and Scenario B together absorb roughly three-quarters of plausible outcomes; the multimodal scenario is the genuine open question and the one with the largest downstream impact on enterprise deployment decisions if it materialises.

For teams planning around DeepSeek specifically, the operational implication is to scenario-plan against A and B rather than betting either way. Architecture and post-training improvements in scenarios A and B are sufficient to justify continued open-weight investment for coding-automation pipelines and long-document RAG. Scenario C — if it materialises — accelerates the multimodal workload class crossing over to open-weight, but the planning posture for that is "maintain readiness" rather than "commit infrastructure now".

"V5 is the most probable Q3 frontier-open release. Coding and math lead extends. Agentic eval is the open question."— Digital Applied open-weight forecast working notes, May 2026

03Llama 5 + Qwen 3.5Llama 5 and Qwen 3.5 — release windows and capability bets.

Beyond DeepSeek, the other two probable Q3 releases come from Meta and Alibaba. Both labs hold meaningful share of the open-weight installed base and both have signalled next-generation releases for the second half of 2026. The release windows and capability bets differ enough that a single forecast collapses too much detail; the breakdown below treats them separately.

Llama 5 · expected window late Q3 / early Q4

Meta's release cadence on the Llama family has stretched from roughly quarterly in the 2024-2025 window to closer to half-yearly through 2026. The 4.x line shipped in March; a 5.0 release in late Q3 is the central case, with non-trivial probability of slipping into Q4. Capability bets we expect to see emphasised: continued agent reliability and tool-use coherence (the axis Llama 4.5 already leads on within the open pack), native multimodal, and meaningful long-context improvements that close some of the distance to DeepSeek V4's efficiency story.

Qwen 3.5 · expected window mid-Q3

Alibaba's release cadence on Qwen 3 has been faster — the series has rolled through revisions across H1 2026 with strong multilingual and long-context retrieval profiles. A 3.5 step in mid-Q3 is the central case. Capability bets to expect: multilingual leadership extension, RAG-pipeline-friendly long-context characteristics, and continued aggressive open-source licensing that beats most of the pack on commercial use terms.

Mistral Large 3 · expected window late Q3

Mistral's release cadence has been more conservative; a Large 3 step in late Q3 is plausible but lower-probability than the Llama 5 and Qwen 3.5 estimates. The capability bet remains European-data-sovereignty positioning, strong instruction following, and the most commercially-friendly license posture in the pack.

Llama 5
Q3/Q4
Agent reliability + multimodal

Late-Q3 to early-Q4 release window. Capability bets: tool-use coherence extension, native multimodal, long-context improvements. Meta retains the largest open-weight installed base; Llama 5 reception shapes the enterprise self-host base rate for H2.

Half-yearly cadence
Qwen 3.5
Q3mid
Multilingual + RAG long-context

Mid-Q3 release window. Capability bets: multilingual leadership extension, RAG-pipeline-friendly retrieval characteristics, continued aggressive open licensing. Strongest open-weight option for non-English-primary enterprise deployments.

Fast iteration cadence
Mistral Large 3
Q3late
Sovereignty + license posture

Late-Q3 release window, lower probability than Llama 5 / Qwen 3.5. European-data-sovereignty positioning, strong instruction following, most commercially-friendly license posture. Niche but durable share of the pack.

Conservative cadence

The cross-cutting observation is that the four labs are increasingly competing on capability axis rather than head-to-head benchmark scores. A Q3 calendar that includes V5, Llama 5, Qwen 3.5, and possibly Mistral Large 3 produces an open-weight pack that's plurally strong across coding, multilingual, agent reliability, and license posture — without any single release displacing closed frontier as the generalist default.

04Frontier GapWhere the closed-frontier gap closes and where it persists.

The closed-frontier gap is not a single number — it's a distribution across workload classes. The matrix below breaks the Q3 2026 projection out by four capability axes: coding, math, agentic evaluation, and long-context retrieval. Each axis carries different projections, different watch-list signals, and different implications for enterprise deployment decisions. Treating the gap as uniform is the most common analytical mistake in open-vs-closed framing.

Coding
Gap closes by Q3 end

V4-Pro-Max already leads on LiveCodeBench and Codeforces. V5 likely extends. By September, open-weight is the credible default for competitive-programming-style tasks and matches closed frontier on most pragmatic code-automation pipelines.

Move code-automation workloads
Math
Gap closes — open-weight leads on formal proofs

V4-Pro-Max scored 120/120 on Putnam-2025. Open-weight is competitive or leading on formal-reasoning evals through Q3. The remaining gap is on under-specified, real-world quantitative reasoning where closed-frontier post-training still holds an edge.

Move formal-reasoning workloads
Agentic eval
Gap persists through Q3

Multi-turn agent coherence, tool-use reliability, long-horizon planning — the closed-frontier pack still leads here by a meaningful margin. Llama 5 narrows the gap on tool use; V5 may narrow on planning. Full parity is unlikely before Q4 / 2027.

Stay with closed frontier
Long-context
Gap narrows — workload-specific verdict

Open-weight efficiency improvements (V4 hybrid attention) make 1M-context economically tractable for self-host. Closed-frontier still leads on MRCR-style retrieval accuracy. The verdict is per-workload — efficiency-bound deployments move open, accuracy-bound stay closed.

Pick per workload

The asymmetry is the planning insight. By Q3 end, the open-weight pack likely reaches parity or leads on coding and math; narrows but doesn't close on long-context retrieval; persists 3-to-6 months behind on agentic evaluation. That distribution implies specific workload classes can credibly cross over to open-weight in H2 (code automation, formal reasoning, efficiency-bound long-context RAG) while others should plan to stay closed-API (multi-turn agentic orchestration, accuracy-critical retrieval, generalist knowledge work).

For engineering leaders building production architecture, the recommended posture is multi-vendor routing rather than vendor lock-in either direction. Route by task class — competitive programming and formal reasoning to open-weight, agentic orchestration to closed frontier, long-context per-workload — and re-baseline quarterly as the gap distribution shifts. We cover the multi-vendor routing pattern in detail in the frontier model Q3 release forecast companion piece.

05Deployment SplitEnterprise deployment split — ~40/60 by Q3 end.

The aggregate enterprise deployment split — what share of production AI inference runs on open-weight self-host versus closed-API — is one of the more measurable forecast outputs of this analysis. Our Q3 2026 projection sits at roughly 40% open-weight self-host and 60% closed-API, up from a roughly 25/75 split at Q1 end. The drivers are concrete: data sovereignty requirements, cost-sensitive bulk inference workloads, and code-automation pipelines moving on-prem as open-weight closes the coding gap.

Enterprise inference deployment split · Q3 2026 projection

Source: Digital Applied enterprise AI deployment forecast, May 2026
Open-weight self-host · projected Q3 endData-sovereignty workloads, bulk-inference cost optimisation, code-automation on-prem
~40%
Closed-API · projected Q3 endAgentic orchestration, generalist knowledge work, accuracy-critical retrieval
~60%
Open-weight self-host · Q1 2026 baselineWhere the split stood three months ago — open-weight share growing roughly 5pp / quarter
~25%
Hybrid deployments · share of enterprise baseMulti-vendor routing — both open and closed in production simultaneously — is the modal pattern
~55%

The hybrid-deployment number — roughly 55% of enterprise installations running both open and closed in production simultaneously — is the operational story underneath the headline split. The modal enterprise pattern is no longer "pick one vendor"; it's multi-vendor routing by task class. That pattern is the practical implementation of the workload-specific gap distribution from Section 04: route coding-automation and formal-reasoning to open-weight, agentic orchestration to closed frontier, long-context per-workload.

Three drivers anchor the open-weight share growth through H2. First, data sovereignty obligations expand under EU AI Act enforcement and sector-specific regulatory frameworks — pushing healthcare, financial services, and public-sector workloads toward self-host. Second, the cost-sensitivity gap widens as bulk-inference workloads (content generation, embedding pipelines, internal-tool assistants) crystallise into stable volume that justifies dedicated infrastructure. Third, the hardware enablement curve (Section 06) makes self-host TCO decisively favourable for the right workloads on B200-class hardware in a way it isn't on H100.

For engineering leaders, the practical implication is to plan for a hybrid architecture by default. Build the routing layer, the cost-attribution model, and the deployment-decision framework this half regardless of which vendor mix you end up running. Our AI transformation engagements cover exactly this architecture pattern — open vs closed deployment modelling, capability-fit analysis, self-host economics, and the migration paths between them.

The deployment split asymmetry
The 40/60 projection is the aggregate — the inference-volume share across enterprise production deployments. The spend split runs differently because closed-frontier pricing carries premium per token. Open-weight self-host likely absorbs ~40% of inference volume by Q3 end but only ~25-30% of inference spend. The gap is the unit-economics dividend that motivates the migration in the first place.

06HardwareHardware enablement — B200 and MI400 gate self-host economics.

Self-host economics is a hardware-curve question, not a software-curve question. The open-weight efficiency improvements from V4-class hybrid attention are necessary but not sufficient — the actual TCO crossover relative to closed-API depends on which generation of accelerator you're running the inference on. The Q3 2026 hardware picture has three distinct curves running simultaneously.

Inflection
Nvidia B200
Production availability · Q3 ramp

B200-class hardware is the inflection point for open-weight self-host economics. Open-weight 1M-context models become economically tractable on B200 in a way they aren't on H100. By Q3 end, B200 capacity supports the deployment-split projection.

Primary enabler
Swing factor
AMD MI400
Selective availability · cost-sensitive workloads

MI400 availability through H2 is the swing factor for cost-sensitive open-weight deployments. Where MI400 capacity is accessible, the self-host TCO crossover comes earlier and at a sharper angle. Capacity constraints remain the operational unknown.

Cost lever
Floor
Nvidia H100 generation
Legacy installed base · constraining curve

H100 remains the installed-base reality for most enterprise GPU capacity through Q3. Open-weight self-host on H100 is workable but the economics narrow versus closed-API; the strongest case is data sovereignty rather than pure unit cost. H100 anchors the lower bound of the curve.

Installed base

The practical reading is that the open-weight deployment-split projection in Section 05 is gated on B200 ramp and selective MI400 availability through Q3. A scenario where B200 capacity ramps slower than expected — supply constraints, hyperscaler allocation politics, or yield issues — pushes the deployment split back toward the closed-API side regardless of how strong the V5 release is. The hardware curve is the binding constraint, not the model-release calendar.

For engineering leaders evaluating self-host commitments, the decision framework should anchor on hardware-access realism. Map your inference workloads against the hardware generation you can credibly secure capacity on through Q3 and Q4. If your B200 access is uncertain, the self-host case narrows; if your H100 inventory is sunk capital, the self-host case is real but constrained to specific workload classes. The hardware question is the under-discussed half of the open-vs-closed framing.

"Self-host economics is a hardware-curve question. B200 is the inflection. MI400 is the swing factor. H100 is the floor."— Digital Applied open-weight hardware forecast, May 2026

07Scenarios + Watch ListTen Q3 scenarios and the watch-list signals that tell you which is unfolding.

The bars below are ten probability-weighted scenarios for how the open-weight competitive picture resolves by Q3 2026 close. The weights are inherently soft — the value is in the scenario range, not in any single probability estimate. After the scenarios, the watch-list section catalogues the twelve leading indicators that tell you which scenario is actually playing out in real time, faster than benchmark scores can confirm it.

Ten Q3 2026 open-weight scenarios · probability-weighted

Source: Digital Applied Q3 2026 open-weight forecast · probability weights are scenario anchors, not predictions
S01 · DeepSeek V5 ships September with agentic gainsMost probable single outcome — central case for the forecast
~38%
S02 · V4.x refreshes only, no V5 in Q3Slippage scenario — V5 deferred to Q4 / early 2027
~32%
S03 · Llama 5 ships late Q3 with strong agentic evalMeta closes the agentic-eval gap before DeepSeek
~30%
S04 · Qwen 3.5 ships mid-Q3, extends multilingual leadHighest-probability non-DeepSeek release of the quarter
~55%
S05 · Enterprise deployment split reaches 40/60 by Sep 30Central forecast for the aggregate inference-volume split
~50%
S06 · B200 capacity meets demand through Q3Hardware enablement on track — does not constrain deployment split
~42%
S07 · Open-weight reaches coding-benchmark parityBy Q3 end on LiveCodeBench, Codeforces, formal-proof evals
~60%
S08 · Agentic-eval gap closes meaningfullyLower-probability — closed-frontier pack likely retains lead
~20%
S09 · Closed-frontier prices drop to defend shareOpenAI / Anthropic / Google price moves through Q3
~45%
S10 · Native multimodal lands in one Q3 open releaseHighest-impact upside surprise — would reset deployment split
~28%

The scenarios are not mutually exclusive — most plausible Q3 outcomes are a combination (S01 + S04 + S07 + S09 is a coherent scenario bundle, for example). The value of the breakout is that it identifies which specific events have the largest implications for enterprise deployment decisions, so engineering leaders can track those events directly rather than waiting for aggregate benchmark publication to confirm the trajectory.

The twelve-signal watch list

Benchmark scores are lagging indicators by the time they publish. The watch-list below catalogues the twelve leading indicators that confirm which scenario is unfolding faster than the next benchmark suite can:

  • Release-date slippage signals. DeepSeek V5 technical-report preview publication. Llama 5 community preview leaks. Qwen 3.5 announcement cadence on Alibaba channels.
  • Training-token disclosure. Token-count and compute-budget figures in the V5 paper. Llama 5 paper compute claims. Qwen 3.5 multilingual coverage disclosures.
  • Closed-frontier pricing moves. OpenAI list price changes through Q3. Anthropic API pricing updates. Google Gemini pricing posture. Defensive pricing signals capability-gap erosion.
  • Hyperscaler tenancy announcements. Open-weight model offerings on AWS Bedrock, Azure AI Foundry, GCP Vertex. Hyperscaler hosting signals enterprise demand validation.
  • Hardware-capacity signals. B200 shipment guidance from Nvidia. MI400 customer wins from AMD. Hyperscaler capacity announcements for open-weight inference.
  • Agentic-eval benchmark releases. New multi-turn agent evaluation suites. Tool-use coherence benchmarks. Long-horizon planning evals.
  • Open-weight installed-base signals. Hugging Face download trends. Inference-provider model availability. Open-router routing-share shifts.
  • Regulatory-disclosure signals. EU AI Act enforcement actions against open or closed models. US sector-regulator AI guidance updates.
  • Cost-per-token benchmarks. Self-host TCO publications. Inference-provider pricing for open-weight models. Spread between open-weight self-host and closed-API equivalents.
  • License-posture signals. Commercial-use license changes from open-weight labs. Patent or IP positioning from closed-frontier labs.
  • Safety / alignment publications. Capability-and-alignment papers from open-weight labs. Defensive safety posture from closed-frontier.
  • Multimodal-capability signals. Vision and audio capability extensions in open-weight releases. Document-grounded reasoning publications.
How to use the watch list
The watch-list signals are leading indicators by design. If you track six-to-eight of them weekly through Q3, you'll know which scenario is unfolding faster than any single benchmark publication can confirm. The signals that matter most for your deployment decision depend on your workload mix — code-automation teams should weight agentic-eval and coding-benchmark signals; bulk inference teams should weight hardware-capacity and cost-per-token signals; sovereignty-bound teams should weight regulatory and license-posture signals.

For teams that want a structured cadence on the watch list, quarterly re-baselining is the right rhythm. Re-evaluate the ten scenarios at the next forecast publication, update the probability weights against actuals through the quarter, and adjust the deployment posture accordingly. Open-weight trajectory shifts too fast for annual planning cycles to stay calibrated; quarterly is the cadence the data supports.

Conclusion

Open-weight Q3 2026 rewards teams who model both deployment paths.

The Q3 2026 open-weight forecast resolves to a clear practical recommendation: model both deployment paths in your H2 architecture. The gap to closed frontier is closing unevenly — small on coding and math, durable on agentic evaluation — which implies a multi-vendor routing pattern rather than a vendor commitment either direction. The teams that win H2 will be the ones who built the routing layer, the cost-attribution model, and the deployment-decision framework before their next quarterly re-baseline rather than during it.

The DeepSeek V5 trajectory is the headline release of the quarter — most probable in September, with a non-trivial chance of slipping to Q4. The Llama 5 and Qwen 3.5 releases shape the non-DeepSeek share of the open-weight pack. The hardware enablement curve (B200 inflection, MI400 swing) gates self-host economics independently of the model-release calendar. The enterprise deployment split projection of ~40% open-weight self-host by Q3 end is anchored on those three trajectories playing out roughly as the central case suggests.

The recommended posture for engineering leaders is multi-vendor routing by default, quarterly re-baselining of the forecast, and disproportionate investment in the workload-specific decision framework rather than vendor-specific infrastructure. Open-weight Q3 2026 doesn't reward bets — it rewards modelling. The teams who treat the deployment decision as a per-workload analysis with explicit watch-list signals are the ones who will be operating on the right vendor mix when the actuals land.

Model both paths

Q3 2026 open-weight rewards teams who model both paths.

Our team helps engineering leaders model open-weight self-host vs closed-API economics, capability fit, and migration paths.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight forecast engagements

  • Open vs closed deployment modelling
  • Capability-fit analysis
  • Self-host economics modelling
  • Migration path design
  • Quarterly forecast re-baseline
FAQ · Open-weight forecast

The questions engineering leaders ask before picking open vs closed.

We treat the gap as a distribution across four workload classes rather than a single number — coding, math, agentic evaluation, and long-context retrieval. Each axis is measured against the strongest publicly-evaluated mode of each closed-frontier model on the canonical public benchmark suite for that axis (LiveCodeBench and Codeforces for coding, Putnam-style proof grading and AIME for math, multi-turn agent benchmarks and tool-use coherence suites for agentic evaluation, MRCR and long-document retrieval evals for long-context). The Q3 projection reads each axis separately: coding and math likely reach parity or open-weight leadership by September, long-context narrows but stays workload-specific, agentic evaluation persists three-to-six months behind through the half. Collapsing the four axes into a single gap number is the most common analytical mistake in open-vs-closed framing — different workloads see different gaps and require different deployment decisions.