SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentQuarterly Report11 min readPublished May 11, 2026

Three families, fifteen releases, benchmark closure with frontier — the half open-weight became enterprise-grade.

Open-Weight Models H1 2026 Retrospective: DeepSeek, Qwen, Llama

Three families — DeepSeek, Qwen, Llama — shipped fifteen tracked releases between January and May 2026. Open-weight closed the gap with closed frontier on code and reasoning, hosting cost dropped via vLLM, SGLang, and Cerebras, and sovereign-cloud deployment patterns moved from theory to production.

DA
Digital Applied Team
Senior strategists · Published May 11, 2026
PublishedMay 11, 2026
Read time11 min
SourcesVendor reports + community evals
Major families tracked
3
DeepSeek · Qwen · Llama
Releases counted
15+
Jan–May 2026
+9 vs H2 2025
Benchmark closure on code
Within 5%
open vs closed frontier
H2 horizon
6mo
Jun–Dec 2026 forecast

Open-weight models in H1 2026 stopped being the experimentation tier and started being the production tier. Three families — DeepSeek, Qwen, Llama — shipped fifteen tracked releases between January and the second week of May, closed the benchmark gap with closed frontier on code and reasoning, and pushed hosting cost down by roughly an order of magnitude on the leading inference stacks.

What changed isn't one model. It's a cadence. DeepSeek published V3.1 in January, V3.2 in February, V3.2-Long in March, and V4 Preview in late April. Qwen shipped three Qwen 3 base releases plus coder, math, and vision variants. Meta's Llama 4 series moved from preview to general availability with a sovereign-deployment licence track. By May the question for most enterprise teams was no longer "is open-weight good enough" — it was "which open-weight, which hosting stack, and which workloads stay closed."

This retrospective compiles the H1 2026 release inventory, the benchmark closure picture against closed frontier, the four trend lines defining the half, and an explicit forecast for H2. Every number cited is sourced from a published vendor report or community evaluation; we don't fabricate metrics and we don't pretend the open-versus-closed gap has fully closed.

Key takeaways
  1. 01
    Release cadence acceleration is the defining signal.Fifteen tracked releases across DeepSeek, Qwen, and Llama in roughly four-and-a-half months — DeepSeek alone shipped four major checkpoints, Qwen three base plus six variants, Llama 4 moved from preview to GA. Cadence reflects competitive pressure, not novelty for its own sake.
  2. 02
    Benchmark closure with closed frontier on code and reasoning.On LiveCodeBench, Codeforces, and Putnam-style formal reasoning, the strongest open-weight model is now within roughly 5% of the strongest closed-frontier model — and in several cases ahead. General knowledge and the hardest retrieval workloads still trail by 3 to 6 months.
  3. 03
    Hosting cost reductions via vLLM, SGLang, and Cerebras.Inference cost per million tokens on leading open-weight stacks reportedly fell by roughly an order of magnitude versus H2 2025. vLLM and SGLang absorbed most of the open-weight production volume; Cerebras pushed latency-sensitive workloads into a new price band.
  4. 04
    Enterprise adoption velocity went vertical.Hyperscaler integrations, sovereign-cloud picks, and procurement-ready licence tracks made open-weight viable for finance, healthcare, and public-sector buyers in the half. Adoption is still concentrated in code automation and document agents — generalist replacement remains rare.
  5. 05
    Sovereign-cloud deployment patterns became standard.EU, UK, Middle East, and APAC buyers consolidated around a small number of deployment patterns — on-prem with vLLM, sovereign hyperscaler with Llama 4, or air-gapped clusters with quantized DeepSeek or Qwen. The pattern, not the model, is what unlocks the procurement conversation.

01Why Open-Weight in H1The half open-weight became enterprise-grade.

The story of open-weight in H1 2026 is less about any single model and more about three simultaneous shifts that compounded. Cadence accelerated. Benchmarks closed. And hosting cost fell far enough that the total cost of ownership conversation flipped for specific workload classes — code automation, long-context retrieval, and sovereign-deployment scenarios.

What enterprise buyers asked about in January was "can we use open-weight for anything serious?" What they asked about in May was "which open-weight family for which workload, and which hosting partner." That is the entire shift, summarized in one sentence. The rest of this retrospective is the data behind it.

A note on methodology. The release inventory counts vendor-marked production releases — preview, base, instruct, coder, math, vision, and reasoning variants — across the three families in the window. Benchmark closure compares the strongest mode of each family's flagship against the strongest publicly evaluated mode of GPT-5.4, Claude Opus 4.6, and Gemini-3.1-Pro. Hosting cost comparisons reference vendor-published or community-measured price per million tokens on vLLM and SGLang reference implementations. We treat all numbers as directionally accurate rather than precise to the basis point.

The shift, in one line
The question stopped being "is open-weight good enough" and became "which open-weight, which hosting stack, which workloads." That's a buyer-side shift, not a vendor-side one — and it shows up most clearly in procurement conversations, not benchmark tables.

02DeepSeek V4 + FamilyFour checkpoints, one architectural redesign.

DeepSeek shipped four major checkpoints in H1 2026 — V3.1 (January), V3.2 (February), V3.2-Long (March), and V4 Preview (April 24). The V4 launch was the most consequential open-weight release of the quarter, but the V3 series carried most of the production traffic through the half because of compatibility with existing tool-call schemas and inference setups.

The signature contribution across the family is efficiency at long context. V4-Pro at 1.6T total / 49B active uses roughly 27% of V3.2's single-token inference FLOPs and 10% of the KV cache at 1M-token context — the result of a hybrid attention stack that interleaves Compressed Sparse Attention with Heavily Compressed Attention across layers. For the full V4 architecture and benchmark breakdown, see our DeepSeek V4 Preview launch analysis.

January
V3.1 Refresh
671B total · 37B active

Production-tier refresh of the V3 stack — sparse-attention tuning, improved tool-call schema, longer training run. The workhorse of the half for code-automation pipelines.

Workhorse release
February
V3.2 Sparse
DeepSeek Sparse Attention v1

Introduced the first production version of DeepSeek Sparse Attention — top-k selector over compressed KV blocks. Cut inference cost on long-context workloads by roughly half versus V3.1.

Sparse-attention baseline
March
V3.2- Long
Extended context · long-doc tuning

Long-context fine-tuning variant of V3.2. The bridge release before V4 — extended context handling without the architectural redesign.

Bridge release
April
V4 Preview
1.6T total · 49B active · 1M ctx

Hybrid attention (CSA + HCA), three reasoning modes, On-Policy Distillation post-training. 27% of V3.2's FLOPs at 1M context — the architectural reset.

Architectural reset

The cadence matters as much as the V4 architecture. Three V3-line refreshes in three months kept production teams able to upgrade incrementally without breaking tool-call contracts; V4 then reset the architecture without forcing a same-month migration. That sequencing — incremental for production, architectural for the future — is what made DeepSeek the dominant open-weight family for code automation through H1.

On the V4-specific numbers: V4-Pro-Max in Think Max mode hits 93.5 on LiveCodeBench (Pass@1), 3206 Codeforces rating, and a proof-perfect 120/120 on Putnam-2025. It trails Gemini-3.1-Pro on MMLU-Pro and GPQA Diamond, and trails Claude Opus 4.6 on MRCR 1M. DeepSeek's own framing — "3 to 6 months behind absolute frontier" — is the honest version and the one we recommend using when scoping eval work.

03Qwen 3 FamilyThree base releases, six domain variants.

Alibaba's Qwen 3 family was the breadth story of H1 2026. Three base releases — Qwen 3, Qwen 3.1, and Qwen 3.5 — plus six domain variants spanning coder, math, vision, audio, embedding, and a long-context reasoning track. The family went from "competitive open-weight alternative" in January to "default open-weight choice across APAC and many EU deployments" by May.

What distinguishes Qwen 3 strategically is the variant tree. Where DeepSeek concentrated on a flagship and a smaller sibling, Qwen shipped purpose-built checkpoints per workload — Qwen 3 Coder for code-automation pipelines, Qwen 3 Math for formal reasoning, Qwen 3 VL for multimodal — each with its own benchmark profile and licence terms. For procurement-bound buyers needing to justify a model choice per workload, that tree structure simplified the conversation.

Base releases
3
Qwen 3 · 3.1 · 3.5

Three production base releases in five months. Qwen 3 baseline in January, 3.1 mid-Q1 refresh with improved instruction tuning, 3.5 in early Q2 with extended-context support and refreshed evaluation profile.

Jan–May 2026
Variants
6
Domain checkpoints

Coder, Math, VL (vision), Audio, Embedding, and a long-context reasoning track. Each tuned and licensed independently — buyers can adopt one variant without committing to the full family.

Per-workload
APAC adoption
Default
Open-weight pick

Reported as the default open-weight family across APAC enterprise deployments by Q2, plus material EU pickup. Sovereign-cloud partners shipped procurement-ready Qwen deployment templates.

Procurement-ready

The Qwen 3 Coder variant deserves a specific call-out. Community evaluations through April reported the Coder variant within a narrow band of DeepSeek V3.2 on most code benchmarks, with the advantage of being meaningfully cheaper to host on most inference stacks. For teams running code-automation workloads where DeepSeek V4-Pro-Max is over-provisioned, Qwen 3 Coder is the right starting point — benchmark on your own repositories before defaulting either way.

The vision variant — Qwen 3 VL — also reset the open-weight multimodal bar. Open competitors on document understanding, chart reasoning, and screenshot Q&A reportedly closed within a single-digit-percent gap of GPT-5.4 vision and Gemini-3.1-Pro multimodal on community-run benchmarks. The picture for multimodal closure is less complete than for text — methodology varies more — but the direction is consistent with the text picture.

"Qwen 3 turned open-weight from a one-flagship story into a per-workload variant tree — that's the procurement-friendly form."— Our reading of enterprise adoption patterns, Q2 2026

04Llama 4 SeriesSovereign-cloud-ready, hyperscaler-integrated.

Meta's Llama 4 series moved through three milestones in H1 2026 — preview release in late January, general availability in March, and a sovereign-deployment licence track in April. The family covers Llama 4 Scout, Maverick, and Behemoth (Behemoth still preview at end of H1), with separate instruct and base checkpoints per size class.

What makes Llama 4 strategically distinct from DeepSeek and Qwen isn't the model itself — benchmark-wise the family sits in the same band — it's the integration footprint. Llama 4 ships with managed deployment options across all major hyperscalers, sovereign-cloud partners, and the major AI hosting providers, with procurement-ready licence terms that fit existing enterprise contracts. For finance, healthcare, and public-sector buyers, that integration is often the deciding factor.

Code automation
Pick DeepSeek V4 or Qwen 3 Coder

Strongest open-weight signal on LiveCodeBench and Codeforces sits with DeepSeek V4-Pro-Max. Qwen 3 Coder is the cost-efficient alternative. Llama 4 trails on competitive-programming benchmarks — viable but not the lead choice.

DeepSeek or Qwen
Long-context retrieval
Pick DeepSeek V4-Pro for on-prem

V4's hybrid attention plus 1M context makes it the strongest open-weight candidate for on-prem long-document RAG. Llama 4 is the easier procurement story; pick on workload weight versus procurement weight.

DeepSeek V4-Pro
Sovereign-cloud workloads
Pick Llama 4

Llama 4's sovereign-deployment licence track and hyperscaler-managed integrations are the smoothest procurement path for finance, healthcare, and public-sector buyers. Performance is competitive; integration is the differentiator.

Llama 4
Multimodal (vision + text)
Pick Qwen 3 VL

Qwen 3 VL set the open-weight multimodal bar in Q2 — within a single-digit percent of closed-frontier vision on community-run benchmarks. Llama 4 multimodal trails; DeepSeek V4 ships without first-party vision in the preview.

Qwen 3 VL

The takeaway for buyers is that open-weight in H1 2026 is no longer a single-model decision. It's a per-workload routing decision, where the three families occupy partly overlapping niches. For most multi-workload enterprises, the right answer is a routing layer that picks per task class — DeepSeek for code-heavy and long-context, Qwen for multimodal and cost-sensitive, Llama for sovereign and integration-bound.

Three-family routing
The end-of-H1 picture: pick DeepSeek for code-automation and 1M-context retrieval, Qwen 3 for per-variant fit and multimodal, Llama 4 for sovereign deployments and hyperscaler procurement. Most enterprise teams will run two or three of these in production by year-end — plan the routing layer now.

05Benchmark ClosureWithin 5% of closed frontier on code and reasoning.

The chart below summarizes where the strongest open-weight mode in each category lands against the strongest closed-frontier mode on the same benchmark. Bars are normalized to the leading model's score; the value column shows the open-weight absolute score. Blue accent throughout — this is the closure picture, not a head-to-head winner table.

Open-weight vs closed frontier · benchmark closure picture

Source: Aggregated from vendor reports + community evaluations, May 2026
LiveCodeBench · Pass@1DeepSeek V4-Pro-Max 93.5 vs Gemini-3.1-Pro 91.7
93.5
Codeforces · RatingDeepSeek V4-Pro-Max 3206 vs GPT-5.4 xHigh 3168
3206
Putnam-2025 · ProofDeepSeek V4-Pro-Max 120/120 (perfect)
120/120
Apex Shortlist · Pass@1DeepSeek V4 90.2 vs Gemini-3.1-Pro 89.1
90.2
MMLU-Pro · EMOpen best 87.5 vs Gemini-3.1-Pro 91.0
87.5
GPQA Diamond · Pass@1Open best 90.1 vs Gemini-3.1-Pro 94.3
90.1
MRCR 1M · long-contextOpen best 83.5 vs Claude Opus 4.6 Max 92.9
83.5
SimpleQA-VerifiedOpen best 57.9 vs Gemini-3.1-Pro 75.6
57.9

The pattern in that chart is the H1 2026 story compressed into one view. On code generation, competitive programming, and formal reasoning, the strongest open-weight model is now at or ahead of the strongest closed-frontier model. On general knowledge, graduate-level science, and very-long-context retrieval, the closed-frontier lead persists at roughly 5–18 percentage points.

The implication for production buyers is that benchmark closure isn't uniform — it's task-class-specific. For workloads in the code and formal-reasoning band, open-weight is now a credible default. For workloads in the general-knowledge and hardest-retrieval band, closed frontier still leads by enough margin that switching purely on cost is premature. The right architectural pattern by mid-H1 is a routing layer that picks per task class.

One methodological caveat is worth stating explicitly. Vendor self-reports tend to flatter; community evaluations tend to normalize. Where vendor numbers and community numbers diverge, we've preferred the lower of the two and noted the difference. The 5% "closure" framing is a directional claim, not a precise margin — your corpus may move the picture in either direction.

"Closure isn't uniform — it's task-class-specific. Code and reasoning have closed. General knowledge and hardest retrieval have not."— Digital Applied, H1 2026 retrospective

Below the release inventory and the benchmark chart sit four trend lines that define how open-weight actually moved in the half. None of them is a single-model story; each is a structural shift in the way open-weight gets produced, hosted, and bought.

Trend 01
Cadence acceleration
15+ releases · Jan–May

Three families collectively shipped fifteen production releases in roughly four-and-a-half months — DeepSeek four, Qwen three base plus six variants, Llama through three milestones. Cadence reflects competitive pressure, not novelty.

+9 vs H2 2025
Trend 02
Hosting cost collapse
vLLM · SGLang · Cerebras

Inference cost on the leading open-weight stacks reportedly dropped by roughly an order of magnitude versus H2 2025. vLLM and SGLang absorbed most production volume; Cerebras pushed latency-sensitive workloads into a new price band.

~10x reduction
Trend 03
Sovereign-cloud standardization
EU · UK · ME · APAC patterns

Sovereign-cloud deployment patterns consolidated around three shapes: on-prem with vLLM, sovereign hyperscaler with Llama 4, air-gapped clusters with quantized DeepSeek or Qwen. The pattern, not the model, unlocks procurement.

Pattern-led adoption

The fourth trend doesn't fit neatly in a three-card grid because it's less a technical shift and more a buyer-side one — enterprise adoption velocity. Through H1 we saw open-weight move from a small number of early-mover engineering teams into a much larger set of procurement-bound enterprises across finance, healthcare, public sector, and mid-market manufacturing. The mechanism is the combination of the prior three trends: cadence gives buyers confidence the family will keep up; hosting cost collapse changes the TCO conversation; sovereign-cloud patterns unblock procurement.

Adoption is still concentrated. Code-automation pipelines, long-context document agents, and specific internal-knowledge-retrieval workloads dominate the production deployments we've observed. Wholesale replacement of closed frontier with open-weight for general-purpose agents remains rare; the routing pattern is more common, and we think it stays the dominant pattern through H2.

The buyer-side compression
What changed underneath the headlines is how fasta procurement-bound buyer can go from interest to production. Six-month sales cycles compressed into eight-week trials. That isn't about the model — it's about cadence, hosting cost, and procurement-ready deployment patterns landing in the same half.

07H2 ProjectionWhere open-weight likely goes next.

Forecasts in a market moving this fast should be hedged. The shape of H1 was hard to predict in late 2025; the shape of H2 is harder still. Treat the projections below as base-case directional calls, not point forecasts.

Base-case calls for H2 2026

  • DeepSeek V4 GA — the Preview becomes a full V4 GA in Q3, with at least one efficiency-focused revision before year-end. Expect a Flash-class model further optimized for the long-context-RAG workload class.
  • Qwen 3 → Qwen 4 transition — the variant tree expands rather than consolidating. Expect an embedding refresh, a stronger audio variant, and the first Qwen 4 base in late Q3 or early Q4.
  • Llama 4 Behemoth GA — the largest size class moves from preview to GA, with hyperscaler-managed deployments landing on the same day. The sovereign-deployment licence track expands to cover more jurisdictions.
  • General-knowledge gap narrows — open-weight MMLU-Pro and GPQA Diamond gaps versus closed frontier reduce from the current 5–10 percentage points to roughly 3–6. Closed frontier still leads on the hardest retrieval workloads.
  • Hosting cost stabilizes— the order-of-magnitude cost collapse of H1 won't repeat; expect another roughly 2–3× reduction on the leading stacks before plateauing as hardware utilization saturates.
  • Routing becomes the production default— the clean "pick one model" architecture loses ground to routing layers that pick per task class. Expect at least one major open-source routing framework to emerge as a category standard.

Two harder-to-call shifts could change the picture meaningfully. First, a new entrant — a fourth family at the DeepSeek / Qwen / Llama tier — would compress competitive timelines further; we assign it moderate probability over six months. Second, a major closed-frontier vendor opening weights for a flagship would re-shape the open-versus-closed framing entirely; we assign it low probability but high impact if it happens.

For teams planning H2 capacity, the practical move is to design the deployment for routing rather than for any single family — and to invest in a benchmark harness that runs against your own corpus, not just published evaluations. That's where our AI digital transformation engagements typically start: per-workload eval, hosting-stack picks, and routing-layer design calibrated to actual workload volumes rather than vendor-pitched headline numbers. For the self-hosting cost side specifically, our companion analysis on self-hosting frontier-model TCO covers the hardware, utilization, and break-even math.

08ConclusionThe half open-weight became enterprise-grade.

The shape of open-weight, May 2026

H1 2026 was the half open-weight became enterprise-grade.

Three families. Fifteen tracked releases. Benchmark closure with closed frontier on code and reasoning. An order-of-magnitude hosting cost reduction on the leading stacks. Sovereign-cloud deployment patterns that consolidated around three procurement- ready shapes. The story of the half isn't any single model — it's the way these four shifts compounded into something an enterprise procurement team can buy.

The honest framing on the gap is the right one. Open-weight has closed with closed frontier on code, competitive programming, and formal reasoning. It hasn't closed on general knowledge, graduate-level science, or the hardest long-context retrieval — those still trail by roughly 3 to 6 months. That gap is real, and the right response is per-workload routing rather than wholesale replacement.

The broader signal for H2 is that the question changes again. H1's question was "is open-weight good enough." H2's question is "which routing pattern, which hosting stack, which sovereign-deployment shape." That's a more specific, more buyable, more enterprise-grade conversation — and it's what makes H1 2026 the half that mattered.

Adopt open-weight in H2

Open-weight became enterprise-grade in H1 2026.

Our team designs production open-weight deployments — DeepSeek V4, Qwen 3, Llama 4 — calibrated to H1 2026 hosting and benchmark realities.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight engagements

  • Model family selection per workload
  • Hosting stack picks (vLLM / SGLang / Cerebras)
  • Benchmark harness for your corpus
  • Sovereign-cloud deployment design
  • H2 trajectory planning
FAQ · Open-weight H1 retrospective

The questions teams ask after H1 data.

We count vendor-marked production releases across DeepSeek, Qwen, and Llama between January 1 and the second week of May 2026 — base, instruct, coder, math, vision, and reasoning variants each count as a distinct release when shipped with their own model card and weights. DeepSeek contributed four (V3.1, V3.2, V3.2-Long, V4 Preview); Qwen contributed three base releases plus six domain variants; Llama 4 moved through three release milestones (preview, GA, sovereign track). Internal-only checkpoints, research-only drops without a Hugging Face model card, and minor patch releases aren't counted. The total is a conservative, externally verifiable inventory rather than a vendor-reported headline.