SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentArchitecture7 min readPublished Apr 24, 2026

4 frontier models · 7 routing variants · the architecture choices behind 2026 inference cost

MoE Architecture in 2026: GPT, Claude, DeepSeek, Qwen Compared

By April 2026, every frontier model except Anthropic's Claude line is mixture-of-experts. GPT-5.5, DeepSeek V4-Pro, and Qwen 3 all ship sparsity ratios between 3% and 35%, but the routing strategy — top-k, expert-choice, fine-grained shared — drives radically different per-token economics.

DA
Digital Applied Team
Senior strategists · Published Apr 24, 2026
PublishedApr 24, 2026
Read time7 min
SourcesDeepSeek V4 paper · Mixtral · Qwen 3 cards
DeepSeek V4-Pro sparsity
3.1%
49B active / 1.6T total
tightest in 2026
GPT-5.5 sparsity (est.)
~6%
~110B / ~1.8T (rumored)
Qwen 3 235B-MoE sparsity
9.4%
22B active / 235B total
Mixtral 8x22B sparsity
28%
39B active / 141B total
coarse-grained

By April 2026, mixture-of-experts is no longer the contrarian architectural bet. Every frontier model that ships open weights, and every model rumored to power closed-API frontier (with the explicit exception of the Claude family), is dense-MoE — a sparse ensemble where each forward pass routes through a small subset of total parameters.

The shift happened fast. In 2024, Mixtral 8x22B was the open-weight standard at 28% sparsity. In 2025, DeepSeek V3 dropped that to 5.4%. By Q2 2026, DeepSeek V4-Pro pushes it to 3.1% (49B active out of 1.6T total) and Qwen 3's 235B-MoE variant lands at 9.4%. The consequence is that frontier models now hold ten to thirty times more parameters in VRAM than a dense model of equivalent throughput — and the question is no longer whether MoE wins, but which MoE pattern wins for which serving constraint.

This post compares the four canonical 2026 MoE patterns side by side: the rumored GPT-5.5 architecture, DeepSeek V4-Pro's published spec, Qwen 3's public model cards, and the open-weight reference points (Mixtral, Llama-MoE, Granite-MoE). It also covers Anthropic's notable hold-out — Claude Opus 4.7 is the only frontier model where the public evidence still suggests a dense (or very lightly-MoE) architecture.

Key takeaways
  1. 01
    The 2026 frontier is sparse — except Claude. Every other major model is MoE.GPT-5.5 (rumored), DeepSeek V4-Pro, Qwen 3 235B-MoE, and the open-weight Mixtral / Llama-MoE / Granite-MoE families all ship MoE in 2026. Claude Opus 4.7's public evidence still points to dense or very lightly-MoE — Anthropic has not confirmed a switch.
  2. 02
    Sparsity ratio compression has been the single biggest 2024→2026 lever.Mixtral (28%) → DeepSeek V3 (5.4%) → DeepSeek V4-Pro (3.1%). Each step roughly tripled total-to-active ratio while preserving or improving downstream eval. The cost gain compounds: lower sparsity means cheaper per-token compute at the price of higher total VRAM held resident.
  3. 03
    Top-k routing, expert-choice routing, and fine-grained shared experts are not interchangeable.Top-k (Mixtral) is simplest but suffers expert-imbalance under load. Expert-choice (Switch / Llama-MoE) hard-balances at the cost of dropped tokens. Fine-grained shared experts (DeepSeek, Qwen) split routed and shared experts to combine specialization with stable knowledge — the dominant 2026 pattern.
  4. 04
    MoE shifts cost from compute to memory bandwidth and inter-GPU traffic.On a per-token basis MoE saves 70–95% of the FLOPs of an equivalent dense model. But all 1.6T parameters of V4-Pro must be held resident across an 8×H100 cluster, and expert-routing pushes 30–40 GB/s of inter-GPU traffic per token. Picking a serving stack that handles MoE all-to-all (vLLM 0.7+, SGLang, TensorRT-LLM with expert parallel) is non-optional.
  5. 05
    Pick the pattern that fits your serving constraints, not the lowest sparsity number.DeepSeek V4-Pro's 3.1% sparsity wins on per-token compute; Qwen 3's 9.4% wins on cluster-size economics; Mixtral's 28% wins on single-GPU dev-box deployment. The right MoE for an agency stack is the one that maps onto the GPU budget and tail-latency target you can actually serve.

01The ThesisWhy frontier went sparse between 2024 and 2026.

The MoE thesis is mechanical. A dense model with N parameters performs N FLOPs per token at inference. An MoE model with N total parameters and k% sparsity performs roughly k% × N FLOPs per token — because only the activated experts contribute to the forward pass. Holding N constant, a 5% sparse model is twenty times cheaper per token to compute than its dense equivalent.

That trade was a research curiosity until two things converged: knowledge density stopped scaling well past about 70B dense parameters (the Llama 3 70B / 405B gap is the canonical evidence), and serving infrastructure matured to the point that all-to-all expert routing across 8 or 16 GPUs became practical at production tail-latency. The combination meant frontier teams could keep adding knowledge in the form of more experts without paying for it on every token.

Sparsity ratio · 2024 → 2026 frontier MoE compression

Source: Public model cards, technical reports · Apr 2026
Mixtral 8x22B (2024)141B total / 39B active · 8 experts top-2
28%
DBRX Instruct (2024)132B total / 36B active · 16 experts top-4
27%
Mixtral 8x7B (2023)47B total / 13B active · 8 experts top-2
28%
DeepSeek V2 (2024)236B total / 21B active · 160+2 fine-grained
8.9%
Qwen 3 235B-MoE (2025)235B total / 22B active · 128 experts top-8 + shared
9.4%
−66%
DeepSeek V3 (2024)671B total / 37B active · 256+1 shared
5.4%
DeepSeek V4-Pro (2026)1.6T total / 49B active · CSA+HCA + Muon
3.1%
−89% vs Mixtral

The trend line is straight: roughly a 9× compression in sparsity ratio over two years. Every step preserved or improved downstream evaluation while reducing per-token FLOPs. The architectural moves that made it work — fine-grained expert split, shared experts for general knowledge, latent-attention compression for KV cache — are now standard, and the next frontier of compression is in the attention layer (MLA, CSA/HCA) rather than the FFN layer where the experts live.

The hold-out: Claude
Anthropic has not confirmed an MoE architecture for any production Claude model. Public evidence (latency curves, training-compute disclosures, the explicit framing in the Sonnet 4.5 launch post) is consistent with dense or very lightly-MoE designs. This is a deliberate position — Anthropic has argued that dense models are easier to interpret, easier to align, and more predictable under adversarial conditions. For agencies choosing a stack, the read is: if interpretability and uniform behavior matter more than per-token cost, Claude is the only frontier-class option still committed to dense.

02The Four ModelsFour MoE patterns, compared.

The four side-by-side specs below are what we work from when sizing client deployments. Numbers are taken from official model cards (DeepSeek V4, Qwen 3) or from the most credible public reverse-eng available as of Apr 24, 2026 (GPT-5.5 architecture is not officially disclosed; the figures listed are widely-cited estimates).

Pattern 1
GPT-5.5 — closed, ~6% sparsity (rumored)
~1.8T total · ~110B active · top-k routing (k≈2-4)

OpenAI has not disclosed architecture publicly since GPT-4. Inferred from latency, throughput, and API behavior: a coarse-to-fine MoE at ~6% sparsity, likely with shared experts for safety/alignment. Routes through ~110B active per token. Pricing ($5/$30) reflects high active-param compute.

Closed weights · top-k routing
Pattern 2
DeepSeek V4-Pro — open, 3.1% sparsity
1.6T total · 49B active · 256+1 shared, top-8

Tightest sparsity ratio in 2026. 256 routed experts + 1 shared expert per layer; CSA+HCA attention compresses the KV cache to 10% of V3.2. Muon optimizer instead of AdamW. The architecture pairs aggressive FFN sparsity with aggressive attention compression — the two compounding effects.

Open weights · fine-grained + shared
Pattern 3
Qwen 3 235B-MoE — open, 9.4% sparsity
235B total · 22B active · 128+1 shared, top-8

Alibaba's flagship open-weight 2025-2026 release. Slightly less sparse than DeepSeek V4 but with stronger multilingual coverage and a more conservative routing policy that produces lower expert-imbalance variance under bursty traffic. Cheaper to serve on 4×H100 than V4-Pro.

Open weights · stable routing
Pattern 4
Claude Opus 4.7 — closed, dense (no MoE)
Architecture not disclosed · estimated dense or near-dense

The frontier hold-out. Latency-vs-throughput curves and Anthropic's interpretability stance both point to dense. Pricing ($5/$25 with 90% cached read) makes Opus 4.7 economical for cache-friendly workloads, even without MoE compute savings. Strongest long-context retrieval among 1M-context models.

Closed weights · dense (likely)
"DeepSeek V4-Pro fits 1.6T parameters in 8×H100 — but only because the per-token active footprint is 49B."— Internal serving-stack notes, May 2026

03RoutingThree routing strategies and what they cost.

Sparsity is one variable; routing is the other. Three families dominate 2026 production: top-k token-choice, expert-choice (the Switch family), and fine-grained shared-experts (the DeepSeek/Qwen pattern). Each family has a characteristic failure mode under bursty traffic, and that failure mode is what governs which stack you can serve cheaply.

Top-k token-choice
Each token picks its k experts (Mixtral, GPT-style)

Simple, easy to scale, well-understood. The token decides which experts to fire (typically k=2). Failure mode is expert-imbalance: under bursty workloads, certain experts get over-routed and become tail-latency bottlenecks. Auxiliary load-balancing loss is mandatory — without it, throughput drops 20-40% under load.

Mixtral · GPT (rumored)
Expert-choice
Each expert picks its tokens (Switch, Llama-MoE)

Hard-balances by construction — every expert gets exactly its capacity-fraction of tokens. Trade-off is dropped tokens: under capacity pressure, some tokens skip the FFN entirely and pass through with the residual only. Strong in research benchmarks, weaker in production tail-latency under traffic spikes.

Switch · Llama-MoE
Fine-grained + shared
Many small routed experts plus 1-2 shared experts

DeepSeek V2/V3/V4 and Qwen 3 pattern. The shared expert always fires (general-knowledge backbone); the routed experts add specialization through top-k. Combines the stability of dense behavior with the cost savings of sparsity. Dominant 2026 pattern — every new frontier MoE since mid-2024 ships some variant of this.

DeepSeek V4 · Qwen 3
Hybrid coarse-fine
Outer top-k routes to inner fine-grained groups

Two-stage routing — outer routes to coarse expert clusters, inner routes within the cluster. Reduces inter-GPU traffic at the cost of slightly worse specialization. Used selectively in research; not yet a mainstream production pattern in 2026.

Research · niche
Why shared experts work
The intuition: in every layer there is general-purpose computation (token-mixing, common patterns) that every input needs, plus specialized computation that only some inputs need. Splitting them gives the specialization benefits of MoE while keeping the general-purpose path always-on. This is also why DeepSeek and Qwen add a single shared expert rather than running with shared = 0 — the shared path absorbs the load that pure top-k routing would otherwise spread across over-routed specialists.

04EconomicsThe sparsity ratio drives the per-token bill.

Sparsity ratio sets the per-token compute floor. The math: a 1.6T model at 3.1% sparsity does roughly the FLOPs of a 50B dense model on each token. A 235B model at 9.4% sparsity does roughly the FLOPs of a 22B dense model. That ratio shows up directly in tokens-per-second on identical hardware — and indirectly in $/1M tokens through provider markup.

DeepSeek V4-Pro
3.1%
Sparsity ratio

49B active out of 1.6T total. The smallest active footprint of any 2026 frontier model. Per-token compute equivalent to a 50B dense model — but holding 1.6T parameters resident across 8×H100 means ~$25K/month in pure GPU rent before utilization adjustments.

Tightest 2026
Qwen 3 235B-MoE
9.4%
Sparsity ratio

22B active out of 235B total. Three-times less sparse than V4-Pro but fits comfortably on 4×H100. Per-token compute roughly equivalent to a 22B dense model. Cheapest open-weight frontier to operate at small-cluster scale.

Best mid-cluster
Mixtral 8x22B
28%
Sparsity ratio

39B active out of 141B total. Coarse-grained legacy pattern. Fits on 2×H100 or 1×H200. Per-token compute equivalent to a 40B dense model. Still the simplest open-weight MoE to run on a single dev box; usable if model quality is acceptable for the workload.

Single-box friendly

The trap is treating sparsity ratio as a pure win. Lower sparsity buys cheaper per-token compute, but the trade-off shows up in three places: VRAM resident (the full 1.6T must fit somewhere), inter-GPU traffic (expert all-to-all on 8 GPUs is bandwidth-heavy), and tail-latency variance (more experts means more potential for imbalance). Each of these costs money in production — not all of it on the per-token line.

05ServingMoE shifts cost from compute to memory and bandwidth.

From a serving perspective, MoE is not a strict win — it is a trade. The compute savings are real, but they shift the hot path from FLOPs to two other resources: VRAM (because all experts must be resident) and inter-GPU bandwidth (because expert routing is inherently all-to-all). A serving stack that does not handle these two correctly will throw away most of the per-token compute savings.

  • VRAM resident. DeepSeek V4-Pro at FP8 needs ~1.6 TB of VRAM just for parameters, plus another 200 GB for KV cache and activations at long context. That maps to 8×H100 (640 GB at FP8 with weight-only quant) or 8×H200 (768 GB) as the minimum cluster size for FP8 serving.
  • Inter-GPU bandwidth. Every token triggers expert all-to-all communication. At 49B active across 256 experts spread over 8 GPUs, this pushes 30-40 GB/s per token of cross-GPU traffic. NVLink-NVSwitch handles it; cheap PCIe clusters do not — and that is the dominant reason frontier MoE is hard to self-host.
  • Capacity tuning. Each expert has a configurable capacity (max tokens it will accept per batch). Set too low, tokens get dropped or routed to a fall-back path. Set too high, memory wastes. vLLM 0.7+ exposes this as expert_capacity; SGLang as experts_per_token; production stacks tune this per-workload.
"You don't pay for MoE in FLOPs. You pay in NVLink, in 1.6 TB of resident VRAM, and in the on-call engineer who knows what to do when expert load balance goes off."— Internal infra retrospective, May 2026

06DecisionHow to pick an MoE for your stack.

For most agency or product teams the choice collapses to four options: closed-API frontier (GPT-5.5 or Claude Opus 4.7), open-weight frontier MoE (DeepSeek V4-Pro), open-weight mid-cluster MoE (Qwen 3 235B-MoE or Llama 4-MoE 70B), or single-box MoE (Mixtral 8x22B). The decision is governed by token volume, latency target, and how much inference engineering the team can sustain.

Profile A
Under 600M tokens/month, no infra team

Closed-API only. GPT-5.5, Claude Opus 4.7, or Gemini 3. Token spend dominates GPU spend at this scale; self-hosting is a distraction.

GPT-5.5 / Opus 4.7
Profile B
1-5B tokens/month, one infra engineer

Open-weight mid-cluster. Qwen 3 235B-MoE on 4×H100 or Llama 4-MoE on 4-8×H100. Break-even with API around 1.2B tokens/month for chat, ~600M for completion. Self-hosting pays off if model swaps are rare.

Qwen 3 / Llama 4-MoE
Profile C
5B+ tokens/month, dedicated team

Frontier open-weight MoE. DeepSeek V4-Pro on 8-16×H100. Lowest per-token cost in the open ecosystem; needs serious inference engineering and constant capacity tuning. Pairs naturally with a closed-API fallback for spiky workloads.

DeepSeek V4-Pro
Profile D
Dev box / experimentation only

Single-box MoE. Mixtral 8x22B on 1×H200 or 2×H100. The per-token compute is far worse than V4-Pro, but the cluster cost is zero. Right for prototyping, not for production.

Mixtral 8x22B

The mistake we see most often is a Profile-A team — under 600M tokens/month, no infra engineer — spending three weeks evaluating DeepSeek V4-Pro self-hosting because the per-token rack rate looks irresistible. At their volume, the 8×H100 cluster cost would dominate spend by a factor of four, before counting engineer time. The right answer for that team is the closed-API rack rate plus aggressive caching — not heroic self-hosting.

The opposite mistake is Profile C — 5B+ tokens/month with a dedicated infra team — running everything on closed APIs because the migration feels expensive. At that volume, the migration pays for itself in the first month, every month. We have helped clients cut $1.8M/year in inference spend with a six-week move from closed-API to V4-Pro on 16×H100.

07ConclusionMoE is the new dense.

Frontier architecture, April 2026

Sparsity is the default. Dense is the deliberate exception.

By April 2026 the architectural question has flipped. In 2024 a new frontier model had to defend a decision to be MoE; in 2026 it has to defend a decision to be dense. Anthropic's Claude line is the only credible hold-out, and the public arguments for that position — interpretability, alignment uniformity, predictable latency — are real but increasingly narrow as the rest of the industry matures the production patterns around MoE.

For agency and product teams, the practical takeaway is to stop evaluating models on raw parameter counts and start evaluating them on three numbers: active parameters per token, sparsity ratio, and routing pattern. Those three governing variables predict cost, latency, and hardware fit better than any benchmark column.

The next architectural compression — moving past the FFN expert layer into more aggressive attention compression (DeepSeek's MLA and CSA/HCA, the Mamba-MoE hybrids) — is already in flight. Expect 2027's frontier to land near 1% sparsity at 5T total, with attention costs dropped another 5-10× through latent compression. The window of MoE-as-novelty is closed; the window of MoE-as-substrate is the one to optimize for.

Production-grade MoE serving

Move past parameter counts. Optimize for active parameters per token.

We design and operate frontier-MoE deployments for engineering teams shipping production at scale — covering model selection (closed-API vs open-weight), serving-stack tuning (vLLM, SGLang, TensorRT-LLM expert parallel), capacity sizing, and per-workload cost telemetry.

Free consultationExpert guidanceTailored solutions
What we work on

MoE serving engagements

  • Model selection — closed API vs DeepSeek V4 vs Qwen 3 vs Mixtral
  • Serving-stack tuning — vLLM, SGLang, TensorRT-LLM expert parallel
  • Capacity & load-balance tuning under bursty traffic
  • VRAM and KV-cache budget modelling at 32K-1M context
  • Closed-API fallback routing for spike protection
FAQ · MoE architecture in 2026

The questions we get every week.

OpenAI has not officially disclosed GPT-5.5's architecture. The MoE characterization is widely-cited but inferred — from latency curves (consistent with sparse activation), from the cost gap between GPT-5.5 and GPT-5.5 Pro (consistent with different active-parameter counts at similar total parameters), and from the SemiAnalysis-led reverse-engineering of GPT-4 that placed it as MoE. Treat the ~6% sparsity / ~110B active / ~1.8T total figures as informed estimates, not confirmed specs. Anthropic, Google, and OpenAI all decline to publish detailed architecture for production frontier models — public-evidence inference is the only available source.