By April 2026, mixture-of-experts is no longer the contrarian architectural bet. Every frontier model that ships open weights, and every model rumored to power closed-API frontier (with the explicit exception of the Claude family), is dense-MoE — a sparse ensemble where each forward pass routes through a small subset of total parameters.
The shift happened fast. In 2024, Mixtral 8x22B was the open-weight standard at 28% sparsity. In 2025, DeepSeek V3 dropped that to 5.4%. By Q2 2026, DeepSeek V4-Pro pushes it to 3.1% (49B active out of 1.6T total) and Qwen 3's 235B-MoE variant lands at 9.4%. The consequence is that frontier models now hold ten to thirty times more parameters in VRAM than a dense model of equivalent throughput — and the question is no longer whether MoE wins, but which MoE pattern wins for which serving constraint.
This post compares the four canonical 2026 MoE patterns side by side: the rumored GPT-5.5 architecture, DeepSeek V4-Pro's published spec, Qwen 3's public model cards, and the open-weight reference points (Mixtral, Llama-MoE, Granite-MoE). It also covers Anthropic's notable hold-out — Claude Opus 4.7 is the only frontier model where the public evidence still suggests a dense (or very lightly-MoE) architecture.
- 01The 2026 frontier is sparse — except Claude. Every other major model is MoE.GPT-5.5 (rumored), DeepSeek V4-Pro, Qwen 3 235B-MoE, and the open-weight Mixtral / Llama-MoE / Granite-MoE families all ship MoE in 2026. Claude Opus 4.7's public evidence still points to dense or very lightly-MoE — Anthropic has not confirmed a switch.
- 02Sparsity ratio compression has been the single biggest 2024→2026 lever.Mixtral (28%) → DeepSeek V3 (5.4%) → DeepSeek V4-Pro (3.1%). Each step roughly tripled total-to-active ratio while preserving or improving downstream eval. The cost gain compounds: lower sparsity means cheaper per-token compute at the price of higher total VRAM held resident.
- 03Top-k routing, expert-choice routing, and fine-grained shared experts are not interchangeable.Top-k (Mixtral) is simplest but suffers expert-imbalance under load. Expert-choice (Switch / Llama-MoE) hard-balances at the cost of dropped tokens. Fine-grained shared experts (DeepSeek, Qwen) split routed and shared experts to combine specialization with stable knowledge — the dominant 2026 pattern.
- 04MoE shifts cost from compute to memory bandwidth and inter-GPU traffic.On a per-token basis MoE saves 70–95% of the FLOPs of an equivalent dense model. But all 1.6T parameters of V4-Pro must be held resident across an 8×H100 cluster, and expert-routing pushes 30–40 GB/s of inter-GPU traffic per token. Picking a serving stack that handles MoE all-to-all (vLLM 0.7+, SGLang, TensorRT-LLM with expert parallel) is non-optional.
- 05Pick the pattern that fits your serving constraints, not the lowest sparsity number.DeepSeek V4-Pro's 3.1% sparsity wins on per-token compute; Qwen 3's 9.4% wins on cluster-size economics; Mixtral's 28% wins on single-GPU dev-box deployment. The right MoE for an agency stack is the one that maps onto the GPU budget and tail-latency target you can actually serve.
01 — The ThesisWhy frontier went sparse between 2024 and 2026.
The MoE thesis is mechanical. A dense model with N parameters performs N FLOPs per token at inference. An MoE model with N total parameters and k% sparsity performs roughly k% × N FLOPs per token — because only the activated experts contribute to the forward pass. Holding N constant, a 5% sparse model is twenty times cheaper per token to compute than its dense equivalent.
That trade was a research curiosity until two things converged: knowledge density stopped scaling well past about 70B dense parameters (the Llama 3 70B / 405B gap is the canonical evidence), and serving infrastructure matured to the point that all-to-all expert routing across 8 or 16 GPUs became practical at production tail-latency. The combination meant frontier teams could keep adding knowledge in the form of more experts without paying for it on every token.
Sparsity ratio · 2024 → 2026 frontier MoE compression
Source: Public model cards, technical reports · Apr 2026The trend line is straight: roughly a 9× compression in sparsity ratio over two years. Every step preserved or improved downstream evaluation while reducing per-token FLOPs. The architectural moves that made it work — fine-grained expert split, shared experts for general knowledge, latent-attention compression for KV cache — are now standard, and the next frontier of compression is in the attention layer (MLA, CSA/HCA) rather than the FFN layer where the experts live.
02 — The Four ModelsFour MoE patterns, compared.
The four side-by-side specs below are what we work from when sizing client deployments. Numbers are taken from official model cards (DeepSeek V4, Qwen 3) or from the most credible public reverse-eng available as of Apr 24, 2026 (GPT-5.5 architecture is not officially disclosed; the figures listed are widely-cited estimates).
GPT-5.5 — closed, ~6% sparsity (rumored)
~1.8T total · ~110B active · top-k routing (k≈2-4)OpenAI has not disclosed architecture publicly since GPT-4. Inferred from latency, throughput, and API behavior: a coarse-to-fine MoE at ~6% sparsity, likely with shared experts for safety/alignment. Routes through ~110B active per token. Pricing ($5/$30) reflects high active-param compute.
Closed weights · top-k routingDeepSeek V4-Pro — open, 3.1% sparsity
1.6T total · 49B active · 256+1 shared, top-8Tightest sparsity ratio in 2026. 256 routed experts + 1 shared expert per layer; CSA+HCA attention compresses the KV cache to 10% of V3.2. Muon optimizer instead of AdamW. The architecture pairs aggressive FFN sparsity with aggressive attention compression — the two compounding effects.
Open weights · fine-grained + sharedQwen 3 235B-MoE — open, 9.4% sparsity
235B total · 22B active · 128+1 shared, top-8Alibaba's flagship open-weight 2025-2026 release. Slightly less sparse than DeepSeek V4 but with stronger multilingual coverage and a more conservative routing policy that produces lower expert-imbalance variance under bursty traffic. Cheaper to serve on 4×H100 than V4-Pro.
Open weights · stable routingClaude Opus 4.7 — closed, dense (no MoE)
Architecture not disclosed · estimated dense or near-denseThe frontier hold-out. Latency-vs-throughput curves and Anthropic's interpretability stance both point to dense. Pricing ($5/$25 with 90% cached read) makes Opus 4.7 economical for cache-friendly workloads, even without MoE compute savings. Strongest long-context retrieval among 1M-context models.
Closed weights · dense (likely)"DeepSeek V4-Pro fits 1.6T parameters in 8×H100 — but only because the per-token active footprint is 49B."— Internal serving-stack notes, May 2026
03 — RoutingThree routing strategies and what they cost.
Sparsity is one variable; routing is the other. Three families dominate 2026 production: top-k token-choice, expert-choice (the Switch family), and fine-grained shared-experts (the DeepSeek/Qwen pattern). Each family has a characteristic failure mode under bursty traffic, and that failure mode is what governs which stack you can serve cheaply.
Each token picks its k experts (Mixtral, GPT-style)
Simple, easy to scale, well-understood. The token decides which experts to fire (typically k=2). Failure mode is expert-imbalance: under bursty workloads, certain experts get over-routed and become tail-latency bottlenecks. Auxiliary load-balancing loss is mandatory — without it, throughput drops 20-40% under load.
Mixtral · GPT (rumored)Each expert picks its tokens (Switch, Llama-MoE)
Hard-balances by construction — every expert gets exactly its capacity-fraction of tokens. Trade-off is dropped tokens: under capacity pressure, some tokens skip the FFN entirely and pass through with the residual only. Strong in research benchmarks, weaker in production tail-latency under traffic spikes.
Switch · Llama-MoEMany small routed experts plus 1-2 shared experts
DeepSeek V2/V3/V4 and Qwen 3 pattern. The shared expert always fires (general-knowledge backbone); the routed experts add specialization through top-k. Combines the stability of dense behavior with the cost savings of sparsity. Dominant 2026 pattern — every new frontier MoE since mid-2024 ships some variant of this.
DeepSeek V4 · Qwen 3Outer top-k routes to inner fine-grained groups
Two-stage routing — outer routes to coarse expert clusters, inner routes within the cluster. Reduces inter-GPU traffic at the cost of slightly worse specialization. Used selectively in research; not yet a mainstream production pattern in 2026.
Research · niche04 — EconomicsThe sparsity ratio drives the per-token bill.
Sparsity ratio sets the per-token compute floor. The math: a 1.6T model at 3.1% sparsity does roughly the FLOPs of a 50B dense model on each token. A 235B model at 9.4% sparsity does roughly the FLOPs of a 22B dense model. That ratio shows up directly in tokens-per-second on identical hardware — and indirectly in $/1M tokens through provider markup.
Sparsity ratio
49B active out of 1.6T total. The smallest active footprint of any 2026 frontier model. Per-token compute equivalent to a 50B dense model — but holding 1.6T parameters resident across 8×H100 means ~$25K/month in pure GPU rent before utilization adjustments.
Tightest 2026Sparsity ratio
22B active out of 235B total. Three-times less sparse than V4-Pro but fits comfortably on 4×H100. Per-token compute roughly equivalent to a 22B dense model. Cheapest open-weight frontier to operate at small-cluster scale.
Best mid-clusterSparsity ratio
39B active out of 141B total. Coarse-grained legacy pattern. Fits on 2×H100 or 1×H200. Per-token compute equivalent to a 40B dense model. Still the simplest open-weight MoE to run on a single dev box; usable if model quality is acceptable for the workload.
Single-box friendlyThe trap is treating sparsity ratio as a pure win. Lower sparsity buys cheaper per-token compute, but the trade-off shows up in three places: VRAM resident (the full 1.6T must fit somewhere), inter-GPU traffic (expert all-to-all on 8 GPUs is bandwidth-heavy), and tail-latency variance (more experts means more potential for imbalance). Each of these costs money in production — not all of it on the per-token line.
05 — ServingMoE shifts cost from compute to memory and bandwidth.
From a serving perspective, MoE is not a strict win — it is a trade. The compute savings are real, but they shift the hot path from FLOPs to two other resources: VRAM (because all experts must be resident) and inter-GPU bandwidth (because expert routing is inherently all-to-all). A serving stack that does not handle these two correctly will throw away most of the per-token compute savings.
- VRAM resident. DeepSeek V4-Pro at FP8 needs ~1.6 TB of VRAM just for parameters, plus another 200 GB for KV cache and activations at long context. That maps to 8×H100 (640 GB at FP8 with weight-only quant) or 8×H200 (768 GB) as the minimum cluster size for FP8 serving.
- Inter-GPU bandwidth. Every token triggers expert all-to-all communication. At 49B active across 256 experts spread over 8 GPUs, this pushes 30-40 GB/s per token of cross-GPU traffic. NVLink-NVSwitch handles it; cheap PCIe clusters do not — and that is the dominant reason frontier MoE is hard to self-host.
- Capacity tuning. Each expert has a configurable capacity (max tokens it will accept per batch). Set too low, tokens get dropped or routed to a fall-back path. Set too high, memory wastes. vLLM 0.7+ exposes this as
expert_capacity; SGLang asexperts_per_token; production stacks tune this per-workload.
"You don't pay for MoE in FLOPs. You pay in NVLink, in 1.6 TB of resident VRAM, and in the on-call engineer who knows what to do when expert load balance goes off."— Internal infra retrospective, May 2026
06 — DecisionHow to pick an MoE for your stack.
For most agency or product teams the choice collapses to four options: closed-API frontier (GPT-5.5 or Claude Opus 4.7), open-weight frontier MoE (DeepSeek V4-Pro), open-weight mid-cluster MoE (Qwen 3 235B-MoE or Llama 4-MoE 70B), or single-box MoE (Mixtral 8x22B). The decision is governed by token volume, latency target, and how much inference engineering the team can sustain.
Under 600M tokens/month, no infra team
Closed-API only. GPT-5.5, Claude Opus 4.7, or Gemini 3. Token spend dominates GPU spend at this scale; self-hosting is a distraction.
GPT-5.5 / Opus 4.71-5B tokens/month, one infra engineer
Open-weight mid-cluster. Qwen 3 235B-MoE on 4×H100 or Llama 4-MoE on 4-8×H100. Break-even with API around 1.2B tokens/month for chat, ~600M for completion. Self-hosting pays off if model swaps are rare.
Qwen 3 / Llama 4-MoE5B+ tokens/month, dedicated team
Frontier open-weight MoE. DeepSeek V4-Pro on 8-16×H100. Lowest per-token cost in the open ecosystem; needs serious inference engineering and constant capacity tuning. Pairs naturally with a closed-API fallback for spiky workloads.
DeepSeek V4-ProDev box / experimentation only
Single-box MoE. Mixtral 8x22B on 1×H200 or 2×H100. The per-token compute is far worse than V4-Pro, but the cluster cost is zero. Right for prototyping, not for production.
Mixtral 8x22BThe mistake we see most often is a Profile-A team — under 600M tokens/month, no infra engineer — spending three weeks evaluating DeepSeek V4-Pro self-hosting because the per-token rack rate looks irresistible. At their volume, the 8×H100 cluster cost would dominate spend by a factor of four, before counting engineer time. The right answer for that team is the closed-API rack rate plus aggressive caching — not heroic self-hosting.
The opposite mistake is Profile C — 5B+ tokens/month with a dedicated infra team — running everything on closed APIs because the migration feels expensive. At that volume, the migration pays for itself in the first month, every month. We have helped clients cut $1.8M/year in inference spend with a six-week move from closed-API to V4-Pro on 16×H100.
07 — ConclusionMoE is the new dense.
Sparsity is the default. Dense is the deliberate exception.
By April 2026 the architectural question has flipped. In 2024 a new frontier model had to defend a decision to be MoE; in 2026 it has to defend a decision to be dense. Anthropic's Claude line is the only credible hold-out, and the public arguments for that position — interpretability, alignment uniformity, predictable latency — are real but increasingly narrow as the rest of the industry matures the production patterns around MoE.
For agency and product teams, the practical takeaway is to stop evaluating models on raw parameter counts and start evaluating them on three numbers: active parameters per token, sparsity ratio, and routing pattern. Those three governing variables predict cost, latency, and hardware fit better than any benchmark column.
The next architectural compression — moving past the FFN expert layer into more aggressive attention compression (DeepSeek's MLA and CSA/HCA, the Mamba-MoE hybrids) — is already in flight. Expect 2027's frontier to land near 1% sparsity at 5T total, with attention costs dropped another 5-10× through latent compression. The window of MoE-as-novelty is closed; the window of MoE-as-substrate is the one to optimize for.