SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentNew Release10 min readPublished Apr 24, 2026

Two open-weight MoE models · 1M context · 27%of V3.2's inference cost

DeepSeek V4 Launches: 1.6T MoE, 1M Context, 10% KV

DeepSeek V4 Preview shipped on April 24, 2026 — V4-Pro at 1.6T / 49B active and V4-Flash at 284B / 13B. The story isn't raw capability; it's a hybrid attention architecture that makes million-token context economically feasible. Weights on Hugging Face, API live, reasoning modes built in.

DA
Digital Applied Team
Senior strategists · Published Apr 24, 2026
PublishedApr 24, 2026
Read time10 min
SourcesDeepSeek technical report
V4-Pro @ 1M context
27%
of V3.2's FLOPs
−73 vs V3.2
V4-Pro KV cache
10%
of V3.2 baseline
−90 vs V3.2
Codeforces rating
3206
open-model SOTA
+154 vs GPT-5.4
Putnam-2025 proof
120/120
perfect score

DeepSeek published the DeepSeek-V4 Preview on April 24, 2026 — a new open-weight Mixture-of-Experts series that stakes the lab's thesis on a single claim: million-token context processing is no longer a capability problem, it's an efficiency problem.

Two models shipped today. V4-Pro packs 1.6 trillion total parameters with 49 billion activated per token. V4-Flash is the efficient sibling at 284 billion total and 13 billion active. Both support native 1M context, both are open weights on Hugging Face, and both are already live on the DeepSeek API and chat.deepseek.com as Expert Mode and Instant Mode respectively.

This guide covers what actually launched, the attention architecture that makes the efficiency story real, honest benchmark positioning versus GPT-5.4, Gemini-3.1-Pro, and Claude Opus 4.6, and how to start using V4 in your own stack today. Everything below is sourced from DeepSeek's technical report published alongside the launch.

Key takeaways
  1. 01
    Two open-weight MoE models shipped together.V4-Pro at 1.6T total / 49B active and V4-Flash at 284B / 13B. Weights are on Hugging Face; the API and chat.deepseek.com Expert/Instant modes are live same-day.
  2. 02
    The headline is efficiency, not raw capability.At 1M-token context, V4-Pro uses 27% of V3.2's single-token inference FLOPs and 10% of the KV cache; V4-Flash drops further to 10% of FLOPs and 7% of KV.
  3. 03
    New hybrid attention — CSA plus HCA.Compressed Sparse Attention keeps a ~1/m-sized KV plus a top-k selector; Heavily Compressed Attention folds many more tokens into a single entry. Interleaving the two makes 1M context affordable.
  4. 04
    Three reasoning modes replace one-shot inference.Non-Think, Think High, and Think Max modes toggle via the <think> token and a Max-only prepended system prompt. Post-training replaces mixed RL with On-Policy Distillation from domain specialists.
  5. 05
    Open SOTA, honestly 3 to 6 months behind frontier.V4-Pro-Max sets open-model highs on LiveCodeBench (93.5) and Codeforces (3206), proof-perfect 120/120 on Putnam-2025, while trailing GPT-5.4 and Gemini-3.1-Pro on MMLU-Pro and GPQA Diamond.

01What ShippedA Preview release, live on three surfaces.

The launch is a Preview release — DeepSeek's term for a production model that the lab considers complete enough for API traffic and open distribution, but that may still evolve before a full "V4" branding. Three surfaces went live simultaneously: open weights on Hugging Face, the DeepSeek API updated to the new models, and chat.deepseek.com exposing V4-Pro as Expert Mode and V4-Flash as Instant Mode.

Both models share a single architectural stack. The differences are the expert count, the hidden size, and the training budget. The paper is explicit that Flash-Base already surpasses V3.2-Base across the majority of benchmarks despite roughly 42% of V3.2's total parameter count — largely because of the architectural and data-quality improvements documented in the sections below.

Expert Mode
DeepSeek-V4-Pro
1.6T total · 49B active · 33T tokens

Frontier open-weight MoE. Native 1M context, three reasoning modes, Codeforces 3206 and a perfect 120/120 on Putnam-2025. Available as Expert Mode on chat.deepseek.com.

huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Instant Mode
DeepSeek-V4-Flash
284B total · 13B active · 32T tokens

Efficient sibling. Surpasses V3.2-Base across most benchmarks at ~42% of V3.2's parameter count. 10% of V3.2's FLOPs and 7% of its KV cache at 1M context.

huggingface.co/deepseek-ai/DeepSeek-V4-Flash
Release snapshot
DeepSeek-V4 Preview launched April 24, 2026 on Hugging Face, the DeepSeek API, and chat.deepseek.com. Checkpoints are in the DeepSeek-V4 collection on Hugging Face with a reference inference implementation in the same repo. Launch pricing on the DeepSeek API: V4-Flash at $0.14 / $0.28 per 1M tokens (input / output) and V4-Pro at $0.435 / $0.87 on a 75%-off launch promotion through May 31, 2026 — list rates of $1.74 / $3.48 apply after.

The release follows the pattern DeepSeek established with V3.1 and V3.2: permissively-licensed open weights, a detailed technical report published alongside the release, and a reference inference implementation in the same repository. Always verify the exact license text on the Hugging Face repo before shipping production workloads.

02EfficiencyThe real headline: 27% FLOPs, 10% KV cache.

The central claim of the V4 paper sits in a single sentence of the abstract: at a one-million-token context length, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with V3.2. V4-Flash goes further — 10% of the FLOPs and 7% of the KV cache. Those are not theoretical numbers; they are DeepSeek's measurements of equivalent FP8 FLOPs against its own prior-generation model on the same hardware.

Inference cost at 1M-token context · V4 vs V3.2 baseline

Source: DeepSeek-V4 technical report
V3.2 baseline671B total / 37B active
100%
V4-Pro FLOPs1.6T / 49B active · single-token inference
27%
V4-Pro KV cache1M-token context window
10%
V4-Flash FLOPs284B / 13B active · single-token inference
10%
V4-Flash KV cache1M-token context window
7%

The practical implication is worth reading twice. V4-Pro has more than double the active parameters of V3.2 (49B vs 37B) and more than twice the total parameters. By every intuition built on dense transformer scaling, inference should cost more — not ~3.7× less. The efficiency gain comes entirely from the attention redesign in Section 03, combined with FP4-quantized routed experts and a fundamentally different KV cache management strategy.

For teams operating long-context workloads — full-codebase analysis, multi-document reasoning, legal or financial corpus Q&A — this changes the math. A workload that required $1 of compute on V3.2 at 1M tokens costs roughly $0.27 on V4-Pro with meaningfully stronger capability, or roughly $0.10 on V4-Flash with modestly weaker knowledge but comparable reasoning. That is the entire thesis of this release.

"Million-token context processing is no longer a capability problem — it's an efficiency problem. The hybrid attention stack in V4 is designed around that."— DeepSeek-V4 technical report, §1 Abstract

03Hybrid AttentionCSA plus HCA, interleaved across layers.

DeepSeek-V4's attention stack interleaves two distinct compression strategies across layers. Neither is exotic in isolation; the contribution is the hybrid configuration and the engineering to make it train stably at this scale.

Compressed Sparse Attention (CSA)

What it is. CSA takes the Key-Value cache for every m tokens and compresses that block into a single entry — so the effective KV sequence is 1/m the length of the raw token sequence. Then, for each query token, a learned Lightning Indexer scores those compressed blocks and a top-k selector picks only the most relevant ones to attend to. A sliding window of recent uncompressed tokens is concatenated alongside so that local fine-grained dependencies are preserved.

Why it matters. Two savings stack: the KV cache itself is 1/m the size of a dense attention KV, and even among the compressed entries, attention is sparse (top-k only). CSA is a generalization of DeepSeek Sparse Attention from V3.2 — same idea, more aggressive compression.

Heavily Compressed Attention (HCA)

What it is. HCA applies much more aggressive compression — every m' tokens (with m' ≫ m) fold into a single KV entry. The trade-off is that HCA skips the sparse selection step entirely; attention over the compressed KV remains dense.

Why it matters. HCA is designed for the layers where retaining a broad, low-resolution view of the full context is more valuable than fine-grained selection. Interleaving HCA blocks with CSA blocks gives the model both modes — precise look-up on some layers, smeared global summary on others.

Attention stack · illustrative layer interleaving
L 01
L 02
L 03
L 04
CSA — 1/m compressed + top-k sparseHCA — heavy compression, denseEdge layers (dense)
The key insight
The efficiency numbers above are not achievable by either CSA or HCA alone. DeepSeek tried variants in ablations; the interleaved hybrid is what holds long-context quality while dropping FLOPs and KV cache by an order of magnitude.

04Architecture StackmHC, Muon, and what carries over from V3.

Beyond the attention redesign, V4 introduces two further innovations and retains the best parts of the V3 stack. The net effect is a model that trains more stably at larger scale than any prior DeepSeek release.

Manifold-Constrained Hyper-Connections (mHC)

mHC replaces the conventional residual connections between Transformer blocks. In standard Hyper-Connections, the residual mapping can amplify signals in unstable ways when many layers stack — causing numerical blow-ups during training. mHC constrains the residual mapping to lie on the Birkhoff polytope (the manifold of doubly stochastic matrices), which bounds the spectral norm to ≤ 1 and makes signal propagation non-expansive by construction. The result: signals stay numerically stable across very deep stacks, which is what unlocks the 1.6T parameter scale at all.

Muon Optimizer

V4 trains with Muonrather than AdamW. In DeepSeek's setup, Muon delivers faster convergence and better training stability, though the paper is careful to note that several training-time stabilizers — Anticipatory Routing and SwiGLU clamping — were still required to keep loss spikes under control at scale.

Carried over from V3

  • DeepSeekMoE — the fine-grained routed-expert FFN framework, with a small tweak: V4 uses Sqrt(Softplus) rather than Sigmoid for affinity scoring, removes the cap on routing target nodes, and replaces the dense FFN layers in the first few Transformer blocks with Hash-routed MoE layers.
  • Multi-Token Prediction (MTP) — retained unchanged from V3. Still used to accelerate inference and to improve training signal.
  • Auxiliary-loss-free load balancing — with a mild sequence-wise balance loss added on top to avoid extreme expert imbalance inside individual sequences.

05Pre-Training33T / 32T tokens, FP4 QAT.

V4-Pro is pre-trained on 33 trillion tokens, V4-Flash on 32 trillion. Both training runs use FP4 quantization-aware training for the routed expert weights and the indexer query/key path, while keeping non-expert computation in FP8. The practical effect today is a smaller memory footprint during training and inference; the paper notes that on current hardware, peak throughput for FP4×FP8 operations is identical to FP8×FP8, but explicitly flags that purpose-built hardware could make FP4 roughly 1.33× more efficient than FP8 — an open lane for future inference gains.

Training stability was actively managed. The paper introduces Anticipatory Routing, which decouples routing updates from the backbone network by one step, fetched in advance — triggered automatically when a loss spike is detected. Combined with SwiGLU clamping (the linear component clamped to [-10, 10], upper gate capped at 10), the authors were able to avoid loss-spike recovery without compromising final-model quality.

Pre-training tokens
33T
V4-Pro dataset

Trained with FP4 quantization-aware training on routed expert weights, FP8 for non-expert computation. Anticipatory Routing handles loss-spike recovery.

V4-Flash: 32T
MMLU-Base lift
88.7
Flash-Base vs V3.2

V4-Flash-Base hits 88.7 on MMLU versus 87.8 for V3.2-Base — despite Flash having 284B total / 13B active vs V3.2's 671B / 37B. Less than half the parameters, higher score.

<½ params
Long-context base
51.5
LongBench-V2 · Pro-Base

V4-Pro-Base reaches 51.5 on LongBench-V2 vs 40.2 for V3.2-Base. The hybrid attention earns its keep most visibly on long-context scenarios, where the numeric step-up is largest.

+11.3 vs V3.2

06Post-TrainingOn-Policy Distillation and three reasoning modes.

V4's post-training pipeline is the second big break from V3-series practice. The mixed-RL stage — previously used to consolidate capabilities across domains — is entirely replaced by On-Policy Distillation (OPD).

The sequence: train a separate specialist model for each target domain (math, code, agent, instruction following) via Supervised Fine-Tuning followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO). Those specialists each become state-of-the-art in their respective field. Then train a single unified model via multi-teacher OPD, where the unified model is the student and the specialists are teachers — the student optimizes a reverse-KL loss against teacher output distributions on its own generated trajectories. The result: one model that inherits the specialists' capabilities without their per-domain narrowness.

Three reasoning modes

Each V4 model supports three inference modes, distinguished by how they use the <think> / </think> tokens:

Fast
Non-Think
</think> summary

Intuition-style output with no deliberate chain-of-thought. Appropriate for routine, low-risk tasks where latency and cost dominate.

Low cost · Low latency
Default
Think High
<think> trace </think>

Explicit reasoning trace before the answer. The right default for medium-risk problem solving — code review, multi-step analysis, structured retrieval.

Production default
Frontier
Think Max
prepended prompt + extended trace

Expanded context budget, reduced length penalties, Max-only system prompt demanding exhaustive decomposition and edge-case stress-testing. Produces V4-Pro's strongest benchmark numbers.

Max tokens · Max accuracy

Think Max is what produces V4-Pro-Max's strongest benchmark numbers in Section 07. At inference time, Max mode adds a prepended instruction demanding "absolute maximum" reasoning — explicitly decomposing the problem, documenting rejected hypotheses, stress-testing against edge cases — and uses a meaningfully larger context budget and reduced length penalty than High mode. The cost is output token count; the payoff is the frontier-competitive numbers on hard reasoning.

Two operational changes worth knowing

  • DSML XML tool-call schema.V4 replaces V3.2's tool-call format with an XML-based schema using dedicated |DSML| tokens. The paper reports fewer escaping failures and tool-call errors; practically, any agent scaffolding calling V4 should expect a slightly different tool-invocation format than V3.2 used.
  • Interleaved Thinking. Unlike V3.2, which discarded reasoning traces at the start of each new user turn, V4 retains the complete reasoning history across tool calls and user messages during tool-using conversations. That preserves coherent long-horizon chains of thought for agent tasks, at the cost of more context consumption.

07BenchmarksWhere V4-Pro-Max leads, matches, and trails.

The chart below is a direct subset of Table 6 from the V4 paper, comparing V4-Pro-Max against the strongest publicly evaluated modes of Claude Opus 4.6 (Max), GPT-5.4 (xHigh), and Gemini-3.1-Pro (High). Orange bars mark V4-Pro-Max scores where V4 leads the field; blue bars mark where a closed-frontier model leads.

V4-Pro-Max vs frontier · selected benchmarks

Source: DeepSeek-V4 report, Table 6
LiveCodeBenchPass@1 · V4-Pro-Max 93.5 · Gemini 91.7
93.5
V4 wins
CodeforcesRating · ~23rd among human contestants
3206
V4 wins
Apex ShortlistPass@1 · Gemini 89.1 · V4 90.2
90.2
V4 wins
Putnam-2025Proof-graded · 120/120 perfect
120/120
V4 wins
MMLU-ProEM · Gemini leads at 91.0
91.0
Gemini 3.1
GPQA DiamondPass@1 · V4 90.1 · Gemini 94.3
94.3
Gemini 3.1
SimpleQA-VerifiedV4 57.9 · Gemini 75.6
75.6
Gemini 3.1
MRCR 1MLong-context retrieval · Opus 92.9 · V4 83.5
92.9
Opus 4.6
V4-Pro-Max leadsClosed-frontier model leads

DeepSeek's own framing is the honest version: V4 trails the absolute frontier by approximately 3 to 6 months on general knowledge and the hardest retrieval workloads, while setting new open-model highs on competitive programming and formal reasoning. Open Codeforces rating of 3206 places the model roughly 23rd among human contest participants. The 120/120 Putnam-2025 is proof-perfect — every solution a valid, graded proof rather than a numeric answer.

"Strong enough to be a serious option, honest enough to set expectations."— Our reading of the V4 paper's positioning, §8 Conclusion

08Access V4 TodayWeights, API, chat — all live.

Three paths are live today. Pick the one that matches the workload: open weights for on-prem or fine-tuning, the API for production integration, or the chat UI for exploration and team evaluation.

Surface
huggingface.co/deepseek-ai
Model exposure
V4-Pro + V4-Flash weights
Best for
On-prem deployment, fine-tuning, quantization for edge, sovereignty-bound workloads. Flash is tractable on modest clusters; Pro needs serious hardware.
Surface
api.deepseek.com
Model exposure
Non-Think / Think High / Think Max
Best for
Production integration. Reasoning mode selected via response format + system prompt. No pricing published at launch.
Surface
chat.deepseek.com
Model exposure
Expert (Pro) · Instant (Flash)
Best for
Exploration, team evaluation, product UX, quick prompt testing. Expert Mode = V4-Pro, Instant Mode = V4-Flash.

For organizations with data-sovereignty or sector-compliance requirements, V4's open weights plus the 1M-context efficiency story make it the strongest open candidate today for on-prem long-document RAG replacement. For most agencies and engineering teams, the practical starting point is the API and chat surfaces — benchmark on your own prompts, measure token spend and latency, decide per-workload before considering on-prem. If you're deciding between V4 and closed frontier for specific pipelines, our AI digital transformation engagements start with exactly this kind of comparative eval.

09ImplicationsWhat this means for agencies and engineering teams.

V4's release changes the practical decision tree for three specific workload classes. For everything else, the frontier picture hasn't moved.

Code automation
Competitive programming & formal reasoning

93.5 LiveCodeBench, 3206 Codeforces, 120/120 Putnam-2025 is the strongest open-model signal ever released. Benchmark V4 against your current stack on your own repos before switching defaults.

Pick V4-Pro-Max
Long-context RAG
On-prem document agents

V4's open weights plus 1M-context efficiency make it the strongest open candidate today for on-prem long-document RAG in sovereignty-bound sectors. Trails Claude Opus 4.6 on MRCR 1M — pick per-workload.

Pick V4-Pro open weights
General knowledge work
Broad Q&A & retrieval

V4 trails GPT-5.4 xHigh and Gemini-3.1-Pro on MMLU-Pro, GPQA Diamond, and SimpleQA-Verified. Stay with closed frontier for generalist knowledge work until the gap closes.

Stay with frontier
Production architecture
Multi-vendor routing

Default to GPT-5.5 for agentic coding broadly; route competitive-programming and Putnam-style workloads to V4-Pro-Max; keep Opus 4.6 for MRCR 1M retrieval; Gemini 3.1 Pro for price-sensitive bulk long-context.

Route by task class

For teams currently running V3.2 in production, the migration is straightforward — same architectural family, same weight-distribution story, meaningfully lower inference cost at 1M context, and three reasoning modes that replace the per-task prompt engineering V3.2 typically needed. For teams evaluating open weights for the first time because of sovereignty or cost, V4-Flash is the right starting point: 284B total / 13B active is tractable on modest clusters, and Flash-Base already surpasses V3.2-Base on most benchmarks.

10ConclusionThe strongest open release of the quarter.

The shape of open frontier, April 2026

Million-token context is no longer a capability question — it's a cost question.

DeepSeek V4 Preview is the most consequential open-weight release of the quarter. Two tightly-related models, a hybrid attention architecture that makes 1M context genuinely economical rather than aspirational, and a post-training pipeline that replaces one-shot inference with three explicit reasoning modes — all shipped on day one as weights, API, and chat.

The honest framing from the paper itself is the right one: V4 trails absolute frontier by three to six months on general knowledge, and sets new open-model highs on competitive programming and formal reasoning. That's enough to be a serious option for specific workload classes — not enough to displace closed frontier for general knowledge work. The practical move is to run your own evals on the specific prompts you care about, not to treat the headline numbers as a vendor decision.

The broader signal is clearer: efficiency, not raw capability, is the axis that matters for the next generation of open models. When a release doubles your active parameter count and still cuts your inference bill by three-quarters, the question stops being "which model is smartest" and becomes "which model is cheap enough to actually run the workload I care about at the scale I care about it." V4 is the first open model to convincingly land on that side of the line.

Deploy open-weight frontier in production

Open weights plus 1M context make on-prem long-document RAG genuinely viable.

Our team helps businesses evaluate, benchmark, and operate open-weight frontier models — including V4 — for code automation, long-context retrieval, and sovereign deployment, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight model engagements

  • V4 benchmarking against closed frontier on your corpus
  • On-prem long-document RAG — sovereignty-bound sectors
  • Fine-tuning & quantization for production deployment
  • Multi-vendor routing — V4 / GPT-5.5 / Opus 4.6 / Gemini
  • Cost & governance programs for open + closed mix
FAQ · DeepSeek V4 guide

The questions we get every week.

DeepSeek V4 Preview is a new series of open-weight Mixture-of-Experts language models from DeepSeek-AI, announced on April 24, 2026. The release includes two variants: V4-Pro with 1.6 trillion total parameters (49 billion activated per token) and V4-Flash with 284 billion total parameters (13 billion activated). Both support a native 1-million-token context window. Model weights are published on Hugging Face, the DeepSeek API has been updated the same day, and chat.deepseek.com now exposes the models as Expert Mode (Pro) and Instant Mode (Flash).