DeepSeek published the DeepSeek-V4 Preview on April 24, 2026 — a new open-weight Mixture-of-Experts series that stakes the lab's thesis on a single claim: million-token context processing is no longer a capability problem, it's an efficiency problem.
Two models shipped today. V4-Pro packs 1.6 trillion total parameters with 49 billion activated per token. V4-Flash is the efficient sibling at 284 billion total and 13 billion active. Both support native 1M context, both are open weights on Hugging Face, and both are already live on the DeepSeek API and chat.deepseek.com as Expert Mode and Instant Mode respectively.
This guide covers what actually launched, the attention architecture that makes the efficiency story real, honest benchmark positioning versus GPT-5.4, Gemini-3.1-Pro, and Claude Opus 4.6, and how to start using V4 in your own stack today. Everything below is sourced from DeepSeek's technical report published alongside the launch.
- 01Two open-weight MoE models shipped together.V4-Pro at 1.6T total / 49B active and V4-Flash at 284B / 13B. Weights are on Hugging Face; the API and chat.deepseek.com Expert/Instant modes are live same-day.
- 02The headline is efficiency, not raw capability.At 1M-token context, V4-Pro uses 27% of V3.2's single-token inference FLOPs and 10% of the KV cache; V4-Flash drops further to 10% of FLOPs and 7% of KV.
- 03New hybrid attention — CSA plus HCA.Compressed Sparse Attention keeps a ~1/m-sized KV plus a top-k selector; Heavily Compressed Attention folds many more tokens into a single entry. Interleaving the two makes 1M context affordable.
- 04Three reasoning modes replace one-shot inference.Non-Think, Think High, and Think Max modes toggle via the <think> token and a Max-only prepended system prompt. Post-training replaces mixed RL with On-Policy Distillation from domain specialists.
- 05Open SOTA, honestly 3 to 6 months behind frontier.V4-Pro-Max sets open-model highs on LiveCodeBench (93.5) and Codeforces (3206), proof-perfect 120/120 on Putnam-2025, while trailing GPT-5.4 and Gemini-3.1-Pro on MMLU-Pro and GPQA Diamond.
01 — What ShippedA Preview release, live on three surfaces.
The launch is a Preview release — DeepSeek's term for a production model that the lab considers complete enough for API traffic and open distribution, but that may still evolve before a full "V4" branding. Three surfaces went live simultaneously: open weights on Hugging Face, the DeepSeek API updated to the new models, and chat.deepseek.com exposing V4-Pro as Expert Mode and V4-Flash as Instant Mode.
Both models share a single architectural stack. The differences are the expert count, the hidden size, and the training budget. The paper is explicit that Flash-Base already surpasses V3.2-Base across the majority of benchmarks despite roughly 42% of V3.2's total parameter count — largely because of the architectural and data-quality improvements documented in the sections below.
DeepSeek-V4-Pro
1.6T total · 49B active · 33T tokensFrontier open-weight MoE. Native 1M context, three reasoning modes, Codeforces 3206 and a perfect 120/120 on Putnam-2025. Available as Expert Mode on chat.deepseek.com.
huggingface.co/deepseek-ai/DeepSeek-V4-ProDeepSeek-V4-Flash
284B total · 13B active · 32T tokensEfficient sibling. Surpasses V3.2-Base across most benchmarks at ~42% of V3.2's parameter count. 10% of V3.2's FLOPs and 7% of its KV cache at 1M context.
huggingface.co/deepseek-ai/DeepSeek-V4-FlashThe release follows the pattern DeepSeek established with V3.1 and V3.2: permissively-licensed open weights, a detailed technical report published alongside the release, and a reference inference implementation in the same repository. Always verify the exact license text on the Hugging Face repo before shipping production workloads.
02 — EfficiencyThe real headline: 27% FLOPs, 10% KV cache.
The central claim of the V4 paper sits in a single sentence of the abstract: at a one-million-token context length, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with V3.2. V4-Flash goes further — 10% of the FLOPs and 7% of the KV cache. Those are not theoretical numbers; they are DeepSeek's measurements of equivalent FP8 FLOPs against its own prior-generation model on the same hardware.
Inference cost at 1M-token context · V4 vs V3.2 baseline
Source: DeepSeek-V4 technical reportThe practical implication is worth reading twice. V4-Pro has more than double the active parameters of V3.2 (49B vs 37B) and more than twice the total parameters. By every intuition built on dense transformer scaling, inference should cost more — not ~3.7× less. The efficiency gain comes entirely from the attention redesign in Section 03, combined with FP4-quantized routed experts and a fundamentally different KV cache management strategy.
For teams operating long-context workloads — full-codebase analysis, multi-document reasoning, legal or financial corpus Q&A — this changes the math. A workload that required $1 of compute on V3.2 at 1M tokens costs roughly $0.27 on V4-Pro with meaningfully stronger capability, or roughly $0.10 on V4-Flash with modestly weaker knowledge but comparable reasoning. That is the entire thesis of this release.
"Million-token context processing is no longer a capability problem — it's an efficiency problem. The hybrid attention stack in V4 is designed around that."— DeepSeek-V4 technical report, §1 Abstract
03 — Hybrid AttentionCSA plus HCA, interleaved across layers.
DeepSeek-V4's attention stack interleaves two distinct compression strategies across layers. Neither is exotic in isolation; the contribution is the hybrid configuration and the engineering to make it train stably at this scale.
Compressed Sparse Attention (CSA)
What it is. CSA takes the Key-Value cache for every m tokens and compresses that block into a single entry — so the effective KV sequence is 1/m the length of the raw token sequence. Then, for each query token, a learned Lightning Indexer scores those compressed blocks and a top-k selector picks only the most relevant ones to attend to. A sliding window of recent uncompressed tokens is concatenated alongside so that local fine-grained dependencies are preserved.
Why it matters. Two savings stack: the KV cache itself is 1/m the size of a dense attention KV, and even among the compressed entries, attention is sparse (top-k only). CSA is a generalization of DeepSeek Sparse Attention from V3.2 — same idea, more aggressive compression.
Heavily Compressed Attention (HCA)
What it is. HCA applies much more aggressive compression — every m' tokens (with m' ≫ m) fold into a single KV entry. The trade-off is that HCA skips the sparse selection step entirely; attention over the compressed KV remains dense.
Why it matters. HCA is designed for the layers where retaining a broad, low-resolution view of the full context is more valuable than fine-grained selection. Interleaving HCA blocks with CSA blocks gives the model both modes — precise look-up on some layers, smeared global summary on others.
04 — Architecture StackmHC, Muon, and what carries over from V3.
Beyond the attention redesign, V4 introduces two further innovations and retains the best parts of the V3 stack. The net effect is a model that trains more stably at larger scale than any prior DeepSeek release.
Manifold-Constrained Hyper-Connections (mHC)
mHC replaces the conventional residual connections between Transformer blocks. In standard Hyper-Connections, the residual mapping can amplify signals in unstable ways when many layers stack — causing numerical blow-ups during training. mHC constrains the residual mapping to lie on the Birkhoff polytope (the manifold of doubly stochastic matrices), which bounds the spectral norm to ≤ 1 and makes signal propagation non-expansive by construction. The result: signals stay numerically stable across very deep stacks, which is what unlocks the 1.6T parameter scale at all.
Muon Optimizer
V4 trains with Muonrather than AdamW. In DeepSeek's setup, Muon delivers faster convergence and better training stability, though the paper is careful to note that several training-time stabilizers — Anticipatory Routing and SwiGLU clamping — were still required to keep loss spikes under control at scale.
Carried over from V3
- DeepSeekMoE — the fine-grained routed-expert FFN framework, with a small tweak: V4 uses Sqrt(Softplus) rather than Sigmoid for affinity scoring, removes the cap on routing target nodes, and replaces the dense FFN layers in the first few Transformer blocks with Hash-routed MoE layers.
- Multi-Token Prediction (MTP) — retained unchanged from V3. Still used to accelerate inference and to improve training signal.
- Auxiliary-loss-free load balancing — with a mild sequence-wise balance loss added on top to avoid extreme expert imbalance inside individual sequences.
05 — Pre-Training33T / 32T tokens, FP4 QAT.
V4-Pro is pre-trained on 33 trillion tokens, V4-Flash on 32 trillion. Both training runs use FP4 quantization-aware training for the routed expert weights and the indexer query/key path, while keeping non-expert computation in FP8. The practical effect today is a smaller memory footprint during training and inference; the paper notes that on current hardware, peak throughput for FP4×FP8 operations is identical to FP8×FP8, but explicitly flags that purpose-built hardware could make FP4 roughly 1.33× more efficient than FP8 — an open lane for future inference gains.
Training stability was actively managed. The paper introduces Anticipatory Routing, which decouples routing updates from the backbone network by one step, fetched in advance — triggered automatically when a loss spike is detected. Combined with SwiGLU clamping (the linear component clamped to [-10, 10], upper gate capped at 10), the authors were able to avoid loss-spike recovery without compromising final-model quality.
V4-Pro dataset
Trained with FP4 quantization-aware training on routed expert weights, FP8 for non-expert computation. Anticipatory Routing handles loss-spike recovery.
V4-Flash: 32TFlash-Base vs V3.2
V4-Flash-Base hits 88.7 on MMLU versus 87.8 for V3.2-Base — despite Flash having 284B total / 13B active vs V3.2's 671B / 37B. Less than half the parameters, higher score.
<½ paramsLongBench-V2 · Pro-Base
V4-Pro-Base reaches 51.5 on LongBench-V2 vs 40.2 for V3.2-Base. The hybrid attention earns its keep most visibly on long-context scenarios, where the numeric step-up is largest.
+11.3 vs V3.206 — Post-TrainingOn-Policy Distillation and three reasoning modes.
V4's post-training pipeline is the second big break from V3-series practice. The mixed-RL stage — previously used to consolidate capabilities across domains — is entirely replaced by On-Policy Distillation (OPD).
The sequence: train a separate specialist model for each target domain (math, code, agent, instruction following) via Supervised Fine-Tuning followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO). Those specialists each become state-of-the-art in their respective field. Then train a single unified model via multi-teacher OPD, where the unified model is the student and the specialists are teachers — the student optimizes a reverse-KL loss against teacher output distributions on its own generated trajectories. The result: one model that inherits the specialists' capabilities without their per-domain narrowness.
Three reasoning modes
Each V4 model supports three inference modes, distinguished by how they use the <think> / </think> tokens:
Non-Think
</think> summaryIntuition-style output with no deliberate chain-of-thought. Appropriate for routine, low-risk tasks where latency and cost dominate.
Low cost · Low latencyThink High
<think> trace </think>Explicit reasoning trace before the answer. The right default for medium-risk problem solving — code review, multi-step analysis, structured retrieval.
Production defaultThink Max
prepended prompt + extended traceExpanded context budget, reduced length penalties, Max-only system prompt demanding exhaustive decomposition and edge-case stress-testing. Produces V4-Pro's strongest benchmark numbers.
Max tokens · Max accuracyThink Max is what produces V4-Pro-Max's strongest benchmark numbers in Section 07. At inference time, Max mode adds a prepended instruction demanding "absolute maximum" reasoning — explicitly decomposing the problem, documenting rejected hypotheses, stress-testing against edge cases — and uses a meaningfully larger context budget and reduced length penalty than High mode. The cost is output token count; the payoff is the frontier-competitive numbers on hard reasoning.
Two operational changes worth knowing
- DSML XML tool-call schema.V4 replaces V3.2's tool-call format with an XML-based schema using dedicated
|DSML|tokens. The paper reports fewer escaping failures and tool-call errors; practically, any agent scaffolding calling V4 should expect a slightly different tool-invocation format than V3.2 used. - Interleaved Thinking. Unlike V3.2, which discarded reasoning traces at the start of each new user turn, V4 retains the complete reasoning history across tool calls and user messages during tool-using conversations. That preserves coherent long-horizon chains of thought for agent tasks, at the cost of more context consumption.
07 — BenchmarksWhere V4-Pro-Max leads, matches, and trails.
The chart below is a direct subset of Table 6 from the V4 paper, comparing V4-Pro-Max against the strongest publicly evaluated modes of Claude Opus 4.6 (Max), GPT-5.4 (xHigh), and Gemini-3.1-Pro (High). Orange bars mark V4-Pro-Max scores where V4 leads the field; blue bars mark where a closed-frontier model leads.
V4-Pro-Max vs frontier · selected benchmarks
Source: DeepSeek-V4 report, Table 6DeepSeek's own framing is the honest version: V4 trails the absolute frontier by approximately 3 to 6 months on general knowledge and the hardest retrieval workloads, while setting new open-model highs on competitive programming and formal reasoning. Open Codeforces rating of 3206 places the model roughly 23rd among human contest participants. The 120/120 Putnam-2025 is proof-perfect — every solution a valid, graded proof rather than a numeric answer.
"Strong enough to be a serious option, honest enough to set expectations."— Our reading of the V4 paper's positioning, §8 Conclusion
08 — Access V4 TodayWeights, API, chat — all live.
Three paths are live today. Pick the one that matches the workload: open weights for on-prem or fine-tuning, the API for production integration, or the chat UI for exploration and team evaluation.
huggingface.co/deepseek-aiapi.deepseek.com| Surface | Model exposure | Best for |
|---|---|---|
huggingface.co/deepseek-ai | V4-Pro + V4-Flash weights | On-prem deployment, fine-tuning, quantization for edge, sovereignty-bound workloads. Flash is tractable on modest clusters; Pro needs serious hardware. |
api.deepseek.com | Non-Think / Think High / Think Max | Production integration. Reasoning mode selected via response format + system prompt. No pricing published at launch. |
| chat.deepseek.com | Expert (Pro) · Instant (Flash) | Exploration, team evaluation, product UX, quick prompt testing. Expert Mode = V4-Pro, Instant Mode = V4-Flash. |
For organizations with data-sovereignty or sector-compliance requirements, V4's open weights plus the 1M-context efficiency story make it the strongest open candidate today for on-prem long-document RAG replacement. For most agencies and engineering teams, the practical starting point is the API and chat surfaces — benchmark on your own prompts, measure token spend and latency, decide per-workload before considering on-prem. If you're deciding between V4 and closed frontier for specific pipelines, our AI digital transformation engagements start with exactly this kind of comparative eval.
09 — ImplicationsWhat this means for agencies and engineering teams.
V4's release changes the practical decision tree for three specific workload classes. For everything else, the frontier picture hasn't moved.
Competitive programming & formal reasoning
93.5 LiveCodeBench, 3206 Codeforces, 120/120 Putnam-2025 is the strongest open-model signal ever released. Benchmark V4 against your current stack on your own repos before switching defaults.
Pick V4-Pro-MaxOn-prem document agents
V4's open weights plus 1M-context efficiency make it the strongest open candidate today for on-prem long-document RAG in sovereignty-bound sectors. Trails Claude Opus 4.6 on MRCR 1M — pick per-workload.
Pick V4-Pro open weightsBroad Q&A & retrieval
V4 trails GPT-5.4 xHigh and Gemini-3.1-Pro on MMLU-Pro, GPQA Diamond, and SimpleQA-Verified. Stay with closed frontier for generalist knowledge work until the gap closes.
Stay with frontierMulti-vendor routing
Default to GPT-5.5 for agentic coding broadly; route competitive-programming and Putnam-style workloads to V4-Pro-Max; keep Opus 4.6 for MRCR 1M retrieval; Gemini 3.1 Pro for price-sensitive bulk long-context.
Route by task classFor teams currently running V3.2 in production, the migration is straightforward — same architectural family, same weight-distribution story, meaningfully lower inference cost at 1M context, and three reasoning modes that replace the per-task prompt engineering V3.2 typically needed. For teams evaluating open weights for the first time because of sovereignty or cost, V4-Flash is the right starting point: 284B total / 13B active is tractable on modest clusters, and Flash-Base already surpasses V3.2-Base on most benchmarks.
10 — ConclusionThe strongest open release of the quarter.
Million-token context is no longer a capability question — it's a cost question.
DeepSeek V4 Preview is the most consequential open-weight release of the quarter. Two tightly-related models, a hybrid attention architecture that makes 1M context genuinely economical rather than aspirational, and a post-training pipeline that replaces one-shot inference with three explicit reasoning modes — all shipped on day one as weights, API, and chat.
The honest framing from the paper itself is the right one: V4 trails absolute frontier by three to six months on general knowledge, and sets new open-model highs on competitive programming and formal reasoning. That's enough to be a serious option for specific workload classes — not enough to displace closed frontier for general knowledge work. The practical move is to run your own evals on the specific prompts you care about, not to treat the headline numbers as a vendor decision.
The broader signal is clearer: efficiency, not raw capability, is the axis that matters for the next generation of open models. When a release doubles your active parameter count and still cuts your inference bill by three-quarters, the question stops being "which model is smartest" and becomes "which model is cheap enough to actually run the workload I care about at the scale I care about it." V4 is the first open model to convincingly land on that side of the line.