DeepSeek V4 Launches: 1.6T MoE, 1M Context, 10% KV
DeepSeek-V4 ships April 24, 2026 as open-weight MoE: Pro (1.6T/49B active) and Flash (284B/13B), 1M context, 27% FLOPs and 10% KV cache vs V3.2.
V4-Pro Total / Active
V4-Flash Total / Active
Context Window
V4-Pro FLOPs @ 1M
Key Takeaways
DeepSeek published the DeepSeek-V4 Preview on April 24, 2026 — a new open-weight Mixture-of-Experts series that stakes the lab's thesis on a single claim: million-token context processing is not a capability problem anymore, it's an efficiency problem. Two models ship today. V4-Pro packs 1.6 trillion total parameters with 49 billion activated per token. V4-Flash is the efficient sibling at 284 billion total / 13 billion active. Both support native 1M context, and both are open weights.
This guide covers what actually launched, the attention architecture that makes the efficiency story real, the honest benchmark positioning versus GPT-5.4, Gemini-3.1-Pro, and Claude Opus 4.6, and how to start using V4 in your own stack today. Everything below is sourced from DeepSeek's technical report published alongside the launch.
Release snapshot: DeepSeek-V4 Preview launched April 24, 2026 on Hugging Face, the DeepSeek API, and chat.deepseek.com (Expert Mode = V4-Pro, Instant Mode = V4-Flash). Based on the official DeepSeek-V4 technical report and the model collection on Hugging Face.
What DeepSeek V4 Preview Actually Ships
The launch is a Preview release — DeepSeek's term for a production model that the lab considers complete enough for API traffic and open distribution, but that may still evolve before a full "V4" branding. Three surfaces went live simultaneously.
- 1.6T total parameters
- 49B activated per token (MoE)
- Pre-trained on 33T tokens
- 1M-token native context
- Exposed as Expert Mode in the chatbot
- 284B total parameters
- 13B activated per token (MoE)
- Pre-trained on 32T tokens
- 1M-token native context
- Exposed as Instant Mode in the chatbot
Both models share a single architectural stack — the differences are the expert count, the hidden size, and the training budget. The paper is explicit that Flash-Base already surpasses V3.2-Base across the majority of benchmarks despite roughly 42% of V3.2's total parameter count, largely because of the architectural and data-quality improvements described in the sections below.
The release follows the pattern DeepSeek established with V3.1 and V3.2 — permissively licensed open weights, a detailed technical report published alongside the release, and a reference inference implementation in the same Hugging Face repository. No pricing disclosure for the API accompanied the launch announcement, but the paper's efficiency claims strongly suggest V4-Pro will be priced at or near V3.2 levels per million tokens despite the larger parameter count.
The Efficiency Story: 27% FLOPs, 10% KV Cache
The central claim of the V4 paper sits in a single sentence of the abstract: at a one-million-token context length, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with V3.2. V4-Flash goes further — 10% of the FLOPs and 7% of the KV cache. Those are not theoretical numbers; they are DeepSeek's measurements of equivalent FP8 FLOPs against its own prior-generation model on the same hardware.
| Model (at 1M context) | Total Params | Active Params | Single-token FLOPs vs V3.2 | KV cache vs V3.2 |
|---|---|---|---|---|
| V3.2 (reference) | 671B | 37B | 100% (baseline) | 100% (baseline) |
| V4-Pro | 1.6T | 49B | 27% | 10% |
| V4-Flash | 284B | 13B | 10% | 7% |
The practical implication is worth reading twice. V4-Pro has more than double the active parameters of V3.2 (49B vs 37B) and more than twice the total parameters. By every intuition built on dense transformer scaling, inference should cost more — not ~3.7x less. The efficiency gain comes entirely from the attention redesign described in the next section, combined with FP4-quantized routed experts and a fundamentally different KV cache management strategy.
For teams operating long-context workloads — full-codebase analysis, multi-document reasoning, legal or financial corpus Q&A — this changes the math. A workload that required $1 of compute on V3.2 at 1M tokens costs roughly $0.27 on V4-Pro with meaningfully stronger capability, or roughly $0.10 on V4-Flash with modestly weaker knowledge but comparable reasoning. That is the entire thesis of this release.
Where this matters in practice: if you are building long-context retrieval or document-agent workloads, our AI Digital Transformation engagements start with benchmarking open-weight models like V4 against your current stack on your own corpus — the place the published numbers stop being theoretical.
Hybrid Attention: CSA + HCA Explained
DeepSeek-V4's attention stack interleaves two distinct compression strategies across layers. Neither is exotic in isolation; the contribution is the hybrid configuration and the engineering to make it train stably at this scale.
Compressed Sparse Attention (CSA)
What it is. CSA takes the Key-Value cache for every m tokens and compresses that block into a single entry — so the effective KV sequence is 1/mthe length of the raw token sequence. Then, for each query token, a learned Lightning Indexer scores those compressed blocks and a top-k selector picks only the most relevant ones to attend to. A sliding window of recent uncompressed tokens is concatenated alongside so that local fine-grained dependencies are preserved.
Why it matters. Two savings stack: the KV cache itself is 1/m the size of a dense attention KV, and even among the compressed entries, attention is sparse (top-k only). CSA is a generalization of DeepSeek Sparse Attention from V3.2 — same idea, more aggressive compression.
Heavily Compressed Attention (HCA)
What it is. HCA applies much more aggressive compression — every m' tokens (with m' ≫ m) fold into a single KV entry. The trade-off is that HCA skips the sparse selection step entirely; attention over the compressed KV remains dense.
Why it matters. HCA is designed for the layers where retaining a broad, low-resolution view of the full context is more valuable than fine-grained selection. Interleaving HCA blocks with CSA blocks gives the model both modes — precise look-up on some layers, smeared global summary on others.
The key insight: the efficiency numbers in Section 02 are not achievable by either CSA or HCA alone. DeepSeek tried variants in ablations; the interleaved hybrid is what holds long-context quality while dropping FLOPs and KV cache by an order of magnitude.
mHC, Muon, and What Carries Over from V3
Beyond the attention redesign, V4 introduces two further innovations and retains the best parts of the V3 stack. The net effect is a model that trains more stably at larger scale than any prior DeepSeek release.
Manifold-Constrained Hyper-Connections (mHC)
mHC replaces the conventional residual connections between Transformer blocks. In standard Hyper-Connections, the residual mapping can amplify signals in unstable ways when many layers stack — causing numerical blow-ups during training. mHC constrains the residual mapping to lie on the Birkhoff polytope (the manifold of doubly stochastic matrices), which bounds the spectral norm to ≤ 1 and makes signal propagation non-expansive by construction. The result: signals stay numerically stable across very deep stacks, which is what unlocks the 1.6T parameter scale at all.
Muon Optimizer
V4 trains with Muon rather than AdamW. In DeepSeek's setup, Muon delivers faster convergence and better training stability, though the paper is careful to note that several training-time stabilizers (Anticipatory Routing, SwiGLU clamping) were still required to keep loss spikes under control at scale.
What carries over from V3
- DeepSeekMoE — the fine-grained routed-expert FFN framework, with a small tweak: V4 uses Sqrt(Softplus) rather than Sigmoid for affinity scoring, removes the cap on routing target nodes, and replaces the dense FFN layers in the first few Transformer blocks with Hash-routed MoE layers.
- Multi-Token Prediction (MTP) — retained unchanged from V3. Still used to accelerate inference and to improve training signal.
- Auxiliary-loss-free load balancing — with a mild sequence-wise balance loss added on top to avoid extreme expert imbalance inside individual sequences.
Pre-Training: 33T / 32T Tokens, FP4 QAT
V4-Pro is pre-trained on 33 trillion tokens, V4-Flash on 32 trillion. Both training runs use FP4 quantization-aware training for the routed expert weights and the indexer query/key path, while keeping non-expert computation in FP8. The practical effect today is a smaller memory footprint during training and inference; the paper notes that on current hardware, peak throughput for FP4×FP8 operations is identical to FP8×FP8, but explicitly flags that purpose-built hardware could make FP4 roughly 1.33x more efficient than FP8 — an open lane for future inference gains.
Training stability was actively managed. The paper introduces Anticipatory Routing, which decouples routing updates from the backbone network by one step, fetched in advance — triggered automatically when a loss spike is detected. Combined with SwiGLU clamping (the linear component clamped to [-10, 10], upper gate capped at 10), the authors were able to avoid loss-spike recovery without compromising final-model quality.
Per Table 1 of the paper, V4-Flash-Base scores 88.7 on MMLU vs 87.8 for V3.2-Base, 89.4 on MMLU-Redux vs 87.5, and 44.7 on LongBench-V2 vs 40.2 — all with 284B total / 13B active vs V3.2's 671B / 37B. V4-Pro-Base extends the lead across the board, reaching 90.1 on MMLU and 51.5 on LongBench-V2. The gains are especially pronounced in long-context scenarios, which is where the hybrid attention earns its keep.
Post-Training: On-Policy Distillation and Reasoning Modes
V4's post-training pipeline is the second big break from V3-series practice. The mixed-RL stage — previously used to consolidate capabilities across domains — is entirely replaced by On-Policy Distillation (OPD).
The sequence: train a separate specialist model for each target domain (math, code, agent, instruction following) via Supervised Fine-Tuning followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO). Those specialists each become state-of-the-art in their respective field. Then train a single unified model via multi-teacher OPD, where the unified model is the student and the specialists are teachers — the student optimizes a reverse-KL loss against teacher output distributions on its own generated trajectories. The result: one model that inherits the specialists' capabilities without their per-domain narrowness.
Three reasoning modes
Each V4 model supports three inference modes, distinguished by how they use the <think> / </think> tokens:
| Mode | Characteristics | Response format |
|---|---|---|
| Non-Think | Fast, intuition-style — for routine tasks and low-risk decisions | </think> summary |
| Think High | Conscious logical analysis — medium-risk problem solving | <think> trace </think> summary |
| Think Max | Maximum compute — exhaustive decomposition, stress-test against edge cases | Prepended system prompt + extended <think> trace |
Think Max is what produces V4-Pro-Max's strongest benchmark numbers (Section 07). At inference time, Max mode adds a prepended instruction demanding "absolute maximum" reasoning — explicitly decomposing the problem, documenting rejected hypotheses, stress-testing against edge cases — and uses a meaningfully larger context budget and reduced length penalty than High mode. The cost is output token count; the payoff is the frontier-competitive numbers on hard reasoning.
Two operational changes worth knowing
- DSML XML tool-call schema. V4 replaces V3.2's tool-call format with an XML-based schema using dedicated
|DSML|tokens. The paper reports fewer escaping failures and tool-call errors; practically, any agent scaffolding calling V4 should expect a slightly different tool-invocation format than V3.2 used. - Interleaved Thinking. Unlike V3.2, which discarded reasoning traces at the start of each new user turn, V4 retains the complete reasoning history across tool calls and user messages during tool-using conversations. That preserves coherent long-horizon chains of thought for agent tasks, at the cost of more context consumption.
Benchmarks: Where V4 Leads, Matches, and Trails
The table below is a direct subset of Table 6 from the V4 paper, comparing V4-Pro-Max against the strongest publicly evaluated modes of Claude Opus 4.6 (Max), GPT-5.4 (xHigh), and Gemini-3.1-Pro (High). Bold marks the best score per row in this comparison; an em-dash marks data the paper did not publish.
| Benchmark (metric) | V4-Pro-Max | Opus 4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro |
|---|---|---|---|---|
| MMLU-Pro (EM) | 87.5 | 89.1 | 87.5 | 91.0 |
| GPQA Diamond (Pass@1) | 90.1 | 91.3 | 93.0 | 94.3 |
| SimpleQA-Verified | 57.9 | 46.2 | 45.3 | 75.6 |
| LiveCodeBench (Pass@1) | 93.5 | 88.8 | — | 91.7 |
| Codeforces (rating) | 3206 | — | 3168 | 3052 |
| HMMT 2026 Feb (Pass@1) | 95.2 | 96.2 | 97.7 | 94.7 |
| Apex Shortlist (Pass@1) | 90.2 | 85.9 | 78.1 | 89.1 |
| SWE-Verified (Resolved) | 80.6 | 80.8 | — | 80.6 |
| SWE-Pro (Resolved) | 55.4 | 57.3 | 57.7 | 54.2 |
| Terminal-Bench 2.0 (Acc) | 67.9 | 65.4 | 75.1 | 68.5 |
| MRCR 1M (MMR) | 83.5 | 92.9 | — | 76.3 |
| BrowseComp (Pass@1) | 83.4 | 83.7 | 82.7 | 85.9 |
| Putnam-2025 Frontier | 120/120 | — | — | — |
Methodology and date. All numbers are from DeepSeek's evaluations as reported in the V4 technical report (April 2026). Comparison models were evaluated with the paper's unified configurations where feasible; some cells are unpopulated where the paper reports API timeouts or did not evaluate that mode. Frontier benchmark standings change monthly — verify current numbers before making procurement decisions.
Honest read of the table
Where V4-Pro-Max wins. LiveCodeBench (93.5 — a new open-model high), Codeforces rating (3206 — DeepSeek reports this places the model roughly 23rd among human contest participants), Apex Shortlist (90.2), SimpleQA-Verified against the other open and near-open models (57.9 is a large margin over the prior open-model best), and a proof-perfect 120/120 on Putnam-2025 formal reasoning under the hybrid informal-formal pipeline.
Where V4-Pro-Max matches. SWE-Verified is a statistical tie with Opus 4.6 and Gemini-3.1-Pro (within the paper's 0.3-point equivalence threshold). BrowseComp and Terminal-Bench 2.0 land in the middle of the pack. These are the "frontier-competitive" wins that matter most for production code automation.
Where V4-Pro-Max trails. MMLU-Pro (87.5 vs Gemini-3.1-Pro 91.0) and GPQA Diamond (90.1 vs 94.3) — raw knowledge tasks are where frontier proprietary models still have a clear lead, driven by larger training data and longer post-training. MRCR 1M (83.5 vs Opus 4.6 92.9) — V4-Pro beats Gemini-3.1-Pro at 1M context retrieval but Opus 4.6 remains the long-context benchmark leader. DeepSeek's own framing in the paper: V4 trails the absolute frontier by approximately three to six months.
How to Access V4 Today
Three paths are live on the launch day.
Both models are published at huggingface.co/collections/deepseek-ai/deepseek-v4 under DeepSeek's standard open-weight license (read the license text on the repository before production use).
A reference inference implementation ships in the same repository at DeepSeek-V4-Pro/tree/main/inference.
The DeepSeek API has been updated the same day. The launch post states "API is updated & available today." Pricing was not published in the launch announcement; historical DeepSeek pricing for V3-series models has been among the lowest per-million-token rates of any frontier-adjacent model, and the V4 efficiency story strongly suggests rates will remain aggressive.
The consumer chat surface exposes V4 via two modes: Expert Mode routes to V4-Pro for maximum capability, Instant Mode routes to V4-Flash for faster, cheaper responses. Reasoning effort is selected per message.
For teams building with V4, the reference inference implementation in the Hugging Face repo is the shortest path to a correct local setup — the paper specifies architecture details (hash-routed layers in the first few blocks, exact CSA compression ratios, Muon hyperparameters) that generic transformer runtimes will not get right by default.
What This Means for Agencies and Engineering Teams
Three segments of the market should treat V4 as a serious decision point today rather than a curiosity to evaluate next quarter.
Long-context workloads and on-prem RAG
The efficiency numbers are the specific reason to pay attention. If your workload already runs against open-weight models for data-sovereignty, EU compliance, or sector-specific reasons, V4-Flash at 7% of V3.2's KV cache at 1M context makes previously-uneconomical long-document workflows tractable — and V4-Pro makes strong-capability long-context inference affordable. This is where the headline story actually lands in production budgets. For teams evaluating this shift, our practice at AI Digital Transformation begins with workload-level benchmarking against your current stack.
Code automation and developer tooling
The LiveCodeBench 93.5 and Codeforces 3206 numbers are the strongest open-model signal on competitive-programming-style workloads to date. For agent platforms, editor integrations, and code-automation products that already allow a bring-your-own-model posture, V4-Pro is worth benchmarking against your current production model on your own repositories before committing to a switch — the open benchmarks are directional, not universal. Worth noting: on SWE-Pro, V4-Pro-Max still trails Opus 4.6 and GPT-5.4 by 1.9 to 2.3 points, so the code-task picture is bench-specific.
Cross-provider model strategy
For teams running multi-model stacks, V4 slots in as the strongest open-weight default for 2026. Pair it with closed frontier models for the specific workloads where proprietary leads (frontier knowledge, MRCR-style 1M retrieval, SWE-Pro) and retain V4-Flash as the low-cost path for everything else. The practical result is a portfolio that spends frontier-tier dollars only on frontier-tier tasks.
Who should not switch
If your workload is dominated by raw knowledge retrieval or very long-context retrieval with high precision requirements (legal discovery, medical summarization at hundreds of thousands of tokens), the MMLU-Pro, GPQA, and MRCR 1M gaps versus Gemini-3.1-Pro and Opus 4.6 are real and measurable. The frontier leads are not margin-of-error. Pick per workload.
Plan Your 2026 AI Model Strategy
Whether you are evaluating open-weight models like V4 for on-prem deployment, building multi-model routing, or deciding which workloads should stay on proprietary frontier models, we can help you design a portfolio that matches cost to capability.
Frequently Asked Questions
Related Guides
Continue exploring frontier AI models and open-weight deployment strategy.