DeepSeek published the DeepSeek-V4 Preview on April 24, 2026 — a new open-weight Mixture-of-Experts series that stakes the lab's thesis on a single claim: million-token context processing is no longer a capability problem, it's an efficiency problem.

Two models shipped today. V4-Pro packs 1.6 trillion total parameters with 49 billion activated per token. V4-Flash is the efficient sibling at 284 billion total and 13 billion active. Both support native 1M context, both are open weights on Hugging Face, and both are already live on the DeepSeek API and chat.deepseek.com as Expert Mode and Instant Mode respectively.

This guide covers what actually launched, the attention architecture that makes the efficiency story real, honest benchmark positioning versus GPT-5.4, Gemini-3.1-Pro, and Claude Opus 4.6, and how to start using V4 in your own stack today. Everything below is sourced from DeepSeek's technical report published alongside the launch.

Key takeaways

01
Two open-weight MoE models shipped together.V4-Pro at 1.6T total / 49B active and V4-Flash at 284B / 13B. Weights are on Hugging Face; the API and chat.deepseek.com Expert/Instant modes are live same-day.
02
The headline is efficiency, not raw capability.At 1M-token context, V4-Pro uses 27% of V3.2's single-token inference FLOPs and 10% of the KV cache; V4-Flash drops further to 10% of FLOPs and 7% of KV.
03
New hybrid attention — CSA plus HCA.Compressed Sparse Attention keeps a ~1/m-sized KV plus a top-k selector; Heavily Compressed Attention folds many more tokens into a single entry. Interleaving the two makes 1M context affordable.
04
Three reasoning modes replace one-shot inference.Non-Think, Think High, and Think Max modes toggle via the <think> token and a Max-only prepended system prompt. Post-training replaces mixed RL with On-Policy Distillation from domain specialists.
05
Open SOTA, honestly 3 to 6 months behind frontier.V4-Pro-Max sets open-model highs on LiveCodeBench (93.5) and Codeforces (3206), proof-perfect 120/120 on Putnam-2025, while trailing GPT-5.4 and Gemini-3.1-Pro on MMLU-Pro and GPQA Diamond.

01 — What ShippedA Preview release, live on three surfaces.

The launch is a Preview release — DeepSeek's term for a production model that the lab considers complete enough for API traffic and open distribution, but that may still evolve before a full "V4" branding. Three surfaces went live simultaneously: open weights on Hugging Face, the DeepSeek API updated to the new models, and chat.deepseek.com exposing V4-Pro as Expert Mode and V4-Flash as Instant Mode.

Both models share a single architectural stack. The differences are the expert count, the hidden size, and the training budget. The paper is explicit that Flash-Base already surpasses V3.2-Base across the majority of benchmarks despite roughly 42% of V3.2's total parameter count — largely because of the architectural and data-quality improvements documented in the sections below.

Expert Mode

DeepSeek-V4-Pro

1.6T total · 49B active · 33T tokens

Frontier open-weight MoE. Native 1M context, three reasoning modes, Codeforces 3206 and a perfect 120/120 on Putnam-2025. Available as Expert Mode on chat.deepseek.com.

huggingface.co/deepseek-ai/DeepSeek-V4-Pro

Instant Mode

DeepSeek-V4-Flash

284B total · 13B active · 32T tokens

Efficient sibling. Surpasses V3.2-Base across most benchmarks at ~42% of V3.2's parameter count. 10% of V3.2's FLOPs and 7% of its KV cache at 1M context.

huggingface.co/deepseek-ai/DeepSeek-V4-Flash

Release snapshot

DeepSeek-V4 Preview launched April 24, 2026 on Hugging Face, the DeepSeek API, and chat.deepseek.com. Checkpoints are in the DeepSeek-V4 collection on Hugging Face with a reference inference implementation in the same repo. Launch pricing on the DeepSeek API: V4-Flash at $0.14 / $0.28 per 1M tokens (input / output) and V4-Pro at $0.435 / $0.87 on a 75%-off launch promotion through May 31, 2026 — list rates of $1.74 / $3.48 apply after.

The release follows the pattern DeepSeek established with V3.1 and V3.2: permissively-licensed open weights, a detailed technical report published alongside the release, and a reference inference implementation in the same repository. Always verify the exact license text on the Hugging Face repo before shipping production workloads.

02 — EfficiencyThe real headline: 27% FLOPs, 10% KV cache.

The central claim of the V4 paper sits in a single sentence of the abstract: at a one-million-token context length, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with V3.2. V4-Flash goes further — 10% of the FLOPs and 7% of the KV cache. Those are not theoretical numbers; they are DeepSeek's measurements of equivalent FP8 FLOPs against its own prior-generation model on the same hardware.

Inference cost at 1M-token context · V4 vs V3.2 baseline

Source: DeepSeek-V4 technical report

V3.2 baseline671B total / 37B active

100%

V4-Pro FLOPs1.6T / 49B active · single-token inference

27%

V4-Pro KV cache1M-token context window

10%

V4-Flash FLOPs284B / 13B active · single-token inference

10%

V4-Flash KV cache1M-token context window

The practical implication is worth reading twice. V4-Pro has more than double the active parameters of V3.2 (49B vs 37B) and more than twice the total parameters. By every intuition built on dense transformer scaling, inference should cost more — not ~3.7× less. The efficiency gain comes entirely from the attention redesign in Section 03, combined with FP4-quantized routed experts and a fundamentally different KV cache management strategy.

For teams operating long-context workloads — full-codebase analysis, multi-document reasoning, legal or financial corpus Q&A — this changes the math. A workload that required $1 of compute on V3.2 at 1M tokens costs roughly $0.27 on V4-Pro with meaningfully stronger capability, or roughly $0.10 on V4-Flash with modestly weaker knowledge but comparable reasoning. That is the entire thesis of this release.

"Million-token context processing is no longer a capability problem — it's an efficiency problem. The hybrid attention stack in V4 is designed around that."— DeepSeek-V4 technical report, §1 Abstract

03 — Hybrid AttentionCSA plus HCA, interleaved across layers.

DeepSeek-V4's attention stack interleaves two distinct compression strategies across layers. Neither is exotic in isolation; the contribution is the hybrid configuration and the engineering to make it train stably at this scale.

Compressed Sparse Attention (CSA)

What it is. CSA takes the Key-Value cache for every m tokens and compresses that block into a single entry — so the effective KV sequence is 1/m the length of the raw token sequence. Then, for each query token, a learned Lightning Indexer scores those compressed blocks and a top-k selector picks only the most relevant ones to attend to. A sliding window of recent uncompressed tokens is concatenated alongside so that local fine-grained dependencies are preserved.

Why it matters. Two savings stack: the KV cache itself is 1/m the size of a dense attention KV, and even among the compressed entries, attention is sparse (top-k only). CSA is a generalization of DeepSeek Sparse Attention from V3.2 — same idea, more aggressive compression.

Heavily Compressed Attention (HCA)

What it is. HCA applies much more aggressive compression — every m' tokens (with m' ≫ m) fold into a single KV entry. The trade-off is that HCA skips the sparse selection step entirely; attention over the compressed KV remains dense.

Why it matters. HCA is designed for the layers where retaining a broad, low-resolution view of the full context is more valuable than fine-grained selection. Interleaving HCA blocks with CSA blocks gives the model both modes — precise look-up on some layers, smeared global summary on others.

Attention stack · illustrative layer interleaving

L 01

L 02

L 03

L 04

CSA — 1/m compressed + top-k sparseHCA — heavy compression, denseEdge layers (dense)

The key insight

The efficiency numbers above are not achievable by either CSA or HCA alone. DeepSeek tried variants in ablations; the interleaved hybrid is what holds long-context quality while dropping FLOPs and KV cache by an order of magnitude.

04 — Architecture StackmHC, Muon, and what carries over from V3.

Beyond the attention redesign, V4 introduces two further innovations and retains the best parts of the V3 stack. The net effect is a model that trains more stably at larger scale than any prior DeepSeek release.

Manifold-Constrained Hyper-Connections (mHC)

mHC replaces the conventional residual connections between Transformer blocks. In standard Hyper-Connections, the residual mapping can amplify signals in unstable ways when many layers stack — causing numerical blow-ups during training. mHC constrains the residual mapping to lie on the Birkhoff polytope (the manifold of doubly stochastic matrices), which bounds the spectral norm to ≤ 1 and makes signal propagation non-expansive by construction. The result: signals stay numerically stable across very deep stacks, which is what unlocks the 1.6T parameter scale at all.

Muon Optimizer

V4 trains with Muon rather than AdamW. In DeepSeek's setup, Muon delivers faster convergence and better training stability, though the paper is careful to note that several training-time stabilizers — Anticipatory Routing and SwiGLU clamping — were still required to keep loss spikes under control at scale.

Carried over from V3

DeepSeekMoE — the fine-grained routed-expert FFN framework, with a small tweak: V4 uses Sqrt(Softplus) rather than Sigmoid for affinity scoring, removes the cap on routing target nodes, and replaces the dense FFN layers in the first few Transformer blocks with Hash-routed MoE layers.
Multi-Token Prediction (MTP) — retained unchanged from V3. Still used to accelerate inference and to improve training signal.
Auxiliary-loss-free load balancing — with a mild sequence-wise balance loss added on top to avoid extreme expert imbalance inside individual sequences.

05 — Pre-Training33T / 32T tokens, FP4 QAT.

V4-Pro is pre-trained on 33 trillion tokens, V4-Flash on 32 trillion. Both training runs use FP4 quantization-aware training for the routed expert weights and the indexer query/key path, while keeping non-expert computation in FP8. The practical effect today is a smaller memory footprint during training and inference; the paper notes that on current hardware, peak throughput for FP4×FP8 operations is identical to FP8×FP8, but explicitly flags that purpose-built hardware could make FP4 roughly 1.33× more efficient than FP8 — an open lane for future inference gains.

Training stability was actively managed. The paper introduces Anticipatory Routing, which decouples routing updates from the backbone network by one step, fetched in advance — triggered automatically when a loss spike is detected. Combined with SwiGLU clamping (the linear component clamped to [-10, 10], upper gate capped at 10), the authors were able to avoid loss-spike recovery without compromising final-model quality.

Pre-training tokens

V4-Pro dataset

33T

Trained with FP4 quantization-aware training on routed expert weights, FP8 for non-expert computation. Anticipatory Routing handles loss-spike recovery.

V4-Flash: 32T

MMLU-Base lift

Flash-Base vs V3.2

88.7

V4-Flash-Base hits 88.7 on MMLU versus 87.8 for V3.2-Base — despite Flash having 284B total / 13B active vs V3.2's 671B / 37B. Less than half the parameters, higher score.

<½ params

Long-context base

LongBench-V2 · Pro-Base

51.5

V4-Pro-Base reaches 51.5 on LongBench-V2 vs 40.2 for V3.2-Base. The hybrid attention earns its keep most visibly on long-context scenarios, where the numeric step-up is largest.

+11.3 vs V3.2

06 — Post-TrainingOn-Policy Distillation and three reasoning modes.

V4's post-training pipeline is the second big break from V3-series practice. The mixed-RL stage — previously used to consolidate capabilities across domains — is entirely replaced by On-Policy Distillation (OPD).

The sequence: train a separate specialist model for each target domain (math, code, agent, instruction following) via Supervised Fine-Tuning followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO). Those specialists each become state-of-the-art in their respective field. Then train a single unified model via multi-teacher OPD, where the unified model is the student and the specialists are teachers — the student optimizes a reverse-KL loss against teacher output distributions on its own generated trajectories. The result: one model that inherits the specialists' capabilities without their per-domain narrowness.

Three reasoning modes

Each V4 model supports three inference modes, distinguished by how they use the <think> / </think> tokens:

Fast

Non-Think

</think> summary

Intuition-style output with no deliberate chain-of-thought. Appropriate for routine, low-risk tasks where latency and cost dominate.

Low cost · Low latency

Default

Think High

<think> trace </think>

Explicit reasoning trace before the answer. The right default for medium-risk problem solving — code review, multi-step analysis, structured retrieval.

Production default

Frontier

Think Max

prepended prompt + extended trace

Expanded context budget, reduced length penalties, Max-only system prompt demanding exhaustive decomposition and edge-case stress-testing. Produces V4-Pro's strongest benchmark numbers.

Max tokens · Max accuracy

Think Max is what produces V4-Pro-Max's strongest benchmark numbers in Section 07. At inference time, Max mode adds a prepended instruction demanding "absolute maximum" reasoning — explicitly decomposing the problem, documenting rejected hypotheses, stress-testing against edge cases — and uses a meaningfully larger context budget and reduced length penalty than High mode. The cost is output token count; the payoff is the frontier-competitive numbers on hard reasoning.

Two operational changes worth knowing

DSML XML tool-call schema. V4 replaces V3.2's tool-call format with an XML-based schema using dedicated |DSML| tokens. The paper reports fewer escaping failures and tool-call errors; practically, any agent scaffolding calling V4 should expect a slightly different tool-invocation format than V3.2 used.
Interleaved Thinking. Unlike V3.2, which discarded reasoning traces at the start of each new user turn, V4 retains the complete reasoning history across tool calls and user messages during tool-using conversations. That preserves coherent long-horizon chains of thought for agent tasks, at the cost of more context consumption.

07 — BenchmarksWhere V4-Pro-Max leads, matches, and trails.

The chart below is a direct subset of Table 6 from the V4 paper, comparing V4-Pro-Max against the strongest publicly evaluated modes of Claude Opus 4.6 (Max), GPT-5.4 (xHigh), and Gemini-3.1-Pro (High). Orange bars mark V4-Pro-Max scores where V4 leads the field; blue bars mark where a closed-frontier model leads.

V4-Pro-Max vs frontier · selected benchmarks

Source: DeepSeek-V4 report, Table 6

LiveCodeBenchPass@1 · V4-Pro-Max 93.5 · Gemini 91.7

93.5

V4 wins

CodeforcesRating · ~23rd among human contestants

3206

V4 wins

Apex ShortlistPass@1 · Gemini 89.1 · V4 90.2

90.2

V4 wins

Putnam-2025Proof-graded · 120/120 perfect

120/120

V4 wins

MMLU-ProEM · Gemini leads at 91.0

91.0

Gemini 3.1

GPQA DiamondPass@1 · V4 90.1 · Gemini 94.3

94.3

Gemini 3.1

SimpleQA-VerifiedV4 57.9 · Gemini 75.6

75.6

Gemini 3.1

MRCR 1MLong-context retrieval · Opus 92.9 · V4 83.5

92.9

Opus 4.6

V4-Pro-Max leadsClosed-frontier model leads

DeepSeek's own framing is the honest version: V4 trails the absolute frontier by approximately 3 to 6 months on general knowledge and the hardest retrieval workloads, while setting new open-model highs on competitive programming and formal reasoning. Open Codeforces rating of 3206 places the model roughly 23rd among human contest participants. The 120/120 Putnam-2025 is proof-perfect — every solution a valid, graded proof rather than a numeric answer.

"Strong enough to be a serious option, honest enough to set expectations."— Our reading of the V4 paper's positioning, §8 Conclusion

08 — Access V4 TodayWeights, API, chat — all live.

Three paths are live today. Pick the one that matches the workload: open weights for on-prem or fine-tuning, the API for production integration, or the chat UI for exploration and team evaluation.

Surface

huggingface.co/deepseek-ai

Model exposure

V4-Pro + V4-Flash weights

Best for

On-prem deployment, fine-tuning, quantization for edge, sovereignty-bound workloads. Flash is tractable on modest clusters; Pro needs serious hardware.

Surface

api.deepseek.com

Model exposure

Non-Think / Think High / Think Max

Best for

Production integration. Reasoning mode selected via response format + system prompt. No pricing published at launch.

Surface

chat.deepseek.com

Model exposure

Expert (Pro) · Instant (Flash)

Best for

Exploration, team evaluation, product UX, quick prompt testing. Expert Mode = V4-Pro, Instant Mode = V4-Flash.

Surface	Model exposure	Best for
`huggingface.co/deepseek-ai`	V4-Pro + V4-Flash weights	On-prem deployment, fine-tuning, quantization for edge, sovereignty-bound workloads. Flash is tractable on modest clusters; Pro needs serious hardware.
`api.deepseek.com`	Non-Think / Think High / Think Max	Production integration. Reasoning mode selected via response format + system prompt. No pricing published at launch.
chat.deepseek.com	Expert (Pro) · Instant (Flash)	Exploration, team evaluation, product UX, quick prompt testing. Expert Mode = V4-Pro, Instant Mode = V4-Flash.

For organizations with data-sovereignty or sector-compliance requirements, V4's open weights plus the 1M-context efficiency story make it the strongest open candidate today for on-prem long-document RAG replacement. For most agencies and engineering teams, the practical starting point is the API and chat surfaces — benchmark on your own prompts, measure token spend and latency, decide per-workload before considering on-prem. If you're deciding between V4 and closed frontier for specific pipelines, our AI digital transformation engagements start with exactly this kind of comparative eval.

09 — ImplicationsWhat this means for agencies and engineering teams.

V4's release changes the practical decision tree for three specific workload classes. For everything else, the frontier picture hasn't moved.

Code automation

Competitive programming & formal reasoning

93.5 LiveCodeBench, 3206 Codeforces, 120/120 Putnam-2025 is the strongest open-model signal ever released. Benchmark V4 against your current stack on your own repos before switching defaults.

Pick V4-Pro-Max

Long-context RAG

On-prem document agents

V4's open weights plus 1M-context efficiency make it the strongest open candidate today for on-prem long-document RAG in sovereignty-bound sectors. Trails Claude Opus 4.6 on MRCR 1M — pick per-workload.

Pick V4-Pro open weights

General knowledge work

Broad Q&A & retrieval

V4 trails GPT-5.4 xHigh and Gemini-3.1-Pro on MMLU-Pro, GPQA Diamond, and SimpleQA-Verified. Stay with closed frontier for generalist knowledge work until the gap closes.

Stay with frontier

Production architecture

Multi-vendor routing

Default to GPT-5.5 for agentic coding broadly; route competitive-programming and Putnam-style workloads to V4-Pro-Max; keep Opus 4.6 for MRCR 1M retrieval; Gemini 3.1 Pro for price-sensitive bulk long-context.

Route by task class

For teams currently running V3.2 in production, the migration is straightforward — same architectural family, same weight-distribution story, meaningfully lower inference cost at 1M context, and three reasoning modes that replace the per-task prompt engineering V3.2 typically needed. For teams evaluating open weights for the first time because of sovereignty or cost, V4-Flash is the right starting point: 284B total / 13B active is tractable on modest clusters, and Flash-Base already surpasses V3.2-Base on most benchmarks.

10 — ConclusionThe strongest open release of the quarter.

The shape of open frontier, April 2026

Million-token context is no longer a capability question — it's a cost question.

DeepSeek V4 Preview is the most consequential open-weight release of the quarter. Two tightly-related models, a hybrid attention architecture that makes 1M context genuinely economical rather than aspirational, and a post-training pipeline that replaces one-shot inference with three explicit reasoning modes — all shipped on day one as weights, API, and chat.

The honest framing from the paper itself is the right one: V4 trails absolute frontier by three to six months on general knowledge, and sets new open-model highs on competitive programming and formal reasoning. That's enough to be a serious option for specific workload classes — not enough to displace closed frontier for general knowledge work. The practical move is to run your own evals on the specific prompts you care about, not to treat the headline numbers as a vendor decision.

The broader signal is clearer: efficiency, not raw capability, is the axis that matters for the next generation of open models. When a release doubles your active parameter count and still cuts your inference bill by three-quarters, the question stops being "which model is smartest" and becomes "which model is cheap enough to actually run the workload I care about at the scale I care about it." V4 is the first open model to convincingly land on that side of the line.

DeepSeek V4 Launches: 1.6T MoE, 1M Context, 10% KV

01 — What ShippedA Preview release, live on three surfaces.

DeepSeek-V4-Pro

DeepSeek-V4-Flash

02 — EfficiencyThe real headline: 27% FLOPs, 10% KV cache.

Inference cost at 1M-token context · V4 vs V3.2 baseline

03 — Hybrid AttentionCSA plus HCA, interleaved across layers.

Compressed Sparse Attention (CSA)

Heavily Compressed Attention (HCA)

04 — Architecture StackmHC, Muon, and what carries over from V3.

Manifold-Constrained Hyper-Connections (mHC)

Muon Optimizer

Carried over from V3

05 — Pre-Training33T / 32T tokens, FP4 QAT.

V4-Pro dataset

Flash-Base vs V3.2

LongBench-V2 · Pro-Base

06 — Post-TrainingOn-Policy Distillation and three reasoning modes.

Three reasoning modes

Non-Think

Think High

Think Max

Two operational changes worth knowing

07 — BenchmarksWhere V4-Pro-Max leads, matches, and trails.

V4-Pro-Max vs frontier · selected benchmarks

08 — Access V4 TodayWeights, API, chat — all live.

09 — ImplicationsWhat this means for agencies and engineering teams.

Competitive programming & formal reasoning

On-prem document agents

Broad Q&A & retrieval

Multi-vendor routing

10 — ConclusionThe strongest open release of the quarter.

Million-token context is no longer a capability question — it's a cost question.

Open weights plus 1M context make on-prem long-document RAG genuinely viable.

Open-weight model engagements

The questions we get every week.

Continue exploring frontier releases.

MoE Architecture: GPT, Claude, DeepSeek, Qwen Compared

Qwen 3.7 Max: Alibaba's New Flagship AI Model 2026

Long-Context Retrieval 2026: Needle-in-Haystack Test

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing