The DeepSeek V3.2 to V4 migration is the first open-weight upgrade in this generation that genuinely moves the cost needle — V4-Pro runs at roughly 27% of V3.2's single-token inference FLOPs and 10% of the KV cache at 1M context — but it is not a drop-in swap. Four breaking-change axes need careful sequencing or quality will regress before the cost win lands.
What's at stake: teams that treat V4 as a version bump tend to ship the new model behind a tokenizer that quietly invalidates every prompt-prefix cache, on an inference stack that wasn't built for HCA/CSA attention, with one default reasoning mode used for every workload class. The result is a regression in both quality and latency, which then gets blamed on the model rather than the migration. The win is real; the path matters.
This playbook covers seven stages: the four axes V4 actually breaks on, when to use each of the three reasoning modes, the tokenizer change and its impact on prompt caching, what the HCA + CSA attention stack means for vLLM / TensorRT-LLM / SGLang, the KV-cache reduction and what new workload archetypes it unlocks, a phased rollout that earns the cost win without staking production quality on it, and the four pitfalls that trip teams the most.
- 01Three reasoning modes need per-workload routing.Non-Think, Think High, and Think Max are not interchangeable. Defaulting every workload to Max wastes tokens; defaulting to Non-Think regresses quality on the workloads V4 was tuned around. Route by task class.
- 02Tokenizer change invalidates prompt caches — re-warm carefully.V4 uses a different tokenizer than V3.2. Any provider-side prefix cache built against V3.2 tokens is dead on first V4 request. Plan a re-warm window or expect a cost spike during the cutover.
- 03Inference-stack adjustments are non-trivial.vLLM, TensorRT-LLM, and SGLang all required code changes to support HCA + CSA hybrid attention. Pin to version ranges the V4 model card endorses; do not run V4 on a stack that does not list V4 as supported.
- 04KV-cache reduction unlocks new workload archetypes.10% of V3.2's KV cache at 1M context makes million-token retrieval and full-codebase agents economically viable for workloads that previously cleared the cost bar only at 32k or 128k context.
- 05FP4 quantization-aware training requires hardware checks.V4 trains routed experts in FP4. Inference can still run on FP8 today with identical throughput, but the future cost lane is FP4×FP8 on capable hardware. Verify GPU support before promising the FP4-native win.
01 — What's NewV4 ships in four axes — reasoning modes, attention, tokenizer, KV cache.
The honest way to scope a V3.2 to V4 migration is to enumerate the axes that broke. The model is a bigger MoE (1.6T total / 49B active for V4-Pro versus 671B / 37B for V3.2), but raw size is not what makes the upgrade a migration project. Four axes do.
Reasoning modes. V3.2 had a single inference path; V4 exposes three (Non-Think, Think High, Think Max) selected via the <think>token and a Max-only prepended system prompt. Every workload now needs a routing decision.
Attention stack. V3.2 used DeepSeek Sparse Attention uniformly. V4 interleaves Compressed Sparse Attention (CSA, the V3.2 generalization) with Heavily Compressed Attention (HCA, much more aggressive folding) across layers. That hybrid is what unlocks the cost numbers — and it required code changes in every major open-weight inference stack.
Tokenizer. V4 ships a new tokenizer. Any prompt-prefix cache, KV-cache prefix index, or evaluation suite built against V3.2 token IDs is invalidated on the first V4 request. This is the silent breakage that catches the most teams.
KV cache.At 1M-token context, V4-Pro's KV cache is roughly 10% of V3.2's. That opens new workload archetypes (full-codebase agents, multi-document corpus reasoning) that previously could not clear the inference cost bar — and changes the right memory configuration for an inference replica.
Two things did notbreak, and the migration benefits from calling them out. DeepSeekMoE's fine-grained routed-expert FFN framework carries over from V3 to V4 with a small affinity-scoring tweak; Multi-Token Prediction (MTP) is retained unchanged. If your V3.2 stack already integrated with MTP for inference acceleration, the integration surface stays compatible.
02 — Reasoning ModesNon-Think, Think High, Think Max — when to use which.
The three reasoning modes are the most visible change at the application layer. They map roughly onto the latency / cost / quality tradeoff every team already makes — but the tradeoff is now explicit in the protocol rather than buried in prompt engineering.
Non-Think
</think> immediate answerNo deliberate chain-of-thought. Lowest output token count, lowest latency. Route routine classification, structured extraction, formatting, and short Q&A here. Do not route multi-step reasoning to Non-Think — V4 was not trained to do its best work in this mode.
Latency-sensitiveThink High
<think> trace </think>Explicit reasoning trace before the answer. The production default for most workloads — code review, retrieval-grounded answering, multi-step analysis. Output token count rises meaningfully versus Non-Think but quality lift on medium-complexity tasks is the largest of the three modes.
Production defaultThink Max
Max-only prompt + extended traceExpanded context budget, reduced length penalties, a prepended Max-only system prompt demanding exhaustive decomposition and edge-case stress-testing. Produces V4-Pro's strongest benchmark numbers — and burns the most output tokens. Reserve for hard reasoning, formal proofs, and high-stakes one-shot decisions.
Maximum qualityThe migration-time decision is a per-workload routing policy, not a single global default. Map your existing V3.2 traffic into three buckets — routine, default, frontier — then route each bucket to the matching V4 mode. Treating every request as Think Max wastes tokens and raises tail latency; treating every request as Non-Think regresses quality on the workloads V4 was actually tuned around.
One operational detail worth knowing: V4 retains complete reasoning history across tool calls within a conversation (Interleaved Thinking), whereas V3.2 discarded reasoning traces at each new user turn. For agent workloads this is a quality win and a token-budget hit at the same time — coherent long-horizon chains of thought are preserved, but context consumption per turn rises. Re-tune your max-output-tokens budget during migration.
"Routing reasoning modes by workload class is the single highest-leverage migration decision — the mode you default to determines whether the cost win lands or evaporates."— Migration retrospective, agentic coding pipeline cutover
03 — TokenizerPrompt-token impact and cache invalidation.
V4 ships a different tokenizer than V3.2. That single sentence hides the most reliably underestimated risk in this migration. Any system, anywhere in the request path, that caches tokenized content keyed by V3.2 token IDs is dead on first contact with V4.
What breaks the moment V4 traffic starts
- Provider-side prefix caches.DeepSeek's own API prompt cache, and any third-party gateway you sit behind, indexes cache hits by the tokenized prefix of the request. Different tokenizer means zero overlap — every V4 request is a cache miss until the cache re-warms.
- Self-hosted KV-cache prefix indexes. If you run vLLM or TensorRT-LLM with prefix caching enabled, the radix tree indexed by V3.2 tokens does not interoperate with V4 tokens. Hot-swap the model and the cache is functionally empty.
- Evaluation suites with stored token counts. Any test that asserts on a specific token count or compares token IDs against a golden output needs regeneration.
- Cost dashboards keyed on tokens-per-request. The same prompt will tokenize to a different count under V4. Year-over-year comparisons need a tokenizer-version dimension.
The cache-rebuild window
Plan it. A production system that runs at 60% cache-hit on V3.2 will start the V4 migration at 0% hits. Cost spikes by the amount of context you previously got for free, until your most common prompt prefixes re-warm. Two patterns work: (a) run a scripted warmup pass on your top-N system prompts and tool descriptions immediately after cutover, or (b) cut over a single workload at a time and let cache warmth rebuild naturally before adding the next workload.
Operationally, the migration recipe that holds up: tokenize a representative sample of your production traffic with both the V3.2 and V4 tokenizers before cutover, measure the average and P95 token-count delta, and re-tune your max-tokens budgets, timeout windows, and cost dashboards accordingly. The delta is usually modest in either direction, but per-workload variance can be significant on technical content with heavy code or domain-specific vocabulary.
04 — Attention StackHCA + CSA — inference-stack adjustments needed.
V4's attention stack interleaves Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA) across layers. The architectural rationale is in the V4 launch coverage — for the migration playbook, what matters is that every open-weight inference stack needed code changes to support the hybrid. The wrong version of vLLM, TensorRT-LLM, or SGLang will either fail to load V4 weights or run them with degraded quality.
Pin to versions the V4 model card endorses. Do not assume the latest tag works; do not assume your existing V3.2 deployment scripts are forward-compatible. The matrix below captures the shape of the decision — verify exact version numbers against current upstream release notes before you commit.
Best for open-source serving
Active vLLM development tracks DeepSeek releases closely. V4 support arrived alongside the launch. Pin to the version range the model card lists. Verify HCA + CSA kernels are enabled — running V4 weights through a V3.2-era attention path is the most common quality regression we see.
Pin to V4-endorsed rangeBest for NVIDIA-only fleets
Highest steady-state throughput on H100/H200 once compiled. Engine builds are non-trivial — plan 1-3 hours per shape per GPU class. V4 support landed slightly behind vLLM; check the engine-build matrix before committing.
Use when latency dominatesBest for structured agent traffic
Strong fit for high-concurrency agent workloads with prefix-cache reuse. V4 support tracks vLLM's cadence closely. Native handling of tool-call structures pairs well with V4's new DSML XML tool-call schema.
Use for agent-heavy stacksGenerally not recommended yet
Stacks without explicit V4 support in their release notes will either fail to load or load with degraded attention. Do not run V4 on TGI, llama.cpp, or smaller engines until they ship a release that names V4 in the changelog.
Wait for upstream supportThe implementation work each stack needed is itself instructive. HCA collapses many tokens into a single KV entry with no sparse selection step; CSA collapses fewer tokens per entry with a top-k selector and a sliding window. Both required new kernel paths — naively running V4 weights through a uniform DeepSeek Sparse Attention implementation built for V3.2 produces output that loads but is meaningfully worse than the same weights run on a V4-aware stack. The failure mode is silent; nothing throws. Validate with a benchmark, not a smoke test.
05 — KV Cache10% of V3.2 — what unlocks.
At 1M-token context, V4-Pro's KV cache occupies roughly 10% of V3.2's footprint; V4-Flash drops further to 7%. The chart below puts the numbers side by side. The practical consequence is not just "cheaper inference" — it is new workload archetypes that previously could not clear the cost-per-query bar.
KV cache and FLOPs at 1M context · V4 vs V3.2
Source: DeepSeek-V4 technical report, §1 AbstractWhat this unlocks.Three workload classes that sat just outside production-economic viability on V3.2 cross into "ship it" territory on V4:
- Full-codebase agents. Loading 500k-1M tokens of a real production codebase plus a task description was a cost-prohibitive luxury on V3.2 for most teams. V4 brings it into per-request economics that justify routine use, not ceremonial use.
- Multi-document corpus reasoning. Legal, financial, and regulatory workflows that span dozens of long documents per query benefit directly. The KV-cache budget that previously forced chunked retrieval can now afford whole-corpus context for the hardest workloads.
- Long-running agent sessions.Interleaved Thinking (Section 02) is more affordable to keep alive across many turns when each turn's KV is an order of magnitude smaller. Coherent multi-step agents stop costing what hourly human reasoning would cost.
On the operational side, re-budget your per-replica memory allocation. A V3.2 inference replica sized for a particular concurrent-request count at 128k context has substantial headroom on V4 at the same context — either raise concurrency on the same hardware, raise context, or downsize the replica. The correct choice depends on whether your workload is latency-bound or cost-bound; measure both before committing.
06 — Phased RolloutTest → benchmark → swap engine → cut over.
The rollout sequence that holds up across migrations is four phases run in order, not parallel. The temptation is to do stack-version pinning, tokenizer testing, mode routing, and cutover in a single sprint; the failure mode is that quality regressions become impossible to attribute. One axis at a time.
Stand up V4 alongside V3.2
Deploy V4 to a non-production inference replica. Pin vLLM/TensorRT-LLM/SGLang to a V4-endorsed version. Tokenize a representative sample of production traffic with both tokenizers and capture the delta. Do not yet route any traffic.
Read-onlyQuality + cost + latency on real prompts
Run a labeled evaluation set on V3.2 and V4 (all three modes) and capture quality, output token count, latency, and cost per query. The benchmark is the cutover-decision artifact — without it, you are migrating on vibes.
Evidence-firstInternal traffic, low-stakes workloads first
Route internal tooling and dev environments to V4 with the new reasoning-mode routing policy in place. Watch cost dashboards for the cache-rebuild spike. Tune max-tokens budgets and timeout windows based on real V4 latency.
ReversibleProduction traffic, workload by workload
One workload class at a time, with a rollback plan per class. Watch error rates, quality samples, and cost per request for 24-72 hours after each cutover before progressing. The migration ends when all production workloads route to V4 cleanly.
Per-workloadTwo cross-cutting practices make every phase smoother. First, keep V3.2 reachable on a parallel endpoint for at least a week after full cutover — a quick rollback is the safety net that makes the rest of the plan executable. Second, treat the reasoning-mode routing decisions as configuration, not code, so you can re-route a workload class without a redeploy. The first production week almost always surfaces at least one workload that benefits from a different mode than your initial mapping assumed.
For teams without dedicated AI infrastructure capacity, our AI transformation engagements handle the migration end-to-end — inference-stack version pinning, benchmarking, reasoning-mode routing, tokenizer-aware cache rebuild planning, and per-workload cutover with measurable rollout. The same playbook applies to Claude 4.6 to 4.7 migrations and other frontier-model upgrades.
07 — Common PitfallsFour ways the migration trips.
Most of the pitfalls below have appeared in this playbook already, in context. Collecting them here as a pre-flight checklist makes them easier to catch before the cutover meeting.
Defaulting every workload to Think Max
Burns output tokens without proportional quality lift on most workloads. Map traffic into three buckets — routine, default, frontier — and route each to the matching mode. Re-evaluate after a week of production data.
Route by workload classForgetting the cache rebuild window
Tokenizer change zeroes out every prompt-prefix cache. First 24-72 hours after cutover run at 1.5-3x steady-state cost. Schedule a warmup pass on top-N system prompts and communicate the expected spike before cutover.
Warm caches deliberatelyLoading V4 on a non-V4-aware stack
vLLM, TensorRT-LLM, and SGLang all needed code changes for HCA + CSA. A model that loads on the wrong stack will run with quietly degraded quality, not throw. Pin to a version that names V4 in its release notes.
Validate with a benchmarkSkipping the per-workload benchmark
Headline benchmarks tell you V4 is strong in aggregate. Your specific workload may sit on one of the axes where V4 trails closed frontier or where V3.2's mode mix was actually well-tuned. Always benchmark on your own prompts before cutover.
Benchmark before cutoverOne pitfall not in the matrix because it deserves its own paragraph: assuming FP4 quantization-aware training in V4 translates to immediate FP4 inference wins on your hardware today. On current GPUs, peak throughput for FP4×FP8 operations is identical to FP8×FP8 — the FP4 weights save memory, not FLOPs. Purpose-built hardware in the next 12-24 months could make FP4 roughly 1.33x more efficient than FP8 at the same quality, but that lane is a forecast, not a present-day cost line. Plan migrations on FP8 economics; treat FP4 as the optionality lane it actually is.
Open-weight migrations move the cost needle — the playbook is the difference between a sprint and a quarter.
DeepSeek V4 is the first open-weight upgrade in this generation where the cost arithmetic genuinely changes the workload calculus. 10% of V3.2's KV cache at 1M context turns full-codebase agents, multi-document corpus reasoning, and long-running agent sessions from ceremonial demos into routine production workloads. The cost win is real and measurable. It also requires a migration that explicitly handles four breaking-change axes — not a version bump.
The teams that ship V4 cleanly do three things in common. They pin their inference stack to a version that names V4 in its release notes; they route reasoning modes per workload class rather than defaulting to a single mode; and they treat the tokenizer-change cache-rebuild window as a budgeted operational event rather than a surprise cost spike. The playbook above sequences those into a two-to-three-week rollout that earns the cost win without staking production quality on the cutover.
The broader signal is the one to act on. Open-weight releases now ship at a cadence where the migration discipline you build for V4 is the same discipline you will need for the next two open releases after it. Build the muscle once — benchmark suite, per-workload routing policy, tokenizer-aware cache plan, version-pinned inference stack — and the next migration is days rather than weeks.