SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentMigration10 min readPublished May 6, 2026

Three reasoning modes, tokenizer changes, HCA/CSA attention, KV-cache reduction — the open-weight stack migration that actually moves the cost needle.

DeepSeek V3.2 to V4 Migration Playbook: Open-Weight Stack

DeepSeek V4 is not a drop-in upgrade. Three reasoning modes, tokenizer changes that invalidate prompt caches, a new HCA + CSA attention stack that vLLM and TensorRT-LLM both needed updates for, and a KV-cache footprint that drops to 10% of V3.2 — this playbook sequences the swap so the migration delivers the cost win without taking quality with it.

DA
Digital Applied Team
Senior AI engineers · Published May 6, 2026
PublishedMay 6, 2026
Read time10 min
SourcesDeepSeek V4 technical report
Breaking-change axes
4
modes · tokenizer · attention · KV
KV cache reduction
90%
V4-Pro vs V3.2 at 1M
−90 vs V3.2
Reasoning modes
3
Non-Think · High · Max
Typical migration
23w
end-to-end duration

The DeepSeek V3.2 to V4 migration is the first open-weight upgrade in this generation that genuinely moves the cost needle — V4-Pro runs at roughly 27% of V3.2's single-token inference FLOPs and 10% of the KV cache at 1M context — but it is not a drop-in swap. Four breaking-change axes need careful sequencing or quality will regress before the cost win lands.

What's at stake: teams that treat V4 as a version bump tend to ship the new model behind a tokenizer that quietly invalidates every prompt-prefix cache, on an inference stack that wasn't built for HCA/CSA attention, with one default reasoning mode used for every workload class. The result is a regression in both quality and latency, which then gets blamed on the model rather than the migration. The win is real; the path matters.

This playbook covers seven stages: the four axes V4 actually breaks on, when to use each of the three reasoning modes, the tokenizer change and its impact on prompt caching, what the HCA + CSA attention stack means for vLLM / TensorRT-LLM / SGLang, the KV-cache reduction and what new workload archetypes it unlocks, a phased rollout that earns the cost win without staking production quality on it, and the four pitfalls that trip teams the most.

Key takeaways
  1. 01
    Three reasoning modes need per-workload routing.Non-Think, Think High, and Think Max are not interchangeable. Defaulting every workload to Max wastes tokens; defaulting to Non-Think regresses quality on the workloads V4 was tuned around. Route by task class.
  2. 02
    Tokenizer change invalidates prompt caches — re-warm carefully.V4 uses a different tokenizer than V3.2. Any provider-side prefix cache built against V3.2 tokens is dead on first V4 request. Plan a re-warm window or expect a cost spike during the cutover.
  3. 03
    Inference-stack adjustments are non-trivial.vLLM, TensorRT-LLM, and SGLang all required code changes to support HCA + CSA hybrid attention. Pin to version ranges the V4 model card endorses; do not run V4 on a stack that does not list V4 as supported.
  4. 04
    KV-cache reduction unlocks new workload archetypes.10% of V3.2's KV cache at 1M context makes million-token retrieval and full-codebase agents economically viable for workloads that previously cleared the cost bar only at 32k or 128k context.
  5. 05
    FP4 quantization-aware training requires hardware checks.V4 trains routed experts in FP4. Inference can still run on FP8 today with identical throughput, but the future cost lane is FP4×FP8 on capable hardware. Verify GPU support before promising the FP4-native win.

01What's NewV4 ships in four axes — reasoning modes, attention, tokenizer, KV cache.

The honest way to scope a V3.2 to V4 migration is to enumerate the axes that broke. The model is a bigger MoE (1.6T total / 49B active for V4-Pro versus 671B / 37B for V3.2), but raw size is not what makes the upgrade a migration project. Four axes do.

Reasoning modes. V3.2 had a single inference path; V4 exposes three (Non-Think, Think High, Think Max) selected via the <think>token and a Max-only prepended system prompt. Every workload now needs a routing decision.

Attention stack. V3.2 used DeepSeek Sparse Attention uniformly. V4 interleaves Compressed Sparse Attention (CSA, the V3.2 generalization) with Heavily Compressed Attention (HCA, much more aggressive folding) across layers. That hybrid is what unlocks the cost numbers — and it required code changes in every major open-weight inference stack.

Tokenizer. V4 ships a new tokenizer. Any prompt-prefix cache, KV-cache prefix index, or evaluation suite built against V3.2 token IDs is invalidated on the first V4 request. This is the silent breakage that catches the most teams.

KV cache.At 1M-token context, V4-Pro's KV cache is roughly 10% of V3.2's. That opens new workload archetypes (full-codebase agents, multi-document corpus reasoning) that previously could not clear the inference cost bar — and changes the right memory configuration for an inference replica.

The four-axis rule
A V3.2 to V4 swap that does not explicitly plan for all four axes will regress before it wins. Reasoning-mode routing, tokenizer-aware cache rebuild, inference-stack version pinning, and KV-cache memory re-budgeting are the four items every migration ticket needs to touch.

Two things did notbreak, and the migration benefits from calling them out. DeepSeekMoE's fine-grained routed-expert FFN framework carries over from V3 to V4 with a small affinity-scoring tweak; Multi-Token Prediction (MTP) is retained unchanged. If your V3.2 stack already integrated with MTP for inference acceleration, the integration surface stays compatible.

02Reasoning ModesNon-Think, Think High, Think Max — when to use which.

The three reasoning modes are the most visible change at the application layer. They map roughly onto the latency / cost / quality tradeoff every team already makes — but the tradeoff is now explicit in the protocol rather than buried in prompt engineering.

Fast
Non-Think
</think> immediate answer

No deliberate chain-of-thought. Lowest output token count, lowest latency. Route routine classification, structured extraction, formatting, and short Q&A here. Do not route multi-step reasoning to Non-Think — V4 was not trained to do its best work in this mode.

Latency-sensitive
Default
Think High
<think> trace </think>

Explicit reasoning trace before the answer. The production default for most workloads — code review, retrieval-grounded answering, multi-step analysis. Output token count rises meaningfully versus Non-Think but quality lift on medium-complexity tasks is the largest of the three modes.

Production default
Frontier
Think Max
Max-only prompt + extended trace

Expanded context budget, reduced length penalties, a prepended Max-only system prompt demanding exhaustive decomposition and edge-case stress-testing. Produces V4-Pro's strongest benchmark numbers — and burns the most output tokens. Reserve for hard reasoning, formal proofs, and high-stakes one-shot decisions.

Maximum quality

The migration-time decision is a per-workload routing policy, not a single global default. Map your existing V3.2 traffic into three buckets — routine, default, frontier — then route each bucket to the matching V4 mode. Treating every request as Think Max wastes tokens and raises tail latency; treating every request as Non-Think regresses quality on the workloads V4 was actually tuned around.

One operational detail worth knowing: V4 retains complete reasoning history across tool calls within a conversation (Interleaved Thinking), whereas V3.2 discarded reasoning traces at each new user turn. For agent workloads this is a quality win and a token-budget hit at the same time — coherent long-horizon chains of thought are preserved, but context consumption per turn rises. Re-tune your max-output-tokens budget during migration.

"Routing reasoning modes by workload class is the single highest-leverage migration decision — the mode you default to determines whether the cost win lands or evaporates."— Migration retrospective, agentic coding pipeline cutover

03TokenizerPrompt-token impact and cache invalidation.

V4 ships a different tokenizer than V3.2. That single sentence hides the most reliably underestimated risk in this migration. Any system, anywhere in the request path, that caches tokenized content keyed by V3.2 token IDs is dead on first contact with V4.

What breaks the moment V4 traffic starts

  • Provider-side prefix caches.DeepSeek's own API prompt cache, and any third-party gateway you sit behind, indexes cache hits by the tokenized prefix of the request. Different tokenizer means zero overlap — every V4 request is a cache miss until the cache re-warms.
  • Self-hosted KV-cache prefix indexes. If you run vLLM or TensorRT-LLM with prefix caching enabled, the radix tree indexed by V3.2 tokens does not interoperate with V4 tokens. Hot-swap the model and the cache is functionally empty.
  • Evaluation suites with stored token counts. Any test that asserts on a specific token count or compares token IDs against a golden output needs regeneration.
  • Cost dashboards keyed on tokens-per-request. The same prompt will tokenize to a different count under V4. Year-over-year comparisons need a tokenizer-version dimension.

The cache-rebuild window

Plan it. A production system that runs at 60% cache-hit on V3.2 will start the V4 migration at 0% hits. Cost spikes by the amount of context you previously got for free, until your most common prompt prefixes re-warm. Two patterns work: (a) run a scripted warmup pass on your top-N system prompts and tool descriptions immediately after cutover, or (b) cut over a single workload at a time and let cache warmth rebuild naturally before adding the next workload.

The silent tokenizer trap
Teams routinely budget for the model cost difference between V3.2 and V4 and forget the cache-rebuild window. The first 24-72 hours after cutover can run at 1.5-3x steady-state cost while prefix caches re-warm. Communicate the expected spike before cutover, not during.

Operationally, the migration recipe that holds up: tokenize a representative sample of your production traffic with both the V3.2 and V4 tokenizers before cutover, measure the average and P95 token-count delta, and re-tune your max-tokens budgets, timeout windows, and cost dashboards accordingly. The delta is usually modest in either direction, but per-workload variance can be significant on technical content with heavy code or domain-specific vocabulary.

04Attention StackHCA + CSA — inference-stack adjustments needed.

V4's attention stack interleaves Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA) across layers. The architectural rationale is in the V4 launch coverage — for the migration playbook, what matters is that every open-weight inference stack needed code changes to support the hybrid. The wrong version of vLLM, TensorRT-LLM, or SGLang will either fail to load V4 weights or run them with degraded quality.

Pin to versions the V4 model card endorses. Do not assume the latest tag works; do not assume your existing V3.2 deployment scripts are forward-compatible. The matrix below captures the shape of the decision — verify exact version numbers against current upstream release notes before you commit.

vLLM
Best for open-source serving

Active vLLM development tracks DeepSeek releases closely. V4 support arrived alongside the launch. Pin to the version range the model card lists. Verify HCA + CSA kernels are enabled — running V4 weights through a V3.2-era attention path is the most common quality regression we see.

Pin to V4-endorsed range
TensorRT-LLM
Best for NVIDIA-only fleets

Highest steady-state throughput on H100/H200 once compiled. Engine builds are non-trivial — plan 1-3 hours per shape per GPU class. V4 support landed slightly behind vLLM; check the engine-build matrix before committing.

Use when latency dominates
SGLang
Best for structured agent traffic

Strong fit for high-concurrency agent workloads with prefix-cache reuse. V4 support tracks vLLM's cadence closely. Native handling of tool-call structures pairs well with V4's new DSML XML tool-call schema.

Use for agent-heavy stacks
Other stacks
Generally not recommended yet

Stacks without explicit V4 support in their release notes will either fail to load or load with degraded attention. Do not run V4 on TGI, llama.cpp, or smaller engines until they ship a release that names V4 in the changelog.

Wait for upstream support

The implementation work each stack needed is itself instructive. HCA collapses many tokens into a single KV entry with no sparse selection step; CSA collapses fewer tokens per entry with a top-k selector and a sliding window. Both required new kernel paths — naively running V4 weights through a uniform DeepSeek Sparse Attention implementation built for V3.2 produces output that loads but is meaningfully worse than the same weights run on a V4-aware stack. The failure mode is silent; nothing throws. Validate with a benchmark, not a smoke test.

The silent quality regression
The most common V4 migration failure is loading V4 weights on a stack that does not name V4 in its release notes. The model loads. Inference succeeds. Output quality is substantially worse than expected. Always validate with a benchmark suite before cutover, not just a smoke test.

05KV Cache10% of V3.2 — what unlocks.

At 1M-token context, V4-Pro's KV cache occupies roughly 10% of V3.2's footprint; V4-Flash drops further to 7%. The chart below puts the numbers side by side. The practical consequence is not just "cheaper inference" — it is new workload archetypes that previously could not clear the cost-per-query bar.

KV cache and FLOPs at 1M context · V4 vs V3.2

Source: DeepSeek-V4 technical report, §1 Abstract
V3.2 baseline · 1M context671B / 37B active · DeepSeek Sparse Attention
100%
V4-Pro KV cache · 1M context1.6T / 49B active · HCA + CSA hybrid
10%
V4-Flash KV cache · 1M context284B / 13B active · same hybrid attention
7%
V4-Pro FLOPs · 1M contextSingle-token inference, FP8 equivalent
27%
V4-Flash FLOPs · 1M contextSingle-token inference, FP8 equivalent
10%

What this unlocks.Three workload classes that sat just outside production-economic viability on V3.2 cross into "ship it" territory on V4:

  • Full-codebase agents. Loading 500k-1M tokens of a real production codebase plus a task description was a cost-prohibitive luxury on V3.2 for most teams. V4 brings it into per-request economics that justify routine use, not ceremonial use.
  • Multi-document corpus reasoning. Legal, financial, and regulatory workflows that span dozens of long documents per query benefit directly. The KV-cache budget that previously forced chunked retrieval can now afford whole-corpus context for the hardest workloads.
  • Long-running agent sessions.Interleaved Thinking (Section 02) is more affordable to keep alive across many turns when each turn's KV is an order of magnitude smaller. Coherent multi-step agents stop costing what hourly human reasoning would cost.

On the operational side, re-budget your per-replica memory allocation. A V3.2 inference replica sized for a particular concurrent-request count at 128k context has substantial headroom on V4 at the same context — either raise concurrency on the same hardware, raise context, or downsize the replica. The correct choice depends on whether your workload is latency-bound or cost-bound; measure both before committing.

06Phased RolloutTest → benchmark → swap engine → cut over.

The rollout sequence that holds up across migrations is four phases run in order, not parallel. The temptation is to do stack-version pinning, tokenizer testing, mode routing, and cutover in a single sprint; the failure mode is that quality regressions become impossible to attribute. One axis at a time.

Week 1 · Test
1
Stand up V4 alongside V3.2

Deploy V4 to a non-production inference replica. Pin vLLM/TensorRT-LLM/SGLang to a V4-endorsed version. Tokenize a representative sample of production traffic with both tokenizers and capture the delta. Do not yet route any traffic.

Read-only
Week 1-2 · Benchmark
2
Quality + cost + latency on real prompts

Run a labeled evaluation set on V3.2 and V4 (all three modes) and capture quality, output token count, latency, and cost per query. The benchmark is the cutover-decision artifact — without it, you are migrating on vibes.

Evidence-first
Week 2 · Swap engine
3
Internal traffic, low-stakes workloads first

Route internal tooling and dev environments to V4 with the new reasoning-mode routing policy in place. Watch cost dashboards for the cache-rebuild spike. Tune max-tokens budgets and timeout windows based on real V4 latency.

Reversible
Week 2-3 · Cut over
4
Production traffic, workload by workload

One workload class at a time, with a rollback plan per class. Watch error rates, quality samples, and cost per request for 24-72 hours after each cutover before progressing. The migration ends when all production workloads route to V4 cleanly.

Per-workload

Two cross-cutting practices make every phase smoother. First, keep V3.2 reachable on a parallel endpoint for at least a week after full cutover — a quick rollback is the safety net that makes the rest of the plan executable. Second, treat the reasoning-mode routing decisions as configuration, not code, so you can re-route a workload class without a redeploy. The first production week almost always surfaces at least one workload that benefits from a different mode than your initial mapping assumed.

For teams without dedicated AI infrastructure capacity, our AI transformation engagements handle the migration end-to-end — inference-stack version pinning, benchmarking, reasoning-mode routing, tokenizer-aware cache rebuild planning, and per-workload cutover with measurable rollout. The same playbook applies to Claude 4.6 to 4.7 migrations and other frontier-model upgrades.

07Common PitfallsFour ways the migration trips.

Most of the pitfalls below have appeared in this playbook already, in context. Collecting them here as a pre-flight checklist makes them easier to catch before the cutover meeting.

Pitfall 1
Defaulting every workload to Think Max

Burns output tokens without proportional quality lift on most workloads. Map traffic into three buckets — routine, default, frontier — and route each to the matching mode. Re-evaluate after a week of production data.

Route by workload class
Pitfall 2
Forgetting the cache rebuild window

Tokenizer change zeroes out every prompt-prefix cache. First 24-72 hours after cutover run at 1.5-3x steady-state cost. Schedule a warmup pass on top-N system prompts and communicate the expected spike before cutover.

Warm caches deliberately
Pitfall 3
Loading V4 on a non-V4-aware stack

vLLM, TensorRT-LLM, and SGLang all needed code changes for HCA + CSA. A model that loads on the wrong stack will run with quietly degraded quality, not throw. Pin to a version that names V4 in its release notes.

Validate with a benchmark
Pitfall 4
Skipping the per-workload benchmark

Headline benchmarks tell you V4 is strong in aggregate. Your specific workload may sit on one of the axes where V4 trails closed frontier or where V3.2's mode mix was actually well-tuned. Always benchmark on your own prompts before cutover.

Benchmark before cutover

One pitfall not in the matrix because it deserves its own paragraph: assuming FP4 quantization-aware training in V4 translates to immediate FP4 inference wins on your hardware today. On current GPUs, peak throughput for FP4×FP8 operations is identical to FP8×FP8 — the FP4 weights save memory, not FLOPs. Purpose-built hardware in the next 12-24 months could make FP4 roughly 1.33x more efficient than FP8 at the same quality, but that lane is a forecast, not a present-day cost line. Plan migrations on FP8 economics; treat FP4 as the optionality lane it actually is.

The shape of the migration

Open-weight migrations move the cost needle — the playbook is the difference between a sprint and a quarter.

DeepSeek V4 is the first open-weight upgrade in this generation where the cost arithmetic genuinely changes the workload calculus. 10% of V3.2's KV cache at 1M context turns full-codebase agents, multi-document corpus reasoning, and long-running agent sessions from ceremonial demos into routine production workloads. The cost win is real and measurable. It also requires a migration that explicitly handles four breaking-change axes — not a version bump.

The teams that ship V4 cleanly do three things in common. They pin their inference stack to a version that names V4 in its release notes; they route reasoning modes per workload class rather than defaulting to a single mode; and they treat the tokenizer-change cache-rebuild window as a budgeted operational event rather than a surprise cost spike. The playbook above sequences those into a two-to-three-week rollout that earns the cost win without staking production quality on the cutover.

The broader signal is the one to act on. Open-weight releases now ship at a cadence where the migration discipline you build for V4 is the same discipline you will need for the next two open releases after it. Build the muscle once — benchmark suite, per-workload routing policy, tokenizer-aware cache plan, version-pinned inference stack — and the next migration is days rather than weeks.

Migrate open-weight stacks cleanly

Open-weight migrations are infrastructure projects, not model bumps.

Our team executes DeepSeek migrations across vLLM, TensorRT-LLM, and SGLang — reasoning-mode routing, tokenizer migration, KV-cache optimization, FP4 adoption — with measurable rollout.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight migration engagements

  • Per-workload reasoning-mode routing
  • Inference-stack upgrade (vLLM, TensorRT-LLM, SGLang)
  • Tokenizer migration and cache rebuild planning
  • KV-cache and 1M-context economics modeling
  • FP4 quantization-aware training feasibility audit
FAQ · DeepSeek V4 migration

The questions infrastructure teams ask before the swap.

Think High is the right production default for most workloads — explicit reasoning trace before the answer, suitable for medium-complexity problem solving, retrieval-grounded answering, and code review. Route routine classification, short Q&A, and structured extraction to Non-Think to save tokens. Reserve Think Max for hard reasoning, formal proofs, and high-stakes one-shot decisions where the output-token cost is justified by the quality lift. Defaulting every workload to Think Max wastes tokens at scale; defaulting every workload to Non-Think regresses quality on the tasks V4 was trained to do its best work on. The actionable migration step is to map your existing V3.2 traffic into three buckets — routine, default, frontier — then route each bucket to the matching V4 mode and re-evaluate after a week of production data.