AI DevelopmentIndustry Guide12 min readPublished June 28, 2026

Weight bytes set the floor · KV cache eats your context budget · the math, recomputed

The VRAM math: weights, a KV cache, and your real context limit

“How much VRAM do I need?” is the wrong question. The right one is two numbers: the fixed bytes of the weights, and the KV cache that grows with every token of context. Get both and you know your real context limit on any GPU — and why the cloud sells 1M tokens your single card can’t.

DA
Digital Applied Team
Senior engineers · Published Jun 28, 2026
PublishedJun 28, 2026
Read time12 min
Sources8 primary
70B at 4-bit
~40GB
down from 140 GB at FP16
KV cache · 70B @ 128K
42GB
rivals the weights themselves
Single-card ceiling
~128K
for a 70B vs 1M in the cloud
70B KV @ 1M tokens
327GB
3.4× a single 96 GB card
needs many GPUs

How much VRAM you need to run an LLM comes down to two numbers, not one. The first is fixed the moment you choose a model and a precision — the weights, measured as parameters times bytes per parameter. The second grows with every token you feed it — the key-value (KV) cache. Most guides stop at the first number. The second is the one that decides whether your context window is 8K or 128K.

That gap is why a 70-billion-parameter model that loads in 40 GB can still run out of memory: at 128K tokens of context, its KV cache alone needs roughly 42 GB — as much as the weights. Invert the usual framing and the real question becomes obvious. Once the model fits, the VRAM that’s left over is your context budget. Nothing else.

This guide gives you both formulas, a recomputed sizing table for the model sizes you’ll actually run, and a worked example — a 35B model at 4-bit on a 96 GB card — that turns leftover VRAM into a concrete token count. It also explains why the cloud can advertise a million tokens of context that no single GPU you can buy will hold. The stakes are simple: pick the wrong number and you either overpay for hardware or hit an out-of-memory wall in production. For the wider backdrop on why context lengths exploded in the first place, see our guide to the 10M-token era.

Key takeaways
  1. 01
    Weight VRAM is just parameters × bytes-per-parameter.FP16 is 2 bytes, INT8/FP8 ~1 byte, 4-bit ~0.5 bytes. A 70B model is 140 GB at FP16 and about 35 GB at 4-bit — roughly 40 GB once you add ~15% overhead for activations and framework buffers.
  2. 02
    The KV cache scales linearly with context length.KV bytes = 2 × layers × KV-heads × head-dim × tokens × bytes-per-element. For a Llama-class 70B that is ~42 GB at 128K tokens and ~327 GB at 1M — the cache, not the weights, is what blows your budget.
  3. 03
    Leftover VRAM is your context budget.A 35B at 4-bit loads in ~20 GB, leaving ~76 GB on a 96 GB card. At ~0.26 MB per token that headroom buys roughly 290K tokens of context — or ~580K if you quantize the KV cache to FP8.
  4. 04
    A single card tops out near 128K; the cloud’s 1M is sharded.A 70B at 1M tokens needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud 1M context works by distributing that cache across many GPUs with ring attention, context parallelism, and aggressive KV quantization.
  5. 05
    Capacity is not speed — decode is bandwidth-bound.Token generation is limited by memory bandwidth, not capacity. The DGX Spark fits big models in 128 GB but its 273 GB/s makes a dense 70B slow; a 96 GB RTX PRO 6000 at 1,792 GB/s is ~6.6× faster per token.

01The Core EquationTwo numbers decide everything.

Every VRAM question collapses into the same pair. One number is fixed and one grows. Confusing them is why so much hardware advice is wrong: a card that comfortably loads a model can still be useless for the context length you actually need. Get both into a single spreadsheet row and the rest of this guide is just arithmetic.

Number 1 · Fixed
Model weights
params × bytes/param

Set the instant you pick a model and a precision. A 70B at FP16 is 140 GB; at 4-bit it is about 35 GB of raw weights. This number does not move while the model runs — it is a one-time footprint you load once.

One-time footprint
Number 2 · Grows
The KV cache
2 × L × H_kv × d × T × bytes

Grows linearly with every token of context and every concurrent request. At 128K tokens a Llama-class 70B’s cache (~42 GB) rivals its own 4-bit weights. This is the number that quietly decides your context ceiling.

Scales with context

The mental model: weights are the cost of admission; the KV cache is the metered tab that runs while you’re inside. You pay the first once and the second per token. Everything that follows — quantization, GPU choice, the gap between local and cloud context lengths — is a consequence of how these two numbers interact inside a fixed VRAM envelope.

02Weight MathWeight VRAM = params × bytes per param.

The first number is the easy one. Multiply the parameter count by the bytes each parameter occupies at your chosen precision: 4 bytes at FP32, 2 at FP16/BF16, roughly 1 at FP8 or INT8, and about 0.5 at 4-bit. A 70B model is therefore 280 GB at FP32, 140 GB at FP16, 70 GB at INT8, and around 35 GB at 4-bit. Then add 10–20% on top for activation tensors, framework buffers, the CUDA context, and a small KV cache at default context length — a 1.15× multiplier is a safe rule of thumb.

Approximate weight VRAM for common model sizes at FP16/BF16, FP8/INT8, and 4-bit, plus the loaded 4-bit footprint (raw bytes × 1.15 overhead) and the smallest GPU that holds it. Computed as parameters × bytes per parameter.
ModelFP16 / BF16FP8 / INT84-bit (Q4)Loaded @ 4-bitSmallest card
7B14 GB7 GB3.5 GB~4 GB8 GB+
13B26 GB13 GB6.5 GB~7.5 GB12 GB+
35B70 GB35 GB17.5 GB~20 GB24 GB (RTX 4090)
70B140 GB70 GB35 GB~40 GB48 GB+ (not a 32 GB 5090)
120B240 GB120 GB60 GB~69 GB96 GB (RTX PRO 6000)

The “Loaded @ 4-bit” column is the number that matters for planning: raw 4-bit weights times 1.15. Note the 70B row — about 40 GB loaded. That is why a 32 GB RTX 5090 cannot hold a 4-bit 70B on a single card no matter how you slice it, while a 96 GB card swallows the weights and still leaves room for a large context. Real 4-bit GGUF formats (Q4_K_M and friends) run slightly above a clean 0.5 bytes per weight, so treat these as floors, not exact reservations. For the quality-versus-size trade behind each precision, see our 4-bit vs 8-bit vs FP8 tradeoff data.

03KV CacheThe formula that grows with context.

During generation, the model caches the key and value tensors for every token it has already seen so it doesn’t recompute them. That cache is the KV cache, and it grows one token at a time. The total size is governed by a single equation built entirely from the model’s architecture and the context length.

The canonical formula
NVIDIA’s inference-optimization guide states it plainly: size of KV cache per token (bytes) = 2 × num_layers × (num_heads × dim_head) × precision_in_bytes. The leading 2 is one key tensor plus one value tensor; multiply by the number of tokens in context for the total. Every variable is a fixed architecture spec except the last one — context length — which is the only lever you control at runtime.

Plug in a Llama 3.1 70B (80 layers, 8 KV heads, 128-dim heads, BF16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes per token, about 0.33 MB. At 128,000 tokens that is ~42 GB; at a million tokens it is ~327 GB. The table below runs the same arithmetic for the models and context lengths you’re most likely to hit, including what happens when you quantize the cache itself.

KV-cache VRAM by context length for Llama 3.1 8B and 70B at BF16, plus the 70B with an FP8 and an INT4 quantized cache. Computed as 2 × layers × KV-heads × head-dim × tokens × bytes-per-element, in decimal GB.
Context8B · BF1670B · BF1670B · FP8 KV70B · INT4 KV
4K0.52 GB1.31 GB0.66 GB0.33 GB
16K2.10 GB5.24 GB2.62 GB1.31 GB
32K4.19 GB10.49 GB5.24 GB2.62 GB
64K8.39 GB20.97 GB10.49 GB5.24 GB
128K16.78 GB41.94 GB20.97 GB10.49 GB
256K33.55 GB83.89 GB41.94 GB20.97 GB
512K67.11 GB167.77 GB83.89 GB41.94 GB
1M131.07 GB327.68 GB163.84 GB81.92 GB
Why the numbers are survivable
Modern open models use grouped-query attention (GQA): Llama 3.1 70B has 64 query heads but only 8 key/value heads, an 8× cut to the cache. Without GQA, that 128K cache would be roughly 336 GB instead of ~42 GB. DeepSeek goes further with multi-head latent attention (MLA), reported at about 70 KB per token versus 192–328 KB for GQA models — figures computed from published architecture specs, not independently reproduced here.

Two cautions before you build a budget on this table. First, the formula counts a single request; multiply by batch size for concurrent users. Four simultaneous 128K requests on a 70B is 4 × 42 = 168 GB of cache alone, which is why throughput planning is a different problem from single-session context. Second, the cache is also the biggest lever you have: dropping it to an FP8 element halves every cell, and an INT4 cache quarters it. The full toolbox — paged attention, prefix caching, quantized caches — lives in our KV-cache optimization techniques guide.

04Worked ExampleLeftover VRAM is your context budget.

Here is the calculation almost no guide does in full: take a real card, load a real model, and ask how much context the remaining VRAM actually buys. Work a 35B model at 4-bit on a 96 GB RTX PRO 6000 in three steps and the answer falls out cleanly.

Step 1 · Weights
35B at 4-bit, loaded
20GB

Raw 4-bit weights are 35B × 0.5 bytes = 17.5 GB. Add ~15% for activations, framework buffers, and CUDA context and you load around 20 GB — versus 70 GB for the same model at FP16.

vs 70 GB at FP16
Step 2 · Headroom
What is left on a 96 GB card
76GB

96 GB total minus ~20 GB of loaded weights leaves 76 GB. That leftover — not the model size — is the number that determines how much context you can actually use.

96 − 20 = 76
Step 3 · Context
Tokens the headroom buys
~290K

At ~0.26 MB per token (a representative 32B GQA layout: 64 layers, 8 KV heads, 128-dim), 76 GB ÷ 0.26 MB ≈ 290K tokens of BF16 KV — roughly 580K if you quantize the cache to FP8.

76 GB ÷ 0.26 MB/tok

That is the whole insight in one line: a smaller model doesn’t just cost less VRAM, it frees enormous context headroom. The same card that gives a 35B model nearly 290K theoretical tokens gives a 70B far less, because the 70B both weighs more and burns cache faster. Swap the 35B for a 70B at 4-bit (~40 GB loaded) and the headroom drops to about 56 GB — enough for the model’s native 128K window (~42 GB of cache) with a thin margin for activations, and nowhere near the 256K tier. Treat these as theoretical ceilings: a real framework reserves activation and scratch memory too, so usable context lands somewhat below the cache-only math.

05The 1M ParadoxWhy one card stops near 128K while the cloud sells 1M.

Run the budget for a 70B at 4-bit on a 96 GB card and the ceiling is stark. Weights take ~40 GB. A 128K cache adds ~42 GB, for ~82 GB total — it fits, with about 14 GB left for activations and the CUDA context. Push to 256K and the cache alone is ~84 GB; weights plus cache is ~124 GB, over budget. The card tops out near 128K, which also happens to be the model’s trained window. Both limits converge on the same answer.

A 70B on one 96 GB card · where the budget runs out

Calculated · 4-bit weights, BF16 KV cache
96 GB card · total capacityRTX PRO 6000 Blackwell
96 GB
70B weights · 4-bit, loaded~35 GB raw + overhead
40 GB
+ KV cache @ 128Kcumulative footprint
82 GB
Fits
+ KV cache @ 256Kcumulative footprint
124 GB
Over budget
KV cache @ 1M (alone)3.4× a single card
327 GB
Needs many GPUs

So how does an API advertise a million tokens? Not by fitting it on one card. That same 70B at 1M needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud providers reach it by sharding the cache across many GPUs and overlapping the work: context parallelism splits the sequence across devices, ring attention passes key/value blocks around a GPU ring so each device sees the whole context, and cache-offloading plus NVFP4 quantization shrink what has to stay resident. The gap between “what fits on my GPU” and “what the API offers” is architecture and orchestration, not just raw VRAM. We dig into the provider-by-provider numbers in our 1M-to-10M context window comparison.

06Capacity vs SpeedFitting a model and running it fast are different problems.

VRAM capacity tells you whether a model loads. It says nothing about how fast it generates. Inference has two phases with different bottlenecks, and conflating them is the most common buying mistake.

"Prefill is usually compute-bound, meaning it's limited by how fast the GPU can do math. Decode is usually memory-bandwidth-bound, meaning it's limited by how fast the GPU can move data around."Jim Allen Wallace · Redis Engineering Blog

Decode — generating one token at a time — is where you spend most of your runtime, and it is bound by memory bandwidth, not compute. A useful ceiling: theoretical tokens per second is roughly memory bandwidth divided by the bytes the model must read per token, and real-world numbers land at 60–80% of that. Double the bandwidth and you roughly double the speed. Capacity gets the model in the door; bandwidth decides how fast it talks.

Memory bandwidth sets the decode-speed ceiling

Vendor specifications
RTX PRO 6000 (96 GB)GDDR7
1,792 GB/s
RTX 5090 (32 GB)GDDR7
1,792 GB/s
Apple M5 Max (128 GB)unified memory
614 GB/s
DGX Spark (128 GB)LPDDR5x unified
273 GB/s
Capacity is not speed
The DGX Spark fits a 70B in its 128 GB of unified memory that a 32 GB RTX 5090 cannot touch — but at 273 GB/s, roughly 6.6× lower than a discrete GDDR7 card’s 1,792 GB/s. Because decode is bandwidth-bound, a dense 70B at 4-bit generates only single-digit tokens per second on it; independent reports land near 3–7 tok/s, and NVIDIA’s own published benchmarks stop at 8B–14B models. Higher figures you’ll see quoted tend to be mixture-of-experts models with few active parameters, smaller models, or NVFP4/TensorRT-LLM-tuned stacks — not a dense 70B. Treat the Spark, with its ~140 W typical draw (240 W-rated PSU), as a way to load big models, not to run them fast.

The same bandwidth lens explains the rest of the field. A 96 GB RTX PRO 6000 has been independently measured at around 31–32 tok/s on a 70B — the practical reference for a single-card 70B. An Apple M5 Max (614 GB/s, up to 128 GB) lands lower; community benchmarks put a 70B in the ~15–32 tok/s range depending on framework and quantization, so treat any single figure as an estimate. And the fast-but-small RTX 5090, despite matching the PRO 6000’s 1,792 GB/s, can’t hold a 70B at all — point it at models up to ~30B, where its bandwidth implies roughly 60–90 tok/s (an estimate derived from memory bandwidth, not a measured benchmark). The tokens-per-second figures here are approximate and stack-dependent — measured for the PRO 6000, community estimates for the M5 Max, and bandwidth-derived for the 5090.

07Hardware ShortlistPick the card for the job, not the spec sheet.

With the weight math, the cache math, and the bandwidth lens in hand, the hardware choice gets concrete. Match the card to whether your bottleneck is fitting a model, generating fast, or both — and budget for a hardware market that is anything but stable right now.

Max capacity · workstation
RTX PRO 6000 Blackwell

96 GB GDDR7, 1,792 GB/s, 600 W rated TDP. The only single workstation card that fits a 4-bit 70B with room for a 128K context. Launched at an ~$8,565 MSRP; mid-2026 listings have surged to roughly $12,000–$14,500 amid the memory shortage.

Pick for a local 70B
Fast · consumer
RTX 5090

32 GB GDDR7 at the same 1,792 GB/s — fastest decode in this list, but 32 GB cannot hold a ~40 GB 4-bit 70B. Keep it to models up to ~30B (≈60–90 tok/s, estimated from bandwidth); step up to two cards or a 96 GB card for 70B.

Pick for ≤30B speed
Big-model capacity · compact
NVIDIA DGX Spark

128 GB LPDDR5x unified, 273 GB/s, ~140 W typical draw (240 W-rated PSU), ~$3,999. Loads models a 5090 can’t, but the low bandwidth means modest, stack-dependent decode speed. A capacity play, not a speed play.

Pick to fit, not to race
Unified memory · Mac
Apple M5 Max

Up to 128 GB unified, 614 GB/s, announced March 2026. Community estimates put a 70B around 15–32 tok/s — verify per framework. Note the Mac Studio M3 Ultra now tops out at 96 GB after its larger configs were pulled in 2026.

Pick for a quiet desktop
One shortage, two price stories
The same 2026 DRAM and GDDR7 shortage shows up on both sides of this market. Apple pulled the 512 GB Mac Studio M3 Ultra in March 2026 and the 256 GB config in May, leaving 96 GB as the largest M3 Ultra you can buy. The same squeeze pushed the RTX PRO 6000 from its ~$8,565 launch MSRP toward roughly $12,000–$14,500 in mid-2026 listings (NVIDIA’s own marketplace around $13,250). Budget for the hardware market, not just the model.

08QuantizationHow low can you quantize?

Quantization is the dial that makes all of this affordable — fewer bytes per parameter shrinks both the weights and, when applied to the cache, the KV footprint. The trade is quality, but on modern formats it is smaller than most people expect until you push past 4-bit.

Lossless
FP16 / BF16

2 bytes per parameter — the reference quality every quantized tier is measured against. Use it when VRAM is plentiful or quality is genuinely non-negotiable.

Reference quality
Near-free
FP8

~1 byte per parameter, under ~0.4 perplexity points of degradation versus BF16, and benchmarks suggest roughly 1.4–1.8× throughput on Hopper/Blackwell at large batch. The practical GPU sweet spot.

Best on new GPUs
Safe 2×
INT8 / Q8

~1 byte per parameter with about half a percent of measured degradation. A conservative choice when you want the memory saving without thinking hard about quality.

Conservative saving
Local sweet spot
Q4_K_M

Roughly 4.5 effective bits per weight — about 0.5 bytes — and only ~0.05 perplexity points over FP16 on a 7B, inside run-to-run noise. The default for local inference on consumer and Apple hardware.

Local default

The quality numbers above are directional — they vary by model, benchmark, and the specific quantization implementation — but the shape is consistent: FP8 and INT8 are close to free, Q4_K_M is the local default that almost nobody can distinguish blind, and only at 2-bit does degradation become obvious. The same logic applies to the cache: an FP8 KV halves every cell in the table above and an INT4 KV quarters it, often with negligible quality cost on long-context retrieval. For the cross-model regression data behind each tier, our quantization tradeoffs breakdown has the benchmarks. And if you’re weighing local inference against a managed deployment for a production workload, that build-versus-buy call is exactly what our AI transformation engagements are built to answer.

09ConclusionTwo numbers, one spreadsheet row.

The shape of local inference, June 2026

Once the model fits, the VRAM that’s left over is your real context budget.

The whole question of “how much VRAM” reduces to two figures you can put in one row. Weights are parameters times bytes per parameter, fixed by your model and precision. The KV cache is 2 × layers × KV-heads × head-dim × tokens × bytes, growing linearly with context. Subtract the loaded weights from your card’s capacity and divide what’s left by the per-token cache size — that quotient, not the model size, is the context you can actually run.

That reframing is what the hardware tiers are really about. A 35B at 4-bit on a 96 GB card has room for hundreds of thousands of tokens; a 70B on the same card is capacity-bound near its 128K window; a million tokens needs more memory than any single card you can buy, which is why the cloud shards it across GPUs. And remember that fitting a model is not the same as running it fast — decode is bandwidth-bound, so a high-capacity, low-bandwidth box loads big models slowly.

Looking ahead, the pressure is on the cache, not the weights. Architectures like MLA and ever-more-aggressive KV quantization are attacking the one term that scales with context, because that is the term standing between a single workstation and genuinely long local context. For now, do the arithmetic before you buy: it’s the difference between a card that serves your workload and one that stalls at the first long prompt. If a custom build is on the table, our engineering team can size the stack with you.

Run frontier models on the right hardware

Get the hardware math right before you spend on the wrong card.

We help teams size, benchmark, and deploy local and hybrid LLM stacks — matching model, quantization, and hardware to the real context and throughput your workload needs, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Local & hybrid inference engagements

  • VRAM and context-budget sizing for your model mix
  • Quantization strategy — FP8, INT8, 4-bit, KV-cache quant
  • Single-card vs multi-GPU vs cloud routing decisions
  • On-prem long-context RAG for sovereignty-bound workloads
  • Throughput and cost benchmarking on your real prompts
FAQ · VRAM & KV-cache math

The questions we get every week.

Start with the weights: a 70B model is about 140 GB at FP16/BF16, 70 GB at INT8 or FP8, and roughly 35 GB at 4-bit — call it ~40 GB once you add 10–20% overhead for activations, framework buffers, and the CUDA context. But weights are only half the answer. You also need room for the KV cache, which grows with context length: at 128K tokens a Llama-class 70B needs another ~42 GB of cache. So a realistic single-card target for a 4-bit 70B at long context is a 96 GB card, which holds weights plus a 128K cache with a thin margin to spare. Smaller contexts need much less — at 4K tokens the cache is barely over 1 GB.