How much VRAM you need to run an LLM comes down to two numbers, not one. The first is fixed the moment you choose a model and a precision — the weights, measured as parameters times bytes per parameter. The second grows with every token you feed it — the key-value (KV) cache. Most guides stop at the first number. The second is the one that decides whether your context window is 8K or 128K.

That gap is why a 70-billion-parameter model that loads in 40 GB can still run out of memory: at 128K tokens of context, its KV cache alone needs roughly 42 GB — as much as the weights. Invert the usual framing and the real question becomes obvious. Once the model fits, the VRAM that’s left over is your context budget. Nothing else.

This guide gives you both formulas, a recomputed sizing table for the model sizes you’ll actually run, and a worked example — a 35B model at 4-bit on a 96 GB card — that turns leftover VRAM into a concrete token count. It also explains why the cloud can advertise a million tokens of context that no single GPU you can buy will hold. The stakes are simple: pick the wrong number and you either overpay for hardware or hit an out-of-memory wall in production. For the wider backdrop on why context lengths exploded in the first place, see our guide to the 10M-token era.

Key takeaways

01
Weight VRAM is just parameters × bytes-per-parameter.FP16 is 2 bytes, INT8/FP8 ~1 byte, 4-bit ~0.5 bytes. A 70B model is 140 GB at FP16 and about 35 GB at 4-bit — roughly 40 GB once you add ~15% overhead for activations and framework buffers.
02
The KV cache scales linearly with context length.KV bytes = 2 × layers × KV-heads × head-dim × tokens × bytes-per-element. For a Llama-class 70B that is ~42 GB at 128K tokens and ~327 GB at 1M — the cache, not the weights, is what blows your budget.
03
Leftover VRAM is your context budget.A 35B at 4-bit loads in ~20 GB, leaving ~76 GB on a 96 GB card. At ~0.26 MB per token that headroom buys roughly 290K tokens of context — or ~580K if you quantize the KV cache to FP8.
04
A single card tops out near 128K; the cloud’s 1M is sharded.A 70B at 1M tokens needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud 1M context works by distributing that cache across many GPUs with ring attention, context parallelism, and aggressive KV quantization.
05
Capacity is not speed — decode is bandwidth-bound.Token generation is limited by memory bandwidth, not capacity. The DGX Spark fits big models in 128 GB but its 273 GB/s makes a dense 70B slow; a 96 GB RTX PRO 6000 at 1,792 GB/s is ~6.6× faster per token.

01 — The Core EquationTwo numbers decide everything.

Every VRAM question collapses into the same pair. One number is fixed and one grows. Confusing them is why so much hardware advice is wrong: a card that comfortably loads a model can still be useless for the context length you actually need. Get both into a single spreadsheet row and the rest of this guide is just arithmetic.

Number 1 · Fixed

Model weights

params × bytes/param

Set the instant you pick a model and a precision. A 70B at FP16 is 140 GB; at 4-bit it is about 35 GB of raw weights. This number does not move while the model runs — it is a one-time footprint you load once.

One-time footprint

Number 2 · Grows

The KV cache

2 × L × H_kv × d × T × bytes

Grows linearly with every token of context and every concurrent request. At 128K tokens a Llama-class 70B’s cache (~42 GB) rivals its own 4-bit weights. This is the number that quietly decides your context ceiling.

Scales with context

The mental model: weights are the cost of admission; the KV cache is the metered tab that runs while you’re inside. You pay the first once and the second per token. Everything that follows — quantization, GPU choice, the gap between local and cloud context lengths — is a consequence of how these two numbers interact inside a fixed VRAM envelope.

02 — Weight MathWeight VRAM = params × bytes per param.

The first number is the easy one. Multiply the parameter count by the bytes each parameter occupies at your chosen precision: 4 bytes at FP32, 2 at FP16/BF16, roughly 1 at FP8 or INT8, and about 0.5 at 4-bit. A 70B model is therefore 280 GB at FP32, 140 GB at FP16, 70 GB at INT8, and around 35 GB at 4-bit. Then add 10–20% on top for activation tensors, framework buffers, the CUDA context, and a small KV cache at default context length — a 1.15× multiplier is a safe rule of thumb.

Approximate weight VRAM for common model sizes at FP16/BF16, FP8/INT8, and 4-bit, plus the loaded 4-bit footprint (raw bytes × 1.15 overhead) and the smallest GPU that holds it. Computed as parameters × bytes per parameter.
Model	FP16 / BF16	FP8 / INT8	4-bit (Q4)	Loaded @ 4-bit	Smallest card
7B	14 GB	7 GB	3.5 GB	~4 GB	8 GB+
13B	26 GB	13 GB	6.5 GB	~7.5 GB	12 GB+
35B	70 GB	35 GB	17.5 GB	~20 GB	24 GB (RTX 4090)
70B	140 GB	70 GB	35 GB	~40 GB	48 GB+ (not a 32 GB 5090)
120B	240 GB	120 GB	60 GB	~69 GB	96 GB (RTX PRO 6000)

The “Loaded @ 4-bit” column is the number that matters for planning: raw 4-bit weights times 1.15. Note the 70B row — about 40 GB loaded. That is why a 32 GB RTX 5090 cannot hold a 4-bit 70B on a single card no matter how you slice it, while a 96 GB card swallows the weights and still leaves room for a large context. Real 4-bit GGUF formats (Q4_K_M and friends) run slightly above a clean 0.5 bytes per weight, so treat these as floors, not exact reservations. For the quality-versus-size trade behind each precision, see our 4-bit vs 8-bit vs FP8 tradeoff data.

03 — KV CacheThe formula that grows with context.

During generation, the model caches the key and value tensors for every token it has already seen so it doesn’t recompute them. That cache is the KV cache, and it grows one token at a time. The total size is governed by a single equation built entirely from the model’s architecture and the context length.

The canonical formula

NVIDIA’s inference-optimization guide states it plainly: size of KV cache per token (bytes) = 2 × num_layers × (num_heads × dim_head) × precision_in_bytes. The leading 2 is one key tensor plus one value tensor; multiply by the number of tokens in context for the total. Every variable is a fixed architecture spec except the last one — context length — which is the only lever you control at runtime.

Plug in a Llama 3.1 70B (80 layers, 8 KV heads, 128-dim heads, BF16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes per token, about 0.33 MB. At 128,000 tokens that is ~42 GB; at a million tokens it is ~327 GB. The table below runs the same arithmetic for the models and context lengths you’re most likely to hit, including what happens when you quantize the cache itself.

KV-cache VRAM by context length for Llama 3.1 8B and 70B at BF16, plus the 70B with an FP8 and an INT4 quantized cache. Computed as 2 × layers × KV-heads × head-dim × tokens × bytes-per-element, in decimal GB.
Context	8B · BF16	70B · BF16	70B · FP8 KV	70B · INT4 KV
4K	0.52 GB	1.31 GB	0.66 GB	0.33 GB
16K	2.10 GB	5.24 GB	2.62 GB	1.31 GB
32K	4.19 GB	10.49 GB	5.24 GB	2.62 GB
64K	8.39 GB	20.97 GB	10.49 GB	5.24 GB
128K	16.78 GB	41.94 GB	20.97 GB	10.49 GB
256K	33.55 GB	83.89 GB	41.94 GB	20.97 GB
512K	67.11 GB	167.77 GB	83.89 GB	41.94 GB
1M	131.07 GB	327.68 GB	163.84 GB	81.92 GB

Why the numbers are survivable

Modern open models use grouped-query attention (GQA): Llama 3.1 70B has 64 query heads but only 8 key/value heads, an 8× cut to the cache. Without GQA, that 128K cache would be roughly 336 GB instead of ~42 GB. DeepSeek goes further with multi-head latent attention (MLA), reported at about 70 KB per token versus 192–328 KB for GQA models — figures computed from published architecture specs, not independently reproduced here.

Two cautions before you build a budget on this table. First, the formula counts a single request; multiply by batch size for concurrent users. Four simultaneous 128K requests on a 70B is 4 × 42 = 168 GB of cache alone, which is why throughput planning is a different problem from single-session context. Second, the cache is also the biggest lever you have: dropping it to an FP8 element halves every cell, and an INT4 cache quarters it. The full toolbox — paged attention, prefix caching, quantized caches — lives in our KV-cache optimization techniques guide.

04 — Worked ExampleLeftover VRAM is your context budget.

Here is the calculation almost no guide does in full: take a real card, load a real model, and ask how much context the remaining VRAM actually buys. Work a 35B model at 4-bit on a 96 GB RTX PRO 6000 in three steps and the answer falls out cleanly.

Step 1 · Weights

35B at 4-bit, loaded

20GB

Raw 4-bit weights are 35B × 0.5 bytes = 17.5 GB. Add ~15% for activations, framework buffers, and CUDA context and you load around 20 GB — versus 70 GB for the same model at FP16.

vs 70 GB at FP16

Step 2 · Headroom

What is left on a 96 GB card

76GB

96 GB total minus ~20 GB of loaded weights leaves 76 GB. That leftover — not the model size — is the number that determines how much context you can actually use.

96 − 20 = 76

Step 3 · Context

Tokens the headroom buys

~290K

At ~0.26 MB per token (a representative 32B GQA layout: 64 layers, 8 KV heads, 128-dim), 76 GB ÷ 0.26 MB ≈ 290K tokens of BF16 KV — roughly 580K if you quantize the cache to FP8.

76 GB ÷ 0.26 MB/tok

That is the whole insight in one line: a smaller model doesn’t just cost less VRAM, it frees enormous context headroom. The same card that gives a 35B model nearly 290K theoretical tokens gives a 70B far less, because the 70B both weighs more and burns cache faster. Swap the 35B for a 70B at 4-bit (~40 GB loaded) and the headroom drops to about 56 GB — enough for the model’s native 128K window (~42 GB of cache) with a thin margin for activations, and nowhere near the 256K tier. Treat these as theoretical ceilings: a real framework reserves activation and scratch memory too, so usable context lands somewhat below the cache-only math.

05 — The 1M ParadoxWhy one card stops near 128K while the cloud sells 1M.

Run the budget for a 70B at 4-bit on a 96 GB card and the ceiling is stark. Weights take ~40 GB. A 128K cache adds ~42 GB, for ~82 GB total — it fits, with about 14 GB left for activations and the CUDA context. Push to 256K and the cache alone is ~84 GB; weights plus cache is ~124 GB, over budget. The card tops out near 128K, which also happens to be the model’s trained window. Both limits converge on the same answer.

A 70B on one 96 GB card · where the budget runs out

Calculated · 4-bit weights, BF16 KV cache

96 GB card · total capacityRTX PRO 6000 Blackwell

96 GB

70B weights · 4-bit, loaded~35 GB raw + overhead

40 GB

+ KV cache @ 128Kcumulative footprint

82 GB

Fits

+ KV cache @ 256Kcumulative footprint

124 GB

Over budget

KV cache @ 1M (alone)3.4× a single card

327 GB

Needs many GPUs

So how does an API advertise a million tokens? Not by fitting it on one card. That same 70B at 1M needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud providers reach it by sharding the cache across many GPUs and overlapping the work: context parallelism splits the sequence across devices, ring attention passes key/value blocks around a GPU ring so each device sees the whole context, and cache-offloading plus NVFP4 quantization shrink what has to stay resident. The gap between “what fits on my GPU” and “what the API offers” is architecture and orchestration, not just raw VRAM. We dig into the provider-by-provider numbers in our 1M-to-10M context window comparison.

06 — Capacity vs SpeedFitting a model and running it fast are different problems.

VRAM capacity tells you whether a model loads. It says nothing about how fast it generates. Inference has two phases with different bottlenecks, and conflating them is the most common buying mistake.

"Prefill is usually compute-bound, meaning it's limited by how fast the GPU can do math. Decode is usually memory-bandwidth-bound, meaning it's limited by how fast the GPU can move data around."Jim Allen Wallace · Redis Engineering Blog

Decode — generating one token at a time — is where you spend most of your runtime, and it is bound by memory bandwidth, not compute. A useful ceiling: theoretical tokens per second is roughly memory bandwidth divided by the bytes the model must read per token, and real-world numbers land at 60–80% of that. Double the bandwidth and you roughly double the speed. Capacity gets the model in the door; bandwidth decides how fast it talks.

Memory bandwidth sets the decode-speed ceiling

Vendor specifications

RTX PRO 6000 (96 GB)GDDR7

1,792 GB/s

RTX 5090 (32 GB)GDDR7

1,792 GB/s

Apple M5 Max (128 GB)unified memory

614 GB/s

DGX Spark (128 GB)LPDDR5x unified

273 GB/s

Capacity is not speed

The DGX Spark fits a 70B in its 128 GB of unified memory that a 32 GB RTX 5090 cannot touch — but at 273 GB/s, roughly 6.6× lower than a discrete GDDR7 card’s 1,792 GB/s. Because decode is bandwidth-bound, a dense 70B at 4-bit generates only single-digit tokens per second on it; independent reports land near 3–7 tok/s, and NVIDIA’s own published benchmarks stop at 8B–14B models. Higher figures you’ll see quoted tend to be mixture-of-experts models with few active parameters, smaller models, or NVFP4/TensorRT-LLM-tuned stacks — not a dense 70B. Treat the Spark, with its ~140 W typical draw (240 W-rated PSU), as a way to load big models, not to run them fast.

The same bandwidth lens explains the rest of the field. A 96 GB RTX PRO 6000 has been independently measured at around 31–32 tok/s on a 70B — the practical reference for a single-card 70B. An Apple M5 Max (614 GB/s, up to 128 GB) lands lower; community benchmarks put a 70B in the ~15–32 tok/s range depending on framework and quantization, so treat any single figure as an estimate. And the fast-but-small RTX 5090, despite matching the PRO 6000’s 1,792 GB/s, can’t hold a 70B at all — point it at models up to ~30B, where its bandwidth implies roughly 60–90 tok/s (an estimate derived from memory bandwidth, not a measured benchmark). The tokens-per-second figures here are approximate and stack-dependent — measured for the PRO 6000, community estimates for the M5 Max, and bandwidth-derived for the 5090.

07 — Hardware ShortlistPick the card for the job, not the spec sheet.

With the weight math, the cache math, and the bandwidth lens in hand, the hardware choice gets concrete. Match the card to whether your bottleneck is fitting a model, generating fast, or both — and budget for a hardware market that is anything but stable right now.

Max capacity · workstation

RTX PRO 6000 Blackwell

96 GB GDDR7, 1,792 GB/s, 600 W rated TDP. The only single workstation card that fits a 4-bit 70B with room for a 128K context. Launched at an ~$8,565 MSRP; mid-2026 listings have surged to roughly $12,000–$14,500 amid the memory shortage.

Pick for a local 70B

Fast · consumer

RTX 5090

32 GB GDDR7 at the same 1,792 GB/s — fastest decode in this list, but 32 GB cannot hold a ~40 GB 4-bit 70B. Keep it to models up to ~30B (≈60–90 tok/s, estimated from bandwidth); step up to two cards or a 96 GB card for 70B.

Pick for ≤30B speed

Big-model capacity · compact

NVIDIA DGX Spark

128 GB LPDDR5x unified, 273 GB/s, ~140 W typical draw (240 W-rated PSU), ~$3,999. Loads models a 5090 can’t, but the low bandwidth means modest, stack-dependent decode speed. A capacity play, not a speed play.

Pick to fit, not to race

Unified memory · Mac

Apple M5 Max

Up to 128 GB unified, 614 GB/s, announced March 2026. Community estimates put a 70B around 15–32 tok/s — verify per framework. Note the Mac Studio M3 Ultra now tops out at 96 GB after its larger configs were pulled in 2026.

Pick for a quiet desktop

One shortage, two price stories

The same 2026 DRAM and GDDR7 shortage shows up on both sides of this market. Apple pulled the 512 GB Mac Studio M3 Ultra in March 2026 and the 256 GB config in May, leaving 96 GB as the largest M3 Ultra you can buy. The same squeeze pushed the RTX PRO 6000 from its ~$8,565 launch MSRP toward roughly $12,000–$14,500 in mid-2026 listings (NVIDIA’s own marketplace around $13,250). Budget for the hardware market, not just the model.

08 — QuantizationHow low can you quantize?

Quantization is the dial that makes all of this affordable — fewer bytes per parameter shrinks both the weights and, when applied to the cache, the KV footprint. The trade is quality, but on modern formats it is smaller than most people expect until you push past 4-bit.

Lossless

FP16 / BF16

2 bytes per parameter — the reference quality every quantized tier is measured against. Use it when VRAM is plentiful or quality is genuinely non-negotiable.

Reference quality

Near-free

FP8

~1 byte per parameter, under ~0.4 perplexity points of degradation versus BF16, and benchmarks suggest roughly 1.4–1.8× throughput on Hopper/Blackwell at large batch. The practical GPU sweet spot.

Best on new GPUs

Safe 2×

INT8 / Q8

~1 byte per parameter with about half a percent of measured degradation. A conservative choice when you want the memory saving without thinking hard about quality.

Conservative saving

Local sweet spot

Q4_K_M

Roughly 4.5 effective bits per weight — about 0.5 bytes — and only ~0.05 perplexity points over FP16 on a 7B, inside run-to-run noise. The default for local inference on consumer and Apple hardware.

Local default

The quality numbers above are directional — they vary by model, benchmark, and the specific quantization implementation — but the shape is consistent: FP8 and INT8 are close to free, Q4_K_M is the local default that almost nobody can distinguish blind, and only at 2-bit does degradation become obvious. The same logic applies to the cache: an FP8 KV halves every cell in the table above and an INT4 KV quarters it, often with negligible quality cost on long-context retrieval. For the cross-model regression data behind each tier, our quantization tradeoffs breakdown has the benchmarks. And if you’re weighing local inference against a managed deployment for a production workload, that build-versus-buy call is exactly what our AI transformation engagements are built to answer.

09 — ConclusionTwo numbers, one spreadsheet row.

The shape of local inference, June 2026

Once the model fits, the VRAM that’s left over is your real context budget.

The whole question of “how much VRAM” reduces to two figures you can put in one row. Weights are parameters times bytes per parameter, fixed by your model and precision. The KV cache is 2 × layers × KV-heads × head-dim × tokens × bytes, growing linearly with context. Subtract the loaded weights from your card’s capacity and divide what’s left by the per-token cache size — that quotient, not the model size, is the context you can actually run.

That reframing is what the hardware tiers are really about. A 35B at 4-bit on a 96 GB card has room for hundreds of thousands of tokens; a 70B on the same card is capacity-bound near its 128K window; a million tokens needs more memory than any single card you can buy, which is why the cloud shards it across GPUs. And remember that fitting a model is not the same as running it fast — decode is bandwidth-bound, so a high-capacity, low-bandwidth box loads big models slowly.

Looking ahead, the pressure is on the cache, not the weights. Architectures like MLA and ever-more-aggressive KV quantization are attacking the one term that scales with context, because that is the term standing between a single workstation and genuinely long local context. For now, do the arithmetic before you buy: it’s the difference between a card that serves your workload and one that stalls at the first long prompt. If a custom build is on the table, our engineering team can size the stack with you.

The VRAM math: weights, a KV cache, and your real context limit

01 — The Core EquationTwo numbers decide everything.

Model weights

The KV cache

02 — Weight MathWeight VRAM = params × bytes per param.

03 — KV CacheThe formula that grows with context.

04 — Worked ExampleLeftover VRAM is your context budget.

35B at 4-bit, loaded

What is left on a 96 GB card

Tokens the headroom buys

05 — The 1M ParadoxWhy one card stops near 128K while the cloud sells 1M.

A 70B on one 96 GB card · where the budget runs out

06 — Capacity vs SpeedFitting a model and running it fast are different problems.

Memory bandwidth sets the decode-speed ceiling

07 — Hardware ShortlistPick the card for the job, not the spec sheet.

RTX PRO 6000 Blackwell

RTX 5090

NVIDIA DGX Spark

Apple M5 Max

08 — QuantizationHow low can you quantize?

FP16 / BF16

FP8

INT8 / Q8

Q4_K_M

09 — ConclusionTwo numbers, one spreadsheet row.

Once the model fits, the VRAM that’s left over is your real context budget.

Get the hardware math right before you spend on the wrong card.

Local & hybrid inference engagements

The questions we get every week.

Keep going on local AI infrastructure.

Best Hardware to Run Local AI Models in 2026: Buyer Guide

Run Local LLMs in 2026: Ollama vs LM Studio vs vLLM

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Data

Fine-Tuning LLMs for Business: Complete Use Cases Guide

AI Search Agents Compared: Google, Perplexity, ChatGPT

OpenAI + Dell Codex: On-Premises Enterprise Agents