How much VRAM you need to run an LLM comes down to two numbers, not one. The first is fixed the moment you choose a model and a precision — the weights, measured as parameters times bytes per parameter. The second grows with every token you feed it — the key-value (KV) cache. Most guides stop at the first number. The second is the one that decides whether your context window is 8K or 128K.
That gap is why a 70-billion-parameter model that loads in 40 GB can still run out of memory: at 128K tokens of context, its KV cache alone needs roughly 42 GB — as much as the weights. Invert the usual framing and the real question becomes obvious. Once the model fits, the VRAM that’s left over is your context budget. Nothing else.
This guide gives you both formulas, a recomputed sizing table for the model sizes you’ll actually run, and a worked example — a 35B model at 4-bit on a 96 GB card — that turns leftover VRAM into a concrete token count. It also explains why the cloud can advertise a million tokens of context that no single GPU you can buy will hold. The stakes are simple: pick the wrong number and you either overpay for hardware or hit an out-of-memory wall in production. For the wider backdrop on why context lengths exploded in the first place, see our guide to the 10M-token era.
- 01Weight VRAM is just parameters × bytes-per-parameter.FP16 is 2 bytes, INT8/FP8 ~1 byte, 4-bit ~0.5 bytes. A 70B model is 140 GB at FP16 and about 35 GB at 4-bit — roughly 40 GB once you add ~15% overhead for activations and framework buffers.
- 02The KV cache scales linearly with context length.KV bytes = 2 × layers × KV-heads × head-dim × tokens × bytes-per-element. For a Llama-class 70B that is ~42 GB at 128K tokens and ~327 GB at 1M — the cache, not the weights, is what blows your budget.
- 03Leftover VRAM is your context budget.A 35B at 4-bit loads in ~20 GB, leaving ~76 GB on a 96 GB card. At ~0.26 MB per token that headroom buys roughly 290K tokens of context — or ~580K if you quantize the KV cache to FP8.
- 04A single card tops out near 128K; the cloud’s 1M is sharded.A 70B at 1M tokens needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud 1M context works by distributing that cache across many GPUs with ring attention, context parallelism, and aggressive KV quantization.
- 05Capacity is not speed — decode is bandwidth-bound.Token generation is limited by memory bandwidth, not capacity. The DGX Spark fits big models in 128 GB but its 273 GB/s makes a dense 70B slow; a 96 GB RTX PRO 6000 at 1,792 GB/s is ~6.6× faster per token.
01 — The Core EquationTwo numbers decide everything.
Every VRAM question collapses into the same pair. One number is fixed and one grows. Confusing them is why so much hardware advice is wrong: a card that comfortably loads a model can still be useless for the context length you actually need. Get both into a single spreadsheet row and the rest of this guide is just arithmetic.
Model weights
Set the instant you pick a model and a precision. A 70B at FP16 is 140 GB; at 4-bit it is about 35 GB of raw weights. This number does not move while the model runs — it is a one-time footprint you load once.
The KV cache
Grows linearly with every token of context and every concurrent request. At 128K tokens a Llama-class 70B’s cache (~42 GB) rivals its own 4-bit weights. This is the number that quietly decides your context ceiling.
The mental model: weights are the cost of admission; the KV cache is the metered tab that runs while you’re inside. You pay the first once and the second per token. Everything that follows — quantization, GPU choice, the gap between local and cloud context lengths — is a consequence of how these two numbers interact inside a fixed VRAM envelope.
02 — Weight MathWeight VRAM = params × bytes per param.
The first number is the easy one. Multiply the parameter count by the bytes each parameter occupies at your chosen precision: 4 bytes at FP32, 2 at FP16/BF16, roughly 1 at FP8 or INT8, and about 0.5 at 4-bit. A 70B model is therefore 280 GB at FP32, 140 GB at FP16, 70 GB at INT8, and around 35 GB at 4-bit. Then add 10–20% on top for activation tensors, framework buffers, the CUDA context, and a small KV cache at default context length — a 1.15× multiplier is a safe rule of thumb.
| Model | FP16 / BF16 | FP8 / INT8 | 4-bit (Q4) | Loaded @ 4-bit | Smallest card |
|---|---|---|---|---|---|
| 7B | 14 GB | 7 GB | 3.5 GB | ~4 GB | 8 GB+ |
| 13B | 26 GB | 13 GB | 6.5 GB | ~7.5 GB | 12 GB+ |
| 35B | 70 GB | 35 GB | 17.5 GB | ~20 GB | 24 GB (RTX 4090) |
| 70B | 140 GB | 70 GB | 35 GB | ~40 GB | 48 GB+ (not a 32 GB 5090) |
| 120B | 240 GB | 120 GB | 60 GB | ~69 GB | 96 GB (RTX PRO 6000) |
The “Loaded @ 4-bit” column is the number that matters for planning: raw 4-bit weights times 1.15. Note the 70B row — about 40 GB loaded. That is why a 32 GB RTX 5090 cannot hold a 4-bit 70B on a single card no matter how you slice it, while a 96 GB card swallows the weights and still leaves room for a large context. Real 4-bit GGUF formats (Q4_K_M and friends) run slightly above a clean 0.5 bytes per weight, so treat these as floors, not exact reservations. For the quality-versus-size trade behind each precision, see our 4-bit vs 8-bit vs FP8 tradeoff data.
03 — KV CacheThe formula that grows with context.
During generation, the model caches the key and value tensors for every token it has already seen so it doesn’t recompute them. That cache is the KV cache, and it grows one token at a time. The total size is governed by a single equation built entirely from the model’s architecture and the context length.
Plug in a Llama 3.1 70B (80 layers, 8 KV heads, 128-dim heads, BF16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes per token, about 0.33 MB. At 128,000 tokens that is ~42 GB; at a million tokens it is ~327 GB. The table below runs the same arithmetic for the models and context lengths you’re most likely to hit, including what happens when you quantize the cache itself.
| Context | 8B · BF16 | 70B · BF16 | 70B · FP8 KV | 70B · INT4 KV |
|---|---|---|---|---|
| 4K | 0.52 GB | 1.31 GB | 0.66 GB | 0.33 GB |
| 16K | 2.10 GB | 5.24 GB | 2.62 GB | 1.31 GB |
| 32K | 4.19 GB | 10.49 GB | 5.24 GB | 2.62 GB |
| 64K | 8.39 GB | 20.97 GB | 10.49 GB | 5.24 GB |
| 128K | 16.78 GB | 41.94 GB | 20.97 GB | 10.49 GB |
| 256K | 33.55 GB | 83.89 GB | 41.94 GB | 20.97 GB |
| 512K | 67.11 GB | 167.77 GB | 83.89 GB | 41.94 GB |
| 1M | 131.07 GB | 327.68 GB | 163.84 GB | 81.92 GB |
Two cautions before you build a budget on this table. First, the formula counts a single request; multiply by batch size for concurrent users. Four simultaneous 128K requests on a 70B is 4 × 42 = 168 GB of cache alone, which is why throughput planning is a different problem from single-session context. Second, the cache is also the biggest lever you have: dropping it to an FP8 element halves every cell, and an INT4 cache quarters it. The full toolbox — paged attention, prefix caching, quantized caches — lives in our KV-cache optimization techniques guide.
04 — Worked ExampleLeftover VRAM is your context budget.
Here is the calculation almost no guide does in full: take a real card, load a real model, and ask how much context the remaining VRAM actually buys. Work a 35B model at 4-bit on a 96 GB RTX PRO 6000 in three steps and the answer falls out cleanly.
35B at 4-bit, loaded
Raw 4-bit weights are 35B × 0.5 bytes = 17.5 GB. Add ~15% for activations, framework buffers, and CUDA context and you load around 20 GB — versus 70 GB for the same model at FP16.
What is left on a 96 GB card
96 GB total minus ~20 GB of loaded weights leaves 76 GB. That leftover — not the model size — is the number that determines how much context you can actually use.
Tokens the headroom buys
At ~0.26 MB per token (a representative 32B GQA layout: 64 layers, 8 KV heads, 128-dim), 76 GB ÷ 0.26 MB ≈ 290K tokens of BF16 KV — roughly 580K if you quantize the cache to FP8.
That is the whole insight in one line: a smaller model doesn’t just cost less VRAM, it frees enormous context headroom. The same card that gives a 35B model nearly 290K theoretical tokens gives a 70B far less, because the 70B both weighs more and burns cache faster. Swap the 35B for a 70B at 4-bit (~40 GB loaded) and the headroom drops to about 56 GB — enough for the model’s native 128K window (~42 GB of cache) with a thin margin for activations, and nowhere near the 256K tier. Treat these as theoretical ceilings: a real framework reserves activation and scratch memory too, so usable context lands somewhat below the cache-only math.
05 — The 1M ParadoxWhy one card stops near 128K while the cloud sells 1M.
Run the budget for a 70B at 4-bit on a 96 GB card and the ceiling is stark. Weights take ~40 GB. A 128K cache adds ~42 GB, for ~82 GB total — it fits, with about 14 GB left for activations and the CUDA context. Push to 256K and the cache alone is ~84 GB; weights plus cache is ~124 GB, over budget. The card tops out near 128K, which also happens to be the model’s trained window. Both limits converge on the same answer.
A 70B on one 96 GB card · where the budget runs out
Calculated · 4-bit weights, BF16 KV cacheSo how does an API advertise a million tokens? Not by fitting it on one card. That same 70B at 1M needs ~327 GB of KV cache — 3.4× a 96 GB card. Cloud providers reach it by sharding the cache across many GPUs and overlapping the work: context parallelism splits the sequence across devices, ring attention passes key/value blocks around a GPU ring so each device sees the whole context, and cache-offloading plus NVFP4 quantization shrink what has to stay resident. The gap between “what fits on my GPU” and “what the API offers” is architecture and orchestration, not just raw VRAM. We dig into the provider-by-provider numbers in our 1M-to-10M context window comparison.
06 — Capacity vs SpeedFitting a model and running it fast are different problems.
VRAM capacity tells you whether a model loads. It says nothing about how fast it generates. Inference has two phases with different bottlenecks, and conflating them is the most common buying mistake.
"Prefill is usually compute-bound, meaning it's limited by how fast the GPU can do math. Decode is usually memory-bandwidth-bound, meaning it's limited by how fast the GPU can move data around."Jim Allen Wallace · Redis Engineering Blog
Decode — generating one token at a time — is where you spend most of your runtime, and it is bound by memory bandwidth, not compute. A useful ceiling: theoretical tokens per second is roughly memory bandwidth divided by the bytes the model must read per token, and real-world numbers land at 60–80% of that. Double the bandwidth and you roughly double the speed. Capacity gets the model in the door; bandwidth decides how fast it talks.
Memory bandwidth sets the decode-speed ceiling
Vendor specificationsThe same bandwidth lens explains the rest of the field. A 96 GB RTX PRO 6000 has been independently measured at around 31–32 tok/s on a 70B — the practical reference for a single-card 70B. An Apple M5 Max (614 GB/s, up to 128 GB) lands lower; community benchmarks put a 70B in the ~15–32 tok/s range depending on framework and quantization, so treat any single figure as an estimate. And the fast-but-small RTX 5090, despite matching the PRO 6000’s 1,792 GB/s, can’t hold a 70B at all — point it at models up to ~30B, where its bandwidth implies roughly 60–90 tok/s (an estimate derived from memory bandwidth, not a measured benchmark). The tokens-per-second figures here are approximate and stack-dependent — measured for the PRO 6000, community estimates for the M5 Max, and bandwidth-derived for the 5090.
07 — Hardware ShortlistPick the card for the job, not the spec sheet.
With the weight math, the cache math, and the bandwidth lens in hand, the hardware choice gets concrete. Match the card to whether your bottleneck is fitting a model, generating fast, or both — and budget for a hardware market that is anything but stable right now.
RTX PRO 6000 Blackwell
96 GB GDDR7, 1,792 GB/s, 600 W rated TDP. The only single workstation card that fits a 4-bit 70B with room for a 128K context. Launched at an ~$8,565 MSRP; mid-2026 listings have surged to roughly $12,000–$14,500 amid the memory shortage.
RTX 5090
32 GB GDDR7 at the same 1,792 GB/s — fastest decode in this list, but 32 GB cannot hold a ~40 GB 4-bit 70B. Keep it to models up to ~30B (≈60–90 tok/s, estimated from bandwidth); step up to two cards or a 96 GB card for 70B.
NVIDIA DGX Spark
128 GB LPDDR5x unified, 273 GB/s, ~140 W typical draw (240 W-rated PSU), ~$3,999. Loads models a 5090 can’t, but the low bandwidth means modest, stack-dependent decode speed. A capacity play, not a speed play.
Apple M5 Max
Up to 128 GB unified, 614 GB/s, announced March 2026. Community estimates put a 70B around 15–32 tok/s — verify per framework. Note the Mac Studio M3 Ultra now tops out at 96 GB after its larger configs were pulled in 2026.
08 — QuantizationHow low can you quantize?
Quantization is the dial that makes all of this affordable — fewer bytes per parameter shrinks both the weights and, when applied to the cache, the KV footprint. The trade is quality, but on modern formats it is smaller than most people expect until you push past 4-bit.
FP16 / BF16
2 bytes per parameter — the reference quality every quantized tier is measured against. Use it when VRAM is plentiful or quality is genuinely non-negotiable.
FP8
~1 byte per parameter, under ~0.4 perplexity points of degradation versus BF16, and benchmarks suggest roughly 1.4–1.8× throughput on Hopper/Blackwell at large batch. The practical GPU sweet spot.
INT8 / Q8
~1 byte per parameter with about half a percent of measured degradation. A conservative choice when you want the memory saving without thinking hard about quality.
Q4_K_M
Roughly 4.5 effective bits per weight — about 0.5 bytes — and only ~0.05 perplexity points over FP16 on a 7B, inside run-to-run noise. The default for local inference on consumer and Apple hardware.
The quality numbers above are directional — they vary by model, benchmark, and the specific quantization implementation — but the shape is consistent: FP8 and INT8 are close to free, Q4_K_M is the local default that almost nobody can distinguish blind, and only at 2-bit does degradation become obvious. The same logic applies to the cache: an FP8 KV halves every cell in the table above and an INT4 KV quarters it, often with negligible quality cost on long-context retrieval. For the cross-model regression data behind each tier, our quantization tradeoffs breakdown has the benchmarks. And if you’re weighing local inference against a managed deployment for a production workload, that build-versus-buy call is exactly what our AI transformation engagements are built to answer.
09 — ConclusionTwo numbers, one spreadsheet row.
Once the model fits, the VRAM that’s left over is your real context budget.
The whole question of “how much VRAM” reduces to two figures you can put in one row. Weights are parameters times bytes per parameter, fixed by your model and precision. The KV cache is 2 × layers × KV-heads × head-dim × tokens × bytes, growing linearly with context. Subtract the loaded weights from your card’s capacity and divide what’s left by the per-token cache size — that quotient, not the model size, is the context you can actually run.
That reframing is what the hardware tiers are really about. A 35B at 4-bit on a 96 GB card has room for hundreds of thousands of tokens; a 70B on the same card is capacity-bound near its 128K window; a million tokens needs more memory than any single card you can buy, which is why the cloud shards it across GPUs. And remember that fitting a model is not the same as running it fast — decode is bandwidth-bound, so a high-capacity, low-bandwidth box loads big models slowly.
Looking ahead, the pressure is on the cache, not the weights. Architectures like MLA and ever-more-aggressive KV quantization are attacking the one term that scales with context, because that is the term standing between a single workstation and genuinely long local context. For now, do the arithmetic before you buy: it’s the difference between a card that serves your workload and one that stalls at the first long prompt. If a custom build is on the table, our engineering team can size the stack with you.