The best hardware to run local AI models in 2026 is not the card with the most TOPS — it is the one with the most memory bandwidth and enough memory to hold your model. Token generation reloads every weight from memory once per token, so decode speed tracks bandwidth, not raw compute. Get that one idea right and the whole buying decision falls into place.
What is at stake is real money in a market that moved underneath buyers all year. A global DRAM and GDDR7 shortage pulled Apple’s largest Mac Studio memory tiers, spiked workstation-GPU street prices, and pushed Apple to raise Mac prices on June 25, 2026. Buying on last year’s launch prices, or on a spec sheet that lists peak TOPS, is how people overpay for a box that decodes slowly.
This guide compares five concrete brackets — the RTX 5090, the MacBook Pro M5 Max, the NVIDIA DGX Spark, the RTX PRO 6000 Blackwell, and the Mac Studio M3 Ultra — on price, memory, bandwidth, the largest model each can hold, and real-world tokens per second. Every number traces to a primary source or an independent benchmark, and anything vendor-stated or estimated is labelled as such.
- 01Decode speed is bound by memory bandwidth, not TOPS.Every generated token reloads the whole model from memory once, so tokens/sec is roughly bandwidth divided by model size. Two cards with the same bandwidth decode at the same speed even if one has six times the TOPS.
- 02The RTX 5090 cannot hold a 70B model at Q4.A 70B model at 4-bit is about 35GB; the 5090 has 32GB of VRAM. It is an excellent sub-30B card (around 66 tok/sec on a 30B) but needs heavy quantization or slow CPU offload to touch 70B.
- 03The RTX PRO 6000 is the fastest real-world 70B box — at a shortage premium.Its 96GB and 1,792 GB/s run a dense 70B at roughly 32 tok/sec. The same GDDR7 shortage that hit Apple pushed its listings from an ~$8,565 launch MSRP toward roughly $12,000 to $14,500 by mid-2026.
- 04DGX Spark buys capacity, not raw 70B speed.128GB lets it load very large models, but its 273 GB/s bus caps dense-70B decode in the single digits with common tools. Its real wins are the CUDA software stack and big-model headroom.
- 05The DRAM shortage rewrote 2026 prices in real time.Apple pulled the 512GB Mac Studio in March and the 256GB tier in May, then raised Mac prices on June 25 (the M3 Ultra alone jumped about $1,300). Anchor on this quarter’s prices, not last year’s.
01 — The One RuleDecode speed is memory-bandwidth-bound.
Here is the single most useful fact for buying local AI hardware. At batch size one — one user, one conversation — generating each token requires reading every weight in the model out of memory exactly once. The arithmetic is cheap; the data movement is the bottleneck. So the ceiling on tokens per second is set by how fast the chip can stream weights from memory, which is its memory bandwidth.
That collapses to a formula you can do in your head:
tokens/sec (ceiling) ≈ memory bandwidth (GB/s) ÷ model size (GB)
At Q4 (4-bit), weights take about 0.5GB per billion parameters, so a 70B model is roughly 35GB and a 30B is roughly 15GB. An RTX PRO 6000 at 1,792 GB/s therefore tops out near 1,792 ÷ 35 ≈ 51 tok/sec on a 70B — and measures about 32 in practice (roughly 63% of ceiling). A DGX Spark at 273 GB/s tops out near 7.8 tok/sec on the same model. Real-world output is typically 55 to 70% of the ceiling, depending on the inference framework and the quantization level you choose (4-bit, 8-bit, FP8).
This is why TOPS on the box is misleading. A device can advertise thousands of trillions of operations per second and still decode slowly, because at batch one those tensor cores spend most of their time waiting on memory. The chart below shows the theoretical 70B ceiling for each bracket — purely a function of bandwidth divided by model size — and it already explains most of the buying decision.
Theoretical 70B Q4 decode ceiling · bandwidth ÷ 35 GB
Derived from vendor memory-bandwidth specs · real-world output ≈ 55–70% of this ceilingOne important caveat keeps this from being the whole story. Prefill — the work of reading a long input prompt before the model starts answering — is compute-bound, not bandwidth-bound. For very long contexts (100K-plus tokens) a high-TOPS NVIDIA card processes input far faster than Apple Silicon: a 128K-token prompt that takes several minutes to ingest on an M5 Max can take seconds on an RTX PRO 6000. So bandwidth governs how fast you read the answer; compute governs how fast the machine reads your question. Most local chat and coding sits in the bandwidth-bound regime; heavy long-document workloads lean on prefill.
02 — Price BracketsThe five brackets, on one decision matrix.
The table below is the heart of this guide. It puts upfront price, memory, bandwidth, real-world tokens per second, and annual electricity cost on a single row per bracket — the combination almost no review assembles in one place. Prices are late-June 2026 street or post-hike figures, not launch MSRPs. For a head-to-head deep dive on the three flagship options, see our DGX Spark vs M5 Max vs RTX 6000 showdown.
| Bracket | Street price | Memory | Bandwidth | 30B Q4 | 70B Q4 | Power | Elec/yr* |
|---|---|---|---|---|---|---|---|
| RTX 5090 | $3,000–$5,000+ | 32GB GDDR7 | 1,792 GB/s | ~66* | won’t fit† | 575W | ~$201 |
| MacBook Pro M5 Max | from ~$3,899 | up to 128GB | 614 GB/s | ~30–40* | ~12–18* | ~80W | ~$28 |
| NVIDIA DGX Spark | ~$4,699 | 128GB LPDDR5x | 273 GB/s | ~10* | ~5* | ~140W‖ | ~$49 |
| RTX PRO 6000 | ~$12,000–$14,500 | 96GB GDDR7 ECC | 1,792 GB/s | ~68* | ~32 | 600W | ~$210‡ |
| Mac Studio M3 Ultra | ~$5,299 | up to 96GB§ | 819 GB/s | ~40–55* | ~16–22* | ~150W | ~$53 |
*Community, derived, or vendor benchmarks; real-world is roughly 55 to 70% of the bandwidth ceiling and depends on framework and quantization. †A 70B at Q4 (~35GB) exceeds the 5090’s 32GB VRAM. ‡GPU at 600W; a whole workstation around it draws closer to 920W (~$322/yr). §The 512GB tier was pulled in March 2026 and the 256GB tier in May 2026, leaving 96GB as the maximum purchasable M3 Ultra Mac Studio configuration in late June 2026. ‖The DGX Spark ships with a 240W-rated power supply but draws about 140W in typical use; its electricity figure uses the ~140W draw. Electricity assumes 8 hours/day at the US-average $0.12/kWh; EU rates run roughly 2.5 to 3 times higher.
Read the matrix and the trend jumps out. The two GDDR7 cards share identical 1,792 GB/s bandwidth, but only the RTX PRO 6000’s 96GB lets it actually hold a 70B — the 5090’s 32GB does not. The Mac Studio M3 Ultra’s 819 GB/s makes it the fastest Apple path and, notably, faster on dense 70B throughput than the costlier-to-run NVIDIA boxes that lean on raw compute. And the DGX Spark, despite its price, sits at the bottom of the speed column — its value is elsewhere. The interpretation worth holding onto: in a shortage year, bandwidth-per-dollar and memory capacity have become the metrics that separate these machines, far more than headline compute.
03 — Model FitWhat each bracket can actually hold.
Speed is moot if the model does not fit. At Q4, weights take about 0.5GB per billion parameters, so the practical ceiling is memory divided by roughly 0.5GB — minus headroom for the KV cache, context, and the operating system. Mixture-of-Experts (MoE) models are the wildcard: they activate only a fraction of their parameters per token, so a 120B MoE fits by total size yet decodes far faster than a dense 120B would.
| Bracket | Small (Q4) | Large (Q4) | |||
|---|---|---|---|---|---|
| ≤13B | 30B | 70B dense | 120B MoE | ≥180B | |
| RTX 5090 (32GB) | Yes | Yes | No | No | No |
| M5 Max (128GB) | Yes | Yes | Yes | Yes | Tight |
| DGX Spark (128GB) | Yes | Yes | Yes | Yes | ~200B |
| RTX PRO 6000 (96GB) | Yes | Yes | Yes | Yes | No |
| Mac Studio M3 Ultra (96GB) | Yes | Yes | Yes | Yes | No |
The locked correction to remember is in that bottom row: with the 512GB and 256GB Mac Studio tiers withdrawn during 2026, the M3 Ultra now tops out at 96GB — the same large-model tier as the 96GB RTX PRO 6000, not the capacity crown it held a year ago. If your goal is to load the very largest open models at Q4, the 128GB unified devices (the M5 Max and the DGX Spark) now hold more than the M3 Ultra does. If you want the full VRAM-versus-context math behind these fit calls, our companion piece on how much VRAM an LLM really needs works the KV-cache formula in detail.
04 — Consumer GPURTX 5090: the 30B speed king that can’t hold 70B.
The RTX 5090 is the best value in this guide for models that fit it. It launched at $1,999 in early 2025 and now trades on the street between roughly $3,000 and $5,000-plus, pushed up by the same memory shortage affecting everything else. On a 30B-class model at Q4, a bandwidth-derived estimate puts it around 66 tok/sec (roughly 55% of its theoretical ceiling) — fast, responsive, and more than enough for interactive coding and chat with sub-30B models.
GDDR7, 512-bit bus
Plenty for 7B to 30B at Q4 with comfortable context headroom. A 70B at Q4 (~35GB) simply does not fit on one card.
Identical to the PRO 6000
Same memory bandwidth as the five-figure workstation card — but a third of the VRAM. Bandwidth was never the 5090's limit; capacity is.
Bandwidth-derived estimate
Roughly 55% of the 119 tok/sec theoretical ceiling for a 15GB model — typical framework overhead. Snappy for everyday local use.
Do not buy a single RTX 5090 to run 70B models. A 70B at Q4 is about 35GB and the card holds 32GB. Your only options are to drop to Q3 or Q2 (a real quality hit) or offload layers to system RAM across the PCIe bus — which collapses throughput to a few tokens per second. Below 30B the 5090 is excellent; at 70B it is the wrong tool. If you need 70B on one box, the RTX PRO 6000 or an Apple Silicon machine is the honest answer.
05 — Grace BlackwellDGX Spark: capacity and CUDA, not raw speed.
NVIDIA’s DGX Spark is the most misunderstood box in this lineup. Built on the GB10 Grace Blackwell Superchip, it pairs 128GB of LPDDR5x unified memory with 1 PFLOP of FP4 compute (with sparsity) in a desktop the size of a hardback book, on a 240W-rated power supply that typically draws around 140W, currently around $4,699. People see “1 PFLOP” and expect blistering speed. But its memory bus runs at 273 GB/s — roughly a sixth of the GDDR7 cards — and decode is bandwidth-bound, so a dense 70B is genuinely slow here.
How slow depends entirely on the software stack, and this is the disclosure most buyer guides skip. The two paths look like this:
Ollama / GGUF
Install Ollama, pull a 70B, and an independent benchmark measured about 4.67 tok/sec — close to the 273 GB/s ÷ 35GB ceiling. Usable for a patient single user; not snappy. This is what most buyers actually experience on day one.
TensorRT-LLM / NVFP4
NVIDIA's own published numbers cover 8B, 14B and 20B plus MoE — for example GPT-OSS-20B at ~82.7 tok/sec. Higher dense-70B figures circulate, but they are stack- and model-dependent and a true dense 70B cannot exceed the bandwidth ceiling. Treat any big dense-70B number with suspicion.
Big models & agents
The real reason to buy one: load models a 32GB card cannot touch, run NVIDIA's agent and microservice stack, and keep CUDA parity with the datacenter. Capacity and ecosystem, not tokens per second, are the pitch.
The DGX Spark is the clearest example of why you must always ask three questions about any local-LLM benchmark: which framework, which quantization, and which model type (dense or MoE). The same box ranges from a few tokens per second on a dense 70B with Ollama to tens of tokens per second on smaller or MoE models with NVIDIA’s optimized stack. Never read a single hero number as typical. For the full story on running 120B-class agent models on this device, see our NVIDIA DGX Spark and the GB10 superchip deep dive.
06 — Workstation GPURTX PRO 6000: the fastest real 70B, at a shortage premium.
If you want the fastest single-box 70B inference money can buy right now, this is it. The RTX PRO 6000 Blackwell pairs 96GB of GDDR7 ECC memory with 1,792 GB/s of bandwidth, and independent testing in LM Studio clocked Llama 3.1 70B at 31.84 tok/sec and Llama 3.3 70B at 31.74 — the roughly 32 tok/sec figure this guide quotes. A Gemma 3 27B ran at 68.06 tok/sec on the same card. It is the only workstation GPU on the market with 96GB, which is exactly what lets it hold a 70B in VRAM and run it at full bandwidth instead of crawling through CPU offload.
Its 96GB also unlocks models that simply will not load on consumer hardware. In the same review, an OpenAI GPT-OSS 120B — a Mixture-of-Experts model, so only a fraction of its parameters fire per token — ran at 163.15 tok/sec, a number no 32GB card can reach because it cannot hold the model at all.
"The standout benchmark was the OpenAI GPT-OSS 120B, which achieved 163.15 tokens per second — a capability unique to the RTX PRO 6000 due to its 96GB memory capacity, as competing consumer GPUs cannot accommodate models of this scale."— StorageReview, independent RTX PRO 6000 review
The catch is price, and it is a 2026 story. The card launched around an $8,565 MSRP, but the same GDDR7 shortage that forced Apple to pull its largest Mac Studio memory tiers has pushed RTX PRO 6000 marketplace listings toward roughly $12,000 to $14,500 by mid-year (NVIDIA Marketplace around $13,250, Newegg around $12,099, B&H around $14,499), even as some retail held nearer $8,500 to $9,200. This is the second face of the shortage narrative: it does not just raise list prices, it widens the gap between MSRP and what you actually pay.
The 512GB Mac Studio pulled in March, the 256GB pulled in May, Apple’s June 25 across-the-board Mac price hike, the DGX Spark’s $700 bump, and the RTX PRO 6000’s street spike are not separate events — they are the same global DRAM and GDDR7 shortage showing up in different stores. The practical takeaway: memory capacity is the scarce, expensive resource of 2026, and it is exactly what determines which models you can run. The PRO 6000’s whole-workstation draw is also real — about 920W under load, closer to $322/yr in electricity than the GPU-only $210.
07 — Apple SiliconApple Silicon: quiet, efficient, and re-priced.
Apple’s unified-memory architecture is a natural fit for local LLMs: the GPU addresses the same large memory pool as the CPU, so a 128GB MacBook Pro can hold models that would need a server rack of consumer GPUs. The trade-offs are no CUDA, a thinner advanced-tooling ecosystem, and slower long-context prefill than NVIDIA. But for many users the wins — silence, low power, and the privacy and cost of on-device AI — outweigh them.
MacBook Pro M5 Pro
From about $2,699 for the 16-inch (pre-hike). Its 307 GB/s caps a dense 70B near 9 tok/sec, so it is really a sub-30B machine — but a very capable, affordable one.
MacBook Pro M5 Max
From about $3,899; a 128GB config ran roughly $5,000-plus at launch and more after June 25. Community estimates put a dense 70B near 12 to 18 tok/sec, at about 80W, near-silent, battery-optional.
Mac Studio M3 Ultra
About $5,299 after the June 25 hike. Its 819 GB/s is the fastest Apple-silicon decode here — roughly 16 to 22 tok/sec on a dense 70B — in a silent desktop. But its 512GB and 256GB memory tiers were withdrawn in 2026.
This is the correction that breaks most older buyer guides. Apple removed the 512GB Mac Studio M3 Ultra option in March 2026 and the 256GB option in May 2026, both attributed to the DRAM shortage — so 96GB is the maximum purchasable M3 Ultra Mac Studio configuration as of late June 2026. Do not plan around a 256GB or 512GB Mac Studio; it is no longer on sale. The M3 Ultra remains the fastest Apple option for 70B decode thanks to its 819 GB/s bandwidth, but it is no longer the capacity king it once was.
08 — Total CostThe bill after you buy it.
Upfront price is only half the decision; power draw is the recurring half. The spread here is dramatic — an Apple machine sips power while a workstation GPU pulls as much as a space heater. The chart uses a consistent basis: typical device power draw, eight hours a day, at the US-average $0.12/kWh, with each annual figure recomputed from watts directly.
Electricity cost · 8 h/day at $0.12/kWh (US average)
Derived: watts × 2,920 h ÷ 1,000 × $0.12 · EU rates run 2.5–3× higherTwo things follow. First, location matters: at European rates of roughly €0.30 to €0.40/kWh, every figure above roughly triples, which can swing a multi-year ownership decision toward the efficient Apple machines. Second, electricity is small next to hardware depreciation in most cases — but for an always-on workstation GPU it is not negligible, and it is the kind of line item that belongs in a proper total cost of ownership including electricity and hardware depreciation. If you are weighing a hardware purchase against cloud subscriptions at all, the math shifted again with Apple’s June 2026 price hike.
09 — The DecisionHow to choose your bracket.
There is no single best machine — there is a best machine for your workload, model size, and tolerance for setup. Match yourself to one of the five profiles below. Before spending anything, though, it is worth deciding whether to self-host open-weight models at all versus renting GPU time for bursty workloads.
RTX 5090
Best tokens-per-dollar for 7B to 30B models — about 66 tok/sec on a 30B at Q4. Skip it if you need a 70B at full quality; 32GB simply will not hold one.
MacBook Pro M5 Max
128GB unified, near-silent, about $28/yr to run, no CUDA. Runs a 70B passably at an estimated 12 to 18 tok/sec (community range). The best laptop for on-device privacy work.
RTX PRO 6000 Blackwell
About 32 tok/sec on a dense 70B, 96GB ECC, and fast long-context prefill. The only card that runs 70B at full VRAM speed — if you can absorb the ~$12k-plus shortage price.
NVIDIA DGX Spark
Capacity and the NVIDIA software stack in a tiny box. Load ~200B-class models and build agents — but accept single-digit dense-70B decode with common tools.
Mac Studio M3 Ultra
819 GB/s is the fastest Apple-silicon path to roughly 16 to 22 tok/sec on a 70B, in a silent desktop — now capped at 96GB after the 2026 memory-tier cuts.
If you would rather not guess — or you need to validate a purchase against the exact models and prompts your team runs — our AI digital transformation engagements benchmark candidate hardware and open-weight models on your real workloads before you commit budget, and pair the hardware decision with the application and tooling layer that actually puts a local model to work.
10 — ConclusionBuy bandwidth and capacity, in that order.
In a shortage year, bandwidth-per-dollar and memory capacity decide the buy — not TOPS.
Strip away the marketing and local AI hardware reduces to two questions. Does the model fit in memory? And how fast can the chip stream those weights? Memory bandwidth sets your decode ceiling; memory capacity sets which models you can run at all. TOPS, the number plastered on every box, barely moves the needle for single-user inference — it matters for prefill and batching, not for the tokens you watch appear.
Against that frame, the 2026 picks are clear. The RTX 5090 is the sub-30B value champion that cannot hold a 70B. The RTX PRO 6000 is the fastest single-box 70B, if you can pay a shortage-inflated five figures. The DGX Spark buys capacity and CUDA, not speed. And Apple Silicon trades raw throughput for silence, efficiency, and privacy — with the hard caveat that the Mac Studio’s big memory tiers were pulled this year.
Looking forward, the shortage is the variable to watch. As long as DRAM and GDDR7 stay scarce, expect capacity to remain the expensive, rationed resource and bandwidth-per-dollar to keep being the metric that separates these machines. If Apple restores higher memory tiers or prices ease, the capacity calculus shifts again. Until then, the durable advice is simple: size the model first, buy the bandwidth to feed it, and verify today’s street price before you commit — because in this market, last quarter’s number is already wrong.