AI DevelopmentIndustry Guide12 min readPublished June 28, 2026

Decode is memory-bandwidth-bound · five brackets · post-shortage 2026 prices

Best Hardware to Run Local AI in 2026: Bandwidth Beats TOPS

Five price brackets, from a $3k RTX 5090 to a five-figure RTX PRO 6000 — compared on the one number that actually sets local LLM speed: memory bandwidth. With the 2026 DRAM shortage reshaping prices in real time, this guide uses post-shortage pricing and real-world tokens-per-second, not launch-day spec sheets.

DA
Digital Applied Team
Senior strategists · Published June 28, 2026
PublishedJune 28, 2026
Read time12 min
Sources9 primary sources
70B model at Q4
35GB
exceeds a 32GB RTX 5090
RTX PRO 6000 · 70B
~32
tokens/sec, real-world
5090 & PRO 6000
1,792
GB/s memory bandwidth
Mac Studio M3 Ultra
$5,299
after the June 25 hike
+$1,300

The best hardware to run local AI models in 2026 is not the card with the most TOPS — it is the one with the most memory bandwidth and enough memory to hold your model. Token generation reloads every weight from memory once per token, so decode speed tracks bandwidth, not raw compute. Get that one idea right and the whole buying decision falls into place.

What is at stake is real money in a market that moved underneath buyers all year. A global DRAM and GDDR7 shortage pulled Apple’s largest Mac Studio memory tiers, spiked workstation-GPU street prices, and pushed Apple to raise Mac prices on June 25, 2026. Buying on last year’s launch prices, or on a spec sheet that lists peak TOPS, is how people overpay for a box that decodes slowly.

This guide compares five concrete brackets — the RTX 5090, the MacBook Pro M5 Max, the NVIDIA DGX Spark, the RTX PRO 6000 Blackwell, and the Mac Studio M3 Ultra — on price, memory, bandwidth, the largest model each can hold, and real-world tokens per second. Every number traces to a primary source or an independent benchmark, and anything vendor-stated or estimated is labelled as such.

Key takeaways
  1. 01
    Decode speed is bound by memory bandwidth, not TOPS.Every generated token reloads the whole model from memory once, so tokens/sec is roughly bandwidth divided by model size. Two cards with the same bandwidth decode at the same speed even if one has six times the TOPS.
  2. 02
    The RTX 5090 cannot hold a 70B model at Q4.A 70B model at 4-bit is about 35GB; the 5090 has 32GB of VRAM. It is an excellent sub-30B card (around 66 tok/sec on a 30B) but needs heavy quantization or slow CPU offload to touch 70B.
  3. 03
    The RTX PRO 6000 is the fastest real-world 70B box — at a shortage premium.Its 96GB and 1,792 GB/s run a dense 70B at roughly 32 tok/sec. The same GDDR7 shortage that hit Apple pushed its listings from an ~$8,565 launch MSRP toward roughly $12,000 to $14,500 by mid-2026.
  4. 04
    DGX Spark buys capacity, not raw 70B speed.128GB lets it load very large models, but its 273 GB/s bus caps dense-70B decode in the single digits with common tools. Its real wins are the CUDA software stack and big-model headroom.
  5. 05
    The DRAM shortage rewrote 2026 prices in real time.Apple pulled the 512GB Mac Studio in March and the 256GB tier in May, then raised Mac prices on June 25 (the M3 Ultra alone jumped about $1,300). Anchor on this quarter’s prices, not last year’s.

01The One RuleDecode speed is memory-bandwidth-bound.

Here is the single most useful fact for buying local AI hardware. At batch size one — one user, one conversation — generating each token requires reading every weight in the model out of memory exactly once. The arithmetic is cheap; the data movement is the bottleneck. So the ceiling on tokens per second is set by how fast the chip can stream weights from memory, which is its memory bandwidth.

That collapses to a formula you can do in your head:

The bandwidth-to-token formula

tokens/sec (ceiling) ≈ memory bandwidth (GB/s) ÷ model size (GB)

At Q4 (4-bit), weights take about 0.5GB per billion parameters, so a 70B model is roughly 35GB and a 30B is roughly 15GB. An RTX PRO 6000 at 1,792 GB/s therefore tops out near 1,792 ÷ 35 ≈ 51 tok/sec on a 70B — and measures about 32 in practice (roughly 63% of ceiling). A DGX Spark at 273 GB/s tops out near 7.8 tok/sec on the same model. Real-world output is typically 55 to 70% of the ceiling, depending on the inference framework and the quantization level you choose (4-bit, 8-bit, FP8).

This is why TOPS on the box is misleading. A device can advertise thousands of trillions of operations per second and still decode slowly, because at batch one those tensor cores spend most of their time waiting on memory. The chart below shows the theoretical 70B ceiling for each bracket — purely a function of bandwidth divided by model size — and it already explains most of the buying decision.

Theoretical 70B Q4 decode ceiling · bandwidth ÷ 35 GB

Derived from vendor memory-bandwidth specs · real-world output ≈ 55–70% of this ceiling
RTX PRO 60001,792 GB/s ÷ 35 GB
51 tok/s
Mac Studio M3 Ultra819 GB/s ÷ 35 GB
23 tok/s
MacBook Pro M5 Max614 GB/s ÷ 35 GB
18 tok/s
DGX Spark273 GB/s ÷ 35 GB
7.8 tok/s
RTX 509032 GB VRAM — a 70B at Q4 (~35 GB) does not fit
won’t fit

One important caveat keeps this from being the whole story. Prefill — the work of reading a long input prompt before the model starts answering — is compute-bound, not bandwidth-bound. For very long contexts (100K-plus tokens) a high-TOPS NVIDIA card processes input far faster than Apple Silicon: a 128K-token prompt that takes several minutes to ingest on an M5 Max can take seconds on an RTX PRO 6000. So bandwidth governs how fast you read the answer; compute governs how fast the machine reads your question. Most local chat and coding sits in the bandwidth-bound regime; heavy long-document workloads lean on prefill.

02Price BracketsThe five brackets, on one decision matrix.

The table below is the heart of this guide. It puts upfront price, memory, bandwidth, real-world tokens per second, and annual electricity cost on a single row per bracket — the combination almost no review assembles in one place. Prices are late-June 2026 street or post-hike figures, not launch MSRPs. For a head-to-head deep dive on the three flagship options, see our DGX Spark vs M5 Max vs RTX 6000 showdown.

Local AI hardware decision matrix 2026, comparing the RTX 5090, MacBook Pro M5 Max, DGX Spark, RTX PRO 6000 and Mac Studio M3 Ultra across street price, memory, bandwidth, 30B and 70B tokens per second, power draw, and annual electricity cost.
BracketStreet priceMemoryBandwidth30B Q470B Q4PowerElec/yr*
RTX 5090$3,000–$5,000+32GB GDDR71,792 GB/s~66*won’t fit†575W~$201
MacBook Pro M5 Maxfrom ~$3,899up to 128GB614 GB/s~30–40*~12–18*~80W~$28
NVIDIA DGX Spark~$4,699128GB LPDDR5x273 GB/s~10*~5*~140W‖~$49
RTX PRO 6000~$12,000–$14,50096GB GDDR7 ECC1,792 GB/s~68*~32600W~$210‡
Mac Studio M3 Ultra~$5,299up to 96GB§819 GB/s~40–55*~16–22*~150W~$53

*Community, derived, or vendor benchmarks; real-world is roughly 55 to 70% of the bandwidth ceiling and depends on framework and quantization. †A 70B at Q4 (~35GB) exceeds the 5090’s 32GB VRAM. ‡GPU at 600W; a whole workstation around it draws closer to 920W (~$322/yr). §The 512GB tier was pulled in March 2026 and the 256GB tier in May 2026, leaving 96GB as the maximum purchasable M3 Ultra Mac Studio configuration in late June 2026. ‖The DGX Spark ships with a 240W-rated power supply but draws about 140W in typical use; its electricity figure uses the ~140W draw. Electricity assumes 8 hours/day at the US-average $0.12/kWh; EU rates run roughly 2.5 to 3 times higher.

Read the matrix and the trend jumps out. The two GDDR7 cards share identical 1,792 GB/s bandwidth, but only the RTX PRO 6000’s 96GB lets it actually hold a 70B — the 5090’s 32GB does not. The Mac Studio M3 Ultra’s 819 GB/s makes it the fastest Apple path and, notably, faster on dense 70B throughput than the costlier-to-run NVIDIA boxes that lean on raw compute. And the DGX Spark, despite its price, sits at the bottom of the speed column — its value is elsewhere. The interpretation worth holding onto: in a shortage year, bandwidth-per-dollar and memory capacity have become the metrics that separate these machines, far more than headline compute.

03Model FitWhat each bracket can actually hold.

Speed is moot if the model does not fit. At Q4, weights take about 0.5GB per billion parameters, so the practical ceiling is memory divided by roughly 0.5GB — minus headroom for the KV cache, context, and the operating system. Mixture-of-Experts (MoE) models are the wildcard: they activate only a fraction of their parameters per token, so a 120B MoE fits by total size yet decodes far faster than a dense 120B would.

Which model sizes fit each hardware bracket at Q4, split into small models up to 30B and large models from 70B, including dense and Mixture-of-Experts examples.
BracketSmall (Q4)Large (Q4)
≤13B30B70B dense120B MoE≥180B
RTX 5090 (32GB)YesYesNoNoNo
M5 Max (128GB)YesYesYesYesTight
DGX Spark (128GB)YesYesYesYes~200B
RTX PRO 6000 (96GB)YesYesYesYesNo
Mac Studio M3 Ultra (96GB)YesYesYesYesNo

The locked correction to remember is in that bottom row: with the 512GB and 256GB Mac Studio tiers withdrawn during 2026, the M3 Ultra now tops out at 96GB — the same large-model tier as the 96GB RTX PRO 6000, not the capacity crown it held a year ago. If your goal is to load the very largest open models at Q4, the 128GB unified devices (the M5 Max and the DGX Spark) now hold more than the M3 Ultra does. If you want the full VRAM-versus-context math behind these fit calls, our companion piece on how much VRAM an LLM really needs works the KV-cache formula in detail.

04Consumer GPURTX 5090: the 30B speed king that can’t hold 70B.

The RTX 5090 is the best value in this guide for models that fit it. It launched at $1,999 in early 2025 and now trades on the street between roughly $3,000 and $5,000-plus, pushed up by the same memory shortage affecting everything else. On a 30B-class model at Q4, a bandwidth-derived estimate puts it around 66 tok/sec (roughly 55% of its theoretical ceiling) — fast, responsive, and more than enough for interactive coding and chat with sub-30B models.

VRAM
GDDR7, 512-bit bus
32GB

Plenty for 7B to 30B at Q4 with comfortable context headroom. A 70B at Q4 (~35GB) simply does not fit on one card.

~30B Q4 ceiling
Bandwidth
Identical to the PRO 6000
1,792GB/s

Same memory bandwidth as the five-figure workstation card — but a third of the VRAM. Bandwidth was never the 5090's limit; capacity is.

575W TGP
30B decode
Bandwidth-derived estimate
66tok/s

Roughly 55% of the 119 tok/sec theoretical ceiling for a 15GB model — typical framework overhead. Snappy for everyday local use.

30B-class Q4
The 70B trap

Do not buy a single RTX 5090 to run 70B models. A 70B at Q4 is about 35GB and the card holds 32GB. Your only options are to drop to Q3 or Q2 (a real quality hit) or offload layers to system RAM across the PCIe bus — which collapses throughput to a few tokens per second. Below 30B the 5090 is excellent; at 70B it is the wrong tool. If you need 70B on one box, the RTX PRO 6000 or an Apple Silicon machine is the honest answer.

05Grace BlackwellDGX Spark: capacity and CUDA, not raw speed.

NVIDIA’s DGX Spark is the most misunderstood box in this lineup. Built on the GB10 Grace Blackwell Superchip, it pairs 128GB of LPDDR5x unified memory with 1 PFLOP of FP4 compute (with sparsity) in a desktop the size of a hardback book, on a 240W-rated power supply that typically draws around 140W, currently around $4,699. People see “1 PFLOP” and expect blistering speed. But its memory bus runs at 273 GB/s — roughly a sixth of the GDDR7 cards — and decode is bandwidth-bound, so a dense 70B is genuinely slow here.

How slow depends entirely on the software stack, and this is the disclosure most buyer guides skip. The two paths look like this:

Out of the box
Ollama / GGUF
~5 tok/s on a dense 70B

Install Ollama, pull a 70B, and an independent benchmark measured about 4.67 tok/sec — close to the 273 GB/s ÷ 35GB ceiling. Usable for a patient single user; not snappy. This is what most buyers actually experience on day one.

What most buyers get
Optimized
TensorRT-LLM / NVFP4
8B–20B fly; dense 70B is contested

NVIDIA's own published numbers cover 8B, 14B and 20B plus MoE — for example GPT-OSS-20B at ~82.7 tok/sec. Higher dense-70B figures circulate, but they are stack- and model-dependent and a true dense 70B cannot exceed the bandwidth ceiling. Treat any big dense-70B number with suspicion.

Setup required
Capacity play
Big models & agents
up to ~200B params

The real reason to buy one: load models a 32GB card cannot touch, run NVIDIA's agent and microservice stack, and keep CUDA parity with the datacenter. Capacity and ecosystem, not tokens per second, are the pitch.

CUDA ecosystem
Same hardware, a 10x range

The DGX Spark is the clearest example of why you must always ask three questions about any local-LLM benchmark: which framework, which quantization, and which model type (dense or MoE). The same box ranges from a few tokens per second on a dense 70B with Ollama to tens of tokens per second on smaller or MoE models with NVIDIA’s optimized stack. Never read a single hero number as typical. For the full story on running 120B-class agent models on this device, see our NVIDIA DGX Spark and the GB10 superchip deep dive.

06Workstation GPURTX PRO 6000: the fastest real 70B, at a shortage premium.

If you want the fastest single-box 70B inference money can buy right now, this is it. The RTX PRO 6000 Blackwell pairs 96GB of GDDR7 ECC memory with 1,792 GB/s of bandwidth, and independent testing in LM Studio clocked Llama 3.1 70B at 31.84 tok/sec and Llama 3.3 70B at 31.74 — the roughly 32 tok/sec figure this guide quotes. A Gemma 3 27B ran at 68.06 tok/sec on the same card. It is the only workstation GPU on the market with 96GB, which is exactly what lets it hold a 70B in VRAM and run it at full bandwidth instead of crawling through CPU offload.

Its 96GB also unlocks models that simply will not load on consumer hardware. In the same review, an OpenAI GPT-OSS 120B — a Mixture-of-Experts model, so only a fraction of its parameters fire per token — ran at 163.15 tok/sec, a number no 32GB card can reach because it cannot hold the model at all.

"The standout benchmark was the OpenAI GPT-OSS 120B, which achieved 163.15 tokens per second — a capability unique to the RTX PRO 6000 due to its 96GB memory capacity, as competing consumer GPUs cannot accommodate models of this scale."— StorageReview, independent RTX PRO 6000 review

The catch is price, and it is a 2026 story. The card launched around an $8,565 MSRP, but the same GDDR7 shortage that forced Apple to pull its largest Mac Studio memory tiers has pushed RTX PRO 6000 marketplace listings toward roughly $12,000 to $14,500 by mid-year (NVIDIA Marketplace around $13,250, Newegg around $12,099, B&H around $14,499), even as some retail held nearer $8,500 to $9,200. This is the second face of the shortage narrative: it does not just raise list prices, it widens the gap between MSRP and what you actually pay.

One shortage, two symptoms

The 512GB Mac Studio pulled in March, the 256GB pulled in May, Apple’s June 25 across-the-board Mac price hike, the DGX Spark’s $700 bump, and the RTX PRO 6000’s street spike are not separate events — they are the same global DRAM and GDDR7 shortage showing up in different stores. The practical takeaway: memory capacity is the scarce, expensive resource of 2026, and it is exactly what determines which models you can run. The PRO 6000’s whole-workstation draw is also real — about 920W under load, closer to $322/yr in electricity than the GPU-only $210.

07Apple SiliconApple Silicon: quiet, efficient, and re-priced.

Apple’s unified-memory architecture is a natural fit for local LLMs: the GPU addresses the same large memory pool as the CPU, so a 128GB MacBook Pro can hold models that would need a server rack of consumer GPUs. The trade-offs are no CUDA, a thinner advanced-tooling ecosystem, and slower long-context prefill than NVIDIA. But for many users the wins — silence, low power, and the privacy and cost of on-device AI — outweigh them.

Portable entry
MacBook Pro M5 Pro
up to 64GB · 307 GB/s

From about $2,699 for the 16-inch (pre-hike). Its 307 GB/s caps a dense 70B near 9 tok/sec, so it is really a sub-30B machine — but a very capable, affordable one.

~$2,699 start
Portable max
MacBook Pro M5 Max
up to 128GB · 614 GB/s

From about $3,899; a 128GB config ran roughly $5,000-plus at launch and more after June 25. Community estimates put a dense 70B near 12 to 18 tok/sec, at about 80W, near-silent, battery-optional.

128GB unified
Desktop bandwidth
Mac Studio M3 Ultra
up to 96GB · 819 GB/s

About $5,299 after the June 25 hike. Its 819 GB/s is the fastest Apple-silicon decode here — roughly 16 to 22 tok/sec on a dense 70B — in a silent desktop. But its 512GB and 256GB memory tiers were withdrawn in 2026.

819 GB/s
The Mac Studio memory caveat

This is the correction that breaks most older buyer guides. Apple removed the 512GB Mac Studio M3 Ultra option in March 2026 and the 256GB option in May 2026, both attributed to the DRAM shortage — so 96GB is the maximum purchasable M3 Ultra Mac Studio configuration as of late June 2026. Do not plan around a 256GB or 512GB Mac Studio; it is no longer on sale. The M3 Ultra remains the fastest Apple option for 70B decode thanks to its 819 GB/s bandwidth, but it is no longer the capacity king it once was.

08Total CostThe bill after you buy it.

Upfront price is only half the decision; power draw is the recurring half. The spread here is dramatic — an Apple machine sips power while a workstation GPU pulls as much as a space heater. The chart uses a consistent basis: typical device power draw, eight hours a day, at the US-average $0.12/kWh, with each annual figure recomputed from watts directly.

Electricity cost · 8 h/day at $0.12/kWh (US average)

Derived: watts × 2,920 h ÷ 1,000 × $0.12 · EU rates run 2.5–3× higher
RTX PRO 6000600 W GPU · whole workstation ≈ $322
~$210/yr
RTX 5090575 W TGP
~$201/yr
Mac Studio M3 Ultra≈150 W under load
~$53/yr
DGX Spark≈140 W typical · 240 W PSU
~$49/yr
MacBook Pro M5 Max≈80 W whole system
~$28/yr

Two things follow. First, location matters: at European rates of roughly €0.30 to €0.40/kWh, every figure above roughly triples, which can swing a multi-year ownership decision toward the efficient Apple machines. Second, electricity is small next to hardware depreciation in most cases — but for an always-on workstation GPU it is not negligible, and it is the kind of line item that belongs in a proper total cost of ownership including electricity and hardware depreciation. If you are weighing a hardware purchase against cloud subscriptions at all, the math shifted again with Apple’s June 2026 price hike.

09The DecisionHow to choose your bracket.

There is no single best machine — there is a best machine for your workload, model size, and tolerance for setup. Match yourself to one of the five profiles below. Before spending anything, though, it is worth deciding whether to self-host open-weight models at all versus renting GPU time for bursty workloads.

Sub-30B at top speed
RTX 5090

Best tokens-per-dollar for 7B to 30B models — about 66 tok/sec on a 30B at Q4. Skip it if you need a 70B at full quality; 32GB simply will not hold one.

Pick the RTX 5090
Portable & private
MacBook Pro M5 Max

128GB unified, near-silent, about $28/yr to run, no CUDA. Runs a 70B passably at an estimated 12 to 18 tok/sec (community range). The best laptop for on-device privacy work.

Pick the M5 Max
Fastest real 70B
RTX PRO 6000 Blackwell

About 32 tok/sec on a dense 70B, 96GB ECC, and fast long-context prefill. The only card that runs 70B at full VRAM speed — if you can absorb the ~$12k-plus shortage price.

Pick the PRO 6000
Big models & CUDA
NVIDIA DGX Spark

Capacity and the NVIDIA software stack in a tiny box. Load ~200B-class models and build agents — but accept single-digit dense-70B decode with common tools.

Pick the DGX Spark
Quiet Apple desktop
Mac Studio M3 Ultra

819 GB/s is the fastest Apple-silicon path to roughly 16 to 22 tok/sec on a 70B, in a silent desktop — now capped at 96GB after the 2026 memory-tier cuts.

Pick the Mac Studio

If you would rather not guess — or you need to validate a purchase against the exact models and prompts your team runs — our AI digital transformation engagements benchmark candidate hardware and open-weight models on your real workloads before you commit budget, and pair the hardware decision with the application and tooling layer that actually puts a local model to work.

10ConclusionBuy bandwidth and capacity, in that order.

The shape of local AI hardware, mid-2026

In a shortage year, bandwidth-per-dollar and memory capacity decide the buy — not TOPS.

Strip away the marketing and local AI hardware reduces to two questions. Does the model fit in memory? And how fast can the chip stream those weights? Memory bandwidth sets your decode ceiling; memory capacity sets which models you can run at all. TOPS, the number plastered on every box, barely moves the needle for single-user inference — it matters for prefill and batching, not for the tokens you watch appear.

Against that frame, the 2026 picks are clear. The RTX 5090 is the sub-30B value champion that cannot hold a 70B. The RTX PRO 6000 is the fastest single-box 70B, if you can pay a shortage-inflated five figures. The DGX Spark buys capacity and CUDA, not speed. And Apple Silicon trades raw throughput for silence, efficiency, and privacy — with the hard caveat that the Mac Studio’s big memory tiers were pulled this year.

Looking forward, the shortage is the variable to watch. As long as DRAM and GDDR7 stay scarce, expect capacity to remain the expensive, rationed resource and bandwidth-per-dollar to keep being the metric that separates these machines. If Apple restores higher memory tiers or prices ease, the capacity calculus shifts again. Until then, the durable advice is simple: size the model first, buy the bandwidth to feed it, and verify today’s street price before you commit — because in this market, last quarter’s number is already wrong.

Match the hardware to the workload, not the spec sheet

Pick the box that runs your model fastest, not the one with the biggest number.

We benchmark candidate GPUs, Apple Silicon, and open-weight models on your real prompts and workloads, then design the local or hybrid inference stack around the result — delivered in days, not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Local & hybrid AI engagements

  • Hardware benchmarking on your models and prompts
  • Local vs cloud TCO and break-even analysis
  • Quantization and serving-stack tuning for throughput
  • On-prem and sovereignty-bound deployment design
  • Hybrid routing across local + frontier models
FAQ · Local AI hardware 2026

The questions we get every week.

Memory bandwidth. At batch size one, generating each token requires reading every model weight from memory once, so the ceiling on tokens per second is roughly the chip's memory bandwidth divided by the model's size in memory. A device with thousands of TOPS but modest bandwidth will still decode slowly, because the tensor cores spend most of their time waiting on memory. This is why an RTX PRO 6000 and an RTX 5090 — which share 1,792 GB/s — have the same decode ceiling on any model that fits both, and why a 273 GB/s DGX Spark is slow on a dense 70B despite its 1 PFLOP of FP4 compute. Buy bandwidth for decode speed; compute mostly matters for prefilling long prompts.
Related dispatches

Keep building your local AI stack.