Pick a local AI machine in 2026 and you are really choosing a bottleneck. The NVIDIA DGX Spark, Apple’s MacBook Pro M5 Max, and the NVIDIA RTX PRO 6000 Blackwell each load large language models on a single desk-bound device — but they spend their money on three different constraints, and a spec sheet read at face value will point you at the wrong one.
The number that decides how fast tokens come out is rarely the one in the headline. At batch size 1, decoding is bound by memory bandwidth, not peak compute — so a card advertising thousands of TOPS can still stream tokens at a modest rate. That single fact reorders the whole comparison, and it is where most buyer guides go wrong.
This guide recomputes the figures from vendor specs and independent benchmarks, builds one proprietary comparison table with every derived cell traced back to its formula, keeps the CUDA-versus-MLX question balanced, and ends with a plain verdict: which box for which job. We also flag the 2026 wildcard that matters more than any benchmark — a DRAM shortage that reshaped all three price tags.
- 01Decode speed is bandwidth-bound — and that bounds everything.At batch size 1, each token requires reading the whole model from memory, so tokens-per-second tracks memory bandwidth more than raw compute. The DGX Spark moves 273 GB/s, the M5 Max 460–614 GB/s, and the RTX PRO 6000 1,792 GB/s.
- 02The M5 Max holds the bandwidth edge; software narrows it.On a same-format dense decode, the 40-core M5 Max (614 GB/s) leads the DGX Spark (273 GB/s) by about 2.25x. NVIDIA's native NVFP4/MXFP4 on Blackwell halves bytes-per-parameter and closes or reverses that gap in practice. Benchmark your own model and runtime.
- 03Treat every tokens-per-second figure as approximate.A dense 70B at 4-bit reads roughly 40GB per token — single-digit to low-double-digit tok/s on a 273 GB/s box. A sparse 120B Mixture-of-Experts activates far fewer parameters and decodes several times faster. Numbers only mean something next to a named model, quantization and runtime.
- 04One DRAM shortage reshaped all three price tags.The same memory squeeze pushed Apple to pull the Mac Studio M3 Ultra's 512GB (March) and 256GB (May) options — 96GB is the 2026 ceiling — added $700 to the DGX Spark, and pushed RTX PRO 6000 listings to roughly $12,000–$14,500, well above its ~$8,565 2025 launch MSRP.
- 05Match the box to the job, not the spec sheet.M5 Max for a portable, efficient, always-on agent; DGX Spark as a 128GB local CUDA API server for big models and concurrent streams; RTX PRO 6000 for maximum single-card speed and training — if you can absorb 600W and the price.
01 — The RuleWhy local LLM speed is a memory problem, not a compute one.
Generating text with a transformer happens one token at a time, and every single token requires reading the entire set of model weights out of memory once. At batch size 1 — one user, one stream — there is almost no parallel work to hide that read behind, so the binding constraint is how quickly the chip can stream weights from memory. That is memory bandwidth, measured in gigabytes per second, and it matters far more for decode than the peak FLOPs or TOPS a vendor puts on the box.
This gives you a back-of-envelope ceiling that is genuinely useful: divide memory bandwidth by the model’s size in memory. A 70B model quantized to 4-bit weighs roughly 40GB, so a 273 GB/s device tops out near 7 tokens per second and a 1,792 GB/s card near 45 — before any real-world inefficiency. Those are ceilings, not observed speeds; actual throughput is a fraction of them and shifts with the runtime, the quantization format and the size of the KV cache.
Prefill — processing the prompt you send in — is the mirror image. All input tokens are handled in parallel as matrix-matrix operations, which makes prefill compute-bound and far faster than decode. So a device can be slow at generating yet quick at digesting a long document, which is exactly the pattern that makes the DGX Spark interesting for retrieval-heavy agent work even where its decode is unremarkable.
02 — The FieldThree machines, three different bets.
These are not three versions of the same thing. One is a laptop, one is a compact CUDA desktop, and one is a 600W workstation GPU. Each optimizes a different part of the bandwidth equation, and that is the real basis for choosing between them. Our deeper write-up on the DGX Spark as a local 120B agent box covers the Spark’s capacity story in more detail.
NVIDIA DGX Spark
A compact CUDA desktop that loads big models the bandwidth can't race through. Full TensorRT-LLM and vLLM stack on ARM64, native NVFP4/MXFP4 on Blackwell tensor cores, and a real edge at high concurrency for multi-agent serving. Linux only. $4,699.
MacBook Pro M5 Max
The only contender you can close and carry. Highest raw memory bandwidth of the three for its class, mature MLX and llama.cpp/Metal for inference and LoRA, near-silent and battery-friendly. macOS only. From $3,599.
NVIDIA RTX PRO 6000
The only 96GB GDDR7 workstation card, with about 6.6x the DGX Spark's memory bandwidth and the full CUDA training stack. The catch: 600W and a price that climbed from a launch near $8,565 MSRP to roughly $12,000–$14,500 in mid-2026 listings. A density buy, not a generalist one.
03 — BandwidthThe number that sets the ceiling.
Memory bandwidth is where these machines diverge most sharply. The RTX PRO 6000’s 1,792 GB/s of GDDR7 is roughly 6.6 times the DGX Spark’s 273 GB/s of LPDDR5X — a genuinely large gap that, all else equal, sets a far higher decode ceiling. The MacBook Pro M5 Max sits between them at 460 GB/s on the 32-core GPU or 614 GB/s on the 40-core GPU, both with up to 128GB of unified memory.
Memory bandwidth, GB/s — the decode ceiling
Source: NVIDIA and Apple spec sheets, 2026Here is the recomputed bandwidth ratio at the heart of this comparison: the 40-core M5 Max moves about 2.25x more memory bandwidth than the DGX Spark (614 versus 273 GB/s). Because dense single-stream decode is bandwidth-bound, that translates into roughly a 2x token-rate ceiling in the Mac’s favor — in a same-format, same-model comparison. Read on a spec sheet alone, the Mac looks like the faster decoder of the two, and on raw memory physics it is.
But raw bandwidth is only half the equation. NVIDIA serves models in native NVFP4 and MXFP4 on Blackwell tensor cores — not emulated — which halves the bytes read per parameter compared with FP16. Fewer bytes per token effectively raises the Spark’s decode ceiling for FP4-served weights, narrowing or reversing the bandwidth gap in practice. The bandwidth-bound principle holds; the format you run in changes what it implies. That is why the next two sections matter as much as this chart.
04 — At a GlanceSpecs, decode ceiling, and 2026 pricing in one place.
The table below is our proprietary side-by-side. Every figure traces to a vendor spec or an independent benchmark, the decode row is a recomputed ceiling rather than a vendor hero number, and the pricing reflects mid-2026 reality, not launch-day MSRP. Where a number is genuinely contested, we say so rather than print false precision.
| Spec | DGX Spark (GB10) | MacBook Pro M5 Max | RTX PRO 6000 Blackwell |
|---|---|---|---|
| Silicon and memory | |||
| Memory | 128GB LPDDR5X unified | Up to 128GB unified | 96GB GDDR7 ECC |
| Memory bandwidth | 273 GB/s | 460 GB/s (32-core) / 614 GB/s (40-core) | 1,792 GB/s |
| Peak FP4 compute (sparse) | 1,000 TOPS (1 PFLOP) | Not published | 4,000 TOPS |
| CUDA cores | 6,144 | None (Metal GPU) | 24,064 |
| Power draw | 240W PSU (~140W typical draw) | ~50–100W (laptop SoC) | 600W TGP |
| Throughput and runtime | |||
| 70B 4-bit decode ceiling | ~7 tok/s | ~15 tok/s (40-core) | ~45 tok/s |
| OS and runtime | Linux — CUDA, TRT-LLM, vLLM | macOS — MLX, llama.cpp/Metal | Windows/Linux — CUDA, TRT-LLM |
| Fine-tuning support | Full (PyTorch CUDA, RLHF/DPO) | LoRA via mlx-lm (RLHF/DPO immature) | Full (PyTorch CUDA) |
| Price (mid-2026) | |||
| Street price | $4,699 (was $3,999) | From $3,599 (14-inch) | ~$12k–$14.5k listings (launch ~$8,565 MSRP) |
Two cells deserve a footnote. The decode-ceiling row is computed, not measured: it is memory bandwidth divided by roughly 40GB (a 70B model at 4-bit), so 273/40 ≈ 7, 614/40 ≈ 15, and 1,792/40 ≈ 45 tokens per second. Real-world throughput is a fraction of these ceilings and varies by runtime and quantization. We deliberately do not print a single “observed” 70B decode number for the Spark, because that figure is genuinely contested — independent reports for a dense 70B land anywhere from low single digits to low double digits depending on the stack, and higher numbers usually reflect a different model class entirely.
05 — SoftwareCUDA or MLX — the software is the other half of the spec sheet.
Silicon sets the ceiling; software decides how close you get to it and what you can actually do with the box. Here the two camps are genuinely different, and a fair comparison resists declaring a single winner. Apple’s strength is mature, efficient inference; NVIDIA’s is depth — native low-precision serving and the full training pipeline.
Native low-precision
NVFP4 and MXFP4 run on Blackwell tensor cores (not emulation), halving bytes-per-parameter versus FP16. That lowers the per-token weight read and lifts the Spark's effective decode ceiling for FP4-served models — the software lever that partly offsets its lower raw bandwidth.
Inference and LoRA, mature
Ollama's native MLX engine, llama.cpp/Metal and PyTorch MPS cover everyday inference and LoRA fine-tuning well, on the highest raw bandwidth in this class. What is not yet mature on Apple Silicon: full-parameter fine-tuning, RLHF and DPO.
M5 Max vs DGX Spark
On raw memory bandwidth the 40-core M5 Max leads the DGX Spark by about 2.25x (614 vs 273 GB/s), so a same-format dense decode favors the Mac by roughly 2x. NVIDIA narrows or reverses that in practice with native FP4 and a more mature server stack. The honest answer is stack-dependent.
The practical takeaway: if your work is portable inference, prototyping and the occasional LoRA, the Mac’s ecosystem is more than enough and far more pleasant to live with. If you need native FP4 serving, high-concurrency production inference, or serious fine-tuning — RLHF, DPO, full-parameter runs — the CUDA stack on the DGX Spark or RTX PRO 6000 is still the path of least resistance, ARM64 and all. Neither camp is strictly ahead; they are ahead at different things.
06 — SparsityWhere single-stream benchmarks mislead.
The tokens-per-second numbers everyone quotes are single-stream — one request at a time. For agentic workloads that is the wrong test. When many requests run at once, they read the same model weights from memory together, so the bandwidth cost is shared rather than multiplied. Aggregate throughput climbs far above the single-stream figure. Independent measurement on the DGX Spark puts aggregate output near 2,451 tokens per second at a concurrency of 256 — orders of magnitude above its single-stream rate, because the weight-read budget is amortized across streams.
"Memory bandwidth is a budget you spend... when you run two streams simultaneously, you spend the same bandwidth budget reading the same weights, and both streams get the result."— Dendro Logic engineering blog, DGX Spark concurrency benchmark
Sparsity is the second reason a single number deceives. A dense 70B model forces a full read of every weight on every token. A sparse Mixture-of-Experts activates only a fraction of its parameters per token, so it reads far fewer bytes and decodes much faster on identical silicon. NVIDIA reports the 120B GPT-OSS model in MXFP4 at roughly 55 tokens per second single-stream on the DGX Spark — a vendor figure, and crucially an MoE result, not a dense-70B one. Quoting it as if it were a dense 70B number is the most common way these comparisons mislead.
07 — The WildcardOne DRAM shortage, three rewritten price tags.
The biggest 2026 story in this comparison is not a benchmark — it is memory supply. As manufacturers shifted capacity toward high-bandwidth memory (HBM) for data-center AI accelerators, the supply of conventional DRAM and GDDR7 tightened and prices rose. One shortage explains three otherwise unrelated facts.
First, Apple quietly pulled the Mac Studio M3 Ultra’s 512GB option around early March 2026, then removed the 256GB option in May — leaving 96GB as the maximum purchasable M3 Ultra configuration in its store by mid-2026. Second, the DGX Spark took a $700 increase, from $3,999 at its October 2025 launch to $4,699, explicitly attributed to DRAM costs. Third, the RTX PRO 6000 Blackwell, which launched near $8,565 MSRP in March 2025, saw mid-2026 listings climb to roughly $12,000–$14,500 amid the shortage — NVIDIA’s marketplace near $13,250, Newegg near $12,099 and B&H near $14,499, even as some retail held closer to $8,500–$9,200. The same forces are tracked in our note on Apple’s recent local-AI price moves.
At those prices, the cost of memory tells its own story. The DGX Spark works out to about $37 per gigabyte ($4,699 ÷ 128GB), while the RTX PRO 6000 lands near $138 per gigabyte ($13,250 listing ÷ 96GB) — roughly 3.7 times more per gigabyte for the privilege of GDDR7’s bandwidth. If your decision hinges on capacity rather than raw speed, that ratio matters as much as any tok/s figure, and it is central to any honest total cost of ownership comparison against renting cloud GPUs.
08 — VerdictWhich box for which job.
There is no overall winner — there is a right answer per workload. Map your actual job to one of these four lanes, then buy the box that owns it. For teams running fleets of background agents, our notes on on-device agent deployments extend this beyond a single machine.
The only laptop in the race
Need a model that travels, runs near-silent and sips power? The MacBook Pro M5 Max is the sole 128GB portable here, with the highest raw bandwidth in its class for everyday inference and LoRA fine-tuning.
Local CUDA API server
Loading 100B-plus models and serving many concurrent agent streams from one box? The DGX Spark's 128GB, native FP4 and concurrency amortization make it a capable local API server at ~140W typical draw — capacity and prefill over single-stream speed.
Density at a price
If you need the fastest single-card decode and serious fine-tuning headroom and can absorb 600W and a $12k–$14.5k sticker, the RTX PRO 6000's 1,792 GB/s and full CUDA training stack lead. A density buy, not a generalist one.
Skip the 70B premium
Most local workloads fit a 30B model. A 32GB RTX 5090 can't hold a 70B at 4-bit (~40GB) on one card, but it runs a 30B at 4-bit at an estimated 60–90 tok/s (a bandwidth-derived estimate) for a fraction of the price — often the rational pick before any of these three.
One discipline ties these lanes together: do not let a single benchmark headline make the call. The right purchase falls out of three concrete questions — how big is the model you must run, how many streams will hit it at once, and what is your power and budget envelope. Answer those honestly and the box chooses itself. If you want that decision pressure-tested against your real workloads, our AI transformation engagements and custom AI development work start with exactly this kind of comparative evaluation.
09 — ConclusionSize the model to the job, then buy the cheapest box that runs it well.
There is no single winner — there is a right box for each workload.
The three machines answer three different questions. The MacBook Pro M5 Max is the portable, efficient, single-user option with the highest raw bandwidth in its class. The DGX Spark is a 128GB local CUDA API server that prizes capacity, prefill and concurrency over single-stream speed. The RTX PRO 6000 is the bandwidth king for those who need maximum single-card decode and full-stack training and can pay for it.
The deeper lesson is to stop reading spec sheets as if one number decides the race. Decode is bounded by memory bandwidth, but native FP4, Mixture-of-Experts sparsity and runtime maturity move the real result by multiples. A headline tokens-per-second figure without its model, quantization and software stack attached is close to meaningless — which is exactly the nuance a fair local-AI evaluation has to surface.
Looking ahead, the pressure point is memory, not silicon. The same DRAM shortage that erased Apple’s 512GB ceiling and pushed the RTX PRO 6000 into the $12,000–$14,500 listing range will shape 2026 buying more than any benchmark. If an M5 Ultra Mac Studio arrives later this year with very high capacity, the big-memory race reopens; until then, the smart move is to size the model to the job, then buy the cheapest box that runs it well.