AI DevelopmentDecision Matrix11 min readPublished June 28, 2026

Memory bandwidth bounds decode · 273 to 1,792 GB/s · stack-dependent in practice

DGX Spark vs M5 Max vs RTX 6000: Local AI Showdown

Three very different machines all promise to run frontier models on your desk. The honest comparison isn’t about who wins — it’s about which bottleneck you’re paying to solve: raw memory bandwidth, big-model capacity, or single-card speed. We recompute the numbers and strip out the spec-sheet theater.

DA
Digital Applied Team
Senior strategists · Published June 28, 2026
PublishedJune 28, 2026
Read time11 min
SourcesVendor specs + benchmarks
Bandwidth spread
6.6×
DGX Spark → RTX 6000
273 → 1,792 GB/s
RTX PRO 6000 listings
$12k–$14.5k
mid-2026 shortage
vs ~$8.5k launch
M3 Ultra Mac Studio max
96GB
mid-2026, was 512GB
−416 GB
DGX Spark
$4,699
128GB · 273 GB/s
+$700 DRAM hike

Pick a local AI machine in 2026 and you are really choosing a bottleneck. The NVIDIA DGX Spark, Apple’s MacBook Pro M5 Max, and the NVIDIA RTX PRO 6000 Blackwell each load large language models on a single desk-bound device — but they spend their money on three different constraints, and a spec sheet read at face value will point you at the wrong one.

The number that decides how fast tokens come out is rarely the one in the headline. At batch size 1, decoding is bound by memory bandwidth, not peak compute — so a card advertising thousands of TOPS can still stream tokens at a modest rate. That single fact reorders the whole comparison, and it is where most buyer guides go wrong.

This guide recomputes the figures from vendor specs and independent benchmarks, builds one proprietary comparison table with every derived cell traced back to its formula, keeps the CUDA-versus-MLX question balanced, and ends with a plain verdict: which box for which job. We also flag the 2026 wildcard that matters more than any benchmark — a DRAM shortage that reshaped all three price tags.

Key takeaways
  1. 01
    Decode speed is bandwidth-bound — and that bounds everything.At batch size 1, each token requires reading the whole model from memory, so tokens-per-second tracks memory bandwidth more than raw compute. The DGX Spark moves 273 GB/s, the M5 Max 460–614 GB/s, and the RTX PRO 6000 1,792 GB/s.
  2. 02
    The M5 Max holds the bandwidth edge; software narrows it.On a same-format dense decode, the 40-core M5 Max (614 GB/s) leads the DGX Spark (273 GB/s) by about 2.25x. NVIDIA's native NVFP4/MXFP4 on Blackwell halves bytes-per-parameter and closes or reverses that gap in practice. Benchmark your own model and runtime.
  3. 03
    Treat every tokens-per-second figure as approximate.A dense 70B at 4-bit reads roughly 40GB per token — single-digit to low-double-digit tok/s on a 273 GB/s box. A sparse 120B Mixture-of-Experts activates far fewer parameters and decodes several times faster. Numbers only mean something next to a named model, quantization and runtime.
  4. 04
    One DRAM shortage reshaped all three price tags.The same memory squeeze pushed Apple to pull the Mac Studio M3 Ultra's 512GB (March) and 256GB (May) options — 96GB is the 2026 ceiling — added $700 to the DGX Spark, and pushed RTX PRO 6000 listings to roughly $12,000–$14,500, well above its ~$8,565 2025 launch MSRP.
  5. 05
    Match the box to the job, not the spec sheet.M5 Max for a portable, efficient, always-on agent; DGX Spark as a 128GB local CUDA API server for big models and concurrent streams; RTX PRO 6000 for maximum single-card speed and training — if you can absorb 600W and the price.

01The RuleWhy local LLM speed is a memory problem, not a compute one.

Generating text with a transformer happens one token at a time, and every single token requires reading the entire set of model weights out of memory once. At batch size 1 — one user, one stream — there is almost no parallel work to hide that read behind, so the binding constraint is how quickly the chip can stream weights from memory. That is memory bandwidth, measured in gigabytes per second, and it matters far more for decode than the peak FLOPs or TOPS a vendor puts on the box.

This gives you a back-of-envelope ceiling that is genuinely useful: divide memory bandwidth by the model’s size in memory. A 70B model quantized to 4-bit weighs roughly 40GB, so a 273 GB/s device tops out near 7 tokens per second and a 1,792 GB/s card near 45 — before any real-world inefficiency. Those are ceilings, not observed speeds; actual throughput is a fraction of them and shifts with the runtime, the quantization format and the size of the KV cache.

Prefill — processing the prompt you send in — is the mirror image. All input tokens are handled in parallel as matrix-matrix operations, which makes prefill compute-bound and far faster than decode. So a device can be slow at generating yet quick at digesting a long document, which is exactly the pattern that makes the DGX Spark interesting for retrieval-heavy agent work even where its decode is unremarkable.

The one rule to carry through this guide
Decode tokens-per-second is bounded by memory bandwidth divided by the bytes you must read per token. You can move that bound two ways: a faster memory bus, or fewer bytes per parameter (4-bit formats, or a sparse Mixture-of-Experts that activates only a slice of its weights). Everything below is an application of that single idea.

02The FieldThree machines, three different bets.

These are not three versions of the same thing. One is a laptop, one is a compact CUDA desktop, and one is a 600W workstation GPU. Each optimizes a different part of the bandwidth equation, and that is the real basis for choosing between them. Our deeper write-up on the DGX Spark as a local 120B agent box covers the Spark’s capacity story in more detail.

Capacity play
NVIDIA DGX Spark
GB10 · 128GB LPDDR5X · 273 GB/s · ~140W

A compact CUDA desktop that loads big models the bandwidth can't race through. Full TensorRT-LLM and vLLM stack on ARM64, native NVFP4/MXFP4 on Blackwell tensor cores, and a real edge at high concurrency for multi-agent serving. Linux only. $4,699.

Big-model capacity, not raw speed
Portable play
MacBook Pro M5 Max
Up to 128GB unified · 460–614 GB/s · ~50–100W

The only contender you can close and carry. Highest raw memory bandwidth of the three for its class, mature MLX and llama.cpp/Metal for inference and LoRA, near-silent and battery-friendly. macOS only. From $3,599.

Single-user, always-on, efficient
Bandwidth king
NVIDIA RTX PRO 6000
96GB GDDR7 ECC · 1,792 GB/s · 600W

The only 96GB GDDR7 workstation card, with about 6.6x the DGX Spark's memory bandwidth and the full CUDA training stack. The catch: 600W and a price that climbed from a launch near $8,565 MSRP to roughly $12,000–$14,500 in mid-2026 listings. A density buy, not a generalist one.

Raw speed at a steep 2026 price

03BandwidthThe number that sets the ceiling.

Memory bandwidth is where these machines diverge most sharply. The RTX PRO 6000’s 1,792 GB/s of GDDR7 is roughly 6.6 times the DGX Spark’s 273 GB/s of LPDDR5X — a genuinely large gap that, all else equal, sets a far higher decode ceiling. The MacBook Pro M5 Max sits between them at 460 GB/s on the 32-core GPU or 614 GB/s on the 40-core GPU, both with up to 128GB of unified memory.

Memory bandwidth, GB/s — the decode ceiling

Source: NVIDIA and Apple spec sheets, 2026
RTX PRO 6000 Blackwell96GB GDDR7 · 512-bit bus
1,792
MacBook Pro M5 Max (40-core GPU)Up to 128GB unified
614
MacBook Pro M5 Max (32-core GPU)Up to 128GB unified
460
DGX Spark (GB10)128GB LPDDR5X
273

Here is the recomputed bandwidth ratio at the heart of this comparison: the 40-core M5 Max moves about 2.25x more memory bandwidth than the DGX Spark (614 versus 273 GB/s). Because dense single-stream decode is bandwidth-bound, that translates into roughly a 2x token-rate ceiling in the Mac’s favor — in a same-format, same-model comparison. Read on a spec sheet alone, the Mac looks like the faster decoder of the two, and on raw memory physics it is.

But raw bandwidth is only half the equation. NVIDIA serves models in native NVFP4 and MXFP4 on Blackwell tensor cores — not emulated — which halves the bytes read per parameter compared with FP16. Fewer bytes per token effectively raises the Spark’s decode ceiling for FP4-served weights, narrowing or reversing the bandwidth gap in practice. The bandwidth-bound principle holds; the format you run in changes what it implies. That is why the next two sections matter as much as this chart.

04At a GlanceSpecs, decode ceiling, and 2026 pricing in one place.

The table below is our proprietary side-by-side. Every figure traces to a vendor spec or an independent benchmark, the decode row is a recomputed ceiling rather than a vendor hero number, and the pricing reflects mid-2026 reality, not launch-day MSRP. Where a number is genuinely contested, we say so rather than print false precision.

Side-by-side memory, bandwidth, FP4 compute, power, software ecosystem, a recomputed 70B 4-bit decode ceiling, and mid-2026 pricing for the NVIDIA DGX Spark, Apple MacBook Pro M5 Max, and NVIDIA RTX PRO 6000 Blackwell.
SpecDGX Spark (GB10)MacBook Pro M5 MaxRTX PRO 6000 Blackwell
Silicon and memory
Memory128GB LPDDR5X unifiedUp to 128GB unified96GB GDDR7 ECC
Memory bandwidth273 GB/s460 GB/s (32-core) / 614 GB/s (40-core)1,792 GB/s
Peak FP4 compute (sparse)1,000 TOPS (1 PFLOP)Not published4,000 TOPS
CUDA cores6,144None (Metal GPU)24,064
Power draw240W PSU (~140W typical draw)~50–100W (laptop SoC)600W TGP
Throughput and runtime
70B 4-bit decode ceiling~7 tok/s~15 tok/s (40-core)~45 tok/s
OS and runtimeLinux — CUDA, TRT-LLM, vLLMmacOS — MLX, llama.cpp/MetalWindows/Linux — CUDA, TRT-LLM
Fine-tuning supportFull (PyTorch CUDA, RLHF/DPO)LoRA via mlx-lm (RLHF/DPO immature)Full (PyTorch CUDA)
Price (mid-2026)
Street price$4,699 (was $3,999)From $3,599 (14-inch)~$12k–$14.5k listings (launch ~$8,565 MSRP)

Two cells deserve a footnote. The decode-ceiling row is computed, not measured: it is memory bandwidth divided by roughly 40GB (a 70B model at 4-bit), so 273/40 ≈ 7, 614/40 ≈ 15, and 1,792/40 ≈ 45 tokens per second. Real-world throughput is a fraction of these ceilings and varies by runtime and quantization. We deliberately do not print a single “observed” 70B decode number for the Spark, because that figure is genuinely contested — independent reports for a dense 70B land anywhere from low single digits to low double digits depending on the stack, and higher numbers usually reflect a different model class entirely.

05SoftwareCUDA or MLX — the software is the other half of the spec sheet.

Silicon sets the ceiling; software decides how close you get to it and what you can actually do with the box. Here the two camps are genuinely different, and a fair comparison resists declaring a single winner. Apple’s strength is mature, efficient inference; NVIDIA’s is depth — native low-precision serving and the full training pipeline.

DGX Spark / CUDA edge
Native low-precision
FP4

NVFP4 and MXFP4 run on Blackwell tensor cores (not emulation), halving bytes-per-parameter versus FP16. That lowers the per-token weight read and lifts the Spark's effective decode ceiling for FP4-served models — the software lever that partly offsets its lower raw bandwidth.

Plus full TRT-LLM, vLLM, PyTorch CUDA
Apple / Metal edge
Inference and LoRA, mature
MLX

Ollama's native MLX engine, llama.cpp/Metal and PyTorch MPS cover everyday inference and LoRA fine-tuning well, on the highest raw bandwidth in this class. What is not yet mature on Apple Silicon: full-parameter fine-tuning, RLHF and DPO.

Inference strong, deep training thin
The bandwidth ratio
M5 Max vs DGX Spark
2.25×

On raw memory bandwidth the 40-core M5 Max leads the DGX Spark by about 2.25x (614 vs 273 GB/s), so a same-format dense decode favors the Mac by roughly 2x. NVIDIA narrows or reverses that in practice with native FP4 and a more mature server stack. The honest answer is stack-dependent.

Benchmark your model + runtime

The practical takeaway: if your work is portable inference, prototyping and the occasional LoRA, the Mac’s ecosystem is more than enough and far more pleasant to live with. If you need native FP4 serving, high-concurrency production inference, or serious fine-tuning — RLHF, DPO, full-parameter runs — the CUDA stack on the DGX Spark or RTX PRO 6000 is still the path of least resistance, ARM64 and all. Neither camp is strictly ahead; they are ahead at different things.

06SparsityWhere single-stream benchmarks mislead.

The tokens-per-second numbers everyone quotes are single-stream — one request at a time. For agentic workloads that is the wrong test. When many requests run at once, they read the same model weights from memory together, so the bandwidth cost is shared rather than multiplied. Aggregate throughput climbs far above the single-stream figure. Independent measurement on the DGX Spark puts aggregate output near 2,451 tokens per second at a concurrency of 256 — orders of magnitude above its single-stream rate, because the weight-read budget is amortized across streams.

"Memory bandwidth is a budget you spend... when you run two streams simultaneously, you spend the same bandwidth budget reading the same weights, and both streams get the result."— Dendro Logic engineering blog, DGX Spark concurrency benchmark

Sparsity is the second reason a single number deceives. A dense 70B model forces a full read of every weight on every token. A sparse Mixture-of-Experts activates only a fraction of its parameters per token, so it reads far fewer bytes and decodes much faster on identical silicon. NVIDIA reports the 120B GPT-OSS model in MXFP4 at roughly 55 tokens per second single-stream on the DGX Spark — a vendor figure, and crucially an MoE result, not a dense-70B one. Quoting it as if it were a dense 70B number is the most common way these comparisons mislead.

Read the asterisk on every tok/s figure
A tokens-per-second number means nothing without the model and runtime that produced it. A dense 70B forces a full ~40GB weight read every token — single-digit to low-double-digit tok/s on a 273 GB/s box. A sparse MoE activates a slice of its parameters, so a 120B MoE can decode several times faster than a dense 70B on the same hardware. Treat every figure as approximate and stack-dependent.

07The WildcardOne DRAM shortage, three rewritten price tags.

The biggest 2026 story in this comparison is not a benchmark — it is memory supply. As manufacturers shifted capacity toward high-bandwidth memory (HBM) for data-center AI accelerators, the supply of conventional DRAM and GDDR7 tightened and prices rose. One shortage explains three otherwise unrelated facts.

First, Apple quietly pulled the Mac Studio M3 Ultra’s 512GB option around early March 2026, then removed the 256GB option in May — leaving 96GB as the maximum purchasable M3 Ultra configuration in its store by mid-2026. Second, the DGX Spark took a $700 increase, from $3,999 at its October 2025 launch to $4,699, explicitly attributed to DRAM costs. Third, the RTX PRO 6000 Blackwell, which launched near $8,565 MSRP in March 2025, saw mid-2026 listings climb to roughly $12,000–$14,500 amid the shortage — NVIDIA’s marketplace near $13,250, Newegg near $12,099 and B&H near $14,499, even as some retail held closer to $8,500–$9,200. The same forces are tracked in our note on Apple’s recent local-AI price moves.

At those prices, the cost of memory tells its own story. The DGX Spark works out to about $37 per gigabyte ($4,699 ÷ 128GB), while the RTX PRO 6000 lands near $138 per gigabyte ($13,250 listing ÷ 96GB) — roughly 3.7 times more per gigabyte for the privilege of GDDR7’s bandwidth. If your decision hinges on capacity rather than raw speed, that ratio matters as much as any tok/s figure, and it is central to any honest total cost of ownership comparison against renting cloud GPUs.

The Apple big-memory caveat
The Mac Studio that once let you load a 400B-class model on your desk at 512GB now tops out at 96GB for the M3 Ultra as of mid-2026 — Apple’s big-memory advantage has narrowed sharply. Note that the contender in this comparison, the MacBook Pro M5 Max, still reaches 128GB of unified memory. An M5 Ultra Mac Studio is widely expected later in 2026 and would restore high capacity, but Apple has not announced it as of this writing — treat it as rumored, not roadmap.

08VerdictWhich box for which job.

There is no overall winner — there is a right answer per workload. Map your actual job to one of these four lanes, then buy the box that owns it. For teams running fleets of background agents, our notes on on-device agent deployments extend this beyond a single machine.

Portable / always-on agent
The only laptop in the race

Need a model that travels, runs near-silent and sips power? The MacBook Pro M5 Max is the sole 128GB portable here, with the highest raw bandwidth in its class for everyday inference and LoRA fine-tuning.

Pick M5 Max
Big models + multi-agent serving
Local CUDA API server

Loading 100B-plus models and serving many concurrent agent streams from one box? The DGX Spark's 128GB, native FP4 and concurrency amortization make it a capable local API server at ~140W typical draw — capacity and prefill over single-stream speed.

Pick DGX Spark
Maximum single-card speed / training
Density at a price

If you need the fastest single-card decode and serious fine-tuning headroom and can absorb 600W and a $12k–$14.5k sticker, the RTX PRO 6000's 1,792 GB/s and full CUDA training stack lead. A density buy, not a generalist one.

Pick RTX PRO 6000
Sub-30B models on a budget
Skip the 70B premium

Most local workloads fit a 30B model. A 32GB RTX 5090 can't hold a 70B at 4-bit (~40GB) on one card, but it runs a 30B at 4-bit at an estimated 60–90 tok/s (a bandwidth-derived estimate) for a fraction of the price — often the rational pick before any of these three.

Consider an RTX 5090

One discipline ties these lanes together: do not let a single benchmark headline make the call. The right purchase falls out of three concrete questions — how big is the model you must run, how many streams will hit it at once, and what is your power and budget envelope. Answer those honestly and the box chooses itself. If you want that decision pressure-tested against your real workloads, our AI transformation engagements and custom AI development work start with exactly this kind of comparative evaluation.

09ConclusionSize the model to the job, then buy the cheapest box that runs it well.

The honest verdict, June 2026

There is no single winner — there is a right box for each workload.

The three machines answer three different questions. The MacBook Pro M5 Max is the portable, efficient, single-user option with the highest raw bandwidth in its class. The DGX Spark is a 128GB local CUDA API server that prizes capacity, prefill and concurrency over single-stream speed. The RTX PRO 6000 is the bandwidth king for those who need maximum single-card decode and full-stack training and can pay for it.

The deeper lesson is to stop reading spec sheets as if one number decides the race. Decode is bounded by memory bandwidth, but native FP4, Mixture-of-Experts sparsity and runtime maturity move the real result by multiples. A headline tokens-per-second figure without its model, quantization and software stack attached is close to meaningless — which is exactly the nuance a fair local-AI evaluation has to surface.

Looking ahead, the pressure point is memory, not silicon. The same DRAM shortage that erased Apple’s 512GB ceiling and pushed the RTX PRO 6000 into the $12,000–$14,500 listing range will shape 2026 buying more than any benchmark. If an M5 Ultra Mac Studio arrives later this year with very high capacity, the big-memory race reopens; until then, the smart move is to size the model to the job, then buy the cheapest box that runs it well.

Run the benchmark, not the brochure

From spec sheets to a running benchmark on your own models.

We help teams cut through spec-sheet theater — benchmarking local and cloud AI hardware on your own models, then standing up the inference or fine-tuning stack that actually fits the workload, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Local AI hardware engagements

  • Decode and prefill benchmarking on your own models
  • Local vs cloud GPU TCO modeling
  • Quantization and FP4 serving for production inference
  • Fine-tuning pipelines — LoRA, RLHF, DPO routing
  • Multi-agent serving and concurrency sizing
FAQ · Local AI hardware

The questions we get every week.

It depends entirely on the model and runtime. At batch size 1, decode is memory-bandwidth-bound, and the 40-core M5 Max moves roughly 2.25x more memory bandwidth than the DGX Spark (614 vs 273 GB/s), so a dense model served in the same format generally decodes faster on the Mac. NVIDIA narrows or reverses that gap by serving weights in native NVFP4/MXFP4 on Blackwell tensor cores, which halves bytes-per-parameter, and with a more mature server stack such as TensorRT-LLM and vLLM. The Spark also wins decisively on prefill and on high-concurrency serving. So 'slower' is the wrong frame: the Mac leads single-stream dense decode on raw bandwidth, while the Spark leads on capacity, prefill and concurrency. Benchmark your specific model and quantization before deciding.
Related dispatches

Continue exploring local AI hardware.