An AI PC with a 40-plus TOPS NPU clears Microsoft’s Copilot+ bar, but the headline TOPS number tells you almost nothing about whether it can run a local LLM at a usable speed. NPUs are genuinely good at small, sustained on-device models and Windows AI features — they are not the way to run a 70B model. This guide separates the marketing from the silicon.
The confusion is everywhere: buyers see “45 TOPS” or “50 TOPS of AI performance” on a spec sheet and assume it means the laptop can run Llama or Mistral locally. It can run some AI workloads beautifully — and a large language model is usually not one of them, at least not on the NPU. The reason is a metric mismatch that almost no marketing page explains.
Below, we decode TOPS into real tokens per second, show why autoregressive LLM inference is bound by memory bandwidth rather than raw compute, map exactly which Copilot+ features run on the NPU versus which fall back to the iGPU or CPU, and give an honest buy-this-for-that decision matrix. Every spec is traceable to the silicon vendor or a named benchmark.
- 01The 40 TOPS bar is a features gate, not an LLM rating.Copilot+ certification requires a 40-plus TOPS NPU, 16 GB RAM and a 256 GB SSD on Windows 11 24H2. That unlocks Recall, Live Captions, Studio Effects and Click to Do — not fast local LLMs.
- 02TOPS is the wrong yardstick for LLM speed.An RTX 4090 advertises 1,300-plus TOPS versus a Copilot+ NPU's 40-80 TOPS, but the comparison that actually predicts tokens per second is memory bandwidth and usable VRAM — not the TOPS figure.
- 03LLM decode is memory-bandwidth bound.Token generation streams the model's weights from memory each step. Throughput tracks bandwidth divided by model size, which is why nearly doubling an NPU's TOPS barely moves the tokens-per-second ceiling.
- 04Mainstream runtimes don't even use the NPU.Ollama, llama.cpp and LM Studio route to the iGPU (Vulkan/ROCm/Metal) or CPU. The NPU needs hand-converted ONNX models via QNN or OpenVINO — an opt-in specialist path, not a drop-in for GGUF models.
- 05Buy by job: AI PC for features, GPU box for big LLMs.Pick a Copilot+ AI PC for all-day battery and Windows AI features; pick a discrete-GPU or 96 GB unified-memory machine for 30B-plus inference. The Ryzen AI Max+ 395 is the one laptop-class exception — and it runs on the iGPU, not the NPU.
01 — The Copilot+ BarWhat an AI PC actually guarantees.
Microsoft defined the “Copilot+ PC” tier with a hardware floor: a neural processing unit rated at 40-plus TOPS, at least 16 GB of RAM, and a 256 GB SSD, all on Windows 11 24H2. That bar is what unlocks the on-device feature set — Windows Recall, Live Captions with real-time translation, Click to Do, and Windows Studio Effects — all running locally on the NPU rather than in the cloud.
Four silicon families clear that bar today. Their NPU ratings are close to one another and very far from a discrete GPU. Crucially, these are NPU-only numbers; some marketing slides quote a combined platform figure (Intel’s 120 TOPS, for example) that sums the NPU, GPU and CPU. The NPU alone is what runs the Windows AI features.
Snapdragon X Elite
The Arm-based launch platform for Copilot+. The NPU targets INT8 and INT4 quantized inference — not FP16 — with up to 64 GB of shared LPDDR5x at 8533 MT/s (16 GB is a common config, not the ceiling).
Core Ultra 200V
Lunar Lake. The NPU alone is 47-48 TOPS on the Ultra 7 258V and Ultra 9 288V; the 120 TOPS figure is the combined NPU + GPU + CPU total — don't read it as the NPU spec.
Ryzen AI 300
Strix Point. The only NPU supporting the full 50 TOPS in Block FP16, not just INT8 — useful headroom for higher-precision on-device models.
Snapdragon X2 Elite
Detailed September 2025: 80 TOPS NPU, 152 GB/s bandwidth, up to 128 GB RAM. A real jump on paper — but treat shipping devices and pricing as unconfirmed until a primary source lists them.
02 — The TOPS MythWhy TOPS is the wrong yardstick.
TOPS — trillions of operations per second — is a theoretical peak-compute figure. It is a fine way to compare two NPUs doing the same fixed-function job. It is a misleading way to predict large language model speed, for two compounding reasons.
First, the absolute gap is enormous. A discrete RTX 4090 advertises 1,300-plus TOPS; a Copilot+ NPU sits at 40-80 TOPS. That is roughly a 29× difference. If TOPS were the LLM metric, an AI PC would be a non-starter. But that framing is itself a trap — because, second, TOPS is not how LLM speed is measured at all.
Raw TOPS · discrete GPU vs Copilot+ NPU
Source: NVIDIA, Qualcomm, Intel and AMD vendor specificationsThis is the two-stage debunk at the heart of every AI PC purchase decision. Stage one: the NPU’s TOPS is tiny next to a GPU. Stage two: it does not matter, because TOPS is the wrong metric for both of them. What you actually want to know is how fast the chip can stream a model’s weights out of memory — and that is a bandwidth question, which the next section makes concrete. If you are weighing an AI PC against a Mac or a GPU box on cost, our local-AI hardware buyer’s guide by price bracket runs the same numbers across every platform.
03 — Bandwidth BoundMemory bandwidth, not TOPS, sets the speed.
Autoregressive token generation works one token at a time, and each step has to read the model’s weights out of memory. For single-stream decoding, the bottleneck is almost never compute — it is how fast those weights can be streamed. Throughput is bound by memory bandwidth divided by the model size in memory. A 70B model in 4-bit quantization occupies roughly 40 GB and needs substantial sustained bandwidth just to reach single-digit tokens per second. The NPU’s TOPS rating never enters this equation.
The table below makes that vivid. The decode ceiling column is computed only from memory bandwidth and model size — TOPS is deliberately left out, because it does not appear in the math. These are theoretical ceilings; real-world output lands below them once runtime overhead, the KV cache and prompt processing are included.
| Platform | Quoted compute | Memory bandwidth | 8B 4-bit ceiling | Reality check |
|---|---|---|---|---|
| Snapdragon X Elite (NPU) | 45 TOPS | 135 GB/s | ≈30 tok/s | NPU itself caps near ~4B models; an 8B runs on the iGPU, not the NPU. |
| Snapdragon X2 Elite (NPU) | 80 TOPS | 152 GB/s | ≈34 tok/s | Nearly double the TOPS, only ~13% more bandwidth — so only ~13% more decode headroom. |
| Ryzen AI Max+ 395 (unified, iGPU) | 50 TOPS NPU | ≈256 GB/s | ≈57 tok/s | The 96 GB-capable iGPU — not the NPU — is what runs 70B at ~14 tok/s (Q4). |
| RTX 4090 (discrete GPU) | 1,300+ TOPS | GDDR6X ≫ LPDDR | Far higher | ~29× an NPU's TOPS, but real tok/s tracks bandwidth and VRAM, not the TOPS headline. |
The math is simple and worth doing by hand. Decode ceiling ≈ memory bandwidth ÷ ~4.5 GB (an 8B model at 4-bit). At 135 GB/s the Snapdragon X Elite tops out near 30 tok/s; at 152 GB/s the X2 Elite reaches about 34. That is the punchline: the X2 Elite nearly doubles the NPU’s TOPS (45 to 80) yet lifts the bandwidth ceiling only about 13%, because bandwidth rose only 13%. TOPS climbed; real LLM headroom barely budged. The Ryzen AI Max+ 395, with roughly 256 GB/s from a 256-bit LPDDR5X-8000 bus, clears 57 tok/s on the same workload — not because its NPU is bigger, but because its memory is wider.
04 — The NPU's Real JobWhat the NPU is genuinely good at.
None of this means the NPU is useless — the opposite. It is a fixed-function accelerator built for small, sustained AI tasks at very low power, and it does that job better than a CPU or GPU could. The flagship example is Phi Silica, Microsoft’s on-device small language model, preinstalled on the NPU of every Copilot+ PC and based on Phi-3.5-mini.
Phi Silica · short prompts
Time-to-first-token for short prompts on the NPU. Fast enough to feel instant inside Word and Outlook Rewrite and Summarize. Vendor-stated.
Phi Silica · sustained
Up to ~20 tokens per second for on-device generation, with a 2k context window (4k planned). Vendor-stated; Microsoft has not published independent third-party benchmarks.
Lower draw than CPU
Phi Silica draws roughly 56% less power than equivalent CPU inference — the reason the NPU runs these features without spinning the fan or draining the battery.
Mainstream NPU runtimes
NPU-optimized models on Snapdragon X Elite top out around 4 billion parameters — Llama-3B, Phi4-mini, Qwen3-4B — via specialist SDKs, not the standard GGUF runtimes.
That ~4B ceiling is the key. The NPU is brilliant at small language models and narrow vision and audio tasks — and small models are exactly where most on-device agent work belongs. Our small language model business guide covers why a 3-9B model like Phi or Qwen handles the majority of real-world steps, and the NPU is the most power-efficient place to run the smallest of them. A reviewer testing NPU acceleration captured the experience well.
"The fan didn't even spin up during audio processing"— XDA Developers, hands-on Snapdragon X Elite NPU test
That silent, low-power profile is the whole point of the NPU. The table below separates the Windows AI features that genuinely run on the NPU from the open LLMs that buyers assume run there but do not — the single most common misconception in this category.
| Feature / workload | Runs on NPU? | Needs Copilot+? | Model path | Approx speed | Power profile |
|---|---|---|---|---|---|
| Windows Recall | Yes — NPU | Yes | ONNX (built-in) | Continuous indexing | Low · silent |
| Live Captions (real-time translation) | Yes — NPU | Yes | ONNX (built-in) | Real-time | Low · silent |
| Windows Studio Effects (blur, eye contact) | Yes — NPU | Yes | ONNX (built-in) | Real-time | Low · silent |
| Click to Do (preview) | Yes — NPU | Yes | ONNX · Phi Silica | Interactive | Low |
| Phi Silica Rewrite / Summarize (Word, Outlook) | Yes — NPU | Yes | ONNX · Phi-3.5-mini | Up to ~20 tok/s (vendor-stated) | ~56% less than CPU |
| Llama 3 8B via Ollama | No — iGPU / CPU | No | GGUF | Depends on iGPU | Moderate · fan |
| Mistral 7B via llama.cpp | No — iGPU / CPU | No | GGUF | Depends on iGPU | Moderate · fan |
| Llama 3.3 70B (quantized) | No — discrete GPU or 96 GB unified | No | GGUF | ~14 tok/s on Strix Halo iGPU | High · sustained |
05 — The Hidden BlockerWhy your runtime ignores the NPU.
Here is the detail that kills most “I’ll just run Llama on my NPU” plans before they start. As of mid-2026, the mainstream local-LLM runtimes — Ollama, llama.cpp and LM Studio — do not route workloads to the NPU at all. They run on the integrated GPU (via Vulkan, ROCm or Metal) or the CPU. Those tools load GGUF-format models; the NPU needs models hand-converted to ONNX and compiled for the vendor’s execution provider — QNN for Qualcomm, OpenVINO for Intel.
That conversion path exists and works, but it is a specialist, opt-in pipeline rather than a drop-in accelerator. Qualcomm’s AI Hub publishes 175-plus pre-optimized ONNX models validated for Snapdragon X Elite, and SDKs such as Nexa enable NPU acceleration for a curated set — Llama-3B, Phi4-mini, Qwen3-4B and an OmniNeural-4B multimodal model. But if you download a GGUF off Hugging Face and point Ollama at it, the NPU sits idle.
06 — The Honest ExceptionOne laptop chip that does run 70B.
There is one platform that breaks the “AI PCs can’t run big models” rule — and it proves the point precisely because of how it does it. The AMD Ryzen AI Max+ 395 (Strix Halo) pairs 128 GB of LPDDR5X-8000 on a 256-bit bus with a 40-CU RDNA 3.5 integrated GPU. Up to 96 GB of that memory can be allocated to the GPU, which is enough to hold a 70B model.
LPDDR5X-8000 · 256-bit
Shared between the CPU and a 40-CU RDNA 3.5 iGPU on a 256-bit bus — roughly 256 GB/s of bandwidth, far wider than a standard Copilot+ laptop.
Enough to hold 70B
Up to 96 GB can be assigned to the iGPU, which is what lets a quantized 70B model fit and run entirely on-device — no discrete card required.
Quantized · iGPU path
Around 14 tokens per second quantized, or roughly 5 tok/s at BF16 — a vendor and community figure for this unified-memory APU, and on the optimistic side of what raw bandwidth alone predicts. Delivered by the iGPU and wide memory, not the NPU.
Workstation-class
This is a premium workstation laptop SoC in machines like the Asus ProArt, not a $1,000-1,400 Copilot+ consumer laptop. Price it against a Mac Studio M4 Max, not a thin-and-light.
Read that carefully: the Ryzen AI Max+ 395 runs 70B because of wide unified memory and a capable iGPU, not because of its 50 TOPS NPU. It is the honest answer to “can an AI PC run a big LLM?” — yes, if you buy a $2,000-plus workstation-class machine whose value is memory capacity, and you accept single-digit- to-low-double-digit tokens per second. If running larger models on a laptop is the goal, the Gemma 12B on a laptop guide and our DGX Spark vs M5 Max vs RTX 6000 comparison walk the same unified-memory-versus-discrete-GPU trade-off in depth.
07 — The Platform ShiftWindows AI Foundry and the loosening NPU-only frame.
Microsoft’s own platform direction is the strongest signal that the NPU-only framing was always too narrow. At Build 2025, Microsoft announced Windows AI Foundry, the evolution of the Windows Copilot Runtime — a unified layer that selects, optimizes, fine-tunes and deploys AI models across NPU, GPU and CPU, spanning AMD, Intel, NVIDIA and Qualcomm silicon, and integrating Foundry Local with open catalogs including Ollama and NVIDIA NIMs.
Underneath it, Windows ML automatically selects the correct execution provider — QNN for Qualcomm NPUs, OpenVINO for Intel NPUs — and falls back to the GPU or CPU when an NPU provider is unavailable. Developers no longer bundle execution providers by hand. The architecture quietly concedes that the right engine depends on the workload, not on a single 40 TOPS threshold.
That trajectory matters for buyers planning two or three years ahead. The NPU’s role is consolidating around what it does best — efficient, always-on small-model and feature inference — while the heavy LLM work migrates to whichever engine has the memory bandwidth to feed it. Expect future Windows AI features to lean on a hybrid NPU- plus-GPU split rather than the NPU alone, and expect the marketing to keep quoting TOPS long after the platform has stopped treating it as the deciding number.
08 — What To BuyThe honest buy-this-for-that matrix.
Match the machine to the job. The mistake is buying an AI PC expecting a local LLM box, or skipping one because “NPUs can’t run LLMs” when the Windows AI features are exactly what you wanted. Four clear lanes.
Buy a Copilot+ AI PC
If you want all-day battery, Recall, Live Captions, Studio Effects and on-device Rewrite and Summarize, a 40-plus TOPS Copilot+ laptop is exactly right. Just don't expect it to be your 30B inference box.
Buy a discrete-GPU box
For interactive 30B-plus inference, a discrete GPU with ample VRAM and GDDR bandwidth wins decisively. This is the path mainstream runtimes are actually built for — GGUF in, fast tokens out.
Buy Strix Halo unified memory
The Ryzen AI Max+ 395 with 96 GB allocatable to the iGPU runs a quantized 70B at ~14 tok/s on-device. Workstation-class price (~$2k+), and the work happens on the iGPU, not the NPU.
Lean on Phi Silica and SLMs
For small, private, always-on agent steps at minimal power, the NPU path through Phi Silica and 4B-class ONNX models is the most efficient option on a Copilot+ PC — silent, fast-to-first-token, battery-friendly.
For most teams, the answer is “both, for different reasons”: a Copilot+ AI PC for the laptop fleet’s battery life and Windows AI features, and a separate GPU or unified-memory machine where heavier local inference actually happens. If you are mapping local versus cloud economics across a fleet, our local AI versus cloud subscription ROI analysis frames the same trade-off in dollars, and our AI transformation engagements start with exactly this kind of hardware-and-workload mapping.
09 — ConclusionThe number on the box is the wrong number.
An NPU is a features chip, not a 70B box — and that's fine.
The single most useful thing to internalize before buying an AI PC in 2026 is that the headline TOPS figure answers a different question than the one you are asking. It tells you the laptop will run Windows AI features locally at low power. It tells you nothing about how fast a local LLM will run, because LLM decode is bound by memory bandwidth, and the mainstream runtimes do not even touch the NPU.
Buy a Copilot+ AI PC for what it is genuinely excellent at: all-day battery, silent on-device features, and small sustained models like Phi Silica. Buy a discrete-GPU box, or a wide-memory machine like the Ryzen AI Max+ 395, when 30B-plus inference is the job. The two are not substitutes, and the spec sheet will not tell you which one you are looking at unless you know to read past the TOPS.
The broader shift is already underway. Microsoft’s own Windows AI Foundry is opening from NPU-only toward GPU and CPU paths, conceding that one threshold can’t carry every workload. The next generation of AI PCs will be judged less by a single TOPS number and more by the boring metrics that actually decide local AI speed — memory bandwidth, usable capacity, and which engine your software will really use.