An AI PC with a 40-plus TOPS NPU clears Microsoft’s Copilot+ bar, but the headline TOPS number tells you almost nothing about whether it can run a local LLM at a usable speed. NPUs are genuinely good at small, sustained on-device models and Windows AI features — they are not the way to run a 70B model. This guide separates the marketing from the silicon.

The confusion is everywhere: buyers see “45 TOPS” or “50 TOPS of AI performance” on a spec sheet and assume it means the laptop can run Llama or Mistral locally. It can run some AI workloads beautifully — and a large language model is usually not one of them, at least not on the NPU. The reason is a metric mismatch that almost no marketing page explains.

Below, we decode TOPS into real tokens per second, show why autoregressive LLM inference is bound by memory bandwidth rather than raw compute, map exactly which Copilot+ features run on the NPU versus which fall back to the iGPU or CPU, and give an honest buy-this-for-that decision matrix. Every spec is traceable to the silicon vendor or a named benchmark.

Key takeaways

01
The 40 TOPS bar is a features gate, not an LLM rating.Copilot+ certification requires a 40-plus TOPS NPU, 16 GB RAM and a 256 GB SSD on Windows 11 24H2. That unlocks Recall, Live Captions, Studio Effects and Click to Do — not fast local LLMs.
02
TOPS is the wrong yardstick for LLM speed.An RTX 4090 advertises 1,300-plus TOPS versus a Copilot+ NPU's 40-80 TOPS, but the comparison that actually predicts tokens per second is memory bandwidth and usable VRAM — not the TOPS figure.
03
LLM decode is memory-bandwidth bound.Token generation streams the model's weights from memory each step. Throughput tracks bandwidth divided by model size, which is why nearly doubling an NPU's TOPS barely moves the tokens-per-second ceiling.
04
Mainstream runtimes don't even use the NPU.Ollama, llama.cpp and LM Studio route to the iGPU (Vulkan/ROCm/Metal) or CPU. The NPU needs hand-converted ONNX models via QNN or OpenVINO — an opt-in specialist path, not a drop-in for GGUF models.
05
Buy by job: AI PC for features, GPU box for big LLMs.Pick a Copilot+ AI PC for all-day battery and Windows AI features; pick a discrete-GPU or 96 GB unified-memory machine for 30B-plus inference. The Ryzen AI Max+ 395 is the one laptop-class exception — and it runs on the iGPU, not the NPU.

01 — The Copilot+ BarWhat an AI PC actually guarantees.

Microsoft defined the “Copilot+ PC” tier with a hardware floor: a neural processing unit rated at 40-plus TOPS, at least 16 GB of RAM, and a 256 GB SSD, all on Windows 11 24H2. That bar is what unlocks the on-device feature set — Windows Recall, Live Captions with real-time translation, Click to Do, and Windows Studio Effects — all running locally on the NPU rather than in the cloud.

Four silicon families clear that bar today. Their NPU ratings are close to one another and very far from a discrete GPU. Crucially, these are NPU-only numbers; some marketing slides quote a combined platform figure (Intel’s 120 TOPS, for example) that sums the NPU, GPU and CPU. The NPU alone is what runs the Windows AI features.

Qualcomm

Snapdragon X Elite

45 TOPS (INT8) · 135 GB/s

The Arm-based launch platform for Copilot+. The NPU targets INT8 and INT4 quantized inference — not FP16 — with up to 64 GB of shared LPDDR5x at 8533 MT/s (16 GB is a common config, not the ceiling).

2024 launch

Intel

Core Ultra 200V

47-48 TOPS NPU · 120 total

Lunar Lake. The NPU alone is 47-48 TOPS on the Ultra 7 258V and Ultra 9 288V; the 120 TOPS figure is the combined NPU + GPU + CPU total — don't read it as the NPU spec.

Sep 2024 launch

AMD

Ryzen AI 300

50 TOPS (XDNA 2)

Strix Point. The only NPU supporting the full 50 TOPS in Block FP16, not just INT8 — useful headroom for higher-precision on-device models.

Jul 2024 launch

Next-gen

Snapdragon X2 Elite

80 TOPS · 152 GB/s

Detailed September 2025: 80 TOPS NPU, 152 GB/s bandwidth, up to 128 GB RAM. A real jump on paper — but treat shipping devices and pricing as unconfirmed until a primary source lists them.

Announced Sep 2025

Read the spec carefully

The NPU number and the “total AI TOPS” number are not the same thing. Intel’s Core Ultra 200V is a 47-48 TOPS NPU; the 120 TOPS you may see is the combined NPU, GPU and CPU. Conflating them overstates the NPU by roughly 2.5×. For Windows AI features, the NPU-only figure is the one that matters.

02 — The TOPS MythWhy TOPS is the wrong yardstick.

TOPS — trillions of operations per second — is a theoretical peak-compute figure. It is a fine way to compare two NPUs doing the same fixed-function job. It is a misleading way to predict large language model speed, for two compounding reasons.

First, the absolute gap is enormous. A discrete RTX 4090 advertises 1,300-plus TOPS; a Copilot+ NPU sits at 40-80 TOPS. That is roughly a 29× difference. If TOPS were the LLM metric, an AI PC would be a non-starter. But that framing is itself a trap — because, second, TOPS is not how LLM speed is measured at all.

Raw TOPS · discrete GPU vs Copilot+ NPU

Source: NVIDIA, Qualcomm, Intel and AMD vendor specifications

RTX 4090 (discrete GPU)Advertised peak compute

1,300+ TOPS

Snapdragon X2 Elite NPUNext-gen, announced Sep 2025

AMD Ryzen AI 300 NPUXDNA 2, Block FP16

Intel Core Ultra 200V NPUNPU-only, not the 120 total

Snapdragon X Elite NPUCopilot+ launch platform

Copilot+ minimumCertification floor

40 (min)

The metric that matters

NVIDIA’s own developer blog puts it plainly: LLM performance is measured in the number of tokens generated by the model — that is, tokens per second, not TOPS. The same post notes the RTX 4090’s 1,300-plus TOPS, yet the relevant comparison for a local LLM is memory bandwidth and VRAM capacity, not the compute headline.

This is the two-stage debunk at the heart of every AI PC purchase decision. Stage one: the NPU’s TOPS is tiny next to a GPU. Stage two: it does not matter, because TOPS is the wrong metric for both of them. What you actually want to know is how fast the chip can stream a model’s weights out of memory — and that is a bandwidth question, which the next section makes concrete. If you are weighing an AI PC against a Mac or a GPU box on cost, our local-AI hardware buyer’s guide by price bracket runs the same numbers across every platform.

03 — Bandwidth BoundMemory bandwidth, not TOPS, sets the speed.

Autoregressive token generation works one token at a time, and each step has to read the model’s weights out of memory. For single-stream decoding, the bottleneck is almost never compute — it is how fast those weights can be streamed. Throughput is bound by memory bandwidth divided by the model size in memory. A 70B model in 4-bit quantization occupies roughly 40 GB and needs substantial sustained bandwidth just to reach single-digit tokens per second. The NPU’s TOPS rating never enters this equation.

The table below makes that vivid. The decode ceiling column is computed only from memory bandwidth and model size — TOPS is deliberately left out, because it does not appear in the math. These are theoretical ceilings; real-world output lands below them once runtime overhead, the KV cache and prompt processing are included.

Memory-bandwidth decode ceiling for an 8-billion-parameter model at 4-bit, computed as memory bandwidth divided by approximately 4.5 GB, across Snapdragon X Elite, Snapdragon X2 Elite, Ryzen AI Max+ 395 and the RTX 4090. NPU TOPS is excluded from the calculation. Bandwidth figures are vendor-stated; the Ryzen AI Max+ 395 figure is derived from its 256-bit LPDDR5X-8000 bus; ceilings are theoretical and real throughput lands below them.
Platform	Quoted compute	Memory bandwidth	8B 4-bit ceiling	Reality check
Snapdragon X Elite (NPU)	45 TOPS	135 GB/s	≈30 tok/s	NPU itself caps near ~4B models; an 8B runs on the iGPU, not the NPU.
Snapdragon X2 Elite (NPU)	80 TOPS	152 GB/s	≈34 tok/s	Nearly double the TOPS, only ~13% more bandwidth — so only ~13% more decode headroom.
Ryzen AI Max+ 395 (unified, iGPU)	50 TOPS NPU	≈256 GB/s	≈57 tok/s	The 96 GB-capable iGPU — not the NPU — is what runs 70B at ~14 tok/s (Q4).
RTX 4090 (discrete GPU)	1,300+ TOPS	GDDR6X ≫ LPDDR	Far higher	~29× an NPU's TOPS, but real tok/s tracks bandwidth and VRAM, not the TOPS headline.

The math is simple and worth doing by hand. Decode ceiling ≈ memory bandwidth ÷ ~4.5 GB (an 8B model at 4-bit). At 135 GB/s the Snapdragon X Elite tops out near 30 tok/s; at 152 GB/s the X2 Elite reaches about 34. That is the punchline: the X2 Elite nearly doubles the NPU’s TOPS (45 to 80) yet lifts the bandwidth ceiling only about 13%, because bandwidth rose only 13%. TOPS climbed; real LLM headroom barely budged. The Ryzen AI Max+ 395, with roughly 256 GB/s from a 256-bit LPDDR5X-8000 bus, clears 57 tok/s on the same workload — not because its NPU is bigger, but because its memory is wider.

04 — The NPU's Real JobWhat the NPU is genuinely good at.

None of this means the NPU is useless — the opposite. It is a fixed-function accelerator built for small, sustained AI tasks at very low power, and it does that job better than a CPU or GPU could. The flagship example is Phi Silica, Microsoft’s on-device small language model, preinstalled on the NPU of every Copilot+ PC and based on Phi-3.5-mini.

First-token latency

Phi Silica · short prompts

230ms

Time-to-first-token for short prompts on the NPU. Fast enough to feel instant inside Word and Outlook Rewrite and Summarize. Vendor-stated.

Dec 2024

On-device throughput

Phi Silica · sustained

20tok/s

Up to ~20 tokens per second for on-device generation, with a 2k context window (4k planned). Vendor-stated; Microsoft has not published independent third-party benchmarks.

Vendor-stated

Power vs CPU

Lower draw than CPU

56%

Phi Silica draws roughly 56% less power than equivalent CPU inference — the reason the NPU runs these features without spinning the fan or draining the battery.

Efficiency

NPU model ceiling

Mainstream NPU runtimes

~4B

NPU-optimized models on Snapdragon X Elite top out around 4 billion parameters — Llama-3B, Phi4-mini, Qwen3-4B — via specialist SDKs, not the standard GGUF runtimes.

Specialist path

That ~4B ceiling is the key. The NPU is brilliant at small language models and narrow vision and audio tasks — and small models are exactly where most on-device agent work belongs. Our small language model business guide covers why a 3-9B model like Phi or Qwen handles the majority of real-world steps, and the NPU is the most power-efficient place to run the smallest of them. A reviewer testing NPU acceleration captured the experience well.

"The fan didn't even spin up during audio processing"— XDA Developers, hands-on Snapdragon X Elite NPU test

That silent, low-power profile is the whole point of the NPU. The table below separates the Windows AI features that genuinely run on the NPU from the open LLMs that buyers assume run there but do not — the single most common misconception in this category.

Copilot+ NPU feature map showing, for each workload, whether it runs on the NPU, whether it requires a Copilot+ PC, its model path (ONNX or GGUF), approximate speed and power profile. Phi Silica throughput is vendor-stated; the 70B figure is a sourced Strix Halo iGPU benchmark; named open LLMs run on the iGPU or CPU, not the NPU.
Feature / workload	Runs on NPU?	Needs Copilot+?	Model path	Approx speed	Power profile
Windows Recall	Yes — NPU	Yes	ONNX (built-in)	Continuous indexing	Low · silent
Live Captions (real-time translation)	Yes — NPU	Yes	ONNX (built-in)	Real-time	Low · silent
Windows Studio Effects (blur, eye contact)	Yes — NPU	Yes	ONNX (built-in)	Real-time	Low · silent
Click to Do (preview)	Yes — NPU	Yes	ONNX · Phi Silica	Interactive	Low
Phi Silica Rewrite / Summarize (Word, Outlook)	Yes — NPU	Yes	ONNX · Phi-3.5-mini	Up to ~20 tok/s (vendor-stated)	~56% less than CPU
Llama 3 8B via Ollama	No — iGPU / CPU	No	GGUF	Depends on iGPU	Moderate · fan
Mistral 7B via llama.cpp	No — iGPU / CPU	No	GGUF	Depends on iGPU	Moderate · fan
Llama 3.3 70B (quantized)	No — discrete GPU or 96 GB unified	No	GGUF	~14 tok/s on Strix Halo iGPU	High · sustained

05 — The Hidden BlockerWhy your runtime ignores the NPU.

Here is the detail that kills most “I’ll just run Llama on my NPU” plans before they start. As of mid-2026, the mainstream local-LLM runtimes — Ollama, llama.cpp and LM Studio — do not route workloads to the NPU at all. They run on the integrated GPU (via Vulkan, ROCm or Metal) or the CPU. Those tools load GGUF-format models; the NPU needs models hand-converted to ONNX and compiled for the vendor’s execution provider — QNN for Qualcomm, OpenVINO for Intel.

That conversion path exists and works, but it is a specialist, opt-in pipeline rather than a drop-in accelerator. Qualcomm’s AI Hub publishes 175-plus pre-optimized ONNX models validated for Snapdragon X Elite, and SDKs such as Nexa enable NPU acceleration for a curated set — Llama-3B, Phi4-mini, Qwen3-4B and an OmniNeural-4B multimodal model. But if you download a GGUF off Hugging Face and point Ollama at it, the NPU sits idle.

The practical takeaway

If your plan is “download a model and run it,” you are using the iGPU or CPU — not the NPU — no matter how many TOPS the box advertises. The NPU accelerates LLMs only through hand-converted ONNX models on a specialist SDK, and only up to roughly 4B parameters in mainstream tooling. For the privacy and cost case that makes local inference worth the effort in the first place, see our on-device local AI agents forecast.

06 — The Honest ExceptionOne laptop chip that does run 70B.

There is one platform that breaks the “AI PCs can’t run big models” rule — and it proves the point precisely because of how it does it. The AMD Ryzen AI Max+ 395 (Strix Halo) pairs 128 GB of LPDDR5X-8000 on a 256-bit bus with a 40-CU RDNA 3.5 integrated GPU. Up to 96 GB of that memory can be allocated to the GPU, which is enough to hold a 70B model.

Unified memory

LPDDR5X-8000 · 256-bit

128GB

Shared between the CPU and a 40-CU RDNA 3.5 iGPU on a 256-bit bus — roughly 256 GB/s of bandwidth, far wider than a standard Copilot+ laptop.

Strix Halo

Allocatable to GPU

Enough to hold 70B

96GB

Up to 96 GB can be assigned to the iGPU, which is what lets a quantized 70B model fit and run entirely on-device — no discrete card required.

On-device

Llama 3.3 70B

Quantized · iGPU path

~14tok/s

Around 14 tokens per second quantized, or roughly 5 tok/s at BF16 — a vendor and community figure for this unified-memory APU, and on the optimistic side of what raw bandwidth alone predicts. Delivered by the iGPU and wide memory, not the NPU.

Vendor / community est.

Workstation-class

$2k+

This is a premium workstation laptop SoC in machines like the Asus ProArt, not a $1,000-1,400 Copilot+ consumer laptop. Price it against a Mac Studio M4 Max, not a thin-and-light.

Not mainstream

Read that carefully: the Ryzen AI Max+ 395 runs 70B because of wide unified memory and a capable iGPU, not because of its 50 TOPS NPU. It is the honest answer to “can an AI PC run a big LLM?” — yes, if you buy a $2,000-plus workstation-class machine whose value is memory capacity, and you accept single-digit- to-low-double-digit tokens per second. If running larger models on a laptop is the goal, the Gemma 12B on a laptop guide and our DGX Spark vs M5 Max vs RTX 6000 comparison walk the same unified-memory-versus-discrete-GPU trade-off in depth.

07 — The Platform ShiftWindows AI Foundry and the loosening NPU-only frame.

Microsoft’s own platform direction is the strongest signal that the NPU-only framing was always too narrow. At Build 2025, Microsoft announced Windows AI Foundry, the evolution of the Windows Copilot Runtime — a unified layer that selects, optimizes, fine-tunes and deploys AI models across NPU, GPU and CPU, spanning AMD, Intel, NVIDIA and Qualcomm silicon, and integrating Foundry Local with open catalogs including Ollama and NVIDIA NIMs.

Underneath it, Windows ML automatically selects the correct execution provider — QNN for Qualcomm NPUs, OpenVINO for Intel NPUs — and falls back to the GPU or CPU when an NPU provider is unavailable. Developers no longer bundle execution providers by hand. The architecture quietly concedes that the right engine depends on the workload, not on a single 40 TOPS threshold.

The direction of travel

Reporting on Build 2026 indicates Microsoft broadened Windows AI Foundry beyond NPU-only, officially supporting GPU and CPU inference paths — and Phi Silica on GPU entered an experimental Windows Insider release in June 2026. Treat the Build 2026 specifics as third-party coverage to verify against Microsoft directly, but the trend is clear: the 40 TOPS NPU bar was never meant to carry the full range of local AI, and the platform is opening up to the GPU and CPU to match.

That trajectory matters for buyers planning two or three years ahead. The NPU’s role is consolidating around what it does best — efficient, always-on small-model and feature inference — while the heavy LLM work migrates to whichever engine has the memory bandwidth to feed it. Expect future Windows AI features to lean on a hybrid NPU- plus-GPU split rather than the NPU alone, and expect the marketing to keep quoting TOPS long after the platform has stopped treating it as the deciding number.

08 — What To BuyThe honest buy-this-for-that matrix.

Match the machine to the job. The mistake is buying an AI PC expecting a local LLM box, or skipping one because “NPUs can’t run LLMs” when the Windows AI features are exactly what you wanted. Four clear lanes.

Battery + Windows AI

Buy a Copilot+ AI PC

If you want all-day battery, Recall, Live Captions, Studio Effects and on-device Rewrite and Summarize, a 40-plus TOPS Copilot+ laptop is exactly right. Just don't expect it to be your 30B inference box.

Pick an AI PC

Fast 30B+ local LLMs

Buy a discrete-GPU box

For interactive 30B-plus inference, a discrete GPU with ample VRAM and GDDR bandwidth wins decisively. This is the path mainstream runtimes are actually built for — GGUF in, fast tokens out.

Pick a GPU box

70B in a laptop

Buy Strix Halo unified memory

The Ryzen AI Max+ 395 with 96 GB allocatable to the iGPU runs a quantized 70B at ~14 tok/s on-device. Workstation-class price (~$2k+), and the work happens on the iGPU, not the NPU.

Pick Ryzen AI Max+ 395

Small on-device agents

Lean on Phi Silica and SLMs

For small, private, always-on agent steps at minimal power, the NPU path through Phi Silica and 4B-class ONNX models is the most efficient option on a Copilot+ PC — silent, fast-to-first-token, battery-friendly.

Pick the NPU path

For most teams, the answer is “both, for different reasons”: a Copilot+ AI PC for the laptop fleet’s battery life and Windows AI features, and a separate GPU or unified-memory machine where heavier local inference actually happens. If you are mapping local versus cloud economics across a fleet, our local AI versus cloud subscription ROI analysis frames the same trade-off in dollars, and our AI transformation engagements start with exactly this kind of hardware-and-workload mapping.

09 — ConclusionThe number on the box is the wrong number.

AI PCs in 2026, honestly

An NPU is a features chip, not a 70B box — and that's fine.

The single most useful thing to internalize before buying an AI PC in 2026 is that the headline TOPS figure answers a different question than the one you are asking. It tells you the laptop will run Windows AI features locally at low power. It tells you nothing about how fast a local LLM will run, because LLM decode is bound by memory bandwidth, and the mainstream runtimes do not even touch the NPU.

Buy a Copilot+ AI PC for what it is genuinely excellent at: all-day battery, silent on-device features, and small sustained models like Phi Silica. Buy a discrete-GPU box, or a wide-memory machine like the Ryzen AI Max+ 395, when 30B-plus inference is the job. The two are not substitutes, and the spec sheet will not tell you which one you are looking at unless you know to read past the TOPS.

The broader shift is already underway. Microsoft’s own Windows AI Foundry is opening from NPU-only toward GPU and CPU paths, conceding that one threshold can’t carry every workload. The next generation of AI PCs will be judged less by a single TOPS number and more by the boring metrics that actually decide local AI speed — memory bandwidth, usable capacity, and which engine your software will really use.

AI PCs and NPUs in 2026: Can They Really Run Local AI?