AI DevelopmentIndustry Guide12 min readPublished June 29, 2026

NPU vs GPU vs unified memory · 40 TOPS doesn’t equal fast LLMs

AI PCs and NPUs in 2026: Can They Really Run Local AI?

Every Copilot+ PC clears a 40 TOPS NPU bar — but that number tells you almost nothing about how fast a local LLM will run. This buyer’s guide decodes TOPS into real tokens per second, shows why inference is memory-bandwidth bound, and maps what the NPU genuinely accelerates versus what still needs a GPU.

DA
Digital Applied Team
Senior strategists · Published Jun 29, 2026
PublishedJune 29, 2026
Read time12 min
SourcesVendor docs + benchmarks
Copilot+ NPU bar
40TOPS
minimum to qualify
RTX 4090 compute
1,300+
TOPS — ~29× an NPU
NPU model ceiling
~4B
params, mainstream NPU path
Phi Silica throughput
20tok/s
on-device · vendor-stated

An AI PC with a 40-plus TOPS NPU clears Microsoft’s Copilot+ bar, but the headline TOPS number tells you almost nothing about whether it can run a local LLM at a usable speed. NPUs are genuinely good at small, sustained on-device models and Windows AI features — they are not the way to run a 70B model. This guide separates the marketing from the silicon.

The confusion is everywhere: buyers see “45 TOPS” or “50 TOPS of AI performance” on a spec sheet and assume it means the laptop can run Llama or Mistral locally. It can run some AI workloads beautifully — and a large language model is usually not one of them, at least not on the NPU. The reason is a metric mismatch that almost no marketing page explains.

Below, we decode TOPS into real tokens per second, show why autoregressive LLM inference is bound by memory bandwidth rather than raw compute, map exactly which Copilot+ features run on the NPU versus which fall back to the iGPU or CPU, and give an honest buy-this-for-that decision matrix. Every spec is traceable to the silicon vendor or a named benchmark.

Key takeaways
  1. 01
    The 40 TOPS bar is a features gate, not an LLM rating.Copilot+ certification requires a 40-plus TOPS NPU, 16 GB RAM and a 256 GB SSD on Windows 11 24H2. That unlocks Recall, Live Captions, Studio Effects and Click to Do — not fast local LLMs.
  2. 02
    TOPS is the wrong yardstick for LLM speed.An RTX 4090 advertises 1,300-plus TOPS versus a Copilot+ NPU's 40-80 TOPS, but the comparison that actually predicts tokens per second is memory bandwidth and usable VRAM — not the TOPS figure.
  3. 03
    LLM decode is memory-bandwidth bound.Token generation streams the model's weights from memory each step. Throughput tracks bandwidth divided by model size, which is why nearly doubling an NPU's TOPS barely moves the tokens-per-second ceiling.
  4. 04
    Mainstream runtimes don't even use the NPU.Ollama, llama.cpp and LM Studio route to the iGPU (Vulkan/ROCm/Metal) or CPU. The NPU needs hand-converted ONNX models via QNN or OpenVINO — an opt-in specialist path, not a drop-in for GGUF models.
  5. 05
    Buy by job: AI PC for features, GPU box for big LLMs.Pick a Copilot+ AI PC for all-day battery and Windows AI features; pick a discrete-GPU or 96 GB unified-memory machine for 30B-plus inference. The Ryzen AI Max+ 395 is the one laptop-class exception — and it runs on the iGPU, not the NPU.

01The Copilot+ BarWhat an AI PC actually guarantees.

Microsoft defined the “Copilot+ PC” tier with a hardware floor: a neural processing unit rated at 40-plus TOPS, at least 16 GB of RAM, and a 256 GB SSD, all on Windows 11 24H2. That bar is what unlocks the on-device feature set — Windows Recall, Live Captions with real-time translation, Click to Do, and Windows Studio Effects — all running locally on the NPU rather than in the cloud.

Four silicon families clear that bar today. Their NPU ratings are close to one another and very far from a discrete GPU. Crucially, these are NPU-only numbers; some marketing slides quote a combined platform figure (Intel’s 120 TOPS, for example) that sums the NPU, GPU and CPU. The NPU alone is what runs the Windows AI features.

Qualcomm
Snapdragon X Elite
45 TOPS (INT8) · 135 GB/s

The Arm-based launch platform for Copilot+. The NPU targets INT8 and INT4 quantized inference — not FP16 — with up to 64 GB of shared LPDDR5x at 8533 MT/s (16 GB is a common config, not the ceiling).

2024 launch
Intel
Core Ultra 200V
47-48 TOPS NPU · 120 total

Lunar Lake. The NPU alone is 47-48 TOPS on the Ultra 7 258V and Ultra 9 288V; the 120 TOPS figure is the combined NPU + GPU + CPU total — don't read it as the NPU spec.

Sep 2024 launch
AMD
Ryzen AI 300
50 TOPS (XDNA 2)

Strix Point. The only NPU supporting the full 50 TOPS in Block FP16, not just INT8 — useful headroom for higher-precision on-device models.

Jul 2024 launch
Next-gen
Snapdragon X2 Elite
80 TOPS · 152 GB/s

Detailed September 2025: 80 TOPS NPU, 152 GB/s bandwidth, up to 128 GB RAM. A real jump on paper — but treat shipping devices and pricing as unconfirmed until a primary source lists them.

Announced Sep 2025
Read the spec carefully
The NPU number and the “total AI TOPS” number are not the same thing. Intel’s Core Ultra 200V is a 47-48 TOPS NPU; the 120 TOPS you may see is the combined NPU, GPU and CPU. Conflating them overstates the NPU by roughly 2.5×. For Windows AI features, the NPU-only figure is the one that matters.

02The TOPS MythWhy TOPS is the wrong yardstick.

TOPS — trillions of operations per second — is a theoretical peak-compute figure. It is a fine way to compare two NPUs doing the same fixed-function job. It is a misleading way to predict large language model speed, for two compounding reasons.

First, the absolute gap is enormous. A discrete RTX 4090 advertises 1,300-plus TOPS; a Copilot+ NPU sits at 40-80 TOPS. That is roughly a 29× difference. If TOPS were the LLM metric, an AI PC would be a non-starter. But that framing is itself a trap — because, second, TOPS is not how LLM speed is measured at all.

Raw TOPS · discrete GPU vs Copilot+ NPU

Source: NVIDIA, Qualcomm, Intel and AMD vendor specifications
RTX 4090 (discrete GPU)Advertised peak compute
1,300+ TOPS
Snapdragon X2 Elite NPUNext-gen, announced Sep 2025
80
AMD Ryzen AI 300 NPUXDNA 2, Block FP16
50
Intel Core Ultra 200V NPUNPU-only, not the 120 total
48
Snapdragon X Elite NPUCopilot+ launch platform
45
Copilot+ minimumCertification floor
40 (min)
The metric that matters
NVIDIA’s own developer blog puts it plainly: LLM performance is measured in the number of tokens generated by the model — that is, tokens per second, not TOPS. The same post notes the RTX 4090’s 1,300-plus TOPS, yet the relevant comparison for a local LLM is memory bandwidth and VRAM capacity, not the compute headline.

This is the two-stage debunk at the heart of every AI PC purchase decision. Stage one: the NPU’s TOPS is tiny next to a GPU. Stage two: it does not matter, because TOPS is the wrong metric for both of them. What you actually want to know is how fast the chip can stream a model’s weights out of memory — and that is a bandwidth question, which the next section makes concrete. If you are weighing an AI PC against a Mac or a GPU box on cost, our local-AI hardware buyer’s guide by price bracket runs the same numbers across every platform.

03Bandwidth BoundMemory bandwidth, not TOPS, sets the speed.

Autoregressive token generation works one token at a time, and each step has to read the model’s weights out of memory. For single-stream decoding, the bottleneck is almost never compute — it is how fast those weights can be streamed. Throughput is bound by memory bandwidth divided by the model size in memory. A 70B model in 4-bit quantization occupies roughly 40 GB and needs substantial sustained bandwidth just to reach single-digit tokens per second. The NPU’s TOPS rating never enters this equation.

The table below makes that vivid. The decode ceiling column is computed only from memory bandwidth and model size — TOPS is deliberately left out, because it does not appear in the math. These are theoretical ceilings; real-world output lands below them once runtime overhead, the KV cache and prompt processing are included.

Memory-bandwidth decode ceiling for an 8-billion-parameter model at 4-bit, computed as memory bandwidth divided by approximately 4.5 GB, across Snapdragon X Elite, Snapdragon X2 Elite, Ryzen AI Max+ 395 and the RTX 4090. NPU TOPS is excluded from the calculation. Bandwidth figures are vendor-stated; the Ryzen AI Max+ 395 figure is derived from its 256-bit LPDDR5X-8000 bus; ceilings are theoretical and real throughput lands below them.
PlatformQuoted computeMemory bandwidth8B 4-bit ceilingReality check
Snapdragon X Elite (NPU)45 TOPS135 GB/s≈30 tok/sNPU itself caps near ~4B models; an 8B runs on the iGPU, not the NPU.
Snapdragon X2 Elite (NPU)80 TOPS152 GB/s≈34 tok/sNearly double the TOPS, only ~13% more bandwidth — so only ~13% more decode headroom.
Ryzen AI Max+ 395 (unified, iGPU)50 TOPS NPU≈256 GB/s≈57 tok/sThe 96 GB-capable iGPU — not the NPU — is what runs 70B at ~14 tok/s (Q4).
RTX 4090 (discrete GPU)1,300+ TOPSGDDR6X ≫ LPDDRFar higher~29× an NPU's TOPS, but real tok/s tracks bandwidth and VRAM, not the TOPS headline.

The math is simple and worth doing by hand. Decode ceiling ≈ memory bandwidth ÷ ~4.5 GB (an 8B model at 4-bit). At 135 GB/s the Snapdragon X Elite tops out near 30 tok/s; at 152 GB/s the X2 Elite reaches about 34. That is the punchline: the X2 Elite nearly doubles the NPU’s TOPS (45 to 80) yet lifts the bandwidth ceiling only about 13%, because bandwidth rose only 13%. TOPS climbed; real LLM headroom barely budged. The Ryzen AI Max+ 395, with roughly 256 GB/s from a 256-bit LPDDR5X-8000 bus, clears 57 tok/s on the same workload — not because its NPU is bigger, but because its memory is wider.

04The NPU's Real JobWhat the NPU is genuinely good at.

None of this means the NPU is useless — the opposite. It is a fixed-function accelerator built for small, sustained AI tasks at very low power, and it does that job better than a CPU or GPU could. The flagship example is Phi Silica, Microsoft’s on-device small language model, preinstalled on the NPU of every Copilot+ PC and based on Phi-3.5-mini.

First-token latency
Phi Silica · short prompts
230ms

Time-to-first-token for short prompts on the NPU. Fast enough to feel instant inside Word and Outlook Rewrite and Summarize. Vendor-stated.

Dec 2024
On-device throughput
Phi Silica · sustained
20tok/s

Up to ~20 tokens per second for on-device generation, with a 2k context window (4k planned). Vendor-stated; Microsoft has not published independent third-party benchmarks.

Vendor-stated
Power vs CPU
Lower draw than CPU
56%

Phi Silica draws roughly 56% less power than equivalent CPU inference — the reason the NPU runs these features without spinning the fan or draining the battery.

Efficiency
NPU model ceiling
Mainstream NPU runtimes
~4B

NPU-optimized models on Snapdragon X Elite top out around 4 billion parameters — Llama-3B, Phi4-mini, Qwen3-4B — via specialist SDKs, not the standard GGUF runtimes.

Specialist path

That ~4B ceiling is the key. The NPU is brilliant at small language models and narrow vision and audio tasks — and small models are exactly where most on-device agent work belongs. Our small language model business guide covers why a 3-9B model like Phi or Qwen handles the majority of real-world steps, and the NPU is the most power-efficient place to run the smallest of them. A reviewer testing NPU acceleration captured the experience well.

"The fan didn't even spin up during audio processing"— XDA Developers, hands-on Snapdragon X Elite NPU test

That silent, low-power profile is the whole point of the NPU. The table below separates the Windows AI features that genuinely run on the NPU from the open LLMs that buyers assume run there but do not — the single most common misconception in this category.

Copilot+ NPU feature map showing, for each workload, whether it runs on the NPU, whether it requires a Copilot+ PC, its model path (ONNX or GGUF), approximate speed and power profile. Phi Silica throughput is vendor-stated; the 70B figure is a sourced Strix Halo iGPU benchmark; named open LLMs run on the iGPU or CPU, not the NPU.
Feature / workloadRuns on NPU?Needs Copilot+?Model pathApprox speedPower profile
Windows RecallYes — NPUYesONNX (built-in)Continuous indexingLow · silent
Live Captions (real-time translation)Yes — NPUYesONNX (built-in)Real-timeLow · silent
Windows Studio Effects (blur, eye contact)Yes — NPUYesONNX (built-in)Real-timeLow · silent
Click to Do (preview)Yes — NPUYesONNX · Phi SilicaInteractiveLow
Phi Silica Rewrite / Summarize (Word, Outlook)Yes — NPUYesONNX · Phi-3.5-miniUp to ~20 tok/s (vendor-stated)~56% less than CPU
Llama 3 8B via OllamaNo — iGPU / CPUNoGGUFDepends on iGPUModerate · fan
Mistral 7B via llama.cppNo — iGPU / CPUNoGGUFDepends on iGPUModerate · fan
Llama 3.3 70B (quantized)No — discrete GPU or 96 GB unifiedNoGGUF~14 tok/s on Strix Halo iGPUHigh · sustained

05The Hidden BlockerWhy your runtime ignores the NPU.

Here is the detail that kills most “I’ll just run Llama on my NPU” plans before they start. As of mid-2026, the mainstream local-LLM runtimes — Ollama, llama.cpp and LM Studio — do not route workloads to the NPU at all. They run on the integrated GPU (via Vulkan, ROCm or Metal) or the CPU. Those tools load GGUF-format models; the NPU needs models hand-converted to ONNX and compiled for the vendor’s execution provider — QNN for Qualcomm, OpenVINO for Intel.

That conversion path exists and works, but it is a specialist, opt-in pipeline rather than a drop-in accelerator. Qualcomm’s AI Hub publishes 175-plus pre-optimized ONNX models validated for Snapdragon X Elite, and SDKs such as Nexa enable NPU acceleration for a curated set — Llama-3B, Phi4-mini, Qwen3-4B and an OmniNeural-4B multimodal model. But if you download a GGUF off Hugging Face and point Ollama at it, the NPU sits idle.

The practical takeaway
If your plan is “download a model and run it,” you are using the iGPU or CPU — not the NPU — no matter how many TOPS the box advertises. The NPU accelerates LLMs only through hand-converted ONNX models on a specialist SDK, and only up to roughly 4B parameters in mainstream tooling. For the privacy and cost case that makes local inference worth the effort in the first place, see our on-device local AI agents forecast.

06The Honest ExceptionOne laptop chip that does run 70B.

There is one platform that breaks the “AI PCs can’t run big models” rule — and it proves the point precisely because of how it does it. The AMD Ryzen AI Max+ 395 (Strix Halo) pairs 128 GB of LPDDR5X-8000 on a 256-bit bus with a 40-CU RDNA 3.5 integrated GPU. Up to 96 GB of that memory can be allocated to the GPU, which is enough to hold a 70B model.

Unified memory
LPDDR5X-8000 · 256-bit
128GB

Shared between the CPU and a 40-CU RDNA 3.5 iGPU on a 256-bit bus — roughly 256 GB/s of bandwidth, far wider than a standard Copilot+ laptop.

Strix Halo
Allocatable to GPU
Enough to hold 70B
96GB

Up to 96 GB can be assigned to the iGPU, which is what lets a quantized 70B model fit and run entirely on-device — no discrete card required.

On-device
Llama 3.3 70B
Quantized · iGPU path
~14tok/s

Around 14 tokens per second quantized, or roughly 5 tok/s at BF16 — a vendor and community figure for this unified-memory APU, and on the optimistic side of what raw bandwidth alone predicts. Delivered by the iGPU and wide memory, not the NPU.

Vendor / community est.
Category
Workstation-class
$2k+

This is a premium workstation laptop SoC in machines like the Asus ProArt, not a $1,000-1,400 Copilot+ consumer laptop. Price it against a Mac Studio M4 Max, not a thin-and-light.

Not mainstream

Read that carefully: the Ryzen AI Max+ 395 runs 70B because of wide unified memory and a capable iGPU, not because of its 50 TOPS NPU. It is the honest answer to “can an AI PC run a big LLM?” — yes, if you buy a $2,000-plus workstation-class machine whose value is memory capacity, and you accept single-digit- to-low-double-digit tokens per second. If running larger models on a laptop is the goal, the Gemma 12B on a laptop guide and our DGX Spark vs M5 Max vs RTX 6000 comparison walk the same unified-memory-versus-discrete-GPU trade-off in depth.

07The Platform ShiftWindows AI Foundry and the loosening NPU-only frame.

Microsoft’s own platform direction is the strongest signal that the NPU-only framing was always too narrow. At Build 2025, Microsoft announced Windows AI Foundry, the evolution of the Windows Copilot Runtime — a unified layer that selects, optimizes, fine-tunes and deploys AI models across NPU, GPU and CPU, spanning AMD, Intel, NVIDIA and Qualcomm silicon, and integrating Foundry Local with open catalogs including Ollama and NVIDIA NIMs.

Underneath it, Windows ML automatically selects the correct execution provider — QNN for Qualcomm NPUs, OpenVINO for Intel NPUs — and falls back to the GPU or CPU when an NPU provider is unavailable. Developers no longer bundle execution providers by hand. The architecture quietly concedes that the right engine depends on the workload, not on a single 40 TOPS threshold.

The direction of travel
Reporting on Build 2026 indicates Microsoft broadened Windows AI Foundry beyond NPU-only, officially supporting GPU and CPU inference paths — and Phi Silica on GPU entered an experimental Windows Insider release in June 2026. Treat the Build 2026 specifics as third-party coverage to verify against Microsoft directly, but the trend is clear: the 40 TOPS NPU bar was never meant to carry the full range of local AI, and the platform is opening up to the GPU and CPU to match.

That trajectory matters for buyers planning two or three years ahead. The NPU’s role is consolidating around what it does best — efficient, always-on small-model and feature inference — while the heavy LLM work migrates to whichever engine has the memory bandwidth to feed it. Expect future Windows AI features to lean on a hybrid NPU- plus-GPU split rather than the NPU alone, and expect the marketing to keep quoting TOPS long after the platform has stopped treating it as the deciding number.

08What To BuyThe honest buy-this-for-that matrix.

Match the machine to the job. The mistake is buying an AI PC expecting a local LLM box, or skipping one because “NPUs can’t run LLMs” when the Windows AI features are exactly what you wanted. Four clear lanes.

Battery + Windows AI
Buy a Copilot+ AI PC

If you want all-day battery, Recall, Live Captions, Studio Effects and on-device Rewrite and Summarize, a 40-plus TOPS Copilot+ laptop is exactly right. Just don't expect it to be your 30B inference box.

Pick an AI PC
Fast 30B+ local LLMs
Buy a discrete-GPU box

For interactive 30B-plus inference, a discrete GPU with ample VRAM and GDDR bandwidth wins decisively. This is the path mainstream runtimes are actually built for — GGUF in, fast tokens out.

Pick a GPU box
70B in a laptop
Buy Strix Halo unified memory

The Ryzen AI Max+ 395 with 96 GB allocatable to the iGPU runs a quantized 70B at ~14 tok/s on-device. Workstation-class price (~$2k+), and the work happens on the iGPU, not the NPU.

Pick Ryzen AI Max+ 395
Small on-device agents
Lean on Phi Silica and SLMs

For small, private, always-on agent steps at minimal power, the NPU path through Phi Silica and 4B-class ONNX models is the most efficient option on a Copilot+ PC — silent, fast-to-first-token, battery-friendly.

Pick the NPU path

For most teams, the answer is “both, for different reasons”: a Copilot+ AI PC for the laptop fleet’s battery life and Windows AI features, and a separate GPU or unified-memory machine where heavier local inference actually happens. If you are mapping local versus cloud economics across a fleet, our local AI versus cloud subscription ROI analysis frames the same trade-off in dollars, and our AI transformation engagements start with exactly this kind of hardware-and-workload mapping.

09ConclusionThe number on the box is the wrong number.

AI PCs in 2026, honestly

An NPU is a features chip, not a 70B box — and that's fine.

The single most useful thing to internalize before buying an AI PC in 2026 is that the headline TOPS figure answers a different question than the one you are asking. It tells you the laptop will run Windows AI features locally at low power. It tells you nothing about how fast a local LLM will run, because LLM decode is bound by memory bandwidth, and the mainstream runtimes do not even touch the NPU.

Buy a Copilot+ AI PC for what it is genuinely excellent at: all-day battery, silent on-device features, and small sustained models like Phi Silica. Buy a discrete-GPU box, or a wide-memory machine like the Ryzen AI Max+ 395, when 30B-plus inference is the job. The two are not substitutes, and the spec sheet will not tell you which one you are looking at unless you know to read past the TOPS.

The broader shift is already underway. Microsoft’s own Windows AI Foundry is opening from NPU-only toward GPU and CPU paths, conceding that one threshold can’t carry every workload. The next generation of AI PCs will be judged less by a single TOPS number and more by the boring metrics that actually decide local AI speed — memory bandwidth, usable capacity, and which engine your software will really use.

Get the local AI hardware decision right

Stop buying TOPS. Buy the bandwidth and capacity your workload actually needs.

We help teams match real hardware to real workloads — AI PCs, unified-memory machines and discrete-GPU boxes — and build the on-device and hybrid local AI stacks that actually run on them, without the TOPS marketing.

Free consultationExpert guidanceTailored solutions
What we work on

Local & on-device AI engagements

  • Hardware-to-workload mapping — NPU, GPU, unified memory
  • On-device small-model agents with Phi, Qwen and Gemma
  • Hybrid local-plus-cloud routing for cost and privacy
  • Benchmarking real tokens per second on your machines
  • Windows AI Foundry and ONNX deployment pipelines
FAQ · AI PC and NPU guide

The questions buyers ask every week.

Not on the NPU itself. The NPU on a Copilot+ PC is a fixed-function accelerator tuned for small, sustained models (around 4 billion parameters or less) and Windows AI features. A 70B model in 4-bit quantization needs roughly 40 GB of memory and far more bandwidth than a typical AI PC provides. The one laptop-class exception is the AMD Ryzen AI Max+ 395, which can allocate up to 96 GB of its 128 GB unified memory to the integrated GPU and run a quantized 70B at around 14 tokens per second — but that work happens on the iGPU and wide memory bus, not the NPU. For fast 70B inference you want a discrete GPU or a wide-memory machine, not the NPU.
Related dispatches

Continue exploring local AI hardware.