NVIDIA's Vera CPU, unveiled at the GTC Taipei keynote on May 31, 2026 and branded "the CPU for agents," is the company's argument that the next bottleneck in AI infrastructure is not the GPU at all — it's the host processor that runs the agent orchestration loop around it. Vera carries 88 of NVIDIA's own Olympus cores and a memory subsystem tuned for the control-heavy, data-movement-intensive work that agents generate.
Why now: a single chat completion barely taxes a CPU, but an agent is different. It plans, calls tools, spins up code sandboxes, hits retrieval stores, and aggregates results — often dozens of times per request, across hundreds of concurrent sessions. Every one of those steps is CPU-bound work that gates GPU utilisation. NVIDIA's framing is blunt: CPU time compounds across the request.
This guide covers what actually launched at the keynote, the Olympus core and its new Spatial Multithreading design, how the Vera Rubin rack-scale system fits together, and — critically — how to read the wall of "10x" numbers NVIDIA published. Many of the figures below are vendor-stated and not yet independently replicated; we label each one so you can weight it accordingly.
- 01Vera is a CPU designed around the agent loop.NVIDIA unveiled Vera at GTC Taipei (May 31, 2026) as 'the CPU for agents.' 88 custom Olympus cores — NVIDIA's first in-house CPU core design, ARM v9.2-compatible but not Neoverse — target orchestration, tool dispatch, and sandbox concurrency rather than raw FP throughput.
- 02Spatial Multithreading is the genuinely novel idea.SMT-X runs two hardware threads per core by physically partitioning core resources instead of time-slicing them. Unlike Intel Hyper-Threading or AMD SMT, that preserves strong isolation between threads — the property a multi-tenant AI factory needs to pack thousands of agent sandboxes onto one socket.
- 03The headline numbers are mostly vendor-stated.NVIDIA claims up to 10x agent throughput, 10x tokens per megawatt, and 10x lower cost per token versus the Grace Blackwell generation at pod scale. All three trace to NVIDIA's own materials and have not been independently replicated as of June 1, 2026. Treat them as direction, not verified fact.
- 04Independent testing exists — but on a curated subset.Phoronix found Vera roughly 11% faster than AMD EPYC and 55% faster than the best single-socket Intel Xeon in geometric mean — but NVIDIA chose which workloads Phoronix could run (Python, code compilation, Java, database, memory-stream). General-purpose and legacy workloads were excluded.
- 05It isn't shipping yet, but the adopter list is real.Commercial availability is fall 2026 with no pricing disclosed. Anthropic, OpenAI, and OCI received pre-production Vera hardware in mid-May 2026 for evaluation; OCI says it plans to deploy hundreds of thousands of units beginning in 2026. NYSE is exploring Vera for high-throughput market data — a different use case than agentic AI.
01 — What Was AnnouncedA keynote unveiling, not a shipping product.
NVIDIA introduced Vera at the GTC Taipei keynote on May 31, 2026, positioning it explicitly as a CPU built for AI agents and announcing it alongside the Vera Rubin platform entering full production. For the broader keynote context — new GPUs, networking, and the platform roadmap — see our first-take on NVIDIA's GTC Taipei keynote. This post drills into the CPU itself and why NVIDIA built it.
Two things are worth getting straight up front. First, Vera is the first CPU NVIDIA has designed from its own core up — the Olympus core is ARM v9.2 instruction-set compatible but architecturally distinct from the licensed Arm Neoverse cores in the previous Grace CPUs and in competitors' ARM server parts. Second, this is an unveiling, not a launch you can buy against: commercial availability is fall 2026 from system builders and cloud partners, and no pricing was disclosed in any announcement.
NVIDIA Vera
NVIDIA's first in-house CPU core design, branded 'the CPU for agents.' Up to 1.2 TB/s memory bandwidth and 1.5 TB capacity per socket (vendor-stated). Configurable 250–450W TDP. Standalone Vera racks scale to 256 CPUs.
Vera Rubin NVL72
Rack-scale system pairing Vera with the new Rubin GPUs over 6th-gen NVLink Switch. Vendor-stated 3,600 PFLOPS NVFP4 inference and 20.7 TB HBM4. The CPU and GPU share a coherent address space via NVLink-C2C.
02 — The Real StoryWhy agentic workloads are CPU-bound.
Most coverage of Vera leads with specs and adopters. The more useful lens is the "why." Traditional LLM serving scales with parameter count and lives on the GPU. Agents scale differently — they scale through actions. Each additional agent step (a tool call, a sandbox execution, a retrieval, an aggregation) adds CPU-bound work that has to complete before the next GPU pass can run. As NVIDIA puts it, CPU time compounds across the request.
The table below maps a single agent action loop — plan, dispatch, execute, retrieve, aggregate, schedule — to where the work actually lands, the kind of pressure it creates, and the Vera design choice aimed at it. It is a diagnostic frame as much as a product map: if your agent platform is starving GPUs, this is usually where the queue is forming.
| Agent loop stage | Primary load | Vera design answer |
|---|---|---|
| LLM inference (token gen) | GPU — compute & HBM bandwidth | Not the CPU's job, but the CPU has to keep the GPU fed; stalls upstream show up as idle GPU cycles. NVLink-C2C coherence keeps prompt/KV data close. |
| Tool dispatch & orchestration | CPU — thread concurrency, branchy control flow | Fan-out to many tools is control-heavy. The Olympus core's wide fetch/decode and neural branch predictor target exactly this kind of branchy, low-arithmetic code. |
| Code execution / sandboxes | CPU — per-tenant isolation at high concurrency | Thousands of concurrent sandboxes need isolation without time-slice contention. SMT-X physically partitions core resources so tenants don't interfere. |
| Retrieval / RAG lookups | CPU — memory bandwidth & pointer chasing | Graph and index traversal is bandwidth- and latency-bound. The monolithic die and graph prefetcher are pitched at indirect, pointer-chasing access patterns. |
| Output aggregation & post-processing | CPU — data movement | Marshalling tool outputs back into the prompt stresses memory bandwidth. The LPDDR5X subsystem targets high aggregate bandwidth at low power. |
| Scheduling across requests | CPU — multi-tenant thread management | Packing many agents per socket is a scheduling problem. 176 isolated threads per socket give the scheduler more independent lanes to fill. |
The interpretation worth drawing out: this is a re-balancing of the server, not a replacement of the GPU. For a year the industry has optimised GPU utilisation as the single number that matters. As agent architectures spread, the constraint quietly migrates to the host — and a GPU sitting idle waiting on tool-call orchestration is just as expensive as one that is genuinely saturated. Vera is NVIDIA's bet that the conductor, not the orchestra, is now the limiting seat.
"AI agents will be the largest users of computing. Vera is the first CPU designed for that future."— Jensen Huang, Founder & CEO, NVIDIA · GTC Taipei keynote, May 31, 2026
03 — The CoreInside the Olympus core.
Olympus is the headline departure. Prior NVIDIA server CPUs (the Grace family) used licensed Arm Neoverse cores. Vera replaces them with NVIDIA's own microarchitecture — still ARM v9.2 instruction-set compatible, so the software story carries over, but designed specifically for control-heavy agent code rather than the dense-numeric profiles that dominate HPC.
Three design choices stand out, and all three read as a response to the agent-loop workload above rather than to a benchmark leaderboard:
- 10-wide instruction fetch/decode. A wide front end keeps many instructions in flight, which matters for the branchy, low-arithmetic-intensity code that orchestration and tool-dispatch logic tends to be.
- Neural branch predictor. NVIDIA states it can sustain two taken branches per cycle with zero penalty — aimed squarely at the unpredictable control flow of agent decision trees and interpreted languages.
- Graph prefetcher for indirect access. A prefetcher tuned for pointer-chasing and graph traversal — the memory access pattern retrieval and RAG produce — rather than the linear streaming that classical prefetchers assume.
Olympus cores
NVIDIA's first in-house CPU core, ARM v9.2-compatible (not Neoverse). 10-wide fetch/decode, neural branch predictor, graph prefetcher — all tuned for control-heavy, data-movement-intensive agent code.
Monolithic die, no chiplets
A single-die design with a 3.4 TB/s bisection on-chip fabric (vendor-stated). No chiplet boundaries means no cross-die latency penalty on the pointer-chasing and graph workloads agents generate.
vs NVIDIA Grace
NVIDIA states roughly 50% higher instructions-per-cycle than its own prior-gen Grace CPU. Vendor-stated and measured by NVIDIA — treat as a generational direction, not an audited number.
04 — Spatial MultithreadingSMT-X: partitioning, not time-slicing.
The most genuinely new idea in Vera is how it does multithreading. Intel Hyper-Threading and AMD SMT run two threads per core by time-sharing the same execution units — when both threads want the same resource, one waits. That contention is fine for throughput servers and a problem for multi-tenant isolation, because one tenant's burst can degrade another's latency.
Vera's Spatial Multithreading (SMT-X) instead physically partitions core resources between the two threads. Each thread gets its own slice of the core rather than competing for a shared pool. NVIDIA frames this as a runtime tradeoff: you can bias a core toward two-thread throughput or toward single-thread performance. The practical payoff is isolation — a requirement, not a nicety, when you are tenanting thousands of independent agent sandboxes on one socket.
At 88 cores and two SMT-X threads each, a Vera socket exposes up to 176 hardware threads with strong inter-thread isolation. That is the architectural underpinning behind the rack-level concurrency claims further down.
05 — Memory & FabricBandwidth, capacity, and a coherent link to the GPU.
Agents move data constantly — between tools, retrieval stores, and the GPU — so the memory subsystem is as central to Vera's thesis as the cores. NVIDIA pairs an LPDDR5X subsystem with a second-generation NVLink-C2C link to the GPU, and the design trades raw DDR5 peak for bandwidth-per-watt.
- Up to 1.2 TB/s aggregate bandwidth and up to 1.5 TB capacity per socket (vendor-stated). NVIDIA characterises this as roughly 3x the per-core bandwidth and 2x the total bandwidth of leading x86 CPUs on DDR5.
- SOCAMM modules.Unlike soldered LPDDR designs, Vera's LPDDR5X uses detachable, field-replaceable SOCAMM modules — a server-class flexibility win that allows on-site capacity upgrades.
- Power efficiency. NVIDIA states the entire LPDDR5X subsystem draws under 30W versus over 100W for a DDR5 rack equivalent. Combined with a configurable 250–450W socket TDP, the memory subsystem is pitched as the efficiency story.
- NVLink-C2C (gen 2): 1.8 TB/s coherent CPU-GPU bandwidth in the Vera Rubin NVL72 configuration (vendor-stated). This unifies the address space across CPU and GPU memory, supporting KV-cache offload and multi-model execution directly from CPU-attached DRAM.
The coherent link is the subtle one. If CPU and GPU share an address space, you can park a model's KV cache in the CPU's much larger DRAM pool and stream it to the GPU on demand — which is exactly what long-context and multi-model serving need. That capability is where the CPU stops being a chaperone for the GPU and starts being part of the memory hierarchy itself.
06 — Rack ScaleVera Rubin NVL72 at full scale.
Vera is sold in two shapes. A standalone Vera rack scales to up to 256 CPUs (paired with BlueField-4 storage DPUs and ConnectX-9 SuperNICs), which NVIDIA says supports over 22,500 concurrent sandbox environments — the pure agent-orchestration play. The flagship, though, is the Vera Rubin NVL72, which pairs Vera with the new Rubin GPUs. For the rack-system roadmap and cloud-deployment angle specifically, see our coverage of Google Cloud's Vera Rubin NVL72 plans.
Rubin GPUs + 36 Vera CPUs
The NVL72 pairs 72 Rubin GPUs with 36 Vera CPUs over a 6th-generation NVLink Switch — a 2:1 GPU-to-CPU ratio that reflects how much orchestration the agent era expects.
NVFP4, vendor-stated
NVIDIA states 3,600 PFLOPS of NVFP4 inference and 2,520 PFLOPS training across the rack, with 20.7 TB of HBM4. All figures are vendor-stated and not independently verified.
CPUs per rack
Without Rubin GPUs, a Vera rack scales to 256 CPUs supporting over 22,500 concurrent sandbox environments (vendor-stated) — the configuration aimed at pure agent orchestration at hyperscale.
One clarification matters for the rack architecture. NVIDIA's materials reference a "Groq 3 LPX" ultra-low-latency inference tier layered below the Rubin GPU tier in the full POD. This is a component tier withinNVIDIA's own architecture for real-time agentic inference — not the standalone Groq Cloud service from the company of a similar name. Any throughput-per-watt figures attached to that tier are, again, vendor-stated.
07 — Read The ClaimsThe 10x triad, and what we can actually verify.
Three of NVIDIA's most quotable numbers all arrive together in the same announcement: up to 10x agent throughput, 10x more tokens per megawatt, and 10x lower cost per tokenfor interactive reasoning workloads — each measured at pod scale against the previous Grace Blackwell generation. They are worth knowing and worth discounting in equal measure: every one of the three traces back to NVIDIA's own newsroom, product pages, and technical blog, and none has been independently replicated as of June 1, 2026.
The honest framing is that a triple "10x" in a single launch, all comparing a new pod to a one-generation-old pod, is a marketing artifact until a third party reproduces it on a workload you recognise. The most useful number in the whole release is the one NVIDIA did not fully control: the Phoronix benchmark.
Independent Phoronix testing · Vera lead in geometric mean
Source: Phoronix review — NVIDIA selected which workloads were testedOne networking claim deserves a harder hedge still. NVIDIA's companion Spectrum-X co-packaged-optics switches are quoted with very large improvement multipliers, including a "63x" figure for signal integrity. The methodology behind that specific multiplier is not disclosed, and an order-of-magnitude claim with no published basis should be treated as a marketing figure rather than an engineering standard. We mention it only to note that it exists and to caution against repeating it as fact.
For teams that have to turn these numbers into a budget, the framing we use is straightforward: a vendor "cost per token" claim is an input to your own model, never the conclusion. Pair any Vera evaluation with a real inference cost optimization pass on your own traffic before you let a 10x headline move a capex line.
08 — Adopters & AvailabilityWho has it, and when you can buy it.
The adopter list is the most concrete part of the announcement, because it is partly verifiable. Anthropic, OpenAI, and SpaceXAI received first Vera hardware on May 17, 2026, and Oracle Cloud Infrastructure on May 20 — these are pre-production deliveries for evaluation, not general-availability shipments. CoreWeave, Lambda, Nebius, and Nscale are named among cloud-provider adopters (ByteDance also appears on NVIDIA's list). OCI says it plans to deploy hundreds of thousands of Vera CPUs beginning in 2026 — itself an OCI-sourced, forward-looking statement.
The surprise name is the New York Stock Exchange. NYSE is exploring Vera — in partnership with HPE and Redpanda — to scale the infrastructure behind the 1.1 trillion messages a day it says it processes. That is a high-throughput, low-latency market-data workload, not agentic AI. Read it as NVIDIA deliberately widening Vera's addressable market beyond AI labs, not as a stock exchange running agents.
| Where to get it | Form | Best for |
|---|---|---|
| System builders | Dell, HPE, Lenovo, Supermicro, Cisco, others | On-prem and colo deployments. HPE has already shown the ProLiant Compute DL394 Gen12 on Vera. General availability fall 2026; no pricing disclosed. |
| Cloud providers | OCI, CoreWeave, Lambda, Nebius, Nscale | Rent before you buy. OCI plans hyperscale deployment from 2026; cloud access is the lowest-commitment way to benchmark Vera on your own agent traffic. |
| Pre-production (labs) | Anthropic, OpenAI, SpaceXAI, OCI | Early evaluation hardware delivered mid-May 2026. Not GA — useful as a signal of who is taking the agent-CPU thesis seriously, not a buying channel. |
For teams weighing whether agent-class silicon is worth the capital outlay at all, the deciding analysis is rarely the spec sheet — it is total cost of ownership against your actual utilisation. Our writeup on self-hosted AI infrastructure TCO lays out the model: amortised hardware, power, staffing, and utilisation, set against managed-cloud pricing. Hardware this specialised only pays off above a utilisation threshold most teams haven't hit yet.
"The CPU is now the conductor, and the GPU is the orchestra."— Jensen Huang, Founder & CEO, NVIDIA · GTC Taipei 2026 keynote
09 — ImplicationsWhat this means for agencies and engineering teams.
For nearly everyone reading this in 2026, Vera is not a near-term purchase — it is a signal about where the constraint is moving. That signal still changes a few decisions today, even before any hardware lands.
Profile the host, not just the GPU
If your agents fan out to many tools and sandboxes, your GPU-idle time is probably orchestration overhead. Measure CPU saturation and sandbox scheduling now — the bottleneck Vera targets is one you can diagnose on the hardware you already own.
Treat 10x as a hypothesis
The throughput, tokens-per-watt, and cost-per-token claims are vendor-stated and pod-scale. Don't move a capex line on a launch slide. Wait for independent benchmarks on workloads you recognise, then model TCO against your real utilisation.
Isolation is the differentiator
SMT-X's physical partitioning is the one feature with a clear, defensible advantage for multi-tenant agent isolation. If you run untrusted code per tenant, that property — not the headline throughput — is what to test for first.
Optimise the stack you have
Vera ships in fall 2026 with no pricing. The higher-leverage move today is cutting the inference and orchestration cost of your current stack — better caching, routing, and sandbox reuse — which pays off regardless of what silicon you buy later.
Projecting forward: if the agent-as-primary-workload thesis holds, the GPU-to-CPU ratio in AI servers will keep falling, and the host processor becomes a first-class part of the inference budget rather than a rounding error. Vera is the first chip to make that argument in silicon. Whether NVIDIA's specific numbers survive contact with independent testing is a separate question from whether the underlying shift is real — and the shift looks real. The teams that benefit earliest will be the ones already instrumenting where their agent loops spend time, so they can recognise the bottleneck when the hardware arrives to address it.
10 — ConclusionA CPU built for the agent era.
The bottleneck is moving from the GPU to the host — and Vera is the first chip built around that.
NVIDIA's Vera CPU is the clearest statement yet that agentic workloads change what a server needs. By designing its own Olympus core, partitioning threads spatially rather than time-slicing them, and pairing the whole thing with a coherent link into the GPU's memory space, NVIDIA is building for a world where the orchestration loop — not the matrix multiply — is on the critical path.
The discipline to apply is the same one we apply to every vendor launch: separate the architecture from the arithmetic. The architecture — CPU-bound agent loops, physical thread isolation, coherent CPU-GPU memory — is a genuine and well-argued response to a real shift. The arithmetic — the triple "10x," the "63x" networking multiplier, the curated Phoronix subset — is vendor-stated and unverified, and should be weighted as such until independent results land.
For most teams the right move is not to wait for fall 2026 hardware but to act on the insight now: instrument your agent loops, find where CPU time compounds, and optimise the stack you already run. The shift Vera is built for is already underway in your own traffic — the silicon is just catching up to it.