NVIDIA released Nemotron 3 Ultra on June 4, 2026 — a 550-billion-parameter open Mixture-of-Experts reasoning model that ships not just weights but training data and recipes under a permissive Linux Foundation license. The headline isn't a new capability ceiling. It's that the strongest US-origin open-weight model now runs fast enough, and opens enough of its supply chain, to change how teams build long-running agents.

The model carries 550B total parameters with roughly 55B active per token — about 10% sparsity — on a hybrid Mamba-Transformer architecture. On the one independent benchmark available at launch, Artificial Analysis's Intelligence Index, it scores 48 and ranks ninth of eighty-nine models evaluated. That places it above every other US open-weight model. It also places it six points below China's Kimi K2.6, which scores 54. Both facts are true at once, and the honest version of this launch holds them together rather than picking the flattering one.

This guide covers what actually shipped, the architecture behind the speed, how to read benchmark numbers that are mostly vendor-stated, the verbosity finding that complicates NVIDIA's cost claim, what the OpenMDW-1.1 license does and doesn't change, and a routing framework for deciding when Ultra is the right call versus a frontier API. Where a number is vendor-stated and not yet independently checked, we say so.

Key takeaways

01
A genuinely open 550B reasoning model shipped.Nemotron 3 Ultra is a 550B-parameter MoE with ~55B active per token, released June 4, 2026 with four checkpoints (NVFP4, BF16 instruct, BF16 base, GenRM) plus training data and recipes under the Linux Foundation's OpenMDW-1.1 license.
02
Leading US open-weight model, not the global leader.On Artificial Analysis's Intelligence Index it scores 48 (ninth of eighty-nine), ahead of all US open peers but trailing China's Kimi K2.6 at 54 — a six-point gap. Describe it as the leading US-origin open-weight model, not the best open model overall.
03
Speed is the real differentiator.Artificial Analysis independently measured 140.3 tokens/second output (seventh of eighty-nine) and a 1.33-second time-to-first-token, against DeepSeek and Kimi serving roughly 50–100 tok/s. NVIDIA's vendor figures claim 4.8–5.9x throughput gains versus comparable open models on GB200.
04
The cost claim has an honest asterisk.NVIDIA states up to 30% lower cost per task. But Artificial Analysis measured the model generating 2.3x more output tokens than the median peer in the same benchmark suite. Per-task economics depend on how verbose your workload makes it — run your own numbers.
05
1M context is architectural, not always served.The model card and NVIDIA blog confirm a 1M-token architectural window via interleaved Mamba-2 and selective Attention layers. Providers may serve a reduced context for cost and latency reasons, so confirm the served limit on whichever endpoint you deploy.

01 — What ShippedFour checkpoints, day-zero on 25+ platforms.

NVIDIA released Nemotron 3 Ultra in four checkpoint variants on the same day: an NVFP4-quantized build, a BF16 post-trained instruct model, a BF16 base model, and a GenRM (generative reward model) variant for building reward pipelines. This is a deliberate spread — the instruct checkpoint for direct deployment, the base for further pre-training or domain adaptation, and the reward model for teams building their own reinforcement-learning loops on top.

Availability was unusually wide at launch. Day-zero access spanned OpenRouter, NVIDIA NIM, Hugging Face weight downloads, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, and more than twenty additional cloud and inference providers — over twenty-five platforms in total. It builds directly on the lineage we covered in our look at NVIDIA's earlier Nemotron 3 Super 120B model, and was positioned within the agent-platform story NVIDIA laid out at Jensen Huang's Computex keynote.

Instruct

BF16 post-trained

550B total · 55B active · 1M context

The deploy-ready reasoning model. Hybrid Mamba-Transformer MoE post-trained via Multi-Teacher On-Policy Distillation. This is the checkpoint behind the independently measured benchmark and speed numbers.

huggingface.co/nvidia · BF16 instruct

Efficient

NVFP4 quantized

FP4 weights · Blackwell-native

NVFP4 quantization with E2M1 encoding and 2D block microscaling. NVIDIA states up to 5x throughput versus BF16 on Blackwell Tensor Cores, calling it their largest-scale stable FP4 training run to date.

Blackwell B200 / B300 / GB200

Build-on

Base + GenRM

BF16 base · reward-model variant

The base checkpoint for further pre-training or domain adaptation, plus a generative reward model for teams building their own RL pipelines. Released with 10M new SFT samples and 1M new RL tasks.

Cumulative: 50M SFT · 2M RL tasks

Release snapshot

Nemotron 3 Ultra shipped June 4, 2026 in four checkpoints across 25+ platforms. NVIDIA also released companion models — a distinct Nemotron 3.5 Content Safety (4B) model (June 2) covering 23 safety categories across 12 languages, and a Nemotron 3.5 ASR (0.6B) streaming speech model (June 4) supporting 40 language-locales. Note these are version 3.5, a separate family from Ultra — not "Ultra Content Safety." Enterprise partners named alongside the launch include Microsoft, SAP, ServiceNow, Red Hat, Palantir, CrowdStrike, Siemens, and Synopsys.

02 — ArchitectureA hybrid Mamba-Transformer MoE built for long-running agents.

According to NVIDIA's technical materials, Ultra is a Mixture-of-Experts hybrid Mamba-Transformer. It interleaves Mamba-2 layers — which give sub-quadratic efficiency on long sequences — with selective Attention layers that preserve precise factual recall. That hybrid is what NVIDIA credits for making the 1M-token architectural context window tractable rather than ruinously expensive. The MoE configuration is reported at 512 experts per layer with top-22 routing, an 8,192 model dimension, and 108 layers.

Two engineering choices target inference speed directly. The first is Multi-Token Prediction (MTP) layers baked into the architecture for native speculative decoding — faster generation without a separate draft model. The second is LatentMoE, which NVIDIA describes as projecting tokens into a smaller latent dimension so more experts can be routed at a fixed inference cost. Both are vendor-stated design claims; the payoff that has been independently confirmed is the throughput number in the next section.

The training story is unusually open. NVIDIA states the model was trained on roughly 20 trillion tokens across a diversity-focused and a quality-focused phase, with a data cutoff of September 2025, using Megatron-LM on NVIDIA clusters between December 2025 and April 2026. Post-training used Multi-Teacher On-Policy Distillation (MOPD), where ten-plus domain-specialized teacher models score student rollouts in an asynchronous pipeline and are themselves periodically retrained from updated student checkpoints. Because NVIDIA released the SFT and RL corpora alongside the weights, that pipeline is partially reproducible — a meaningful asset for smaller labs training specialized domain agents.

Sparsity

Active per token

55B

Of 550B total parameters, roughly 55B are active per token — about 10% sparsity. That is the lever behind the speed: a large knowledge store with a small per-token compute footprint, routed through 512 experts at top-22.

~10% active · 512 experts

Training

Tokens, two phases

20T

Vendor-stated ~20T-token run: ~15T diversity-focused, ~5T quality-focused, cutoff September 2025. Trained with Megatron-LM between Dec 2025 and Apr 2026. NVIDIA released the data and recipes, not just weights.

Data cutoff: Sept 2025

Languages

Natural + 43 code

Vendor-stated support for 12 natural languages — including English, French, Spanish, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese — plus 43 programming languages. Self-hostable on vLLM, SGLang, and TRT-LLM.

Ampere → Blackwell GPUs

Agents don't just answer once. They plan, call tools, delegate work to sub-agents, check results, and keep going across hundreds of turns.— AWS SageMaker JumpStart launch blog, June 4, 2026

03 — Context Window1M is the architecture. Check what your provider serves.

The Hugging Face model card and NVIDIA developer blog both confirm a one-million-token architectural context window, enabled by the interleaved Mamba-2 and selective Attention design. NVIDIA reports a RULER score of 94.7% at one million tokens — a vendor-stated long-context retrieval result that, if it holds up under independent testing, would be strong.

Here is the distinction that matters for deployment, and that most launch coverage skips. The 1M figure is what the model can do architecturally. The context a given provider actually serves can be lower — endpoints frequently cap served context well below the architectural ceiling for cost and latency reasons. Do not assume every Nemotron 3 Ultra endpoint gives you a full million tokens; confirm the served limit on whichever provider you deploy against before designing a long-document pipeline around it. The architectural number is a ceiling, not a guarantee at the API boundary.

Deployment caution

Treat 1M tokens as the architectural maximum and verify the served window per endpoint. A long-context RAG pipeline designed for a million tokens will fail quietly if the provider you chose caps served context far below that — a real risk worth a five-minute check before you build.

04 — BenchmarksWhere Ultra leads — and where the asterisks live.

One independent benchmark exists at launch: Artificial Analysis's Intelligence Index. Ultra scores 48, ranking ninth of eighty-nine models and sitting well above the peer average of 31. Among US-origin open-weight models it leads the field — the nearest peers are Gemma 4 31B at 39, the earlier Nemotron 3 Super at 36, and gpt-oss-120b at 33. Above it sits Kimi K2.6 at 54. The chart below reads that landscape honestly: orange marks Ultra, blue marks the Chinese open model that currently leads it.

AA Intelligence Index · Ultra vs open-weight peers

Source: Artificial Analysis Intelligence Index

Kimi K2.6China open-weight · current open leader

Leads index

Nemotron 3 UltraLeading US open-weight model

#9 of 89

Gemma 4 31BNearest US open peer

Nemotron 3 SuperEarlier NVIDIA open model

gpt-oss-120bUS open-weight

Peer averageAcross models evaluated

Nemotron 3 Ultra (US open)Kimi K2.6 (China open, leads)

On the vendor-stated coding and agentic benchmarks, treat the numbers with appropriate care — none have been independently replicated at scale yet. The one worth reading precisely is SWE-Bench Verified. NVIDIA's materials cite a peak of 71.9%, but the underlying range across five different agent harnesses (Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent) is 65.0% to 70.4%, with the 71.9% peak attributed to an unspecified configuration. The honest figure to plan around is the 65.0–70.4% harness range, not the peak — the gap between them is a measure of how much harness choice moves the result. Other vendor-stated agentic scores (PinchBench at 90.0%, Terminal Bench 2.1 at 56.4%) carry the same caveat: stated by NVIDIA, awaiting third-party replication.

Vendor-stated benchmarks · read with caveats

Source: NVIDIA tech report + MarkTechPost · all VENDOR-STATED

SWE-Bench Verified (range)65.0–70.4% across 5 harnesses · vendor-stated

65–70%

RULER @ 1M tokensLong-context retrieval · vendor-stated

94.7%

PinchBench (agentic)Vendor-stated · no third-party replication

90.0%

Terminal Bench 2.1Vendor-stated · agentic terminal tasks

56.4%

The framing that holds all of this together: Ultra is the strongest US-origin open-weight model on the only independent benchmark we have, it is not the best open model overall, and most of its capability story is still vendor-stated. That is not a knock — it is simply where every major launch sits in its first week, before the community runs its own evals. The responsible move is to benchmark on your own workload rather than treat a launch-day table as settled.

Chinese labs have been flooding the open ecosystem with strong models while American companies — OpenAI, Anthropic, Google — keep their best systems behind APIs.— Decrypt launch analysis, June 4, 2026

05 — The Verbosity TaxWhy "30% cheaper" and 2.3x verbose can both be true.

NVIDIA states up to 30% lower cost to task completion versus open frontier models in its class, attributing the saving to fewer tokens per turn in agentic loops. That is a vendor-stated, not independently-audited, figure. And it runs straight into a finding from the same independent source that gave us the Intelligence Index: Artificial Analysis flagged Ultra as "very verbose," measuring it generate 100 million output tokens running the Intelligence Index against a median of 43 million tokens for comparable models — roughly 2.3x more verbose.

These two facts do not cancel; they interact, and the interaction is the whole point. Output tokens are what you pay for. A model priced below a peer per token can still cost more per completed task if it emits enough extra tokens to overwhelm the per-token discount. Run the arithmetic on a worked example: take a task that costs $1.00 of output on a peer model. A 30% lower per-token price puts the same token count at $0.70. But if Ultra emits 2.3x the output tokens to finish the same task, that $0.70 becomes roughly $1.61 — more than the peer, not less. The 30% claim and the 2.3x verbosity are not contradictory; they describe different axes, and which one dominates depends entirely on how verbose your specific workload makes the model.

The verbosity tax · per-token price vs per-task cost

Illustrative arithmetic · NVIDIA 30% claim + AA 2.3x verbosity

Peer model · same taskBaseline output cost

$1.00

Ultra · per-token only30% cheaper per token, same token count

$0.70

Looks cheaper

Ultra · verbosity-adjusted30% cheaper × 2.3x more output tokens

~$1.61

Can cost more

Per-token list priceReal per-task cost (verbosity-adjusted)

The honest read

Treat the 30% cost-reduction claim as vendor-stated, and application-dependent. The independently observed 2.3x verbosity means actual savings vary widely by use case — and can invert into a premium on chatty workloads. Measure tokens-per-completed-task on your own prompts before you assume Ultra is the cheaper option.

There is a more optimistic reading too, and it is worth stating to be fair. Verbosity measured on a reasoning-heavy benchmark suite is not necessarily representative of a production agent loop with tight system prompts and tool-call constraints. It is plausible that disciplined prompting narrows the gap. But "plausible" is not "audited," and the only way to know for your workload is to measure it. The point of this section is not that Ultra is expensive — it is that the cost question is genuinely open, and the vendor's framing answers only half of it.

06 — Claims vs ChecksA launch-day verification ledger.

Because most of the capability story is vendor-stated, the most useful thing we can hand you is a ledger: each major claim, its source, and whether an independent check exists yet. This is the discipline we apply to any model launch before recommending it in a client AI transformation engagement. The proprietary table below is our launch-day read; the verdict column will move as the community publishes its own evals over the coming weeks.

Nemotron 3 Ultra — vendor claims vs independent verification status at launch
Claim	Source	Independent check	Verdict
140 tok/s output speed	Artificial Analysis	Yes — directly measured	Confirmed
Intelligence Index 48 (#9/89)	Artificial Analysis	Yes — independent eval	Confirmed
4.8–5.9x throughput vs open peers	NVIDIA research page	Partial — speed lead corroborated, ratio not	Plausible
SWE-Bench 71.9% peak	NVIDIA tech report	No — harness range 65.0–70.4% cited	Use the range
94.7% RULER @ 1M tokens	NVIDIA / MarkTechPost	No — not yet replicated	Needs data
90.0% PinchBench (agentic)	NVIDIA developer blog	No — vendor-only	Needs data
Up to 30% lower cost per task	NVIDIA developer blog	Contradicted — 2.3x verbosity observed	Varies by use

Read top to bottom, the ledger tells a consistent story. The two things independently confirmed at launch — raw speed and a top-ten intelligence ranking — are exactly the two things that make Ultra interesting for agents. The things still awaiting confirmation are the precise capability ceilings, which matter most for one-shot difficulty rather than throughput. That shape is genuinely useful: you can deploy on the confirmed strengths today while treating the vendor-stated ceilings as hypotheses to validate on your own data.

07 — The LicenseOpenMDW-1.1 is a licensing inflection point.

Released on May 28, 2026, OpenMDW-1.1 is a Linux Foundation permissive license built specifically for AI model artifacts — weights, code, data, and docs together — rather than software alone. On its surface terms, it grants royalty-free rights including commercial use, and it carries a patent termination clause. NVIDIA adopted it simultaneously across its Cosmos, Isaac GR00T, Ising, and Nemotron model families. One practically important surface term: model outputs are explicitly free from the license's obligations, so end-user products built on Nemotron-generated outputs are not encumbered by it.

A careful caveat, because licensing is where confident overstatement does the most damage. OpenMDW-1.1 is not Apache 2.0 and should not be described as "Apache-equivalent." It is purpose-built for model artifacts, with a patent-termination mechanism and a scope that spans data and weights, not just source code. Beyond those surface terms, the legal analysis of what OpenMDW-1.1 changes in practice — how its termination clause interacts with downstream redistribution, how its data scope is interpreted — is still immature. It is a new license from a neutral foundation, and the community's reading of its finer points will firm up over months, not days. Characterize it as "permissive with patent termination," cite the surface terms, and get your own counsel before betting a redistribution strategy on it.

We're helping establish a simpler, more consistent standard for open models at scale.— Kari Briski, VP of Generative AI, NVIDIA

08 — Access & PricingWhere to run it, and what it costs.

Pricing landed competitively. OpenRouter lists Nemotron 3 Ultra at $0.50 per million input tokens and $2.50 per million output, with a free tier at $0/$0, and the model appears across its June listings alongside other major launches — context we tracked in our OpenRouter June 2026 model listings roundup. Artificial Analysis computes a blended rate near $0.52 per million at a typical cache/input/output mix, with cache-hit tokens discounted roughly two-thirds against input. The model is self-hostable on vLLM, SGLang, and TRT-LLM, and fine-tunable through NVIDIA's NeMo stack, across GPU families from Ampere to Blackwell.

Hosted API

OpenRouter input

$0.50/1M

Output runs $2.50 per 1M tokens, with a $0/$0 free tier for evaluation. AA's blended rate sits near $0.52 per 1M at a typical cache-heavy mix. Remember to weight output cost by your workload's verbosity, not the list price alone.

Free tier available

Self-host

Serving frameworks

Run on vLLM, SGLang, or TRT-LLM. Fine-tune via NeMo Automodel, NeMo Megatron Bridge, or NeMo RL (GRPO). Supported across Ampere A100, Hopper H100/H200, and Grace Blackwell GB200/GB300.

Open weights on Hugging Face

Cloud

Day-zero platforms

25+

NVIDIA NIM, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, Nebius, and more. The NVFP4 checkpoint targets Blackwell for the highest throughput tier.

Verify served context per endpoint

09 — RoutingWhen Ultra is the right call — and when it isn't.

The genuine decision here is a tradeoff, not a recommendation. Ultra runs several times faster per token than Kimi K2.6 but sits six intelligence-index points beneath it. For long-horizon agent tasks, faster throughput directly cuts wall-clock time and GPU-hour cost — but a lower capability ceiling raises the odds of a failed run that needs a retry, and a retry eats the speed advantage. The right answer depends on whether your bottleneck is latency or one-shot difficulty.

Long-horizon agents

Speed-bound agent loops

Hundreds of tool-calling turns where wall-clock time and GPU-hours dominate, and individual steps are not at the edge of model capability. Ultra's measured 140 tok/s and open weights are the strongest fit here. Pair with disciplined prompts to control verbosity.

Pick Nemotron 3 Ultra

Open + sovereign

On-prem deployment

Sovereignty, sector-compliance, or data-residency requirements that rule out a closed API. Open weights plus the OpenMDW-1.1 license make Ultra a leading US-origin candidate — verify the served context window and confirm license terms with counsel first.

Pick Ultra open weights

Hardest reasoning

Raw capability ceiling

When a single hard step decides the outcome and a retry is expensive, the six-point index gap to Kimi K2.6 — and the gap to closed frontier — matters. Route ceiling-critical work to a frontier API, or to Kimi where an open model is required.

Use a frontier API

Cost-sensitive bulk

High-volume, chatty workloads

Where output volume drives the bill, the 2.3x verbosity finding means Ultra's per-token discount may not survive contact with your traffic. Measure tokens-per-task on your own prompts before committing — the answer is workload-specific.

Benchmark before you commit

Our forward read: the most durable thing about this release is not the benchmark line, it is the supply-chain openness. Shipping the training data and recipes under a neutral-foundation license turns Ultra from a model you call into a pipeline you can rebuild. Over the next two quarters we expect the more consequential downstream effect to be smaller labs using NVIDIA's released SFT and RL corpora to distill specialized domain agents via the same multi-teacher method — "training data as a product" rather than weights as a product. The capability gap to Kimi K2.6 will likely narrow or widen on any given week; the reproducibility advantage is structural and harder to undo.

10 — ConclusionThe strongest US open model, read honestly.

The shape of open frontier, June 2026

Nemotron 3 Ultra is a speed-and-openness story, not a new capability ceiling.

NVIDIA shipped a genuinely open 550B reasoning model — weights, data, and recipes under a permissive license, four checkpoints, day-zero across twenty-five-plus platforms. On the one independent benchmark available, it is the leading US-origin open-weight model, and it runs measurably faster than its Chinese open rivals. Both of those are real, confirmed wins.

The honest asterisks are equally real. It trails Kimi K2.6 by six index points, so it is not the best open model overall. Most of its capability story is vendor-stated and awaiting community replication — cite SWE-Bench as the 65.0–70.4% harness range, not the 71.9% peak. And the 30% cost claim collides with an independently observed 2.3x verbosity, which means per-task economics are genuinely workload-dependent rather than settled in NVIDIA's favor. The 1M context is an architectural ceiling, not a served guarantee.

The practical move is the one we apply to every launch: route by task class, not by headline. Pick Ultra for speed-bound, long-horizon agents that need open weights; reach for a frontier API when a single hard step decides the run; and measure tokens-per-completed-task on your own prompts before you trust any cost claim. The most consequential thing here is the supply-chain openness — and that is the part no benchmark table captures.

NVIDIA Nemotron 3 Ultra: 550B Open Reasoning Model Live