NVIDIA released Nemotron 3 Ultra on June 4, 2026 — a 550-billion-parameter open Mixture-of-Experts reasoning model that ships not just weights but training data and recipes under a permissive Linux Foundation license. The headline isn't a new capability ceiling. It's that the strongest US-origin open-weight model now runs fast enough, and opens enough of its supply chain, to change how teams build long-running agents.
The model carries 550B total parameters with roughly 55B active per token — about 10% sparsity — on a hybrid Mamba-Transformer architecture. On the one independent benchmark available at launch, Artificial Analysis's Intelligence Index, it scores 48 and ranks ninth of eighty-nine models evaluated. That places it above every other US open-weight model. It also places it six points below China's Kimi K2.6, which scores 54. Both facts are true at once, and the honest version of this launch holds them together rather than picking the flattering one.
This guide covers what actually shipped, the architecture behind the speed, how to read benchmark numbers that are mostly vendor-stated, the verbosity finding that complicates NVIDIA's cost claim, what the OpenMDW-1.1 license does and doesn't change, and a routing framework for deciding when Ultra is the right call versus a frontier API. Where a number is vendor-stated and not yet independently checked, we say so.
- 01A genuinely open 550B reasoning model shipped.Nemotron 3 Ultra is a 550B-parameter MoE with ~55B active per token, released June 4, 2026 with four checkpoints (NVFP4, BF16 instruct, BF16 base, GenRM) plus training data and recipes under the Linux Foundation's OpenMDW-1.1 license.
- 02Leading US open-weight model, not the global leader.On Artificial Analysis's Intelligence Index it scores 48 (ninth of eighty-nine), ahead of all US open peers but trailing China's Kimi K2.6 at 54 — a six-point gap. Describe it as the leading US-origin open-weight model, not the best open model overall.
- 03Speed is the real differentiator.Artificial Analysis independently measured 140.3 tokens/second output (seventh of eighty-nine) and a 1.33-second time-to-first-token, against DeepSeek and Kimi serving roughly 50–100 tok/s. NVIDIA's vendor figures claim 4.8–5.9x throughput gains versus comparable open models on GB200.
- 04The cost claim has an honest asterisk.NVIDIA states up to 30% lower cost per task. But Artificial Analysis measured the model generating 2.3x more output tokens than the median peer in the same benchmark suite. Per-task economics depend on how verbose your workload makes it — run your own numbers.
- 051M context is architectural, not always served.The model card and NVIDIA blog confirm a 1M-token architectural window via interleaved Mamba-2 and selective Attention layers. Providers may serve a reduced context for cost and latency reasons, so confirm the served limit on whichever endpoint you deploy.
01 — What ShippedFour checkpoints, day-zero on 25+ platforms.
NVIDIA released Nemotron 3 Ultra in four checkpoint variants on the same day: an NVFP4-quantized build, a BF16 post-trained instruct model, a BF16 base model, and a GenRM (generative reward model) variant for building reward pipelines. This is a deliberate spread — the instruct checkpoint for direct deployment, the base for further pre-training or domain adaptation, and the reward model for teams building their own reinforcement-learning loops on top.
Availability was unusually wide at launch. Day-zero access spanned OpenRouter, NVIDIA NIM, Hugging Face weight downloads, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, and more than twenty additional cloud and inference providers — over twenty-five platforms in total. It builds directly on the lineage we covered in our look at NVIDIA's earlier Nemotron 3 Super 120B model, and was positioned within the agent-platform story NVIDIA laid out at Jensen Huang's Computex keynote.
BF16 post-trained
The deploy-ready reasoning model. Hybrid Mamba-Transformer MoE post-trained via Multi-Teacher On-Policy Distillation. This is the checkpoint behind the independently measured benchmark and speed numbers.
NVFP4 quantized
NVFP4 quantization with E2M1 encoding and 2D block microscaling. NVIDIA states up to 5x throughput versus BF16 on Blackwell Tensor Cores, calling it their largest-scale stable FP4 training run to date.
Base + GenRM
The base checkpoint for further pre-training or domain adaptation, plus a generative reward model for teams building their own RL pipelines. Released with 10M new SFT samples and 1M new RL tasks.
02 — ArchitectureA hybrid Mamba-Transformer MoE built for long-running agents.
According to NVIDIA's technical materials, Ultra is a Mixture-of-Experts hybrid Mamba-Transformer. It interleaves Mamba-2 layers — which give sub-quadratic efficiency on long sequences — with selective Attention layers that preserve precise factual recall. That hybrid is what NVIDIA credits for making the 1M-token architectural context window tractable rather than ruinously expensive. The MoE configuration is reported at 512 experts per layer with top-22 routing, an 8,192 model dimension, and 108 layers.
Two engineering choices target inference speed directly. The first is Multi-Token Prediction (MTP) layers baked into the architecture for native speculative decoding — faster generation without a separate draft model. The second is LatentMoE, which NVIDIA describes as projecting tokens into a smaller latent dimension so more experts can be routed at a fixed inference cost. Both are vendor-stated design claims; the payoff that has been independently confirmed is the throughput number in the next section.
The training story is unusually open. NVIDIA states the model was trained on roughly 20 trillion tokens across a diversity-focused and a quality-focused phase, with a data cutoff of September 2025, using Megatron-LM on NVIDIA clusters between December 2025 and April 2026. Post-training used Multi-Teacher On-Policy Distillation (MOPD), where ten-plus domain-specialized teacher models score student rollouts in an asynchronous pipeline and are themselves periodically retrained from updated student checkpoints. Because NVIDIA released the SFT and RL corpora alongside the weights, that pipeline is partially reproducible — a meaningful asset for smaller labs training specialized domain agents.
Active per token
Of 550B total parameters, roughly 55B are active per token — about 10% sparsity. That is the lever behind the speed: a large knowledge store with a small per-token compute footprint, routed through 512 experts at top-22.
Tokens, two phases
Vendor-stated ~20T-token run: ~15T diversity-focused, ~5T quality-focused, cutoff September 2025. Trained with Megatron-LM between Dec 2025 and Apr 2026. NVIDIA released the data and recipes, not just weights.
Natural + 43 code
Vendor-stated support for 12 natural languages — including English, French, Spanish, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese — plus 43 programming languages. Self-hostable on vLLM, SGLang, and TRT-LLM.
Agents don't just answer once. They plan, call tools, delegate work to sub-agents, check results, and keep going across hundreds of turns.— AWS SageMaker JumpStart launch blog, June 4, 2026
03 — Context Window1M is the architecture. Check what your provider serves.
The Hugging Face model card and NVIDIA developer blog both confirm a one-million-token architectural context window, enabled by the interleaved Mamba-2 and selective Attention design. NVIDIA reports a RULER score of 94.7% at one million tokens — a vendor-stated long-context retrieval result that, if it holds up under independent testing, would be strong.
Here is the distinction that matters for deployment, and that most launch coverage skips. The 1M figure is what the model can do architecturally. The context a given provider actually serves can be lower — endpoints frequently cap served context well below the architectural ceiling for cost and latency reasons. Do not assume every Nemotron 3 Ultra endpoint gives you a full million tokens; confirm the served limit on whichever provider you deploy against before designing a long-document pipeline around it. The architectural number is a ceiling, not a guarantee at the API boundary.
04 — BenchmarksWhere Ultra leads — and where the asterisks live.
One independent benchmark exists at launch: Artificial Analysis's Intelligence Index. Ultra scores 48, ranking ninth of eighty-nine models and sitting well above the peer average of 31. Among US-origin open-weight models it leads the field — the nearest peers are Gemma 4 31B at 39, the earlier Nemotron 3 Super at 36, and gpt-oss-120b at 33. Above it sits Kimi K2.6 at 54. The chart below reads that landscape honestly: orange marks Ultra, blue marks the Chinese open model that currently leads it.
AA Intelligence Index · Ultra vs open-weight peers
Source: Artificial Analysis Intelligence IndexOn the vendor-stated coding and agentic benchmarks, treat the numbers with appropriate care — none have been independently replicated at scale yet. The one worth reading precisely is SWE-Bench Verified. NVIDIA's materials cite a peak of 71.9%, but the underlying range across five different agent harnesses (Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent) is 65.0% to 70.4%, with the 71.9% peak attributed to an unspecified configuration. The honest figure to plan around is the 65.0–70.4% harness range, not the peak — the gap between them is a measure of how much harness choice moves the result. Other vendor-stated agentic scores (PinchBench at 90.0%, Terminal Bench 2.1 at 56.4%) carry the same caveat: stated by NVIDIA, awaiting third-party replication.
Vendor-stated benchmarks · read with caveats
Source: NVIDIA tech report + MarkTechPost · all VENDOR-STATEDThe framing that holds all of this together: Ultra is the strongest US-origin open-weight model on the only independent benchmark we have, it is not the best open model overall, and most of its capability story is still vendor-stated. That is not a knock — it is simply where every major launch sits in its first week, before the community runs its own evals. The responsible move is to benchmark on your own workload rather than treat a launch-day table as settled.
Chinese labs have been flooding the open ecosystem with strong models while American companies — OpenAI, Anthropic, Google — keep their best systems behind APIs.— Decrypt launch analysis, June 4, 2026
05 — The Verbosity TaxWhy "30% cheaper" and 2.3x verbose can both be true.
NVIDIA states up to 30% lower cost to task completion versus open frontier models in its class, attributing the saving to fewer tokens per turn in agentic loops. That is a vendor-stated, not independently-audited, figure. And it runs straight into a finding from the same independent source that gave us the Intelligence Index: Artificial Analysis flagged Ultra as "very verbose," measuring it generate 100 million output tokens running the Intelligence Index against a median of 43 million tokens for comparable models — roughly 2.3x more verbose.
These two facts do not cancel; they interact, and the interaction is the whole point. Output tokens are what you pay for. A model priced below a peer per token can still cost more per completed task if it emits enough extra tokens to overwhelm the per-token discount. Run the arithmetic on a worked example: take a task that costs $1.00 of output on a peer model. A 30% lower per-token price puts the same token count at $0.70. But if Ultra emits 2.3x the output tokens to finish the same task, that $0.70 becomes roughly $1.61 — more than the peer, not less. The 30% claim and the 2.3x verbosity are not contradictory; they describe different axes, and which one dominates depends entirely on how verbose your specific workload makes the model.
The verbosity tax · per-token price vs per-task cost
Illustrative arithmetic · NVIDIA 30% claim + AA 2.3x verbosityThere is a more optimistic reading too, and it is worth stating to be fair. Verbosity measured on a reasoning-heavy benchmark suite is not necessarily representative of a production agent loop with tight system prompts and tool-call constraints. It is plausible that disciplined prompting narrows the gap. But "plausible" is not "audited," and the only way to know for your workload is to measure it. The point of this section is not that Ultra is expensive — it is that the cost question is genuinely open, and the vendor's framing answers only half of it.
06 — Claims vs ChecksA launch-day verification ledger.
Because most of the capability story is vendor-stated, the most useful thing we can hand you is a ledger: each major claim, its source, and whether an independent check exists yet. This is the discipline we apply to any model launch before recommending it in a client AI transformation engagement. The proprietary table below is our launch-day read; the verdict column will move as the community publishes its own evals over the coming weeks.
| Claim | Source | Independent check | Verdict |
|---|---|---|---|
| 140 tok/s output speed | Artificial Analysis | Yes — directly measured | Confirmed |
| Intelligence Index 48 (#9/89) | Artificial Analysis | Yes — independent eval | Confirmed |
| 4.8–5.9x throughput vs open peers | NVIDIA research page | Partial — speed lead corroborated, ratio not | Plausible |
| SWE-Bench 71.9% peak | NVIDIA tech report | No — harness range 65.0–70.4% cited | Use the range |
| 94.7% RULER @ 1M tokens | NVIDIA / MarkTechPost | No — not yet replicated | Needs data |
| 90.0% PinchBench (agentic) | NVIDIA developer blog | No — vendor-only | Needs data |
| Up to 30% lower cost per task | NVIDIA developer blog | Contradicted — 2.3x verbosity observed | Varies by use |
Read top to bottom, the ledger tells a consistent story. The two things independently confirmed at launch — raw speed and a top-ten intelligence ranking — are exactly the two things that make Ultra interesting for agents. The things still awaiting confirmation are the precise capability ceilings, which matter most for one-shot difficulty rather than throughput. That shape is genuinely useful: you can deploy on the confirmed strengths today while treating the vendor-stated ceilings as hypotheses to validate on your own data.
07 — The LicenseOpenMDW-1.1 is a licensing inflection point.
Released on May 28, 2026, OpenMDW-1.1 is a Linux Foundation permissive license built specifically for AI model artifacts — weights, code, data, and docs together — rather than software alone. On its surface terms, it grants royalty-free rights including commercial use, and it carries a patent termination clause. NVIDIA adopted it simultaneously across its Cosmos, Isaac GR00T, Ising, and Nemotron model families. One practically important surface term: model outputsare explicitly free from the license's obligations, so end-user products built on Nemotron-generated outputs are not encumbered by it.
A careful caveat, because licensing is where confident overstatement does the most damage. OpenMDW-1.1 is not Apache 2.0 and should not be described as "Apache-equivalent." It is purpose-built for model artifacts, with a patent-termination mechanism and a scope that spans data and weights, not just source code. Beyond those surface terms, the legal analysis of what OpenMDW-1.1 changes in practice — how its termination clause interacts with downstream redistribution, how its data scope is interpreted — is still immature. It is a new license from a neutral foundation, and the community's reading of its finer points will firm up over months, not days. Characterize it as "permissive with patent termination," cite the surface terms, and get your own counsel before betting a redistribution strategy on it.
We're helping establish a simpler, more consistent standard for open models at scale.— Kari Briski, VP of Generative AI, NVIDIA
08 — Access & PricingWhere to run it, and what it costs.
Pricing landed competitively. OpenRouter lists Nemotron 3 Ultra at $0.50 per million input tokens and $2.50 per million output, with a free tier at $0/$0, and the model appears across its June listings alongside other major launches — context we tracked in our OpenRouter June 2026 model listings roundup. Artificial Analysis computes a blended rate near $0.52 per million at a typical cache/input/output mix, with cache-hit tokens discounted roughly two-thirds against input. The model is self-hostable on vLLM, SGLang, and TRT-LLM, and fine-tunable through NVIDIA's NeMo stack, across GPU families from Ampere to Blackwell.
OpenRouter input
Output runs $2.50 per 1M tokens, with a $0/$0 free tier for evaluation. AA's blended rate sits near $0.52 per 1M at a typical cache-heavy mix. Remember to weight output cost by your workload's verbosity, not the list price alone.
Serving frameworks
Run on vLLM, SGLang, or TRT-LLM. Fine-tune via NeMo Automodel, NeMo Megatron Bridge, or NeMo RL (GRPO). Supported across Ampere A100, Hopper H100/H200, and Grace Blackwell GB200/GB300.
Day-zero platforms
NVIDIA NIM, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, Nebius, and more. The NVFP4 checkpoint targets Blackwell for the highest throughput tier.
09 — RoutingWhen Ultra is the right call — and when it isn't.
The genuine decision here is a tradeoff, not a recommendation. Ultra runs several times faster per token than Kimi K2.6 but sits six intelligence-index points beneath it. For long-horizon agent tasks, faster throughput directly cuts wall-clock time and GPU-hour cost — but a lower capability ceiling raises the odds of a failed run that needs a retry, and a retry eats the speed advantage. The right answer depends on whether your bottleneck is latency or one-shot difficulty.
Speed-bound agent loops
Hundreds of tool-calling turns where wall-clock time and GPU-hours dominate, and individual steps are not at the edge of model capability. Ultra's measured 140 tok/s and open weights are the strongest fit here. Pair with disciplined prompts to control verbosity.
On-prem deployment
Sovereignty, sector-compliance, or data-residency requirements that rule out a closed API. Open weights plus the OpenMDW-1.1 license make Ultra a leading US-origin candidate — verify the served context window and confirm license terms with counsel first.
Raw capability ceiling
When a single hard step decides the outcome and a retry is expensive, the six-point index gap to Kimi K2.6 — and the gap to closed frontier — matters. Route ceiling-critical work to a frontier API, or to Kimi where an open model is required.
High-volume, chatty workloads
Where output volume drives the bill, the 2.3x verbosity finding means Ultra's per-token discount may not survive contact with your traffic. Measure tokens-per-task on your own prompts before committing — the answer is workload-specific.
Our forward read: the most durable thing about this release is not the benchmark line, it is the supply-chain openness. Shipping the training data and recipes under a neutral-foundation license turns Ultra from a model you call into a pipeline you can rebuild. Over the next two quarters we expect the more consequential downstream effect to be smaller labs using NVIDIA's released SFT and RL corpora to distill specialized domain agents via the same multi-teacher method — "training data as a product" rather than weights as a product. The capability gap to Kimi K2.6 will likely narrow or widen on any given week; the reproducibility advantage is structural and harder to undo.
10 — ConclusionThe strongest US open model, read honestly.
Nemotron 3 Ultra is a speed-and-openness story, not a new capability ceiling.
NVIDIA shipped a genuinely open 550B reasoning model — weights, data, and recipes under a permissive license, four checkpoints, day-zero across twenty-five-plus platforms. On the one independent benchmark available, it is the leading US-origin open-weight model, and it runs measurably faster than its Chinese open rivals. Both of those are real, confirmed wins.
The honest asterisks are equally real. It trails Kimi K2.6 by six index points, so it is not the best open model overall. Most of its capability story is vendor-stated and awaiting community replication — cite SWE-Bench as the 65.0–70.4% harness range, not the 71.9% peak. And the 30% cost claim collides with an independently observed 2.3x verbosity, which means per-task economics are genuinely workload-dependent rather than settled in NVIDIA's favor. The 1M context is an architectural ceiling, not a served guarantee.
The practical move is the one we apply to every launch: route by task class, not by headline. Pick Ultra for speed-bound, long-horizon agents that need open weights; reach for a frontier API when a single hard step decides the run; and measure tokens-per-completed-task on your own prompts before you trust any cost claim. The most consequential thing here is the supply-chain openness — and that is the part no benchmark table captures.