AI DevelopmentNew Release12 min readPublished June 5, 2026

Open 550B MoE · 1M architectural context · #9 of 89 models independently measured

NVIDIA Nemotron 3 Ultra: 550B Open Reasoning Model Live

NVIDIA shipped Nemotron 3 Ultra on June 4, 2026 — a 550B-parameter open Mixture-of-Experts reasoning model with 55B active per token, published with weights, training data, and recipes under the Linux Foundation's permissive OpenMDW-1.1 license. It tops US open-weight rankings yet trails China's Kimi K2.6 by six intelligence-index points — while running several times faster.

DA
Digital Applied Team
Senior strategists · Published Jun 5, 2026
PublishedJun 5, 2026
Read time12 min
Sources13 primary + independent
Total parameters
550B
55B active per token
~10% sparsity
AA Intelligence Index
48
#9 of 89 models
Kimi K2.6 at 54
Output speed (measured)
140tok/s
#7 of 89 · independent
vs 50–100 rivals
Verbosity vs median
2.3×
output tokens per eval
cost watch-out

NVIDIA released Nemotron 3 Ultra on June 4, 2026 — a 550-billion-parameter open Mixture-of-Experts reasoning model that ships not just weights but training data and recipes under a permissive Linux Foundation license. The headline isn't a new capability ceiling. It's that the strongest US-origin open-weight model now runs fast enough, and opens enough of its supply chain, to change how teams build long-running agents.

The model carries 550B total parameters with roughly 55B active per token — about 10% sparsity — on a hybrid Mamba-Transformer architecture. On the one independent benchmark available at launch, Artificial Analysis's Intelligence Index, it scores 48 and ranks ninth of eighty-nine models evaluated. That places it above every other US open-weight model. It also places it six points below China's Kimi K2.6, which scores 54. Both facts are true at once, and the honest version of this launch holds them together rather than picking the flattering one.

This guide covers what actually shipped, the architecture behind the speed, how to read benchmark numbers that are mostly vendor-stated, the verbosity finding that complicates NVIDIA's cost claim, what the OpenMDW-1.1 license does and doesn't change, and a routing framework for deciding when Ultra is the right call versus a frontier API. Where a number is vendor-stated and not yet independently checked, we say so.

Key takeaways
  1. 01
    A genuinely open 550B reasoning model shipped.Nemotron 3 Ultra is a 550B-parameter MoE with ~55B active per token, released June 4, 2026 with four checkpoints (NVFP4, BF16 instruct, BF16 base, GenRM) plus training data and recipes under the Linux Foundation's OpenMDW-1.1 license.
  2. 02
    Leading US open-weight model, not the global leader.On Artificial Analysis's Intelligence Index it scores 48 (ninth of eighty-nine), ahead of all US open peers but trailing China's Kimi K2.6 at 54 — a six-point gap. Describe it as the leading US-origin open-weight model, not the best open model overall.
  3. 03
    Speed is the real differentiator.Artificial Analysis independently measured 140.3 tokens/second output (seventh of eighty-nine) and a 1.33-second time-to-first-token, against DeepSeek and Kimi serving roughly 50–100 tok/s. NVIDIA's vendor figures claim 4.8–5.9x throughput gains versus comparable open models on GB200.
  4. 04
    The cost claim has an honest asterisk.NVIDIA states up to 30% lower cost per task. But Artificial Analysis measured the model generating 2.3x more output tokens than the median peer in the same benchmark suite. Per-task economics depend on how verbose your workload makes it — run your own numbers.
  5. 05
    1M context is architectural, not always served.The model card and NVIDIA blog confirm a 1M-token architectural window via interleaved Mamba-2 and selective Attention layers. Providers may serve a reduced context for cost and latency reasons, so confirm the served limit on whichever endpoint you deploy.

01What ShippedFour checkpoints, day-zero on 25+ platforms.

NVIDIA released Nemotron 3 Ultra in four checkpoint variants on the same day: an NVFP4-quantized build, a BF16 post-trained instruct model, a BF16 base model, and a GenRM (generative reward model) variant for building reward pipelines. This is a deliberate spread — the instruct checkpoint for direct deployment, the base for further pre-training or domain adaptation, and the reward model for teams building their own reinforcement-learning loops on top.

Availability was unusually wide at launch. Day-zero access spanned OpenRouter, NVIDIA NIM, Hugging Face weight downloads, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, and more than twenty additional cloud and inference providers — over twenty-five platforms in total. It builds directly on the lineage we covered in our look at NVIDIA's earlier Nemotron 3 Super 120B model, and was positioned within the agent-platform story NVIDIA laid out at Jensen Huang's Computex keynote.

Instruct
BF16 post-trained
550B total · 55B active · 1M context

The deploy-ready reasoning model. Hybrid Mamba-Transformer MoE post-trained via Multi-Teacher On-Policy Distillation. This is the checkpoint behind the independently measured benchmark and speed numbers.

huggingface.co/nvidia · BF16 instruct
Efficient
NVFP4 quantized
FP4 weights · Blackwell-native

NVFP4 quantization with E2M1 encoding and 2D block microscaling. NVIDIA states up to 5x throughput versus BF16 on Blackwell Tensor Cores, calling it their largest-scale stable FP4 training run to date.

Blackwell B200 / B300 / GB200
Build-on
Base + GenRM
BF16 base · reward-model variant

The base checkpoint for further pre-training or domain adaptation, plus a generative reward model for teams building their own RL pipelines. Released with 10M new SFT samples and 1M new RL tasks.

Cumulative: 50M SFT · 2M RL tasks
Release snapshot
Nemotron 3 Ultra shipped June 4, 2026 in four checkpoints across 25+ platforms. NVIDIA also released companion models — a distinct Nemotron 3.5 Content Safety (4B) model (June 2) covering 23 safety categories across 12 languages, and a Nemotron 3.5 ASR (0.6B)streaming speech model (June 4) supporting 40 language-locales. Note these are version 3.5, a separate family from Ultra — not "Ultra Content Safety." Enterprise partners named alongside the launch include Microsoft, SAP, ServiceNow, Red Hat, Palantir, CrowdStrike, Siemens, and Synopsys.

02ArchitectureA hybrid Mamba-Transformer MoE built for long-running agents.

According to NVIDIA's technical materials, Ultra is a Mixture-of-Experts hybrid Mamba-Transformer. It interleaves Mamba-2 layers — which give sub-quadratic efficiency on long sequences — with selective Attention layers that preserve precise factual recall. That hybrid is what NVIDIA credits for making the 1M-token architectural context window tractable rather than ruinously expensive. The MoE configuration is reported at 512 experts per layer with top-22 routing, an 8,192 model dimension, and 108 layers.

Two engineering choices target inference speed directly. The first is Multi-Token Prediction (MTP) layers baked into the architecture for native speculative decoding — faster generation without a separate draft model. The second is LatentMoE, which NVIDIA describes as projecting tokens into a smaller latent dimension so more experts can be routed at a fixed inference cost. Both are vendor-stated design claims; the payoff that has been independently confirmed is the throughput number in the next section.

The training story is unusually open. NVIDIA states the model was trained on roughly 20 trillion tokens across a diversity-focused and a quality-focused phase, with a data cutoff of September 2025, using Megatron-LM on NVIDIA clusters between December 2025 and April 2026. Post-training used Multi-Teacher On-Policy Distillation (MOPD), where ten-plus domain-specialized teacher models score student rollouts in an asynchronous pipeline and are themselves periodically retrained from updated student checkpoints. Because NVIDIA released the SFT and RL corpora alongside the weights, that pipeline is partially reproducible — a meaningful asset for smaller labs training specialized domain agents.

Sparsity
Active per token
55B

Of 550B total parameters, roughly 55B are active per token — about 10% sparsity. That is the lever behind the speed: a large knowledge store with a small per-token compute footprint, routed through 512 experts at top-22.

~10% active · 512 experts
Training
Tokens, two phases
20T

Vendor-stated ~20T-token run: ~15T diversity-focused, ~5T quality-focused, cutoff September 2025. Trained with Megatron-LM between Dec 2025 and Apr 2026. NVIDIA released the data and recipes, not just weights.

Data cutoff: Sept 2025
Languages
Natural + 43 code
12

Vendor-stated support for 12 natural languages — including English, French, Spanish, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese — plus 43 programming languages. Self-hostable on vLLM, SGLang, and TRT-LLM.

Ampere → Blackwell GPUs
Agents don't just answer once. They plan, call tools, delegate work to sub-agents, check results, and keep going across hundreds of turns.— AWS SageMaker JumpStart launch blog, June 4, 2026

03Context Window1M is the architecture. Check what your provider serves.

The Hugging Face model card and NVIDIA developer blog both confirm a one-million-token architectural context window, enabled by the interleaved Mamba-2 and selective Attention design. NVIDIA reports a RULER score of 94.7% at one million tokens — a vendor-stated long-context retrieval result that, if it holds up under independent testing, would be strong.

Here is the distinction that matters for deployment, and that most launch coverage skips. The 1M figure is what the model can do architecturally. The context a given provider actually serves can be lower — endpoints frequently cap served context well below the architectural ceiling for cost and latency reasons. Do not assume every Nemotron 3 Ultra endpoint gives you a full million tokens; confirm the served limit on whichever provider you deploy against before designing a long-document pipeline around it. The architectural number is a ceiling, not a guarantee at the API boundary.

Deployment caution
Treat 1M tokens as the architectural maximum and verify the served window per endpoint. A long-context RAG pipeline designed for a million tokens will fail quietly if the provider you chose caps served context far below that — a real risk worth a five-minute check before you build.

04BenchmarksWhere Ultra leads — and where the asterisks live.

One independent benchmark exists at launch: Artificial Analysis's Intelligence Index. Ultra scores 48, ranking ninth of eighty-nine models and sitting well above the peer average of 31. Among US-origin open-weight models it leads the field — the nearest peers are Gemma 4 31B at 39, the earlier Nemotron 3 Super at 36, and gpt-oss-120b at 33. Above it sits Kimi K2.6 at 54. The chart below reads that landscape honestly: orange marks Ultra, blue marks the Chinese open model that currently leads it.

AA Intelligence Index · Ultra vs open-weight peers

Source: Artificial Analysis Intelligence Index
Kimi K2.6China open-weight · current open leader
54
Leads index
Nemotron 3 UltraLeading US open-weight model
48
#9 of 89
Gemma 4 31BNearest US open peer
39
Nemotron 3 SuperEarlier NVIDIA open model
36
gpt-oss-120bUS open-weight
33
Peer averageAcross models evaluated
31
Nemotron 3 Ultra (US open)Kimi K2.6 (China open, leads)

On the vendor-stated coding and agentic benchmarks, treat the numbers with appropriate care — none have been independently replicated at scale yet. The one worth reading precisely is SWE-Bench Verified. NVIDIA's materials cite a peak of 71.9%, but the underlying range across five different agent harnesses (Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent) is 65.0% to 70.4%, with the 71.9% peak attributed to an unspecified configuration. The honest figure to plan around is the 65.0–70.4% harness range, not the peak — the gap between them is a measure of how much harness choice moves the result. Other vendor-stated agentic scores (PinchBench at 90.0%, Terminal Bench 2.1 at 56.4%) carry the same caveat: stated by NVIDIA, awaiting third-party replication.

Vendor-stated benchmarks · read with caveats

Source: NVIDIA tech report + MarkTechPost · all VENDOR-STATED
SWE-Bench Verified (range)65.0–70.4% across 5 harnesses · vendor-stated
65–70%
RULER @ 1M tokensLong-context retrieval · vendor-stated
94.7%
PinchBench (agentic)Vendor-stated · no third-party replication
90.0%
Terminal Bench 2.1Vendor-stated · agentic terminal tasks
56.4%

The framing that holds all of this together: Ultra is the strongest US-origin open-weight model on the only independent benchmark we have, it is not the best open model overall, and most of its capability story is still vendor-stated. That is not a knock — it is simply where every major launch sits in its first week, before the community runs its own evals. The responsible move is to benchmark on your own workload rather than treat a launch-day table as settled.

Chinese labs have been flooding the open ecosystem with strong models while American companies — OpenAI, Anthropic, Google — keep their best systems behind APIs.— Decrypt launch analysis, June 4, 2026

05The Verbosity TaxWhy "30% cheaper" and 2.3x verbose can both be true.

NVIDIA states up to 30% lower cost to task completion versus open frontier models in its class, attributing the saving to fewer tokens per turn in agentic loops. That is a vendor-stated, not independently-audited, figure. And it runs straight into a finding from the same independent source that gave us the Intelligence Index: Artificial Analysis flagged Ultra as "very verbose," measuring it generate 100 million output tokens running the Intelligence Index against a median of 43 million tokens for comparable models — roughly 2.3x more verbose.

These two facts do not cancel; they interact, and the interaction is the whole point. Output tokens are what you pay for. A model priced below a peer per token can still cost more per completed task if it emits enough extra tokens to overwhelm the per-token discount. Run the arithmetic on a worked example: take a task that costs $1.00 of output on a peer model. A 30% lower per-token price puts the same token count at $0.70. But if Ultra emits 2.3x the output tokens to finish the same task, that $0.70 becomes roughly $1.61 — more than the peer, not less. The 30% claim and the 2.3x verbosity are not contradictory; they describe different axes, and which one dominates depends entirely on how verbose your specific workload makes the model.

The verbosity tax · per-token price vs per-task cost

Illustrative arithmetic · NVIDIA 30% claim + AA 2.3x verbosity
Peer model · same taskBaseline output cost
$1.00
Ultra · per-token only30% cheaper per token, same token count
$0.70
Looks cheaper
Ultra · verbosity-adjusted30% cheaper × 2.3x more output tokens
~$1.61
Can cost more
Per-token list priceReal per-task cost (verbosity-adjusted)
The honest read
Treat the 30% cost-reduction claim as vendor-stated, and application-dependent. The independently observed 2.3x verbosity means actual savings vary widely by use case — and can invert into a premium on chatty workloads. Measure tokens-per-completed-task on your own prompts before you assume Ultra is the cheaper option.

There is a more optimistic reading too, and it is worth stating to be fair. Verbosity measured on a reasoning-heavy benchmark suite is not necessarily representative of a production agent loop with tight system prompts and tool-call constraints. It is plausible that disciplined prompting narrows the gap. But "plausible" is not "audited," and the only way to know for your workload is to measure it. The point of this section is not that Ultra is expensive — it is that the cost question is genuinely open, and the vendor's framing answers only half of it.

06Claims vs ChecksA launch-day verification ledger.

Because most of the capability story is vendor-stated, the most useful thing we can hand you is a ledger: each major claim, its source, and whether an independent check exists yet. This is the discipline we apply to any model launch before recommending it in a client AI transformation engagement. The proprietary table below is our launch-day read; the verdict column will move as the community publishes its own evals over the coming weeks.

Nemotron 3 Ultra — vendor claims vs independent verification status at launch
ClaimSourceIndependent checkVerdict
140 tok/s output speedArtificial AnalysisYes — directly measuredConfirmed
Intelligence Index 48 (#9/89)Artificial AnalysisYes — independent evalConfirmed
4.8–5.9x throughput vs open peersNVIDIA research pagePartial — speed lead corroborated, ratio notPlausible
SWE-Bench 71.9% peakNVIDIA tech reportNo — harness range 65.0–70.4% citedUse the range
94.7% RULER @ 1M tokensNVIDIA / MarkTechPostNo — not yet replicatedNeeds data
90.0% PinchBench (agentic)NVIDIA developer blogNo — vendor-onlyNeeds data
Up to 30% lower cost per taskNVIDIA developer blogContradicted — 2.3x verbosity observedVaries by use

Read top to bottom, the ledger tells a consistent story. The two things independently confirmed at launch — raw speed and a top-ten intelligence ranking — are exactly the two things that make Ultra interesting for agents. The things still awaiting confirmation are the precise capability ceilings, which matter most for one-shot difficulty rather than throughput. That shape is genuinely useful: you can deploy on the confirmed strengths today while treating the vendor-stated ceilings as hypotheses to validate on your own data.

07The LicenseOpenMDW-1.1 is a licensing inflection point.

Released on May 28, 2026, OpenMDW-1.1 is a Linux Foundation permissive license built specifically for AI model artifacts — weights, code, data, and docs together — rather than software alone. On its surface terms, it grants royalty-free rights including commercial use, and it carries a patent termination clause. NVIDIA adopted it simultaneously across its Cosmos, Isaac GR00T, Ising, and Nemotron model families. One practically important surface term: model outputsare explicitly free from the license's obligations, so end-user products built on Nemotron-generated outputs are not encumbered by it.

A careful caveat, because licensing is where confident overstatement does the most damage. OpenMDW-1.1 is not Apache 2.0 and should not be described as "Apache-equivalent." It is purpose-built for model artifacts, with a patent-termination mechanism and a scope that spans data and weights, not just source code. Beyond those surface terms, the legal analysis of what OpenMDW-1.1 changes in practice — how its termination clause interacts with downstream redistribution, how its data scope is interpreted — is still immature. It is a new license from a neutral foundation, and the community's reading of its finer points will firm up over months, not days. Characterize it as "permissive with patent termination," cite the surface terms, and get your own counsel before betting a redistribution strategy on it.

We're helping establish a simpler, more consistent standard for open models at scale.— Kari Briski, VP of Generative AI, NVIDIA

08Access & PricingWhere to run it, and what it costs.

Pricing landed competitively. OpenRouter lists Nemotron 3 Ultra at $0.50 per million input tokens and $2.50 per million output, with a free tier at $0/$0, and the model appears across its June listings alongside other major launches — context we tracked in our OpenRouter June 2026 model listings roundup. Artificial Analysis computes a blended rate near $0.52 per million at a typical cache/input/output mix, with cache-hit tokens discounted roughly two-thirds against input. The model is self-hostable on vLLM, SGLang, and TRT-LLM, and fine-tunable through NVIDIA's NeMo stack, across GPU families from Ampere to Blackwell.

Hosted API
OpenRouter input
$0.50/1M

Output runs $2.50 per 1M tokens, with a $0/$0 free tier for evaluation. AA's blended rate sits near $0.52 per 1M at a typical cache-heavy mix. Remember to weight output cost by your workload's verbosity, not the list price alone.

Free tier available
Self-host
Serving frameworks
4

Run on vLLM, SGLang, or TRT-LLM. Fine-tune via NeMo Automodel, NeMo Megatron Bridge, or NeMo RL (GRPO). Supported across Ampere A100, Hopper H100/H200, and Grace Blackwell GB200/GB300.

Open weights on Hugging Face
Cloud
Day-zero platforms
25+

NVIDIA NIM, Perplexity, Together AI, Fireworks AI, DeepInfra, Amazon SageMaker JumpStart, Nebius, and more. The NVFP4 checkpoint targets Blackwell for the highest throughput tier.

Verify served context per endpoint

09RoutingWhen Ultra is the right call — and when it isn't.

The genuine decision here is a tradeoff, not a recommendation. Ultra runs several times faster per token than Kimi K2.6 but sits six intelligence-index points beneath it. For long-horizon agent tasks, faster throughput directly cuts wall-clock time and GPU-hour cost — but a lower capability ceiling raises the odds of a failed run that needs a retry, and a retry eats the speed advantage. The right answer depends on whether your bottleneck is latency or one-shot difficulty.

Long-horizon agents
Speed-bound agent loops

Hundreds of tool-calling turns where wall-clock time and GPU-hours dominate, and individual steps are not at the edge of model capability. Ultra's measured 140 tok/s and open weights are the strongest fit here. Pair with disciplined prompts to control verbosity.

Pick Nemotron 3 Ultra
Open + sovereign
On-prem deployment

Sovereignty, sector-compliance, or data-residency requirements that rule out a closed API. Open weights plus the OpenMDW-1.1 license make Ultra a leading US-origin candidate — verify the served context window and confirm license terms with counsel first.

Pick Ultra open weights
Hardest reasoning
Raw capability ceiling

When a single hard step decides the outcome and a retry is expensive, the six-point index gap to Kimi K2.6 — and the gap to closed frontier — matters. Route ceiling-critical work to a frontier API, or to Kimi where an open model is required.

Use a frontier API
Cost-sensitive bulk
High-volume, chatty workloads

Where output volume drives the bill, the 2.3x verbosity finding means Ultra's per-token discount may not survive contact with your traffic. Measure tokens-per-task on your own prompts before committing — the answer is workload-specific.

Benchmark before you commit

Our forward read: the most durable thing about this release is not the benchmark line, it is the supply-chain openness. Shipping the training data and recipes under a neutral-foundation license turns Ultra from a model you call into a pipeline you can rebuild. Over the next two quarters we expect the more consequential downstream effect to be smaller labs using NVIDIA's released SFT and RL corpora to distill specialized domain agents via the same multi-teacher method — "training data as a product" rather than weights as a product. The capability gap to Kimi K2.6 will likely narrow or widen on any given week; the reproducibility advantage is structural and harder to undo.

10ConclusionThe strongest US open model, read honestly.

The shape of open frontier, June 2026

Nemotron 3 Ultra is a speed-and-openness story, not a new capability ceiling.

NVIDIA shipped a genuinely open 550B reasoning model — weights, data, and recipes under a permissive license, four checkpoints, day-zero across twenty-five-plus platforms. On the one independent benchmark available, it is the leading US-origin open-weight model, and it runs measurably faster than its Chinese open rivals. Both of those are real, confirmed wins.

The honest asterisks are equally real. It trails Kimi K2.6 by six index points, so it is not the best open model overall. Most of its capability story is vendor-stated and awaiting community replication — cite SWE-Bench as the 65.0–70.4% harness range, not the 71.9% peak. And the 30% cost claim collides with an independently observed 2.3x verbosity, which means per-task economics are genuinely workload-dependent rather than settled in NVIDIA's favor. The 1M context is an architectural ceiling, not a served guarantee.

The practical move is the one we apply to every launch: route by task class, not by headline. Pick Ultra for speed-bound, long-horizon agents that need open weights; reach for a frontier API when a single hard step decides the run; and measure tokens-per-completed-task on your own prompts before you trust any cost claim. The most consequential thing here is the supply-chain openness — and that is the part no benchmark table captures.

Deploy open-weight frontier in production

Open weights plus measured speed make long-running agents genuinely viable.

Our team helps businesses evaluate, benchmark, and route open-weight frontier models — including Nemotron 3 Ultra — for long-running agents, sovereign deployment, and cost-aware production architectures, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight model engagements

  • Ultra benchmarking against frontier on your own corpus
  • Tokens-per-task cost modeling — verbosity-adjusted
  • On-prem long-context agents — sovereignty-bound sectors
  • Multi-vendor routing — Ultra / Kimi / frontier APIs
  • License and governance review for open + closed mix
FAQ · Nemotron 3 Ultra guide

The questions we get every week.

Nemotron 3 Ultra is a 550-billion-parameter open Mixture-of-Experts reasoning model from NVIDIA, released on June 4, 2026. It has roughly 55 billion active parameters per token — about 10% sparsity — on a hybrid Mamba-Transformer architecture. NVIDIA shipped four checkpoints (an NVFP4-quantized build, a BF16 post-trained instruct model, a BF16 base model, and a GenRM reward-model variant), and published not just the weights but training data and recipes under the Linux Foundation's permissive OpenMDW-1.1 license. Day-zero availability spanned more than twenty-five cloud and inference platforms, including OpenRouter, NVIDIA NIM, Hugging Face, Perplexity, Together AI, and Amazon SageMaker JumpStart.