NVIDIA Cosmos 3 is the first fully open physical-AI omnimodel — a single model that natively understands and generates text, images, video and ambient sound, and also outputs the action signals a robot needs to move. Announced at GTC Taipei / Computex on May 31, 2026, its weights are downloadable from Hugging Face today.
Earlier physical-AI systems chained separate specialist models — one for scene understanding, another for world generation, a third for predicting actions — and paid an integration tax at every handoff. Cosmos 3 collapses that pipeline into one Mixture-of-Transformers architecture, which is the part of this release that actually changes how robotics and autonomous-vehicle teams build.
This guide covers what shipped, what "omnimodel" means in practice, the two-tower design, the five use modes a single model supports, the three hardware tiers, NVIDIA's leaderboard claims (and why we read every one of them as vendor-stated), and how the OpenMDW-1.1 license changes the calculus for commercial products.
- 01One open model that reasons, simulates, and acts.Cosmos 3 natively handles text, images, video, ambient sound, and action trajectories in a single unified model — NVIDIA's framing of an 'omnimodel'. It is the first physical-AI foundation model to put all of these in one set of weights.
- 02A two-tower Mixture-of-Transformers architecture.A Reasoner Tower (autoregressive vision-language model) feeds context to a Generator Tower (diffusion transformer). They share 3D rotary position embeddings, and the Generator cannot run without the Reasoner — reasoning comes before generation by design.
- 03Three tiers map to your hardware budget.Super (64B) targets Hopper/Blackwell datacenter GPUs, Nano (16B) runs on RTX PRO 6000 workstation-class GPUs, and Edge (4B) is built for Jetson devices. Edge is 'coming soon' with no announced date — Nano and Super weights are on Hugging Face now.
- 04It outputs robot control signals, not just video.Beyond text and video, Cosmos 3 produces numerical control outputs — joint angles, gripper positions, trajectory points. That action output is the key differentiator over prior vision-only generative world models.
- 05Open license, but read the leaderboard claims carefully.Cosmos 3 ships under the Linux Foundation's OpenMDW-1.1 license, which permits commercial use and derivatives with a 'Built on NVIDIA Cosmos' attribution. NVIDIA's #1 rankings are stated among open models and were not independently reproduced at launch.
01 — What ShippedA launch at Computex, weights live the same week.
NVIDIA announced Cosmos 3 during Jensen Huang's GTC Taipei keynote at Computex 2026 on May 31, 2026, positioning it as the open frontier foundation model for physical AI. The launch arrived in the same event cycle that NVIDIA used to frame the rest of its physical-AI roadmap — for the wider keynote context, see our GTC Taipei / Computex 2026 keynote first take and the broader order-pipeline analysis from the same keynote.
Cosmos 3 is not NVIDIA's first move into world models. The original Cosmos family, released in 2025, shipped separate models for world generation, physical understanding, and controlled scene generation. Cosmos 3 is best understood as a unification: it folds what previously required chaining several specialist models into one architecture that runs in a single forward pass. Two of the three tiers — Nano and Super — are downloadable now; the smaller Edge tier is announced but not yet released.
Cosmos 3 Super
The maximum-capability tier, targeted at NVIDIA Hopper and Blackwell datacenter GPUs. Best for large-scale synthetic data generation and the most demanding world-modeling and policy workloads. Weights available now on Hugging Face.
Cosmos 3 Nano
Optimized for RTX PRO 6000 workstation-class GPUs. The pragmatic on-prem starting point for most teams: tractable to run, available as a NIM microservice, and exposed on build.nvidia.com for GPU-free trials.
Cosmos 3 Edge
A compact model built for Jetson edge devices and on-robot inference. NVIDIA describes it as 'coming soon' — no release date has been announced, and it is not downloadable today. Plan device-side deployments around Nano until Edge ships.
nvidia/cosmos3 collection lists 15 items), with deployable NIM microservices, a Cosmos3OmniPipeline in Hugging Face Diffusers, and a GPU-free trial on build.nvidia.com. Six open synthetic-data datasets ship alongside it, spanning robotics, physics, spatial reasoning, digital humans, autonomous driving, and warehouse operations.02 — The Omnimodel IdeaWhat an omnimodel actually collapses.
NVIDIA's term for Cosmos 3 is "omnimodel": one model that natively understands and generates text, images, video, and ambient sound, and additionally produces robot action signals. The label matters less than what it removes. Modality encoders — a Vision Transformer for vision, a Variational Autoencoder for generation, and domain-aware vectors for actions — all project into a shared representation space, so the model reasons across modalities instead of bolting them together with glue code.
The differentiator over prior generative world models is the action output. Cosmos 3 does not just describe or render a scene; it can emit numerical control signals — joint angles, gripper positions, and trajectory points — that a robot controller or AV stack can execute directly. That is the difference between a model that imagines the world and one that can be wired into a physical control loop.
Cosmos 3 doesn't just understand the physical world — it generates it, predicts actions within it, and outputs the action trajectories that robot controllers and AV systems need to act in it.— NVIDIA Technical Blog
Here is why the pipeline-collapse framing is the right lens. Before an omnimodel, a robotics team typically ran a perception model to understand a scene, a separate world model to simulate outcomes, and a third policy model to predict actions — passing outputs between them and absorbing integration error and latency at every boundary. A single model that handles all of those tasks does more than improve any one score: it removes inference steps, cuts handoff latency, and simplifies the MLOps stack a team has to maintain. NVIDIA frames this as compressing physical-AI training and evaluation cycles "from months to days" — a vendor claim worth testing against your own workload rather than taking at face value.
03 — ArchitectureA two-tower design where reasoning comes first.
Cosmos 3 is built as a Mixture-of-Transformers (MoT) with a two-tower design. The two towers split the work — and the dependency direction between them is the whole point.
Reasoner Tower — autoregressive vision-language model
The Reasoner is an autoregressive vision-language model that ingests the scene and the instruction and builds a structured understanding of what is happening and what should happen next. It is the "think before you act" half of the model: it produces the context that conditions everything the Generator does.
Generator Tower — diffusion transformer
The Generator is a diffusion-based transformer that produces the output — future video frames, generated worlds, or action trajectories. Critically, the Generator cannot run without the Reasoner's context. Generation is always conditioned on reasoning, rather than the two operating as independent stages you could swap out.
Shared spatial-temporal structure
Both towers share a 3D multi-dimensional rotary position embedding (mRoPE), which gives them a consistent sense of spatial and temporal structure across modalities. That shared coordinate system is part of what lets reasoning and generation stay aligned on the same scene rather than drifting apart.
04 — Five ModesOne set of weights, five ways to use it.
The same Cosmos 3 weights support five distinct use modes, which is the practical expression of the omnimodel claim. Rather than picking a specialist model per task, a builder selects a mode based on the input and output they need. The capability map below pairs each mode with a concrete robotics or autonomous-vehicle example.
Vision Language Model
Text and video in, text reasoning out. Use it to answer questions about a scene or generate a structured description — e.g. a warehouse robot asked what is on a shelf, or an AV stack reasoning about a traffic situation.
World Model / video generation
Text, image, or video in, generated video out. Produce synthetic worlds and rollouts for training data — a manipulation scene rendered from a prompt, or rare driving scenarios generated for AV evaluation.
Forward Dynamics Model
Action plus image in, future video out. Given a candidate action and the current frame, predict what happens next — letting a robot 'imagine' the result of a grasp before it commits, or an AV preview a maneuver.
Inverse Dynamics Model
Video in, action out. Recover the actions implied by a demonstration video — useful for learning from human demonstrations or auto-labeling teleoperation footage into action trajectories for training.
Policy Model
Image and text in, video and action out. The full policy loop: a dual-arm robot given a goal produces both the predicted rollout and the joint-angle trajectory to execute it — the mode early partners use for pick-and-place.
05 — Tier SelectionWhich tier to pick for your hardware.
Most launch coverage skips the hardware reality, but the tier you choose is dictated by where you deploy. The decision matrix below maps each tier to its hardware target, primary use case, and availability — including the Edge tier that is announced but not yet downloadable, so device-side teams know not to plan around it arriving today.
64B total (32B + 32B)
16B total (8B + 8B)
4B · coming soon
| Tier | Hardware target | Best for & availability |
|---|---|---|
| Cosmos 3 Super 64B total (32B + 32B) | Hopper / Blackwell datacenter GPUs | Large-scale synthetic data generation and the heaviest world-modeling and policy workloads. Quantization to BF16, FP8, and NVFP4 supported. Weights available now on Hugging Face; deployable as a NIM microservice. |
| Cosmos 3 Nano 16B total (8B + 8B) | RTX PRO 6000 workstation-class GPU | The pragmatic on-prem starting point. Tractable for most teams, exposed on build.nvidia.com for GPU-free trials, and available as a NIM microservice. Use this to prototype before committing datacenter spend. |
| Cosmos 3 Edge 4B · coming soon | Jetson family edge devices | Compact on-robot / on-device inference. Announced but not yet released — no date given. Do not plan an embedded launch around Edge being downloadable today; prototype on Nano and watch for the release. |
06 — BenchmarksThe leaderboard claims, read honestly.
NVIDIA says Cosmos 3 ranks #1 across a wide set of physical-AI leaderboards. Every one of these is stated among open models, and we treat them all as vendor-stated: NVIDIA had not published point-score comparison tables at launch, and independent reproduction of the rankings was not available. These leaderboards are also new, with limited third-party audit history. We list the claims qualitatively below — no invented scores — because the honest version is more useful than a precise number we cannot verify.
NVIDIA leaderboard claims · among open models (vendor-stated)
Source: NVIDIA, all rankings among open models · *vendor-stated, not independently reproduced at launchWhat can be said with confidence: the rankings span three distinct problem families — world generation, robot policy, and vision understanding — which is itself notable for a single model, since most systems specialize. What cannot be said yet: how Cosmos 3 compares to closed frontier physical-AI systems, or how the open-model rankings hold up under independent evaluation. For sophisticated teams, the right move is the same one we recommend for any new model — run the eval on your own scenes and tasks, not on the press release.
07 — License & AccessOpen weights under a Linux Foundation license.
Cosmos 3 ships under OpenMDW-1.1, a license stewarded by the Linux Foundation. It permits commercial use, training, modification, redistribution, and derivative models. The one notable constraint: products built on Cosmos must display "Built on NVIDIA Cosmos" somewhere visible — a website, UI, about page, or documentation. NVIDIA does not claim ownership of outputs generated from Cosmos or its derivatives.
The licensing story is bigger than one model. NVIDIA adopted OpenMDW-1.1 across four model families at once — Cosmos, Isaac GR00T, Ising, and Nemotron — standardizing open-model licensing across its physical-AI, robotics, quantum, and general foundation-model lines. For the coding-model side of that same strategy, see our coverage of NVIDIA's Nemotron open-model strategy. Note that OpenMDW-1.1 is distinct from NVIDIA's prior NVIDIA Open Model License — Cosmos 3 uses the Linux Foundation standard.
huggingface.co/nvidiaNIM microservices| Surface | What you get | Best for |
|---|---|---|
huggingface.co/nvidia | Cosmos3-Super + Cosmos3-Nano weights | On-prem deployment, fine-tuning, quantization. Run via the Cosmos3OmniPipeline in Hugging Face Diffusers. Six open synthetic-data datasets are published in the same org for training. |
NIM microservices | Docker deploy with an NGC API key | Production integration on managed infra. Partners include CoreWeave, Microsoft Azure, Baseten, Nebius, Deep Infra, and Classmethod. Best when you want managed serving rather than self-hosting weights. |
| build.nvidia.com | Cosmos 3 Nano Reasoner + full model | GPU-free trial and evaluation. The fastest path to test prompts and judge fit before any infrastructure decision — two hosted experiences are live at launch. |
A full builder toolchain ships with the model: Cosmos Curator for data filtering, annotation, and deduplication; Cosmos Evaluator for output scoring; NVIDIA TAO 7 for fine-tuning; and the Cosmos Cookbook of domain-specific recipes, with post-training scripts on the github.com/nvidia/Cosmos repository. If you are weighing an open physical-AI model against alternatives for a specific pipeline, our AI & digital transformation engagements start with exactly this kind of comparative, scenario-grounded evaluation.
08 — ImplicationsWhat this means for builders and the ecosystem.
Cosmos 3 also launched with a coalition and a roster of early adopters, which together signal where this is headed. The NVIDIA Cosmos Coalition has six founding members — Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI — drawn deliberately from different corners of the field: Agile Robots is a hardware manufacturer, Black Forest Labs, LTX, and Runway are generative-video labs, Skild AI works on generalist robot policy, and Generalist rounds out the policy side. The coalition's premise is a shared ecosystem with DGX Cloud infrastructure access and open contribution — a different model from a typical vendor-partner program.
Agile Robots is already an early-access partner, using Cosmos 3 to generate action-conditioned training trajectories for its Thor 3 and FR3 robots on complex industrial pick-and-place tasks, with testing inside the European Industrial AI Cloud. The launch adopter list also spans Samsung, LG Electronics, and Doosan Robotics in robotics; Li Auto in autonomous vehicles; and Centific, Fogsphere, Linker Vision, Milestone Systems, and Yuan in vision AI.
Thanks to the training in NVIDIA Cosmos 3, our robotic arms Thor 3 and FR3 can grasp a variation of objects with greater accuracy.— Agile Robots, early-access partner
For builders, the decision tree depends on what you are building. The matrix below sorts the most common physical-AI workload classes — and for the broader open-model landscape this fits into, our open-weight frontier models retrospective and our humanoid robotics pipeline coverage give useful context.
World-model rollouts for training
Generating rare scenarios and labeled rollouts at scale is the clearest immediate win — replacing chained specialist models with one. Start on Super for throughput; validate output quality with Cosmos Evaluator before trusting the data.
Workstation-scale evaluation
Most teams should prototype on Nano on an RTX PRO 6000 (or the GPU-free build.nvidia.com trial) before committing datacenter spend. Benchmark on your own scenes — the leaderboards are vendor-stated.
Device-side inference
Edge (4B, Jetson) is the eventual fit, but it is not released yet. Do not block a roadmap on it. Prototype the policy on Nano now and migrate to Edge once it ships, treating the timeline as unannounced.
Managed serving vs self-host
If you want managed infra, deploy via NIM microservices on a partner cloud; if you need sovereignty or fine-tuning control, self-host the open weights. Either way, the OpenMDW-1.1 'Built on NVIDIA Cosmos' attribution applies.
09 — ConclusionThe first open omnimodel for the physical world.
The real story is pipeline collapse, not a single benchmark.
Cosmos 3 is a genuine step in open physical AI: one model that reasons over a scene, generates the world, predicts outcomes, and outputs the action signals a robot needs — under a permissive Linux Foundation license with weights you can download today. The two-tower Mixture-of-Transformers design, where generation is always conditioned on reasoning, is the engineering choice that makes the omnimodel framing more than marketing.
The most consequential change is not any leaderboard position — every one of those is vendor-stated, among open models, and unproven by independent evaluation at launch. It is the pipeline collapse: replacing chained perception, world, and policy models with a single forward pass removes inference steps, handoff latency, and a large slice of MLOps complexity. That is a structural advantage that holds even if the headline rankings move.
The practical move for any serious team is the same as with any new model: ignore the press-release numbers, download Nano or open the build.nvidia.com trial, and run the evaluation on the scenes and tasks you actually care about. Cosmos 3 is the strongest open starting point for physical AI right now — but "strongest open starting point" is a reason to test it seriously, not a reason to ship it on faith.