MLPerf Training v6.0 results landed on June 16, 2026, and the most useful way to read the industry-standard training benchmark is not as a podium but as a market signal. The headline numbers — NVIDIA Blackwell topping the chart, AMD closing in, a doubling of cloud entries — describe where training cost, chip procurement, and deployment patterns are heading over the next year.
MLCommons runs MLPerf Training every six months as a neutral, audited measure of how long real models take to train on real hardware. This round drew 95 unique system submissions from 24 organizations, spanning 13 different accelerators and 19 host processors — a diversity record for the suite. Crucially, it added the first 671-billion-parameter mixture-of-experts (MoE) benchmark, making the leaderboard finally test the architecture that underlies nearly every major 2025–2026 model.
This guide walks through what changed in v6.0, why the MoE addition matters, a consolidated time-to-train reference card for all seven workloads, where NVIDIA and AMD actually stand, the software compounding story most coverage skips, and a practical read for teams making hardware and cloud decisions. Several performance figures are vendor-stated; we flag them as such throughout.
- 01MoE is now a first-class benchmark, not a side note.v6.0 added DeepSeek V3 (671B total / 37B active) and GPT-OSS 20B (21B total / 3.6B active) — both mixture-of-experts. DeepSeek V3 is now the largest workload in the suite, reflecting the industry-wide shift to sparse models.
- 02NVIDIA Blackwell led every benchmark it entered.NVIDIA reports its GB300 NVL72 (Blackwell Ultra) posted the fastest time-to-train on all seven workloads — and was the only vendor to submit across the full suite. The fastest DeepSeek-V3 run finished in 2.02 minutes on 8,192 GPUs (CoreWeave).
- 03AMD is no longer a token alternative.AMD's MI355X came within 5% of NVIDIA's B200 on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pre-training (vendor-stated, both using FP4), and made its first-ever multi-node MLPerf Training submission.
- 04Software is compounding faster than hardware cycles.NVIDIA reports its per-GPU DeepSeek-V3 throughput on Blackwell rose roughly 1.3x in about six months from software optimization alone — no hardware change. That means already-deployed systems keep getting faster between upgrade cycles.
- 05Training-as-a-service is now a leaderboard category.Cloud submissions doubled versus v5.1 six months earlier, 60% of all submissions were multi-node, and four first-time submitters appeared. How teams procure training compute is shifting structurally toward the cloud.
01 — What ChangedA diversity record and a bigger benchmark.
MLPerf Training is the benchmark hardware vendors actually compete on — a fixed set of models trained to a defined quality target, with time-to-train measured under audited rules. Reading it well means separating the genuinely new signals from the marketing. Three things define v6.0.
Record participation. The round collected 95 unique system submissions from 24 organizations, using 13 different hardware accelerators and 19 different host processors — MLCommons describes it as a new diversity record for the suite. Four first-time submitters appeared: Inventec, Netweb Technologies India, TTA, and Vultr, a mix of Asian ODMs and cloud providers that signals the benchmark is broadening beyond the usual hyperscaler-and-OEM core.
Two new workloads. v6.0 added DeepSeek V3 (671B total parameters, 37B activated per token) and GPT-OSS 20B (21B total, 3.6B activated) — both mixture-of-experts architectures. DeepSeek V3 is now the largest benchmark in the suite and shares its base with DeepSeek-R1; GPT-OSS 20B was deliberately scoped to train on a single 8-GPU node, giving teams without large clusters an entry point.
A maturing precision story. Multiple FP4-precision training recipes appeared for the first time, from more than one vendor — NVIDIA’s NVFP4 and AMD’s MXFP4. These are vendor-specific FP4 formats, not interchangeable, but their arrival in an audited benchmark is a real maturation signal for lower-precision training.
Across 24 organizations
Spanning 13 different hardware accelerators and 19 host processors — what MLCommons calls a diversity record for the benchmark suite.
DeepSeek V3 MoE
Now the biggest workload in the suite — 671B total parameters, 37B activated per token, the same base model that underlies DeepSeek-R1.
New on the board
Inventec, Netweb Technologies India, TTA, and Vultr — a mix of Asian ODMs and cloud providers expanding the benchmark's base.
02 — The MoE PivotMixture-of-experts is the new baseline.
The most consequential change in v6.0 is not who finished fastest — it is what they were asked to train. Adding DeepSeek V3 and GPT-OSS 20B means MLPerf now measures the sparse mixture-of-experts architecture that dominates frontier model design, rather than the dense transformers that defined earlier rounds. That reframing matters: dense-architecture results (the older Llama benchmarks) are now legacy context rather than leading indicators.
MoE matters for hardware because it decouples total parameter count from per-token compute. DeepSeek V3 holds 671B parameters but activates only 37B per token; GPT-OSS 20B holds 21B but activates 3.6B. That sparsity rewards memory capacity and interconnect bandwidth — moving expert weights and routing tokens between them — as much as raw matrix-multiply throughput, which changes how chips should be compared.
Sparse computation is a dominant trend in AI right now. Over the past two years, all major new generative AI models have utilized sparse computation architecture, frequently MoE.— Shriya Rishab, MLPerf Training Working Group co-chair, MLCommons
The two additions also signal an intent to test both ends of the scale. GPT-OSS 20B was built specifically as an entry-point benchmark — trainable on a single 8-GPU node, using the same dataset as the existing Llama 3.1 8B workload. For enterprise AI teams, that is MLCommons explicitly acknowledging that meaningful training no longer happens only at hyperscaler scale, and that a credible benchmark should exist within a single-node hardware footprint. If your team is weighing where model training fits alongside hosted inference, our guide to reading AI benchmarks critically is a useful companion to this round.
03 — Reference CardTime-to-train across all seven workloads.
The complete v6.0 suite comprises seven workloads. The table below consolidates the fastest reported time-to-train for each, the platform and GPU count behind it, and a scale tier we derive from the GPU count: single-node (≤8 GPUs), multi-node (9–999), or hyperscale (1,000+). Every time figure is NVIDIA-reported from its own submissions, so read it as vendor-stated best-case rather than a cross-vendor head-to-head.
| Workload | Fastest time-to-train | Platform | GPU count | Scale tier |
|---|---|---|---|---|
| DeepSeek-V3 (671B) | 2.02 min | GB300 NVL72 (CoreWeave) | 8,192 | Hyperscale |
| Llama 3.1 405B | 7.07 min | GB200 NVL72 (Azure) | 8,192 | Hyperscale |
| GPT-OSS 20B | 7.43 min | GB300 NVL72 | 512 | Multi-node |
| Llama 3.1 8B | 4.46 min | GB200 NVL72 | 1,024 | Hyperscale |
| Llama 2 70B LoRA | 0.4 min | GB300 NVL72 | 512 | Multi-node |
| FLUX.1 | 17.1 min | GB300 NVL72 | 512 | Multi-node |
| DLRM-dcnv2 | 0.67 min | GB300 NVL72 | 64 | Multi-node |
Two patterns are worth naming. First, the very largest models still need hyperscale clusters to hit these times — DeepSeek-V3 and Llama 3.1 405B both ran on 8,192 GPUs. Second, fine-tuning and recommendation workloads (Llama 2 70B LoRA at 0.4 minutes, DLRM-dcnv2 at 0.67) finish in well under a minute at far smaller scale, which is closer to what most enterprise teams actually run. The leaderboard's extremes get the headlines; the sub-minute multi-node rows are the realistic reference points.
04 — NVIDIABlackwell Ultra tops every workload it entered.
NVIDIA reports that its GB300 NVL72 (Blackwell Ultra) posted the fastest time-to-train on every benchmark in v6.0 — and that it was the only vendor to submit across all seven workloads. That last point matters more than the first: because NVIDIA was the sole full-suite submitter, "fastest on every benchmark" is partly a statement about coverage. AMD, for instance, did not submit to DeepSeek-V3 or GPT-OSS 20B, so a like-for-like full-suite comparison does not exist this round.
Within NVIDIA's own results, the generational story is clean. The company reports the GB300 NVL72 delivers up to 1.6x faster training than the prior GB200 NVL72 at an equivalent GPU count, attributing the gain to higher compute density with NVFP4, expanded memory capacity, and a higher power ceiling. The speedup varies by workload: roughly 1.6x on DeepSeek-V3, 1.5x on Llama 3.1 405B, and 1.3x on GPT-OSS 20B.
GB300 vs GB200 generational speedup · by workload
Source: NVIDIA Technical Blog (vendor-stated), Jun 16, 202605 — AMDThe real story for anyone not buying NVIDIA.
AMD's v6.0 submission is the round's quietest important result. On the LLM workloads enterprises actually run, AMD reports its Instinct MI355X came within 5% of NVIDIA's B200 on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pre-training, both using FP4 (AMD's MXFP4 against NVIDIA's NVFP4). Those are vendor-stated figures, independently echoed in third-party coverage, but they are close enough to reframe AMD from token alternative to genuine procurement option.
The memory dimension sharpens the case. AMD's MI350-series parts carry 288 GB of HBM3E and, per AMD, support models up to 520 billion parameters on a single GPU — meaningfully more on-chip capacity than the B200 generation. For MoE training, where holding expert weights close to compute reduces interconnect pressure, that headroom is a real consideration, not a spec-sheet footnote.
| Workload | AMD chip | NVIDIA chip | Gap (AMD-stated) | Precision |
|---|---|---|---|---|
| Llama 2-70B fine-tuning | MI355X | B200 | Within 5% | MXFP4 vs NVFP4 |
| Llama 3.1-8B pre-training | MI355X | B200 | Within 6% | MXFP4 vs NVFP4 |
Two structural firsts round out the picture. v6.0 marked AMD's first-ever multi-node MLPerf Training submission — it entered FLUX.1 at 64 MI325X GPUs, while Oracle Cloud Infrastructure, working with AMD, submitted FLUX.1 at 512 GPUs. AMD also reports a record 10 ecosystem partners submitting results on its Instinct platforms, with partner numbers landing within 6% of AMD's official figures. A broad partner ecosystem is what turns a competitive chip into a buyable one.
The generational gain is large but needs its timeframe stated precisely. AMD reports a 3.5x improvement on Llama 2-70B training from MLPerf Training 5.0 (MI300X, MXFP8) to v6.0 (MI355X, MXFP4) — but v5.0 is roughly a year-plus earlier than v6.0, so this is an annual-scale generational leap, not a six-month one. Within the MI350 series over the most recent seven months, AMD reports more modest gains: MI355X improved Llama 2-70B fine-tuning by 19% and Llama 3.1-8B pre-training by 13%.
06 — Software CompoundingThe story hardware vendors don't lead with.
The most under-covered result in v6.0 is not about silicon at all. NVIDIA reports that its per-GPU DeepSeek-V3 throughput on Blackwell rose from roughly 1,088 TFLOPS/GPU in November 2025 to about 1,648 TFLOPS/GPU (around 6,338 tokens/sec/GPU) in June 2026 — a 1.3x uplift in roughly six months driven purely by software optimization, with no hardware change. Re-checking the arithmetic, 1,648 ÷ 1,088 ≈ 1.51, so the "1.3x" NVIDIA cites for this comparison is best read as its own stated figure rather than a clean ratio of those two throughput numbers; either way, the direction is a substantial same-hardware gain.
The mechanics behind it are concrete. For GPT-OSS 20B on Blackwell, NVIDIA reports a CuTe DSL-based kernel achieving a 93% end-to-end speedup, a router optimization delivering a 5x kernel speedup, and all-to-all communication overlap reaching nearly 100% overlap for an 8% performance benefit. These are vendor-stated, but they describe a real pattern: a large share of each round's gains now comes from the software stack maturing around fixed hardware.
Starting throughput
NVIDIA's reported per-GPU DeepSeek-V3 training throughput on Blackwell at the prior measurement point — the baseline for the software-only improvement.
Software-optimized
Roughly 6,338 tokens/sec/GPU — reached with no hardware change. NVIDIA frames this as a ~1.3x same-hardware uplift over about six months.
Ceiling not yet reached
If already-deployed Blackwell systems keep getting faster from software, the performance gap to a hardware upgrade narrows — changing the upgrade calculus for existing buyers.
The procurement implication is the part most coverage skips. If a deployed system gains meaningful throughput between hardware generations purely through software, the case for rushing the next upgrade weakens — buyers of current-generation systems are still getting faster without a refresh. That is a genuinely different way to value a multi-million-dollar cluster than a static spec sheet suggests, and it argues for weighting a vendor's software cadence alongside its peak hardware numbers.
07 — CloudTraining-as-a-service becomes a leaderboard category.
The structural shift in v6.0 is who is submitting. Cloud submissions doubled compared with v5.1 six months earlier, and 60% of all v6.0 submissions were multi-node systems. The benchmark is increasingly populated by training systems you rent rather than own — CoreWeave, Microsoft Azure, Oracle Cloud Infrastructure, Nebius, and Vultr all appear in the results.
There are more ways of getting your AI training than ever before. Several companies now offer training systems in the cloud.— Pavan Yalamanchili, MLPerf Working Group co-chair, MLCommons
Cloud providers also surfaced useful head-to-head data. Nebius, using NVIDIA HGX B300 and GB300 NVL72, reported the GB300 NVL72 ran Llama 3.1-8B about 12% faster than the HGX B300 at an equivalent GPU count (64.35 minutes versus 72.01 on an 8-GPU node). And several NVIDIA partners reported their own gains — Cohere said it saw 3x faster training on GB200 NVL72 versus prior-generation hardware, and Nebius and Higgsfield reported a 30% reduction in training time on GB300 NVL72. All of these are self-reported partner testimony surfaced through NVIDIA's blog, not independently audited benchmark entries.
The takeaway for buyers is that the "build versus rent" decision for training compute now has audited evidence on both sides. A doubling of cloud submissions in a single six-month cycle is a fast structural move, and it means time-to-train comparisons increasingly need a cost-and-availability column the leaderboard itself does not provide.
08 — What To DoHow teams should read this round.
For most teams the right use of MLPerf Training v6.0 is as a decision input, not a verdict. Match the signal to your situation.
Largest MoE training runs
NVIDIA's full-suite coverage and GB300 generational gains make it the safe default for the very largest runs. The 8,192-GPU DeepSeek-V3 and Llama 405B results are hyperscale-only — plan for cloud or a major cluster.
Memory-bound workloads
AMD's MI355X within 5-6% of B200 on the workloads enterprises run, plus 288 GB of HBM3E, makes it a serious procurement option — especially where on-chip memory headroom matters. Benchmark on your own models first.
Already own Blackwell
The software-only throughput gains argue against rushing an upgrade. Track your vendor's software cadence and re-measure your own workloads before committing capital to the next generation.
Cloud or single-node training
Cloud submissions doubled and GPT-OSS 20B is single-node by design. Rent capacity from the providers on the leaderboard and add a cost-and-availability column the benchmark omits.
Across all four paths, one discipline applies: treat vendor-stated figures as starting hypotheses and run your own evaluation on the models and data you actually train. A 5% benchmark gap can invert on a different workload, a different batch size, or a different interconnect. If you want help turning a benchmark round like this into a concrete hardware-and-cloud decision, our AI and digital transformation engagements start with exactly this kind of comparative evaluation, and it pairs naturally with the inference-side view in our AI inference and latency benchmarks guide.
09 — ConclusionA benchmark that reads as a market map.
The 2026 hardware race is about memory, software, and where you rent — not just peak speed.
MLPerf Training v6.0 is most valuable read as a forward indicator. The arrival of a 671B MoE benchmark confirms that sparse models are now the default thing worth measuring. NVIDIA Blackwell tops the chart it entered, but its full-suite sweep is partly a coverage story — and AMD's 5-to-6-percent gap on the workloads enterprises actually run, paired with more on-chip memory, makes the competitive field real for the first time in a while.
The quieter signals matter more for budgets. Software optimization is delivering same-hardware throughput gains between generations, which changes the upgrade calculus for anyone who already owns current silicon. And a doubling of cloud submissions in six months says the procurement question is shifting from which chip to buy toward whether to buy at all.
For teams making 2026 training decisions, the practical move is the same one the benchmark itself encourages: take the audited numbers as inputs, treat vendor framing skeptically, and run your own evaluation on the models and data you care about. The leaderboard tells you where the field is heading. It does not tell you which row is right for your workload — that is still a decision you have to make on your own hardware.