DevelopmentNew Release10 min readPublished June 21, 2026

95 systems · 24 organizations · 671B MoE benchmark joins the suite

MLPerf Training v6.0: Reading the 2026 Hardware Race

MLCommons published MLPerf Training v6.0 on June 16, 2026, and the results are less a speed contest than a map of where AI training compute is heading. A 671-billion-parameter mixture-of-experts benchmark arrives, NVIDIA Blackwell tops every workload it entered, AMD lands within a handful of percent, and cloud submissions double.

DA
Digital Applied Team
Senior strategists · Published Jun 21, 2026
PublishedJun 21, 2026
Read time10 min
SourcesMLCommons + vendor blogs
Systems submitted
95
from 24 organizations
diversity record
Fastest DeepSeek-V3
2.02min
8,192 GB300 GPUs
AMD MI355X vs B200
5–6%
gap on core LLM work
vendor-stated
Cloud submissions
2×
vs v5.1, six months prior
doubled

MLPerf Training v6.0 results landed on June 16, 2026, and the most useful way to read the industry-standard training benchmark is not as a podium but as a market signal. The headline numbers — NVIDIA Blackwell topping the chart, AMD closing in, a doubling of cloud entries — describe where training cost, chip procurement, and deployment patterns are heading over the next year.

MLCommons runs MLPerf Training every six months as a neutral, audited measure of how long real models take to train on real hardware. This round drew 95 unique system submissions from 24 organizations, spanning 13 different accelerators and 19 host processors — a diversity record for the suite. Crucially, it added the first 671-billion-parameter mixture-of-experts (MoE) benchmark, making the leaderboard finally test the architecture that underlies nearly every major 2025–2026 model.

This guide walks through what changed in v6.0, why the MoE addition matters, a consolidated time-to-train reference card for all seven workloads, where NVIDIA and AMD actually stand, the software compounding story most coverage skips, and a practical read for teams making hardware and cloud decisions. Several performance figures are vendor-stated; we flag them as such throughout.

Key takeaways
  1. 01
    MoE is now a first-class benchmark, not a side note.v6.0 added DeepSeek V3 (671B total / 37B active) and GPT-OSS 20B (21B total / 3.6B active) — both mixture-of-experts. DeepSeek V3 is now the largest workload in the suite, reflecting the industry-wide shift to sparse models.
  2. 02
    NVIDIA Blackwell led every benchmark it entered.NVIDIA reports its GB300 NVL72 (Blackwell Ultra) posted the fastest time-to-train on all seven workloads — and was the only vendor to submit across the full suite. The fastest DeepSeek-V3 run finished in 2.02 minutes on 8,192 GPUs (CoreWeave).
  3. 03
    AMD is no longer a token alternative.AMD's MI355X came within 5% of NVIDIA's B200 on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pre-training (vendor-stated, both using FP4), and made its first-ever multi-node MLPerf Training submission.
  4. 04
    Software is compounding faster than hardware cycles.NVIDIA reports its per-GPU DeepSeek-V3 throughput on Blackwell rose roughly 1.3x in about six months from software optimization alone — no hardware change. That means already-deployed systems keep getting faster between upgrade cycles.
  5. 05
    Training-as-a-service is now a leaderboard category.Cloud submissions doubled versus v5.1 six months earlier, 60% of all submissions were multi-node, and four first-time submitters appeared. How teams procure training compute is shifting structurally toward the cloud.

01What ChangedA diversity record and a bigger benchmark.

MLPerf Training is the benchmark hardware vendors actually compete on — a fixed set of models trained to a defined quality target, with time-to-train measured under audited rules. Reading it well means separating the genuinely new signals from the marketing. Three things define v6.0.

Record participation. The round collected 95 unique system submissions from 24 organizations, using 13 different hardware accelerators and 19 different host processors — MLCommons describes it as a new diversity record for the suite. Four first-time submitters appeared: Inventec, Netweb Technologies India, TTA, and Vultr, a mix of Asian ODMs and cloud providers that signals the benchmark is broadening beyond the usual hyperscaler-and-OEM core.

Two new workloads. v6.0 added DeepSeek V3 (671B total parameters, 37B activated per token) and GPT-OSS 20B (21B total, 3.6B activated) — both mixture-of-experts architectures. DeepSeek V3 is now the largest benchmark in the suite and shares its base with DeepSeek-R1; GPT-OSS 20B was deliberately scoped to train on a single 8-GPU node, giving teams without large clusters an entry point.

A maturing precision story. Multiple FP4-precision training recipes appeared for the first time, from more than one vendor — NVIDIA’s NVFP4 and AMD’s MXFP4. These are vendor-specific FP4 formats, not interchangeable, but their arrival in an audited benchmark is a real maturation signal for lower-precision training.

Systems submitted
Across 24 organizations
95

Spanning 13 different hardware accelerators and 19 host processors — what MLCommons calls a diversity record for the benchmark suite.

v6.0 · Jun 16, 2026
Largest benchmark
DeepSeek V3 MoE
671B

Now the biggest workload in the suite — 671B total parameters, 37B activated per token, the same base model that underlies DeepSeek-R1.

37B active / token
First-time submitters
New on the board
4

Inventec, Netweb Technologies India, TTA, and Vultr — a mix of Asian ODMs and cloud providers expanding the benchmark's base.

ODMs + cloud

02The MoE PivotMixture-of-experts is the new baseline.

The most consequential change in v6.0 is not who finished fastest — it is what they were asked to train. Adding DeepSeek V3 and GPT-OSS 20B means MLPerf now measures the sparse mixture-of-experts architecture that dominates frontier model design, rather than the dense transformers that defined earlier rounds. That reframing matters: dense-architecture results (the older Llama benchmarks) are now legacy context rather than leading indicators.

MoE matters for hardware because it decouples total parameter count from per-token compute. DeepSeek V3 holds 671B parameters but activates only 37B per token; GPT-OSS 20B holds 21B but activates 3.6B. That sparsity rewards memory capacity and interconnect bandwidth — moving expert weights and routing tokens between them — as much as raw matrix-multiply throughput, which changes how chips should be compared.

Sparse computation is a dominant trend in AI right now. Over the past two years, all major new generative AI models have utilized sparse computation architecture, frequently MoE.— Shriya Rishab, MLPerf Training Working Group co-chair, MLCommons

The two additions also signal an intent to test both ends of the scale. GPT-OSS 20B was built specifically as an entry-point benchmark — trainable on a single 8-GPU node, using the same dataset as the existing Llama 3.1 8B workload. For enterprise AI teams, that is MLCommons explicitly acknowledging that meaningful training no longer happens only at hyperscaler scale, and that a credible benchmark should exist within a single-node hardware footprint. If your team is weighing where model training fits alongside hosted inference, our guide to reading AI benchmarks critically is a useful companion to this round.

03Reference CardTime-to-train across all seven workloads.

The complete v6.0 suite comprises seven workloads. The table below consolidates the fastest reported time-to-train for each, the platform and GPU count behind it, and a scale tier we derive from the GPU count: single-node (≤8 GPUs), multi-node (9–999), or hyperscale (1,000+). Every time figure is NVIDIA-reported from its own submissions, so read it as vendor-stated best-case rather than a cross-vendor head-to-head.

MLPerf Training v6.0 fastest reported time-to-train by workload, with platform, GPU count, and derived scale tier. Times are NVIDIA-reported from its own submissions.
WorkloadFastest time-to-trainPlatformGPU countScale tier
DeepSeek-V3 (671B)2.02 minGB300 NVL72 (CoreWeave)8,192Hyperscale
Llama 3.1 405B7.07 minGB200 NVL72 (Azure)8,192Hyperscale
GPT-OSS 20B7.43 minGB300 NVL72512Multi-node
Llama 3.1 8B4.46 minGB200 NVL721,024Hyperscale
Llama 2 70B LoRA0.4 minGB300 NVL72512Multi-node
FLUX.117.1 minGB300 NVL72512Multi-node
DLRM-dcnv20.67 minGB300 NVL7264Multi-node

Two patterns are worth naming. First, the very largest models still need hyperscale clusters to hit these times — DeepSeek-V3 and Llama 3.1 405B both ran on 8,192 GPUs. Second, fine-tuning and recommendation workloads (Llama 2 70B LoRA at 0.4 minutes, DLRM-dcnv2 at 0.67) finish in well under a minute at far smaller scale, which is closer to what most enterprise teams actually run. The leaderboard's extremes get the headlines; the sub-minute multi-node rows are the realistic reference points.

04NVIDIABlackwell Ultra tops every workload it entered.

NVIDIA reports that its GB300 NVL72 (Blackwell Ultra) posted the fastest time-to-train on every benchmark in v6.0 — and that it was the only vendor to submit across all seven workloads. That last point matters more than the first: because NVIDIA was the sole full-suite submitter, "fastest on every benchmark" is partly a statement about coverage. AMD, for instance, did not submit to DeepSeek-V3 or GPT-OSS 20B, so a like-for-like full-suite comparison does not exist this round.

Within NVIDIA's own results, the generational story is clean. The company reports the GB300 NVL72 delivers up to 1.6x faster training than the prior GB200 NVL72 at an equivalent GPU count, attributing the gain to higher compute density with NVFP4, expanded memory capacity, and a higher power ceiling. The speedup varies by workload: roughly 1.6x on DeepSeek-V3, 1.5x on Llama 3.1 405B, and 1.3x on GPT-OSS 20B.

GB300 vs GB200 generational speedup · by workload

Source: NVIDIA Technical Blog (vendor-stated), Jun 16, 2026
DeepSeek-V3 (671B)GB300 NVL72 vs GB200 NVL72, equal GPU count
1.6×
Llama 3.1 405BGB300 NVL72 vs GB200 NVL72, equal GPU count
1.5×
GPT-OSS 20BGB300 NVL72 vs GB200 NVL72, equal GPU count
1.3×
Read the framing carefully
NVIDIA's "fastest on every benchmark" claim is vendor-stated and shaped by the fact that it was the only vendor to enter all seven workloads. It is a real result, but it is not the same as beating every competitor head-to-head on the workloads they both ran. For the chip roadmap context behind GB300, see our coverage of NVIDIA's Blackwell Ultra announcement.

05AMDThe real story for anyone not buying NVIDIA.

AMD's v6.0 submission is the round's quietest important result. On the LLM workloads enterprises actually run, AMD reports its Instinct MI355X came within 5% of NVIDIA's B200 on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pre-training, both using FP4 (AMD's MXFP4 against NVIDIA's NVFP4). Those are vendor-stated figures, independently echoed in third-party coverage, but they are close enough to reframe AMD from token alternative to genuine procurement option.

The memory dimension sharpens the case. AMD's MI350-series parts carry 288 GB of HBM3E and, per AMD, support models up to 520 billion parameters on a single GPU — meaningfully more on-chip capacity than the B200 generation. For MoE training, where holding expert weights close to compute reduces interconnect pressure, that headroom is a real consideration, not a spec-sheet footnote.

MLPerf Training v6.0 AMD versus NVIDIA on two core LLM workloads, showing the chips compared, AMD-stated performance gap, precision format, and memory. Gap figures are AMD-stated.
WorkloadAMD chipNVIDIA chipGap (AMD-stated)Precision
Llama 2-70B fine-tuningMI355XB200Within 5%MXFP4 vs NVFP4
Llama 3.1-8B pre-trainingMI355XB200Within 6%MXFP4 vs NVFP4

Two structural firsts round out the picture. v6.0 marked AMD's first-ever multi-node MLPerf Training submission — it entered FLUX.1 at 64 MI325X GPUs, while Oracle Cloud Infrastructure, working with AMD, submitted FLUX.1 at 512 GPUs. AMD also reports a record 10 ecosystem partners submitting results on its Instinct platforms, with partner numbers landing within 6% of AMD's official figures. A broad partner ecosystem is what turns a competitive chip into a buyable one.

The generational gain is large but needs its timeframe stated precisely. AMD reports a 3.5x improvement on Llama 2-70B training from MLPerf Training 5.0 (MI300X, MXFP8) to v6.0 (MI355X, MXFP4) — but v5.0 is roughly a year-plus earlier than v6.0, so this is an annual-scale generational leap, not a six-month one. Within the MI350 series over the most recent seven months, AMD reports more modest gains: MI355X improved Llama 2-70B fine-tuning by 19% and Llama 3.1-8B pre-training by 13%.

Where the comparison ends
AMD did not submit to the DeepSeek-V3 or GPT-OSS 20B MoE benchmarks in v6.0, and its blog reports relative speedups rather than absolute minute figures for MI355X runs — so there is no public head-to-head on the largest MoE workloads, and the AMD-versus-NVIDIA picture is complete only on the dense LLM workloads above.

06Software CompoundingThe story hardware vendors don't lead with.

The most under-covered result in v6.0 is not about silicon at all. NVIDIA reports that its per-GPU DeepSeek-V3 throughput on Blackwell rose from roughly 1,088 TFLOPS/GPU in November 2025 to about 1,648 TFLOPS/GPU (around 6,338 tokens/sec/GPU) in June 2026 — a 1.3x uplift in roughly six months driven purely by software optimization, with no hardware change. Re-checking the arithmetic, 1,648 ÷ 1,088 ≈ 1.51, so the "1.3x" NVIDIA cites for this comparison is best read as its own stated figure rather than a clean ratio of those two throughput numbers; either way, the direction is a substantial same-hardware gain.

The mechanics behind it are concrete. For GPT-OSS 20B on Blackwell, NVIDIA reports a CuTe DSL-based kernel achieving a 93% end-to-end speedup, a router optimization delivering a 5x kernel speedup, and all-to-all communication overlap reaching nearly 100% overlap for an 8% performance benefit. These are vendor-stated, but they describe a real pattern: a large share of each round's gains now comes from the software stack maturing around fixed hardware.

Nov 2025
Starting throughput
~1,088 TFLOPS/GPU

NVIDIA's reported per-GPU DeepSeek-V3 training throughput on Blackwell at the prior measurement point — the baseline for the software-only improvement.

DeepSeek-V3 · Blackwell
Jun 2026
Software-optimized
~1,648 TFLOPS/GPU

Roughly 6,338 tokens/sec/GPU — reached with no hardware change. NVIDIA frames this as a ~1.3x same-hardware uplift over about six months.

No hardware change
Why it matters
Ceiling not yet reached
Buyers keep gaining

If already-deployed Blackwell systems keep getting faster from software, the performance gap to a hardware upgrade narrows — changing the upgrade calculus for existing buyers.

Procurement signal

The procurement implication is the part most coverage skips. If a deployed system gains meaningful throughput between hardware generations purely through software, the case for rushing the next upgrade weakens — buyers of current-generation systems are still getting faster without a refresh. That is a genuinely different way to value a multi-million-dollar cluster than a static spec sheet suggests, and it argues for weighting a vendor's software cadence alongside its peak hardware numbers.

07CloudTraining-as-a-service becomes a leaderboard category.

The structural shift in v6.0 is who is submitting. Cloud submissions doubled compared with v5.1 six months earlier, and 60% of all v6.0 submissions were multi-node systems. The benchmark is increasingly populated by training systems you rent rather than own — CoreWeave, Microsoft Azure, Oracle Cloud Infrastructure, Nebius, and Vultr all appear in the results.

There are more ways of getting your AI training than ever before. Several companies now offer training systems in the cloud.— Pavan Yalamanchili, MLPerf Working Group co-chair, MLCommons

Cloud providers also surfaced useful head-to-head data. Nebius, using NVIDIA HGX B300 and GB300 NVL72, reported the GB300 NVL72 ran Llama 3.1-8B about 12% faster than the HGX B300 at an equivalent GPU count (64.35 minutes versus 72.01 on an 8-GPU node). And several NVIDIA partners reported their own gains — Cohere said it saw 3x faster training on GB200 NVL72 versus prior-generation hardware, and Nebius and Higgsfield reported a 30% reduction in training time on GB300 NVL72. All of these are self-reported partner testimony surfaced through NVIDIA's blog, not independently audited benchmark entries.

The takeaway for buyers is that the "build versus rent" decision for training compute now has audited evidence on both sides. A doubling of cloud submissions in a single six-month cycle is a fast structural move, and it means time-to-train comparisons increasingly need a cost-and-availability column the leaderboard itself does not provide.

08What To DoHow teams should read this round.

For most teams the right use of MLPerf Training v6.0 is as a decision input, not a verdict. Match the signal to your situation.

Frontier-scale pretraining
Largest MoE training runs

NVIDIA's full-suite coverage and GB300 generational gains make it the safe default for the very largest runs. The 8,192-GPU DeepSeek-V3 and Llama 405B results are hyperscale-only — plan for cloud or a major cluster.

Default to NVIDIA GB300
Fine-tuning & mid-scale
Memory-bound workloads

AMD's MI355X within 5-6% of B200 on the workloads enterprises run, plus 288 GB of HBM3E, makes it a serious procurement option — especially where on-chip memory headroom matters. Benchmark on your own models first.

Seriously evaluate AMD
Existing deployments
Already own Blackwell

The software-only throughput gains argue against rushing an upgrade. Track your vendor's software cadence and re-measure your own workloads before committing capital to the next generation.

Hold and re-measure
No large cluster
Cloud or single-node training

Cloud submissions doubled and GPT-OSS 20B is single-node by design. Rent capacity from the providers on the leaderboard and add a cost-and-availability column the benchmark omits.

Rent, don't buy

Across all four paths, one discipline applies: treat vendor-stated figures as starting hypotheses and run your own evaluation on the models and data you actually train. A 5% benchmark gap can invert on a different workload, a different batch size, or a different interconnect. If you want help turning a benchmark round like this into a concrete hardware-and-cloud decision, our AI and digital transformation engagements start with exactly this kind of comparative evaluation, and it pairs naturally with the inference-side view in our AI inference and latency benchmarks guide.

09ConclusionA benchmark that reads as a market map.

The shape of AI training, mid-2026

The 2026 hardware race is about memory, software, and where you rent — not just peak speed.

MLPerf Training v6.0 is most valuable read as a forward indicator. The arrival of a 671B MoE benchmark confirms that sparse models are now the default thing worth measuring. NVIDIA Blackwell tops the chart it entered, but its full-suite sweep is partly a coverage story — and AMD's 5-to-6-percent gap on the workloads enterprises actually run, paired with more on-chip memory, makes the competitive field real for the first time in a while.

The quieter signals matter more for budgets. Software optimization is delivering same-hardware throughput gains between generations, which changes the upgrade calculus for anyone who already owns current silicon. And a doubling of cloud submissions in six months says the procurement question is shifting from which chip to buy toward whether to buy at all.

For teams making 2026 training decisions, the practical move is the same one the benchmark itself encourages: take the audited numbers as inputs, treat vendor framing skeptically, and run your own evaluation on the models and data you care about. The leaderboard tells you where the field is heading. It does not tell you which row is right for your workload — that is still a decision you have to make on your own hardware.

Turn benchmarks into decisions

Make the 2026 hardware race a decision you can defend.

Our team helps businesses translate AI hardware and cloud benchmarks into concrete procurement and deployment decisions — sizing training compute, comparing chips on your own workloads, and choosing between build and rent.

Free consultationExpert guidanceTailored solutions
What we work on

AI hardware & training strategy

  • Chip and cloud benchmarking on your own models
  • Build-versus-rent analysis for training compute
  • MoE training architecture and capacity planning
  • Multi-vendor strategy — NVIDIA, AMD, and cloud
  • Cost and governance for AI infrastructure programs
FAQ · MLPerf Training v6.0

The questions teams ask about this round.

MLPerf Training is the industry-standard benchmark for measuring how long real machine-learning models take to train on real hardware, run by the MLCommons consortium roughly every six months under audited rules. Version 6.0 results were published on June 16, 2026. This round drew 95 unique system submissions from 24 organizations, using 13 different hardware accelerators and 19 host processors — a diversity record for the suite. The big change in v6.0 was the addition of the first 671-billion-parameter mixture-of-experts benchmark, DeepSeek V3, alongside a smaller entry-point benchmark, GPT-OSS 20B.