AI DevelopmentNew Release11 min readPublished June 13, 2026

First open-weight text diffusion LLM · 26B MoE · ~4x faster, lower quality

Google DiffusionGemma: First Open-Weight Text Diffusion

Google DeepMind shipped DiffusionGemma on June 10, 2026 — a 26B mixture-of-experts model under Apache 2.0 that generates text by denoising whole blocks in parallel instead of one token at a time. The vendor-stated headline is up to 4x faster than Gemma 4 on a single H100. The honest catch: lower benchmark quality almost everywhere. This guide maps exactly which workloads it earns.

DA
Digital Applied Team
Senior strategists · Published June 13, 2026
PublishedJune 13, 2026
Read time11 min
SourcesGoogle blog + model card
H100 throughput (FP8)
1,100+
tok/s · vendor-stated
low-concurrency only
Active parameters
3.8B
of 25.2B total (MoE)
Fits in VRAM (NVFP4)
18GB
single RTX 5090
MMLU Pro vs Gemma 4
77.6
vs 82.6 (vendor-stated)
−5.0 pts

Google DiffusionGemma is Google DeepMind’s first open-weight text diffusion model, released on June 10, 2026 under an Apache 2.0 license. It is a 26B mixture-of-experts model — roughly 3.8B active parameters per pass — that abandons left-to-right token generation for a parallel, block-by-block denoising process, reaching a vendor-stated 1,100-plus tokens per second on a single NVIDIA H100.

The reason this matters is the architecture, not the leaderboard. Autoregressive models — every GPT-style LLM you have used — emit one token at a time, bottlenecked by memory bandwidth. DiffusionGemma instead starts each 256-token block as random placeholders and iteratively refines the whole block in parallel, saturating compute instead of waiting on sequential memory reads. That is what produces the speed. It is also why the model can look back and forward inside a block, fixing its own mistakes mid-generation.

This guide covers what actually shipped, how discrete text diffusion works in plain terms, the architectural tradeoff against autoregressive models, the honest benchmark picture — Google itself recommends Gemma 4 when quality matters — and a workload routing matrix so you know exactly which tasks belong on DiffusionGemma. Every figure is sourced from Google’s announcement, developer docs, and the official model card, with vendor-stated numbers labelled as such.

Key takeaways
  1. 01
    First open-weight text diffusion LLM from a tier-one lab.Released June 10, 2026 under Apache 2.0, DiffusionGemma is a 26B MoE (~3.8B active) built on the Gemma 4 26B-A4B backbone with a diffusion head. It is the open-weight counterpart to closed commercial diffusion models.
  2. 02
    Speed is the headline — and it is vendor-stated.Google reports up to 4x faster than Gemma 4 26B and 1,000–1,100+ tokens/sec on one H100 (FP8). Treat these as vendor figures for local, low-concurrency inference, not universal throughput guarantees.
  3. 03
    It trades quality for speed, by Google's own framing.DiffusionGemma trails Gemma 4 on almost every published benchmark — MMLU Pro 77.6 vs 82.6, AIME 2026 69.1 vs 88.3. Google labels it experimental and recommends Gemma 4 where maximum quality is required.
  4. 04
    Document parsing is the one clear win.On OmniDocBench 1.5, DiffusionGemma leads Gemma 4. Bidirectional attention during denoising gives it a structural edge on OCR and layout-aware extraction — the workload to actually route to it.
  5. 05
    It fits on a single RTX 5090.An NVIDIA-quantized NVFP4 build runs within ~18GB VRAM at a vendor-stated 700+ tokens/sec. That is a different deployment story from most 26B models, which expect A100/H100-class hardware.

01What ShippedAn open-weight model on every major surface on day one.

DiffusionGemma launched as google/diffusiongemma-26B-A4B-it on Hugging Face under Apache 2.0, with same-day availability on Kaggle, Google Cloud Vertex AI Model Garden (corroborated via secondary coverage rather than the primary release notes), and NVIDIA NIM. The named research scientists are Brendan O’Donoghue and Sebastian Flennerhag at Google DeepMind. There is no dedicated DiffusionGemma arXiv preprint as of this writing — the canonical references are the Google blog and the official model card, with the block-diffusion technique itself rooted in the BD3-LMs paper (arXiv:2503.09573), an ICLR 2025 Oral.

Under the hood it is a 25.2B-parameter mixture-of-experts with 30 transformer layers, 8 active experts of 128 total plus 1 shared expert, and roughly 3.8B active parameters per forward pass. It carries a 262,144-token vocabulary, a 256K-token context window, and an approximately 550M-parameter vision encoder. It accepts text, images, and video up to 60 seconds at 1fps; it does not accept audio input and generates text only. The training data spans web documents in 140-plus languages, code, mathematics, and images, with a January 2025 cutoff — so its world knowledge is over a year stale at launch.

Open weights
DiffusionGemma 26B
25.2B total · 3.8B active · Apache 2.0

26B MoE on the Gemma 4 26B-A4B backbone with an integrated diffusion head. 30 layers, 8/128 experts + 1 shared, 256K context, 262,144 vocab. The first open-weight text diffusion model from a tier-one lab.

huggingface.co/google/diffusiongemma-26B-A4B-it
NVIDIA NVFP4
Quantized for the desktop
~18GB VRAM · 700+ tok/s (vendor-stated)

An NVFP4-quantized variant fits inside 18GB, enabling deployment on a single RTX 5090. FP8 needs ~28GB; BF16 full precision is 50GB+ and effectively multi-GPU.

huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4
Release snapshot
DiffusionGemma launched June 10, 2026 from Google DeepMind under Apache 2.0, with day-zero support in vLLM (the first diffusion LLM natively supported there), Hugging Face Transformers, MLX, and SGLang, plus Unsloth and NVIDIA NeMo for fine-tuning. One integration gotcha: in Transformers you must use the DiffusionGemmaForBlockDiffusion class — AutoModelForCausalLM will not work.

02The MechanismFrom sequential typewriter to a printing press.

A standard language model is a typewriter: it predicts one token, commits it, then predicts the next conditioned on everything written so far. DiffusionGemma works the other way. It begins each 256-token block — Google calls it a “canvas” — as random placeholder tokens, then iteratively denoises the whole canvas at once, locking in the tokens it is most confident about and using them as context to refine the rest. When the block converges, it commits that block to the KV cache and starts the next canvas. This is the block-autoregressive interpolation between pure autoregression and pure diffusion that BD3-LMs introduced.

The recommended inference recipe is concrete: up to 48 denoising steps per canvas, a linear temperature schedule decaying from 0.8 to 0.4, and an entropy threshold of 0.005 for adaptive early stopping so easy blocks finish in fewer steps. Each forward pass refines the full canvas and commits roughly 15–20 tokens; at low batch sizes those passes compound into the headline throughput. Two attention regimes are in play: causal attention during the prefill stage that encodes your prompt into the KV cache, then bidirectional attention during denoising — and that bidirectionality is precisely what lets the canvas self-correct, because every position can see both directions.

“It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.”Brendan O'Donoghue and Sebastian Flennerhag, Research Scientists at Google DeepMind

The clearest demonstration of why iterative refinement is more than a speed trick is a Sudoku fine-tuning demo Google published. The base model solved essentially 0% of Sudoku puzzles; after fine-tuning, it reached an 80% success rate and solved puzzles in 12 denoising steps versus 48 for the base model. Constrained problems that stump a left-to-right model — where an early wrong digit poisons everything downstream — suit a model that can revisit and rewrite the whole grid as a unit. Built-in thinking mode and function calling are both supported, though each adds overhead worth profiling before you enable it in a throughput-sensitive pipeline.

03Architecture TradeoffAutoregressive versus diffusion, line by line.

The two paradigms differ on more than speed. The table below operationalizes the tradeoffs so an engineering decision-maker can reason about where each one belongs, rather than treating “diffusion = faster” as a blanket truth.

Architectural comparison of autoregressive (GPT-style) generation versus discrete text diffusion as implemented in DiffusionGemma, across generation mechanism, attention, bottleneck, and latency behaviour. Sources: Google Developers Blog, the official model card, and independent coverage, retrieved June 13, 2026.
DimensionAutoregressive (GPT-style)Text diffusion (DiffusionGemma)
Generation mechanismOne token at a time, left to rightDenoises a 256-token canvas in parallel, block by block
Attention during generationCausal (look back only)Bidirectional within the canvas (look back and forward)
Computational bottleneckMemory bandwidth — sequential KV reads per tokenCompute — saturates tensor cores with parallel matmuls
Self-correction mid-generationNo — a committed token cannot be revisedYes — the canvas is re-refined until it converges
VRAM profile (comparable size)Standard for a 26B MoE~18GB (NVFP4) / ~28GB (FP8) / 50GB+ (BF16)
Best latency scenarioHigh-QPS cloud serving (batching saturates compute)Local, single-user, low-concurrency inference
Worst latency scenarioSingle-user local decode (memory-bound)High-QPS cloud serving (advantage shrinks or inverts)
The concurrency trap
The single most under-reported caveat: Google’s speed advantage is for local, low-concurrency inference. In high-QPS cloud serving, an autoregressive model can batch many requests to saturate compute, potentially making DiffusionGemma’s parallel decoding more costly per token. If you are serving 50 concurrent users, model the cost before assuming the 4x carries over — it may shrink or disappear.

04The Speed ClaimThe 4x figure, with the asterisks attached.

Google’s headline is up to 4x faster than Gemma 4 26B-A4B at a matched model size, and roughly 2.25x faster than Gemma 4 12B with speculative decoding enabled. Reported throughput is 1,000–1,100-plus tokens per second on an H100 in FP8, and 700-plus tokens per second on an RTX 5090 with the NVFP4 quantized build. Every one of these is vendor-stated at low batch sizes, and Google is explicit that they describe local, low-concurrency inference. Real-world throughput varies with concurrency and quantization.

DiffusionGemma throughput and speed multipliers · vendor-stated

Source: Google blog and developer docs (throughput figures vendor-stated); speed-vs-12B per The Register
DiffusionGemma · H100 FP8vendor-stated · low batch size
1,100+ tok/s
DiffusionGemma · RTX 5090 NVFP4vendor-stated · within 18GB VRAM
700+ tok/s
Speed vs Gemma 4 26Bmatched size · vendor-stated
up to 4x
Speed vs Gemma 4 12B + spec. decodeper independent coverage
~2.25x

The mechanism behind the gain is worth understanding, because it tells you when the speed is real. Autoregressive decoding is memory-bandwidth bound: each new token requires a sequential read of the growing KV cache, and the accelerator spends most of its time waiting on memory rather than computing. DiffusionGemma moves that bottleneck onto compute — denoising a whole canvas is a dense batch of matrix operations that keeps the tensor cores busy. When you are the only user and the GPU would otherwise sit idle between sequential reads, that is a large, real win. When the GPU is already saturated by batched cloud traffic, there is less idle time to reclaim, which is exactly why the advantage is local-inference-shaped.

05BenchmarksWhere DiffusionGemma wins and where it loses.

Most coverage stops at “it is faster but worse.” The more useful view is benchmark by benchmark, because the one place it wins tells you the workload to route to it. The table below is the vendor-stated comparison against Gemma 4 26B-A4B from the official model card (Codeforces ELO and OmniDocBench corroborated via independent coverage). On every percentage benchmark, higher is better and Gemma 4 leads; on OmniDocBench 1.5, document parsing, the order flips.

Vendor-stated benchmark comparison of DiffusionGemma 26B versus Gemma 4 26B-A4B across nine benchmarks, with the gap and the direction of each metric. DiffusionGemma trails on eight and leads only on OmniDocBench 1.5 document parsing. Source: Google AI model card, with Codeforces and OmniDocBench corroborated by independent coverage, retrieved June 13, 2026.
BenchmarkDiffusionGemma 26BGemma 4 26BGap
Gemma 4 leads — knowledge, reasoning, code
MMLU ProHigher is better · general knowledge77.6%82.6%−5.0 pts
GPQA DiamondHigher is better · graduate-level science73.2%82.3%−9.1 pts
AIME 2026 (no tools)Higher is better · competition math69.1%88.3%−19.2 pts
LiveCodeBench v6Higher is better · coding69.1%77.1%−8.0 pts
BigBench Extra HardHigher is better · hard reasoning47.6%64.8%−17.2 pts
MMMU Pro (Vision)Higher is better · multimodal54.3%73.8%−19.5 pts
MRCR v2 · 8-needle · 128kHigher is better · long-context retrieval32.0%44.1%−12.1 pts
Codeforces ELOHigher is better · competitive programming14291718−289 ELO
DiffusionGemma leads — document parsing
OmniDocBench 1.5Document parsing · DiffusionGemma's one win0.3190.149DiffusionGemma leads

The pattern is unambiguous and Google does not hide it: the model trails Gemma 4 across general knowledge, science, competition math, coding, hard reasoning, vision, and long-context retrieval, with the widest gaps on MMMU Pro vision (−19.5 points), AIME 2026 (−19.2 points), and a 289-point Codeforces ELO deficit. The one place the order flips is OmniDocBench 1.5, where bidirectional attention gives it a structural advantage on OCR and layout-aware extraction. This is not a model that is “almost as good and much faster” — it is a model with one genuine strength and a real quality cost everywhere else. Independent coverage notes the same trend held for earlier diffusion LLMs, so the trade-off looks like a property of the paradigm rather than a one-off.

Read the numbers carefully
OmniDocBench 1.5 is the lone benchmark where DiffusionGemma leads Gemma 4. Everywhere else, Gemma 4 wins — and the two MMLU Pro figures (77.6 vs 82.6) are close enough to swap by accident. Google’s own guidance settles the ambiguity: this is an experimental model, and for applications that demand maximum quality, deploy Gemma 4.

06Run It LocallyThe deployment story that fits on a single GPU.

The most interesting practical fact is the hardware envelope. The NVFP4-quantized build fits within roughly 18GB of VRAM, which puts a 26B-class model on a single consumer RTX 5090 — a fundamentally different deployment story from most 26B models that expect A100/H100-class hardware. FP8 needs about 28GB; BF16 full precision is 50GB-plus and effectively multi-GPU. For teams that care about local, private inference, the question shifts from “can we afford the cluster” to “does a single desktop GPU clear the bar.”

Day-zero frameworks
Serving stacks supported
4

vLLM (the first diffusion LLM natively supported there), Hugging Face Transformers, MLX, and SGLang at launch. Use the DiffusionGemmaForBlockDiffusion class in Transformers — AutoModelForCausalLM will not load it.

+ Unsloth, NeMo for fine-tuning
Single-GPU footprint
VRAM · NVFP4 quantized
18GB

Runs on one RTX 5090 at a vendor-stated 700+ tokens/sec. FP8 ~28GB, BF16 50GB+. A genuinely different cost curve from cluster-bound 26B models.

FP8: ~28GB · BF16: 50GB+
Inference recipe
Max denoising steps / canvas
48

Linear temperature 0.8 → 0.4, entropy early-stop at 0.005, 256-token canvas. Each pass commits ~15–20 tokens. Watch for incoherence at canvas boundaries on very long structured outputs.

256-token canvas

One known limitation deserves a flag for long-form work: because each canvas denoises semi-independently, very long structured outputs can show incoherence at the 256-token canvas boundaries. For document-length structured generation, validate the seams. If you are sizing the speed against your existing stack, our roundup of AI model latency benchmarks puts figures like 1,100 tokens per second in context against mainstream autoregressive throughput.

07Routing MatrixWhich workloads to route where.

The benchmark profile resolves cleanly into a routing decision. Send DiffusionGemma the workloads that reward its speed and bidirectional attention; keep everything quality-sensitive on Gemma 4. The four cases below are the ones where the choice is non-obvious.

Document parsing & OCR
Layout-aware extraction

DiffusionGemma's one clear win is OmniDocBench 1.5, where bidirectional attention helps it read structure. Document extraction and dense parsing are the workloads to actively route here.

Pick DiffusionGemma
Real-time local chat
Single-user streaming

Low-concurrency, latency-sensitive local inference is exactly the scenario the speed advantage is built for. On one RTX 5090 you get a vendor-stated 700+ tokens/sec inside 18GB — fast and private.

Pick DiffusionGemma
Math, code & reasoning
AIME, LiveCodeBench, Codeforces

It trails Gemma 4 hard here — AIME 2026 −19.2 points, a 289-point Codeforces ELO gap. For multi-step math, competitive programming, or anything where a wrong step compounds, stay on Gemma 4.

Stay with Gemma 4
High-QPS cloud serving
50+ concurrent users

The 4x is a local, low-concurrency figure. Under heavy batched load an autoregressive model saturates compute efficiently and the diffusion advantage shrinks or inverts. Benchmark before committing.

Default to AR (Gemma 4)

For the open-weight versus closed-weight framing of the broader text-diffusion landscape, the natural comparison point is Inception Labs Mercury 2, the first commercial text diffusion model. Both land near the 1,000-tokens-per-second mark, but Mercury 2 is a closed commercial API while DiffusionGemma is Apache 2.0 open weights with day-zero vLLM support. The two were not run through a shared benchmark suite, so treat the contrast as structural — open versus closed, self-hosted versus API — rather than a head-to-head quality number. The wider open-versus-closed question is covered in our open-weight vs closed-source AI models comparison.

08ImplicationsWhat a first open diffusion model changes.

Read narrowly, DiffusionGemma is an experimental release with one standout workload. Read as a signal, it is the moment text diffusion stopped being a closed-lab demo and became something any team can download, inspect, fine-tune, and self-host. That open-weight status is the part that compounds. Until now the fastest diffusion LLMs were commercial APIs; an Apache 2.0 model with native vLLM support and a JAX fine-tuning toolbox turns the paradigm into infrastructure researchers and product teams can build on without a vendor contract.

Projecting forward, the practical near-term value is not in replacing your default model — it is in routing specific workloads. Document extraction pipelines, single-user local assistants, and constrained generation tasks like the Sudoku demo are where iterative refinement and a single-GPU footprint pay off today. Expect the quality gap to narrow as the technique matures, the way successive autoregressive generations did, but do not bet a production pipeline on that convergence yet. The honest read for June 2026 is that diffusion is now a real, open option for a defined slice of workloads — and a poor fit for the rest. Teams weighing where it fits into a multi-model stack are exactly the kind of comparative evaluation our AI and digital transformation engagements begin with, alongside the autoregressive Gemma 4 family that DiffusionGemma trades quality against for speed.

09ConclusionA genuine option for the right workloads.

The shape of open text diffusion, June 2026

Speed is real, the quality cost is real, and routing is the whole game.

DiffusionGemma is the first open-weight text diffusion model from a tier-one lab, and it ships exactly as advertised: a 26B mixture-of-experts that denoises whole blocks in parallel, runs on a single desktop GPU, and hits a vendor-stated 1,100-plus tokens per second on an H100. The speed is genuine for local, low-concurrency inference — and only there.

The honest framing is Google’s own: it is experimental, it trails Gemma 4 on nearly every benchmark, and for maximum quality you should still reach for Gemma 4. The exception that earns it a place in a stack is document parsing, where bidirectional attention gives it a real edge. Everything else is a routing decision, not a default swap.

The larger signal is the open-weight release itself. Text diffusion is no longer a thing you can only rent through an API — it is a thing you can download, fine-tune, and run on hardware you already own. For teams that route by task class rather than by headline, that is the development worth tracking: not “which model is smartest,” but “which generation paradigm is cheapest and fastest for the specific slice of work in front of me.”

Put open-weight models to work

Open weights plus single-GPU inference make local document AI genuinely viable.

Our team helps businesses evaluate, benchmark, and deploy open-weight models — including text diffusion models like DiffusionGemma — for document extraction, local private inference, and multi-model routing, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight model engagements

  • DiffusionGemma vs Gemma 4 benchmarking on your own documents
  • Local, private inference on single-GPU hardware
  • Document extraction & OCR pipelines
  • Multi-model routing — diffusion / autoregressive / closed API
  • Fine-tuning & quantization for production deployment
FAQ · DiffusionGemma guide

The questions we get every week.

DiffusionGemma is Google DeepMind's first open-weight text diffusion language model, released on June 10, 2026 under an Apache 2.0 license. It is a 26B mixture-of-experts model (25.2B total parameters, roughly 3.8B active per pass) built on the Gemma 4 26B-A4B backbone with an integrated diffusion head. Instead of generating text one token at a time like a conventional autoregressive model, it denoises 256-token blocks in parallel. It launched on Hugging Face as google/diffusiongemma-26B-A4B-it, with same-day availability on Kaggle, Google Cloud Vertex AI Model Garden, and NVIDIA NIM, plus day-zero support in vLLM, Hugging Face Transformers, MLX, and SGLang.