Google DiffusionGemma is Google DeepMind’s first open-weight text diffusion model, released on June 10, 2026 under an Apache 2.0 license. It is a 26B mixture-of-experts model — roughly 3.8B active parameters per pass — that abandons left-to-right token generation for a parallel, block-by-block denoising process, reaching a vendor-stated 1,100-plus tokens per second on a single NVIDIA H100.
The reason this matters is the architecture, not the leaderboard. Autoregressive models — every GPT-style LLM you have used — emit one token at a time, bottlenecked by memory bandwidth. DiffusionGemma instead starts each 256-token block as random placeholders and iteratively refines the whole block in parallel, saturating compute instead of waiting on sequential memory reads. That is what produces the speed. It is also why the model can look back and forward inside a block, fixing its own mistakes mid-generation.
This guide covers what actually shipped, how discrete text diffusion works in plain terms, the architectural tradeoff against autoregressive models, the honest benchmark picture — Google itself recommends Gemma 4 when quality matters — and a workload routing matrix so you know exactly which tasks belong on DiffusionGemma. Every figure is sourced from Google’s announcement, developer docs, and the official model card, with vendor-stated numbers labelled as such.
- 01First open-weight text diffusion LLM from a tier-one lab.Released June 10, 2026 under Apache 2.0, DiffusionGemma is a 26B MoE (~3.8B active) built on the Gemma 4 26B-A4B backbone with a diffusion head. It is the open-weight counterpart to closed commercial diffusion models.
- 02Speed is the headline — and it is vendor-stated.Google reports up to 4x faster than Gemma 4 26B and 1,000–1,100+ tokens/sec on one H100 (FP8). Treat these as vendor figures for local, low-concurrency inference, not universal throughput guarantees.
- 03It trades quality for speed, by Google's own framing.DiffusionGemma trails Gemma 4 on almost every published benchmark — MMLU Pro 77.6 vs 82.6, AIME 2026 69.1 vs 88.3. Google labels it experimental and recommends Gemma 4 where maximum quality is required.
- 04Document parsing is the one clear win.On OmniDocBench 1.5, DiffusionGemma leads Gemma 4. Bidirectional attention during denoising gives it a structural edge on OCR and layout-aware extraction — the workload to actually route to it.
- 05It fits on a single RTX 5090.An NVIDIA-quantized NVFP4 build runs within ~18GB VRAM at a vendor-stated 700+ tokens/sec. That is a different deployment story from most 26B models, which expect A100/H100-class hardware.
01 — What ShippedAn open-weight model on every major surface on day one.
DiffusionGemma launched as google/diffusiongemma-26B-A4B-it on Hugging Face under Apache 2.0, with same-day availability on Kaggle, Google Cloud Vertex AI Model Garden (corroborated via secondary coverage rather than the primary release notes), and NVIDIA NIM. The named research scientists are Brendan O’Donoghue and Sebastian Flennerhag at Google DeepMind. There is no dedicated DiffusionGemma arXiv preprint as of this writing — the canonical references are the Google blog and the official model card, with the block-diffusion technique itself rooted in the BD3-LMs paper (arXiv:2503.09573), an ICLR 2025 Oral.
Under the hood it is a 25.2B-parameter mixture-of-experts with 30 transformer layers, 8 active experts of 128 total plus 1 shared expert, and roughly 3.8B active parameters per forward pass. It carries a 262,144-token vocabulary, a 256K-token context window, and an approximately 550M-parameter vision encoder. It accepts text, images, and video up to 60 seconds at 1fps; it does not accept audio input and generates text only. The training data spans web documents in 140-plus languages, code, mathematics, and images, with a January 2025 cutoff — so its world knowledge is over a year stale at launch.
DiffusionGemma 26B
26B MoE on the Gemma 4 26B-A4B backbone with an integrated diffusion head. 30 layers, 8/128 experts + 1 shared, 256K context, 262,144 vocab. The first open-weight text diffusion model from a tier-one lab.
Quantized for the desktop
An NVFP4-quantized variant fits inside 18GB, enabling deployment on a single RTX 5090. FP8 needs ~28GB; BF16 full precision is 50GB+ and effectively multi-GPU.
DiffusionGemmaForBlockDiffusion class — AutoModelForCausalLM will not work.02 — The MechanismFrom sequential typewriter to a printing press.
A standard language model is a typewriter: it predicts one token, commits it, then predicts the next conditioned on everything written so far. DiffusionGemma works the other way. It begins each 256-token block â Google calls it a “canvas” â as random placeholder tokens, then iteratively denoises the whole canvas at once, locking in the tokens it is most confident about and using them as context to refine the rest. When the block converges, it commits that block to the KV cache and starts the next canvas. This is the block-autoregressive interpolation between pure autoregression and pure diffusion that BD3-LMs introduced.
The recommended inference recipe is concrete: up to 48 denoising steps per canvas, a linear temperature schedule decaying from 0.8 to 0.4, and an entropy threshold of 0.005 for adaptive early stopping so easy blocks finish in fewer steps. Each forward pass refines the full canvas and commits roughly 15–20 tokens; at low batch sizes those passes compound into the headline throughput. Two attention regimes are in play: causal attention during the prefill stage that encodes your prompt into the KV cache, then bidirectional attention during denoising — and that bidirectionality is precisely what lets the canvas self-correct, because every position can see both directions.
“It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.”Brendan O'Donoghue and Sebastian Flennerhag, Research Scientists at Google DeepMind
The clearest demonstration of why iterative refinement is more than a speed trick is a Sudoku fine-tuning demo Google published. The base model solved essentially 0% of Sudoku puzzles; after fine-tuning, it reached an 80% success rate and solved puzzles in 12 denoising steps versus 48 for the base model. Constrained problems that stump a left-to-right model — where an early wrong digit poisons everything downstream — suit a model that can revisit and rewrite the whole grid as a unit. Built-in thinking mode and function calling are both supported, though each adds overhead worth profiling before you enable it in a throughput-sensitive pipeline.
03 — Architecture TradeoffAutoregressive versus diffusion, line by line.
The two paradigms differ on more than speed. The table below operationalizes the tradeoffs so an engineering decision-maker can reason about where each one belongs, rather than treating “diffusion = faster” as a blanket truth.
| Dimension | Autoregressive (GPT-style) | Text diffusion (DiffusionGemma) |
|---|---|---|
| Generation mechanism | One token at a time, left to right | Denoises a 256-token canvas in parallel, block by block |
| Attention during generation | Causal (look back only) | Bidirectional within the canvas (look back and forward) |
| Computational bottleneck | Memory bandwidth — sequential KV reads per token | Compute — saturates tensor cores with parallel matmuls |
| Self-correction mid-generation | No — a committed token cannot be revised | Yes — the canvas is re-refined until it converges |
| VRAM profile (comparable size) | Standard for a 26B MoE | ~18GB (NVFP4) / ~28GB (FP8) / 50GB+ (BF16) |
| Best latency scenario | High-QPS cloud serving (batching saturates compute) | Local, single-user, low-concurrency inference |
| Worst latency scenario | Single-user local decode (memory-bound) | High-QPS cloud serving (advantage shrinks or inverts) |
04 — The Speed ClaimThe 4x figure, with the asterisks attached.
Google’s headline is up to 4x faster than Gemma 4 26B-A4B at a matched model size, and roughly 2.25x faster than Gemma 4 12B with speculative decoding enabled. Reported throughput is 1,000–1,100-plus tokens per second on an H100 in FP8, and 700-plus tokens per second on an RTX 5090 with the NVFP4 quantized build. Every one of these is vendor-stated at low batch sizes, and Google is explicit that they describe local, low-concurrency inference. Real-world throughput varies with concurrency and quantization.
DiffusionGemma throughput and speed multipliers · vendor-stated
Source: Google blog and developer docs (throughput figures vendor-stated); speed-vs-12B per The RegisterThe mechanism behind the gain is worth understanding, because it tells you when the speed is real. Autoregressive decoding is memory-bandwidth bound: each new token requires a sequential read of the growing KV cache, and the accelerator spends most of its time waiting on memory rather than computing. DiffusionGemma moves that bottleneck onto compute — denoising a whole canvas is a dense batch of matrix operations that keeps the tensor cores busy. When you are the only user and the GPU would otherwise sit idle between sequential reads, that is a large, real win. When the GPU is already saturated by batched cloud traffic, there is less idle time to reclaim, which is exactly why the advantage is local-inference-shaped.
05 — BenchmarksWhere DiffusionGemma wins and where it loses.
Most coverage stops at “it is faster but worse.” The more useful view is benchmark by benchmark, because the one place it wins tells you the workload to route to it. The table below is the vendor-stated comparison against Gemma 4 26B-A4B from the official model card (Codeforces ELO and OmniDocBench corroborated via independent coverage). On every percentage benchmark, higher is better and Gemma 4 leads; on OmniDocBench 1.5, document parsing, the order flips.
| Benchmark | DiffusionGemma 26B | Gemma 4 26B | Gap |
|---|---|---|---|
| Gemma 4 leads — knowledge, reasoning, code | |||
| MMLU ProHigher is better · general knowledge | 77.6% | 82.6% | −5.0 pts |
| GPQA DiamondHigher is better · graduate-level science | 73.2% | 82.3% | −9.1 pts |
| AIME 2026 (no tools)Higher is better · competition math | 69.1% | 88.3% | −19.2 pts |
| LiveCodeBench v6Higher is better · coding | 69.1% | 77.1% | −8.0 pts |
| BigBench Extra HardHigher is better · hard reasoning | 47.6% | 64.8% | −17.2 pts |
| MMMU Pro (Vision)Higher is better · multimodal | 54.3% | 73.8% | −19.5 pts |
| MRCR v2 · 8-needle · 128kHigher is better · long-context retrieval | 32.0% | 44.1% | −12.1 pts |
| Codeforces ELOHigher is better · competitive programming | 1429 | 1718 | −289 ELO |
| DiffusionGemma leads — document parsing | |||
| OmniDocBench 1.5Document parsing · DiffusionGemma's one win | 0.319 | 0.149 | DiffusionGemma leads |
The pattern is unambiguous and Google does not hide it: the model trails Gemma 4 across general knowledge, science, competition math, coding, hard reasoning, vision, and long-context retrieval, with the widest gaps on MMMU Pro vision (−19.5 points), AIME 2026 (−19.2 points), and a 289-point Codeforces ELO deficit. The one place the order flips is OmniDocBench 1.5, where bidirectional attention gives it a structural advantage on OCR and layout-aware extraction. This is not a model that is “almost as good and much faster” — it is a model with one genuine strength and a real quality cost everywhere else. Independent coverage notes the same trend held for earlier diffusion LLMs, so the trade-off looks like a property of the paradigm rather than a one-off.
06 — Run It LocallyThe deployment story that fits on a single GPU.
The most interesting practical fact is the hardware envelope. The NVFP4-quantized build fits within roughly 18GB of VRAM, which puts a 26B-class model on a single consumer RTX 5090 — a fundamentally different deployment story from most 26B models that expect A100/H100-class hardware. FP8 needs about 28GB; BF16 full precision is 50GB-plus and effectively multi-GPU. For teams that care about local, private inference, the question shifts from “can we afford the cluster” to “does a single desktop GPU clear the bar.”
Serving stacks supported
vLLM (the first diffusion LLM natively supported there), Hugging Face Transformers, MLX, and SGLang at launch. Use the DiffusionGemmaForBlockDiffusion class in Transformers — AutoModelForCausalLM will not load it.
VRAM · NVFP4 quantized
Runs on one RTX 5090 at a vendor-stated 700+ tokens/sec. FP8 ~28GB, BF16 50GB+. A genuinely different cost curve from cluster-bound 26B models.
Max denoising steps / canvas
Linear temperature 0.8 → 0.4, entropy early-stop at 0.005, 256-token canvas. Each pass commits ~15–20 tokens. Watch for incoherence at canvas boundaries on very long structured outputs.
One known limitation deserves a flag for long-form work: because each canvas denoises semi-independently, very long structured outputs can show incoherence at the 256-token canvas boundaries. For document-length structured generation, validate the seams. If you are sizing the speed against your existing stack, our roundup of AI model latency benchmarks puts figures like 1,100 tokens per second in context against mainstream autoregressive throughput.
07 — Routing MatrixWhich workloads to route where.
The benchmark profile resolves cleanly into a routing decision. Send DiffusionGemma the workloads that reward its speed and bidirectional attention; keep everything quality-sensitive on Gemma 4. The four cases below are the ones where the choice is non-obvious.
Layout-aware extraction
DiffusionGemma's one clear win is OmniDocBench 1.5, where bidirectional attention helps it read structure. Document extraction and dense parsing are the workloads to actively route here.
Single-user streaming
Low-concurrency, latency-sensitive local inference is exactly the scenario the speed advantage is built for. On one RTX 5090 you get a vendor-stated 700+ tokens/sec inside 18GB — fast and private.
AIME, LiveCodeBench, Codeforces
It trails Gemma 4 hard here — AIME 2026 −19.2 points, a 289-point Codeforces ELO gap. For multi-step math, competitive programming, or anything where a wrong step compounds, stay on Gemma 4.
50+ concurrent users
The 4x is a local, low-concurrency figure. Under heavy batched load an autoregressive model saturates compute efficiently and the diffusion advantage shrinks or inverts. Benchmark before committing.
For the open-weight versus closed-weight framing of the broader text-diffusion landscape, the natural comparison point is Inception Labs Mercury 2, the first commercial text diffusion model. Both land near the 1,000-tokens-per-second mark, but Mercury 2 is a closed commercial API while DiffusionGemma is Apache 2.0 open weights with day-zero vLLM support. The two were not run through a shared benchmark suite, so treat the contrast as structural — open versus closed, self-hosted versus API — rather than a head-to-head quality number. The wider open-versus-closed question is covered in our open-weight vs closed-source AI models comparison.
08 — ImplicationsWhat a first open diffusion model changes.
Read narrowly, DiffusionGemma is an experimental release with one standout workload. Read as a signal, it is the moment text diffusion stopped being a closed-lab demo and became something any team can download, inspect, fine-tune, and self-host. That open-weight status is the part that compounds. Until now the fastest diffusion LLMs were commercial APIs; an Apache 2.0 model with native vLLM support and a JAX fine-tuning toolbox turns the paradigm into infrastructure researchers and product teams can build on without a vendor contract.
Projecting forward, the practical near-term value is not in replacing your default model — it is in routing specific workloads. Document extraction pipelines, single-user local assistants, and constrained generation tasks like the Sudoku demo are where iterative refinement and a single-GPU footprint pay off today. Expect the quality gap to narrow as the technique matures, the way successive autoregressive generations did, but do not bet a production pipeline on that convergence yet. The honest read for June 2026 is that diffusion is now a real, open option for a defined slice of workloads — and a poor fit for the rest. Teams weighing where it fits into a multi-model stack are exactly the kind of comparative evaluation our AI and digital transformation engagements begin with, alongside the autoregressive Gemma 4 family that DiffusionGemma trades quality against for speed.
09 — ConclusionA genuine option for the right workloads.
Speed is real, the quality cost is real, and routing is the whole game.
DiffusionGemma is the first open-weight text diffusion model from a tier-one lab, and it ships exactly as advertised: a 26B mixture-of-experts that denoises whole blocks in parallel, runs on a single desktop GPU, and hits a vendor-stated 1,100-plus tokens per second on an H100. The speed is genuine for local, low-concurrency inference — and only there.
The honest framing is Google’s own: it is experimental, it trails Gemma 4 on nearly every benchmark, and for maximum quality you should still reach for Gemma 4. The exception that earns it a place in a stack is document parsing, where bidirectional attention gives it a real edge. Everything else is a routing decision, not a default swap.
The larger signal is the open-weight release itself. Text diffusion is no longer a thing you can only rent through an API — it is a thing you can download, fine-tune, and run on hardware you already own. For teams that route by task class rather than by headline, that is the development worth tracking: not “which model is smartest,” but “which generation paradigm is cheapest and fastest for the specific slice of work in front of me.”