Back to list
Jun 12, 2026
5
0
0
Open SourceNEW

Google DiffusionGemma: 26B MoE Text Diffusion Model at 1,000+ Tokens/Sec

Google open-sourced DiffusionGemma on June 10, 2026 — a 26B MoE model using text diffusion that generates tokens in parallel, delivering 4x faster inference than autoregressive Gemma models.

#DiffusionGemma#Google#Open Source#Text Diffusion#LLM
Google DiffusionGemma: 26B MoE Text Diffusion Model at 1,000+ Tokens/Sec
AI Summary

Google open-sourced DiffusionGemma on June 10, 2026 — a 26B MoE model using text diffusion that generates tokens in parallel, delivering 4x faster inference than autoregressive Gemma models.

Introduction

On June 10, 2026, Google released DiffusionGemma under the Apache 2.0 license, marking a significant architectural departure from the autoregressive LLMs that have dominated the open-source ecosystem. Rather than generating text one token at a time, DiffusionGemma applies a diffusion process to language — iteratively refining a sequence of tokens from noise toward a coherent output. The result is a 26-billion-parameter Mixture-of-Experts model capable of exceeding 1,000 tokens per second on an H100 GPU, and over 700 tokens per second on a consumer RTX 5090. For the open-source AI community, this release represents the first production-ready text diffusion LLM from a major AI laboratory.

Feature Overview

Text Diffusion Architecture: Why It Matters

Most open-source LLMs — Llama, Mistral, Qwen, and earlier Gemma models — use autoregressive decoding. Each new token depends on all previous tokens, creating an inherently sequential bottleneck. No matter how powerful the hardware, the model must complete token N before beginning token N+1.

DiffusionGemma breaks this constraint entirely. Text diffusion treats the entire output sequence as a single object. The model starts from a noisy, incomplete representation and refines it across multiple denoising steps — but crucially, all token positions are updated simultaneously during each step. This parallelism is what drives the speed advantage.

The architectural shift is analogous to what diffusion models did for image generation when they displaced GANs. Stable Diffusion and its successors generate entire images by iterative refinement rather than pixel-by-pixel synthesis. DiffusionGemma applies the same principle to text.

Mixture-of-Experts Design

DiffusionGemma uses a Mixture-of-Experts (MoE) architecture at the 26B parameter scale. MoE models activate only a subset of parameters per forward pass, which allows a model to have a large total parameter count while keeping compute per inference step manageable. Combined with the parallel token generation of diffusion, this produces a model that is both capable and efficient relative to its nominal size.

Speed Benchmarks

Google reports DiffusionGemma sustains over 1,000 tokens per second on an NVIDIA H100 GPU. On the RTX 5090 — a high-end consumer card — throughput exceeds 700 tokens per second. Compared to equivalent autoregressive Gemma models, this represents approximately 4x faster generation speed (per official announcement via SiliconAngle, June 10, 2026). For applications requiring low latency or high-volume inference, this gap is operationally significant.

Hardware Accessibility

When quantized, DiffusionGemma fits within 18GB of VRAM. This places it within reach of prosumer hardware setups. The RTX 5090 ships with 32GB of VRAM, and workstation-class cards such as the RTX 4090 and RTX 6000 Ada also fall within this range. A 26B-parameter model running locally at 700+ tokens per second on consumer hardware is a meaningful threshold for the self-hosted AI community.

Context Window and Language Coverage

The model supports a 256K token context window, enabling processing of very long documents, codebases, or conversation histories in a single pass. DiffusionGemma supports 140+ languages, consistent with the multilingual scope Google has pursued across the Gemma model family.

Usability Analysis

DiffusionGemma is available on Hugging Face at google/diffusion-gemma-26b under Apache 2.0, meaning commercial use, modification, and redistribution are all permitted without royalty obligations.

For researchers and developers experimenting with text diffusion, this is the first opportunity to work with a production-grade open implementation from a major lab. Prior open-source text diffusion efforts existed but lacked the scale, polish, and benchmark validation that a Google release brings.

For local inference users, the 18GB VRAM quantized footprint makes DiffusionGemma deployable on hardware that many serious practitioners already own. The 700+ tokens/sec throughput on an RTX 5090 means real-time or near-real-time responses even for lengthy outputs, which changes the practical experience of running a 26B-class model locally.

For production API deployments, the H100 throughput figures suggest significant cost advantages for inference-heavy workloads compared to autoregressive models of similar capability.

The primary caveat for current users is that text diffusion models can exhibit different quality characteristics than autoregressive models, particularly for tasks requiring precise token-level control such as structured data generation or constrained decoding. Users migrating workflows from autoregressive Gemma should plan for evaluation rather than assuming drop-in equivalence.

Pros and Cons

Pros

  1. 4x inference speedup: 1,000+ tokens/sec on H100, 700+ on RTX 5090, confirmed against equivalent autoregressive Gemma models (official announcement, June 10, 2026).
  2. Consumer hardware compatible: 18GB VRAM requirement when quantized brings a 26B MoE model within reach of high-end prosumer setups.
  3. Apache 2.0 license: Unrestricted commercial use with no royalty or attribution requirements beyond license inclusion.
  4. 256K context window: Handles large documents, long codebases, and extended sessions in a single pass.
  5. 140+ language support: Broad multilingual coverage suitable for international deployments.

Cons

  1. Architectural novelty introduces uncertainty: Text diffusion behavior on structured output tasks, constrained decoding, and fine-tuning pipelines is less established than autoregressive alternatives. Existing tooling may require adaptation.
  2. No official benchmark comparisons published at launch: Quality comparisons against Llama 3, Mistral, or autoregressive Gemma at equivalent scales have not been released by Google as of June 10, 2026, making independent quality assessment the responsibility of early adopters.
  3. RTX 5090 requirement for peak consumer speed: The 700+ tokens/sec figure requires an RTX 5090. Users on older consumer hardware will see lower throughput, and the quantized 18GB figure does not guarantee performance parity across all card generations.

Outlook

DiffusionGemma's release is likely to accelerate research and tooling development around text diffusion architectures. The open-source community now has a well-resourced, production-scale baseline to evaluate, fine-tune, and extend. If quality benchmarks hold up against autoregressive competitors as community evaluations emerge, the speed advantages of text diffusion will put pressure on the autoregressive paradigm in inference-cost-sensitive applications.

The Apache 2.0 license also enables commercial products to adopt DiffusionGemma as a foundation, which may drive integration into inference servers like llama.cpp, vLLM, and Ollama — though those projects will need to implement diffusion-specific decoding paths, which is non-trivial engineering work.

Google's decision to open-source this model rather than hold it for Gemini API exclusivity suggests a strategic interest in establishing text diffusion as an ecosystem standard, consistent with its broader Gemma open-source strategy.

Conclusion

DiffusionGemma is the most significant architectural departure in open-source LLMs since the MoE approach became widespread. It delivers verified speed improvements that matter for real workloads, runs on hardware that practitioners can own, and ships under a license that imposes no commercial restrictions. The open-source AI community now has its first production-grade text diffusion model from a tier-one lab. Researchers studying diffusion architectures, developers building latency-sensitive applications, and practitioners wanting a capable local model on high-end consumer hardware all have a compelling reason to evaluate this release.

Editor's Verdict

Google DiffusionGemma: 26B MoE Text Diffusion Model at 1,000+ Tokens/Sec earns a solid recommendation within the open source space.

The strongest case for paying attention is 1,000+ tokens/sec on H100 and 700+ on RTX 5090 — verified 4x speed improvement over autoregressive Gemma equivalents, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, fits in 18GB VRAM when quantized, enabling local deployment on high-end consumer hardware adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: text diffusion eliminates the sequential token dependency of autoregressive models, enabling true parallelism during generation and fundamentally different latency scaling characteristics. On the other side of the ledger, text diffusion tooling ecosystem is immature — fine-tuning pipelines, constrained decoding, and structured output support require community development work is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, no public quality benchmarks against autoregressive competitors published at launch, leaving capability assessment to community evaluation narrows the set of teams for whom this is an obvious yes.

For developers building locally, infrastructure engineers, and anyone preferring transparent, modifiable software, this is a serious evaluation candidate, not just a curiosity to bookmark. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

  • 1,000+ tokens/sec on H100 and 700+ on RTX 5090 — verified 4x speed improvement over autoregressive Gemma equivalents
  • Fits in 18GB VRAM when quantized, enabling local deployment on high-end consumer hardware
  • Apache 2.0 license with no commercial restrictions, available immediately on Hugging Face
  • 256K token context window handles large documents and codebases in a single pass
  • 140+ language support suitable for multilingual and international use cases

Cons

  • Text diffusion tooling ecosystem is immature — fine-tuning pipelines, constrained decoding, and structured output support require community development work
  • No public quality benchmarks against autoregressive competitors published at launch, leaving capability assessment to community evaluation
  • Peak consumer speed of 700+ tokens/sec requires an RTX 5090; older consumer GPU performance is unspecified

Comments0

Key Features

DiffusionGemma uses a text diffusion architecture that generates all tokens in parallel rather than sequentially, delivering 1,000+ tokens/sec on H100 and 700+ tokens/sec on RTX 5090 — approximately 4x faster than equivalent autoregressive Gemma models. The 26B Mixture-of-Experts design fits in 18GB VRAM when quantized, enabling local deployment on high-end consumer hardware. It supports a 256K token context window and over 140 languages, and is released under Apache 2.0 on Hugging Face.

Key Insights

  • Text diffusion eliminates the sequential token dependency of autoregressive models, enabling true parallelism during generation and fundamentally different latency scaling characteristics.
  • The 4x speed advantage over equivalent autoregressive Gemma models (per official announcement, June 10, 2026) translates directly to lower inference costs for API deployments and more responsive local experiences.
  • An 18GB quantized VRAM footprint for a 26B MoE model represents a practical threshold: high-end prosumer GPUs like the RTX 5090 can run it locally at production-relevant speeds.
  • Apache 2.0 licensing removes commercial barriers entirely, positioning DiffusionGemma as a viable foundation model for startups and enterprises building inference-heavy products.
  • This is the first production-ready open-source text diffusion LLM from a major AI laboratory, providing the community with a credible baseline for evaluating the architecture at scale.
  • The 256K context window enables single-pass processing of large codebases, lengthy documents, and extended multi-turn sessions without chunking or truncation.
  • Integration with existing inference ecosystems such as vLLM, llama.cpp, and Ollama will require diffusion-specific decoding implementations, creating a near-term engineering gap that early contributors can address.
  • Google's open-source strategy with DiffusionGemma mirrors its Gemma series approach: releasing capable models freely to establish ecosystem standards while retaining proprietary advantages in the Gemini API.

Was this review helpful?

Share

Twitter/X