DiffusionGemma : Google releases the first open source diffusion text model — 4x faster than autoregressive

LLM & Modèles 🟢 Beginner ⏱️ 16 min read 📅 2026-06-11

DiffusionGemma : Google releases the first open-source diffusion text model — 4x faster than autoregressive

🔎 The end of the "token by token" reign?

Since 2017 and the publication of Attention Is All You Need, every language model works the same way: it predicts the next token, then the next one, then the next one. This autoregressive approach became a dogma. Nobody questioned the fact that an LLM had to generate text sequentially.

On June 10, 2026, Google DeepMind just shattered this dogma with DiffusionGemma. This is the first major open-source model that uses diffusion — yes, the same technique behind Midjourney or DALL-E — to generate text. Not images. Text.

The result is raw: 1000+ tokens per second on a consumer RTX GPU, which is 4x faster than an equivalent autoregressive model. DiffusionGemma no longer predicts token by token. It fills a canvas of 256 tokens in parallel, then iteratively denoises it over 48 steps.

This is an architectural paradigm shift. And it is open under the Apache 2.0 license. According to Ars Technica, it is the biggest technical surprise of the year in open-source AI.

The essentials

DiffusionGemma is the first open-source text model based on diffusion (not autoregression), released on June 10, 2026, by Google DeepMind under the Apache 2.0 license.
Architecture: 26B total params, 4B active (Mixture of Experts), based on Gemma 4. Parallel generation of a 256-token canvas, denoised in 48 steps.
Performance: 1000+ tokens/second on consumer RTX GPUs, making it 4x faster than an equivalent autoregressive model for long generation.
Bidirectional context: unlike a classic LLM, each token "sees" all the other tokens in the canvas, which improves coherence.
Built-in self-correction: the denoising process naturally corrects errors over iterations, without any external mechanism.
Available on Hugging Face with a complete developer guide.

Recommended tools

Tool	Main usage	Price (June 2026, check on huggingface.co)	Ideal for
DiffusionGemma 26B-A4B-it	Diffusion text generation	Free (Apache 2.0)	Self-hosting, fast RAG
DiffusionGemma developer guide	Integration and deployment	Free	Developers, API integration
Ollama	Local LLM execution	Free	Quick local runners
LM Studio	Desktop interface for local LLMs	Free	Non-technical users

Autoregression vs diffusion: two radically different philosophies

To understand why DiffusionGemma is a milestone, you need to grasp the fundamental difference between the two approaches.

Autoregression: sequential scrolling

An autoregressive model like GPT-5.5 or Claude Opus 4.7 generates text like a human typing on a keyboard: one word after another. At each step, it takes all the previous tokens as input and predicts the next one. It is deterministic in principle, but fundamentally sequential.

The problem? This sequentiality is a hardware bottleneck. Even if your GPU has thousands of cores, the prediction of token N depends on token N-1. You cannot parallelize the generation. This is why current LLMs plateau around 150-250 tokens/second on consumer hardware, despite constant architectural gains.

Diffusion: the parallel canvas

DiffusionGemma works differently. It starts by filling a canvas of 256 tokens in a single parallel pass. These initial tokens are intentionally noisy — almost random. Then, over 48 denoising steps, the model refines the entire canvas simultaneously.

At each step, the model looks at the complete state of the canvas and applies a global correction. It is exactly the same principle as image generation by diffusion: starting from noise and progressively bringing out a coherent structure. But applied to text.

The architectural advantage is massive: each denoising step is entirely parallelizable on the GPU. Hence the 1000+ tokens/second.

Technical architecture: 26B params, 4B active, all the context at once

DiffusionGemma is based on a Mixture of Experts (MoE) architecture with 26 billion total parameters, but only 4 billion are active at each denoising step. This makes it particularly well-suited for self-hosting on consumer hardware.

Bidirectional context, the real game-changer

In an autoregressive LLM, the token at position 50 only "sees" tokens 1 to 49. This is a causal mask. DiffusionGemma does not have this constraint. At each denoising step, each token on the canvas can attend to all other tokens.

This profoundly changes the quality of the generated text. A pronoun at position 200 can be resolved by looking at an antecedent at position 250, even if the latter has not yet been "generated" in a sequential sense. The model plans globally, then refines locally.

The 48 denoising steps

The generation process takes place in exactly 48 steps. Google DeepMind determined this number as the sweet spot between quality and speed. Fewer steps = less coherent text. More steps = diminishing returns. Developers can adjust this number via the official guide, but 48 is the optimized default setting.

Gemma 4 base and legacy

DiffusionGemma is built on the base of Gemma 4, Google's family of open-source models. The base architecture (embeddings, normalization, attention mechanisms) is retained, but the generation layer is entirely replaced by a diffusion process. It is a hybrid: a classic transformer backbone, driven by a diffusion scheduler.

Benchmarks : 4x faster, but at what cost to quality?

The obvious question: does this speed come at a cost to quality? The answer is nuanced.

Raw speed

On an NVIDIA RTX 4090, DiffusionGemma reaches around 1100 tokens/second for a generation of 1024 tokens (4 canvases of 256). By comparison, an equivalently sized MoE autoregressive model runs around 250-280 tokens/second on the same hardware. The ratio is indeed 4x.

On an RTX 5070 (more accessible), it stays above 800 tokens/second. This is sufficient for real-time streaming where the text literally appears faster than you can read it.

Textual quality

On standard benchmarks (MMLU, HumanEval, GSM8K), DiffusionGemma sits slightly below a 26B params autoregressive model. The gap is in the range of 2-4 percentage points. This is not negligible, but it is remarkably good for a first iteration of an entirely new paradigm.

Where DiffusionGemma excels, however, is in long-form coherence. The bidirectional context works wonders for texts over 500 tokens: fewer repetitions, better management of pronominal references, and a more logical narrative structure. Autoregression tends to "forget" what it said at the beginning of a long text. Diffusion does not, because it constantly revisits the entire canvas.

Domains where autoregression remains superior

Complex code, chained mathematical reasoning, and strictly sequential logical tasks remain the domain of autoregression. When strict causality (the exact order of operations) matters more than overall coherence, predicting token by token is more reliable. For coding, the meilleurs LLM pour coder remain autoregressive models like Claude Opus 4.7 or GPT-5.5.

Concrete implications for self-hosting

This is perhaps where DiffusionGemma changes the game the most. Self-hosting LLMs is often limited by generation speed, not by the ability to fit the model in VRAM.

Comfort threshold exceeded

With a throughput of 1000+ tokens/second, we move from a "wait then block of text" experience to an "instant flow" experience. The comfort threshold for a human is around 100-150 tokens/second (reading speed). DiffusionGemma exceeds this threshold by a factor of 7. This is a qualitative change, not just a quantitative one.

For the meilleurs LLM locaux, this means that consumer hardware finally becomes sufficient for real-time professional use cases. An RTX 4070 Ti with 16 GB of VRAM can run DiffusionGemma at full speed without aggressive quantization.

Impact on serving architectures

Serving frameworks like vLLM or TGI are optimized for autoregression (continuous batching, speculative decoding, KV cache). DiffusionGemma requires a different approach: batching is done at the canvas level, not at the individual token level. The guide développeur de Google provides a suitable custom serving solution, but the open-source ecosystem will take a few months to adapt.

For those who use Ollama pour faire tourner des modèles locaux, DiffusionGemma integration is underway. The denoising process requires modifications to the inference pipeline that are not trivial. Wait for an official update rather than tinkering with it.

Comparison with other heavy open-source models

The NVIDIA Nemotron 3 Ultra 550B remains the most powerful open-source model in terms of raw quality, but it requires serious multi-GPU infrastructure. DiffusionGemma does not target the same segment: it sacrifices a few benchmark points to be executable on a single consumer card, with a throughput that defies all competition in this hardware category.

Built-in self-correction: denoising as a verification mechanism

One of the most underestimated aspects of DiffusionGemma is its natural capacity for self-correction. In an autoregressive model, if the model generates an error at token 50, this error inevitably propagates. The model is "stuck" by its own previous choices.

With diffusion, this is not the case. The denoising process revisits each token at each step. An inconsistency detected at step 30 can be corrected at step 31, because the model has access to the global context of the canvas.

No retry, no backtracking

Unlike self-correction systems that require an external agent (generate → evaluate → regenerate), the correction here is intrinsic to the generation process. It costs no additional calls, no additional tokens. It is "free" in terms of inference cost.

This property has significant implications for the best LLMs for AI agents, where the consistency of action plans across multiple steps is critical. An agent based on DiffusionGemma could generate a complete 256-step action plan, then denoise it to eliminate logical inconsistencies — all in a single pass.

Text diffusion: why now?

The idea of applying diffusion to text is not new. But several factors aligned in 2026 to make it viable.

The legacy of data privacy research

Surprisingly, research into the privacy of web searches indirectly contributed to this advance. The study Private Information Disclosure from Web Searches (2010) had shown how sequential query patterns revealed private information. Diffusion approaches, by generating content in a non-sequential manner, are intrinsically more resistant to this type of pattern-based deduction, because there is no observable causal chain in the generation process.

Lessons from reproducible research

The scientific community has long faced problems with metric manipulation. The study Manipulating Google Scholar Citations and Google Scholar Metrics (2012) illustrates how sequential, citation-based metrics can be exploited. A generative model that produces content holistically (rather than sequentially) offers a different perspective on building coherent knowledge.

Accumulated experience in image diffusion

Five years of intensive research into image diffusion (Stable Diffusion, Imagen, etc.) have produced a mature understanding of noise schedules, denoising architectures, and sampling techniques. Google simply transferred this expertise to the text domain, with the necessary adjustments.

The context of any-to-any models

The announcement of Gemini Omni, Google's any-to-any model that handles text, image, audio, and video as input and video as output, is part of the same trend: unifying modalities under common architectures. DiffusionGemma is another step in this direction — using diffusion as a universal paradigm, beyond images.

Current Limitations and Technical Challenges

DiffusionGemma is not a perfect model. It is a starting point for a new paradigm, with the inevitable limitations of a first iteration.

The maximum length per canvas

The 256-token canvas is a fundamental constraint. To generate 2048 tokens, the model must produce 8 successive canvases. The joining between canvases is not as fluid as the continuous generation of an autoregressive model. We sometimes observe thematic jumps between the end of one canvas and the beginning of the next.

Google DeepMind is working on sliding canvases (overlapping canvases) to mitigate this issue, but this is not yet available in the version initiale sur Hugging Face.

The fixed cost of 48 steps

Even for a short response of 50 tokens, DiffusionGemma must perform 48 denoising steps on a 256-token canvas. For very short generations (yes/no answers, classifications), autoregression remains more efficient. Diffusion only becomes advantageous from around 128 generated tokens.

The immature ecosystem

No major RAG framework natively supports diffusion generation. Existing pipelines (LlamaIndex, LangChain) all assume token-by-token generation. Integration requires a custom adapter. For the meilleurs LLM pour la recherche, autoregression remains the default choice.

System prompts and format control

Precisely controlling the output format (JSON, XML, strict schemas) is more difficult with diffusion than with autoregression. In autoregression, you can constrain each token via formal grammars. In diffusion, the canvas is globally rewritten at each step, which makes token-level constraints more complex to apply. The guide développeur proposes workarounds, but it is a work in progress.

Google's Strategic Positioning: Why Open-Source?

In a context where Meta recently closed its Muse Spark model, marking a shift towards proprietary, Google is making the opposite choice with DiffusionGemma. The Apache 2.0 license is as permissive as possible: commercial use, modification, redistribution, everything is allowed.

An Ecosystem Strategy

Google does not make money directly with DiffusionGemma. But by making this paradigm open-source, it encourages the community to solve the problems mentioned above (canvas joining, RAG integration, format control). If diffusion for text becomes an industry standard, Google has a considerable head start in terms of research and training data.

Competing with Proprietary Autoregression

OpenAI with GPT-5.5 and Anthropic with Claude Opus 4.7 dominate the proprietary autoregressive segment. By opening up an alternative paradigm, Google creates a new battlefield where it takes the lead. It's a classic strategic maneuver: if you can't win on the opponent's turf, change the turf.

The Link with Free LLMs

For users who want to test diffusion without installing the model, the best free LLMs now include APIs based on DiffusionGemma via Google AI Studio. This makes it possible to directly compare speed and quality with free autoregressive models like Gemini 3.1 Pro or the free version of GPT-5.5.

What this means for the future of LLMs

DiffusionGemma is not just another model. It is a proof of concept that the autoregressive dogma is not a law of physics. Other architectures are possible, and some can be significantly more efficient.

The convergence of paradigms

We are seeing a landscape emerge where the following coexist: pure autoregression (GPT-5.5, Claude), pure diffusion (DiffusionGemma), and very likely hybrids that use autoregression for planning and diffusion for execution. The best LLMs right now could well be hybrid by the end of 2027.

The impact on hardware

GPUs are optimized for massive parallelism. Autoregression only exploits 10-20% of their theoretical capacity during the generation phase (the prefill phase, yes, but not the decode phase). Diffusion makes much better use of parallelism. If diffusion becomes dominant, GPU benchmarks will change, and hardware architectures could evolve to specifically optimize for denoising iterations rather than sequential decoding.

The impact on APIs and billing

Today, free and paid AI APIs charge per generated token. With diffusion, the notion of a "generated token" becomes blurry — a token is rewritten 48 times. The pricing model will have to adapt. Google chose to charge for DiffusionGemma per 256-token canvas on its API, regardless of the number of denoising steps.

❌ Common mistakes

Mistake 1: Confusing text diffusion with image diffusion

DiffusionGemma does not generate images. It is a language model. The diffusion technique is the same (noise → iterative denoising), but the latent space is textual, not pixel-based. It is not a multimodal model.

Mistake 2: Expecting higher quality than autoregression

DiffusionGemma is faster, not better in absolute quality. On pure reasoning benchmarks, it still lags behind GPT-5.5 or Claude Opus 4.7. Its advantage is the speed/quality ratio, not maximum quality.

Mistake 3: Trying to integrate it into a classic RAG pipeline without adapting it

Plugging DiffusionGemma into an existing LangChain or LlamaIndex pipeline without adapting the code will generate errors or poor results. The model expects queries formatted for a 256-token canvas, not token-by-token generate() calls. Follow the developer guide.

Mistake 4: Using aggressive 4-bit quantization

With 4B active params, DiffusionGemma fits in 16 GB of VRAM in FP16. 4-bit quantization significantly degrades denoising quality, much more so than for an autoregressive model. Iterative steps amplify precision errors. Stay in FP16 or INT8 maximum.

❓ Frequently Asked Questions

Does DiffusionGemma replace autoregressive LLMs?

No. It is complementary. For short generations, code, and strictly sequential reasoning, autoregression remains superior. DiffusionGemma excels at long generations where speed and overall consistency are paramount.

Can DiffusionGemma run on a Mac?

Theoretically yes via MLX, but performance is disappointing. Diffusion is highly sensitive to memory bandwidth, and Apple chips (M-series) are limited in this regard compared to NVIDIA GPUs. Wait for specific MLX optimization.

Does the Apache 2.0 license allow commercial use?

Yes, without any restrictions. You can integrate DiffusionGemma into a commercial product, modify it, and redistribute it. Google retains no specific rights over the generated outputs.

How much VRAM is actually needed?

Count on 10-12 GB for the model in FP16, plus 2-3 GB for the canvas and denoising buffers. 16 GB is the comfortable minimum. 24 GB (RTX 4090) allows for batching of multiple canvases simultaneously.

Does DiffusionGemma support French?

Yes, the model is multilingual like its Gemma 4 base. Among the meilleurs LLM en français, DiffusionGemma positions itself well thanks to its bidirectional context, which is particularly helpful for languages with complex agreements (gender, number, adjective agreement).

✅ Conclusion

DiffusionGemma proves that text generation does not need to be sequential. By applying diffusion to language, Google DeepMind opens a path toward LLMs 4x faster on consumer hardware, with built-in self-correction and bidirectional context that changes the game for long texts. The model is not perfect — limited context window, immature ecosystem, lower raw quality than autoregression — but it is an architectural turning point whose repercussions will be felt for years to come. If you are self-hosting, download DiffusionGemma on Hugging Face and test it: you will see text generation differently.

#intelligence-artificielle #google #modele-open-source #diffusiongemma #diffusion-de-texte #llm

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

July 17: Gemini 3.5 Pro and Shanghai's WAIC collide — the day AI officially goes bipolar

On July 17, 2026, the Gemini 3.5 Pro launch and Shanghai WAIC illustrate two opposing visions. Discover this key day for AI.

2026-07-14 17:03

LLM & Modèles 🟢 Débutant 14 min

GPT-Live : OpenAI launches full-duplex voice — AI agents can finally listen and speak at the same time

OpenAI launches GPT-Live with full-duplex voice. Discover how AI agents can finally listen and speak at the same time.

2026-07-13 15:04

LLM & Modèles 🟢 Débutant 11 min

Meta Muse Spark 1.1 : Meta launches its first paid model and enters the agentic coding battle

Discover Meta Muse Spark 1.1, Meta's first paid model. The giant enters the agentic coding battle and changes strategy.

2026-07-11 15:02

📑 Table of contents