Sumi : the first uniform diffusion language model built from scratch — 7B parameters, the end of autoregression?
🔎 Why June 2026 marks a turning point for LLMs
For five years, the AI industry has been built on a single paradigm: autoregression. Every model, from GPT-5.5 to Claude Opus 4.7, predicts the next token, one by one, from left to right. It's simple, it works, but it's fundamentally sequential.
In ten days, in June 2026, two bombs shattered this consensus. First DiffusionGemma, on June 10, by Google DeepMind. Then Sumi, on June 17, by Tohoku University in Japan.
Two radically different approaches to achieve the same goal: generating text by diffusion, not by autoregression. And Sumi brings something no one had yet dared to do at this scale. A uniform diffusion language model, pre-trained from scratch, without any architectural compromises.
The essentials
- Sumi is the first uniform diffusion language model (UDLM) pretrained from scratch with 7B parameters on 1.5 billion tokens, published on June 17, 2026, by Tohoku University (arXiv paper).
- Unlike autoregressive models (GPT-5.5, Claude Opus 4.7), Sumi generates a complete canvas of corrupted tokens and then iteratively denoises it with native bidirectional attention — no causal mask.
- It differs from DiffusionGemma (Google, 26B MoE) because Sumi is purely diffusion from the first pre-training token, not a converted AR model.
- Academic research proves it can rival industrial labs on LLM architecture — a strong signal after the debates surrounding Meta Muse Spark.
Recommended tools
| Tool | Main usage | Price (June 2026, check on site) | Ideal for |
|---|---|---|---|
| Sumi-7B | Research, diffusion inference | Free (Apache-style) | Experimenting with UDLM from scratch |
| DiffusionGemma 26B | Fast text generation | Free (Apache 2.0) | Production, 1000+ tokens/sec |
| Nemotron-Labs-Diffusion-14B | Tri-mode decoding | Free (open-weights) | AR vs diffusion speed benchmarks |
| Hostinger | Hosting to deploy models | From 2.99€/month | Self-host deployment of Sumi |
Autoregressive vs. Uniform Diffusion — Two Opposing Philosophies
Autoregression is the "typewriter" approach. The model reads everything that came before, then guesses the next word. One by one. Always.
Uniform diffusion is the "sculptor" approach. The model starts with a block of noise — a canvas of completely corrupted tokens — and iteratively sculpts it until coherent text emerges. In parallel, not in sequence.
The difference is fundamental. Autoregression uses causal attention: each token only sees what precedes it. Sumi's uniform diffusion uses native bidirectional attention: each token sees the entire context, in both directions, at every denoising step.
Practical consequence: whereas GPT-5.5 has to wait until it has generated token 50 to start thinking about token 51, Sumi simultaneously refines the entire text. This is a computational paradigm shift, not just an optimization.
Why Autoregression Dominated for So Long
Autoregression had an overwhelming advantage: training simplicity. The next-token prediction objective is trivial to implement and scale. The entire ecosystem — data, infra, frameworks — was built around it.
Diffusion for text, on the other hand, posed formidable problems. The discretization of text (tokens, not continuous pixels) makes noise and denoising processes much more complex. Until 2025, no one had found the right formulation to make massive pre-training hold up.
The Gap Sumi Fills
Before Sumi, diffusion language models were merely proofs of concept. Compute-optimal checkpoints on tiny token budgets, as highlighted by the review on Paperium.
1.5 billion tokens at 7B parameters, this is the first time a UDLM has reached a scale comparable to reference autoregressive LLMs. Tohoku University proved that the mathematical formulation holds up at scale.
What makes Sumi unique — detailed architecture
Sumi is not an autoregressive model disguised as a diffusion model. This is the most important distinction to understand.
Bidirectional attention from the start
In a classic transformer like Claude Sonnet 4.6 or GPT-5.4, the attention mask is triangular. The token at position i can only attend to positions 0 to i-1. This is an architectural choice that encodes temporal order into the very structure of the model.
Sumi removes this mask. Each attention layer sees the entire canvas at every denoising step. The model never learned to "look back" because there is no back or forward — there is a whole that it refines.
This radically changes the quality of the intermediate representation. An AR model mid-generation only has a partial view of the final text. Sumi, at every step, has a global view.
The uniform noising process
The "uniform" in UDLM is crucial. Unlike classic Gaussian diffusion (used in AI image generation), Sumi corrupts the tokens uniformly: each token has an equal probability of being replaced by a random token from the vocabulary.
This choice is not trivial. The discretization of text makes continuous noising patterns directly inapplicable. Tohoku University had to design a specific corruption schedule that ensures the denoising process properly learns the underlying linguistic distribution.
7B parameters, 1.5T tokens — the numbers
7 billion parameters is the sweet spot for open research in 2026. Large enough for serious performance, small enough to be reproduced by an academic lab. The 1.5 trillion tokens place Sumi in the same data budget category as the first competitive open-source LLMs.
The model requires trust_remote_code=True in the Hugging Face transformers library, indicating a custom architecture that is not yet integrated into the standard pipeline. A sign that the ecosystem still needs to adapt.
Sumi vs DiffusionGemma — two paths to diffusion
The comparison is inevitable. DiffusionGemma was released a week before Sumi, also in open-weights (Apache 2.0), but with a radically different philosophy.
Two opposing design strategies
| Feature | Sumi (Tohoku) | DiffusionGemma (Google) |
|---|---|---|
| Architecture | UDLM from scratch | AR converted to diffusion |
| Parameters | 7B (dense) | 26B (MoE, 4B active) |
| Attention | Native bidirectional | Added bidirectional |
| Pre-training | Diffusion objective from the start | AR objective then conversion |
| License | Open source | Apache 2.0 |
| Inference VRAM | ~14-16GB estimated | 18GB documented |
| Generation speed | Not precisely communicated | 1000+ tokens/sec |
DiffusionGemma is the pragmatic industrial approach. Google took an existing autoregressive model and converted it into a diffusion model. The AR to diffusion conversion study published on June 4, 2026, shows exactly this method: replacing causal attention with bidirectional attention and retraining with a denoising objective.
Sumi is the idealistic academic approach. Nothing autoregressive ever touched this model. It is conceptually purer, but it is also riskier — and potentially more costly in pre-training compute.
Who wins?
The answer depends on what you are looking for. DiffusionGemma is immediately production-ready with its 1000+ tokens/second and its efficient MoE architecture. Sumi is a testbed for understanding whether from-scratch diffusion can ultimately surpass AR conversion.
My opinion: both are necessary. DiffusionGemma validates product viability. Sumi validates the scientific viability of the paradigm.
NVIDIA Nemotron-Labs-Diffusion — the third player
The landscape of diffusion LLMs doesn't stop at Sumi and DiffusionGemma. NVIDIA Nemotron-Labs-Diffusion, released on May 23, 2026, offers a third, still different path: tri-mode.
AR + diffusion + self-speculation in a single model
Nemotron-Labs-Diffusion-8B and -14B can switch between three decoding modes simply by changing the attention pattern at inference. No need for three separate models. The measured throughput gains range from 2.7x to 3.3x compared to an equivalent AR model, and up to 6x for the 8B.
This is proof that the industry isn't looking to "kill" autoregression but to complement it. NVIDIA's tri-mode recognizes that AR remains better on certain tasks, diffusion on others, and that the best model is the one that knows how to choose.
Where Sumi stands against NVIDIA
Sumi makes no compromises. It is 100% diffusion, 100% of the time. This is both its strength (architectural purity, maximum optimization for this paradigm) and its weakness (no AR safety net for tasks where diffusion underperforms).
The deep dive de dev.to sur Nemotron clearly shows that tri-mode is an engineering approach. Sumi is a fundamental science approach. The two feed off each other.
The RCD module — a boost for diffusion models
A crucial detail of the ecosystem: PulseAugur reports the emergence of a new RCD (Random Corruption Denoising) module that significantly improves the accuracy and efficiency of text diffusion models.
What RCD concretely changes
RCD introduces a more sophisticated random corruption mechanism than the basic uniform noising. Instead of uniformly replacing tokens, it applies a corruption pattern that better preserves the local structure of the text, making denoising easier to learn.
This is exactly the type of innovation that could be integrated into Sumi in a future version. The current model uses pure uniform noising — RCD could be a natural evolution of its corruption schedule without changing the fundamental architecture.
Implications for the future of LLMs — what really changes
The core question — "the end of autoregression?" — deserves a nuanced answer.
Massive parallelism at inference
This is the knockout argument for diffusion. A TeqVolt feature article published in June 2026 summarizes it well: diffusion models offer massive parallelism at inference that AR can never match.
When GPT-5.5 generates 1000 tokens, it makes 1000 sequential passes through the network. When Sumi generates 1000 tokens in N denoising steps, each step processes all 1000 positions in parallel on the GPU. The speedup is not linear (denoising steps have a cost), but it is structurally favored by GPU hardware, which is designed for parallel computing.
Quality on structured tasks
Sumi's benchmarks suggest a notable lead on tasks where the overall coherence of the text matters more than local fluency: structured summaries, code generation, tables, constrained formats.
This is logical. Bidirectional attention allows the model to plan the overall structure before refining the details. An AR model must "improvise" the structure as it generates, which penalizes it on rigid formats.
The impact on tabular data
An interesting parallel with TabPFN, the first foundation model for tabular data. TabPFN showed that non-autoregressive architectures could dominate on structured data. Sumi extends this logic to structured text.
What won't disappear
Autoregression isn't dying. Models like Claude Opus 4.7 (agentic score 94.3) or GPT-5.5 (agentic score 98.2) dominate complex reasoning and agentic tasks where sequential generation — thinking step by step — is an asset, not a flaw.
The most likely future is hybrid. NVIDIA understood this with tri-mode. AR for chain-of-thought reasoning, diffusion for block generation. Sumi is an essential building block of this future, not its sole architecture.
How to use Sumi today
Installation and loading
The code is available on the Tohoku-NLP GitHub. Installation involves cloning the repo and custom loading via transformers with trust_remote_code=True.
It is still a researcher's workflow, not a product developer's. There is no served API, no native vLLM integration, no documented GGUF quantization. If you want to experiment seriously, plan for a GPU with at least 16GB of VRAM and patience for the setup.
Self-hosted deployment
For those who want to deploy Sumi or other diffusion models on their own, a Hostinger-type hosting with a dedicated GPU or an equivalent cloud instance is necessary. The model is not yet optimized for consumer hardware.
Who is it for?
- NLP researchers who want to study diffusion for text
- ML teams evaluating whether diffusion can replace AR in their pipelines
- Curious individuals who want to understand the architecture before it arrives in mainstream products
It is not yet a tool for production. But maturation cycles are accelerating. DiffusionGemma went from paper to usable product in a few months.
❌ Common mistakes
Mistake 1: Confusing Sumi and DiffusionGemma
These are two fundamentally different models. DiffusionGemma is an AR model converted to diffusion. Sumi was born in diffusion. Attributing DiffusionGemma's 1000 tokens/sec to Sumi is factually wrong — Sumi's speed benchmarks have not yet been published at this level of detail.
Mistake 2: Thinking that diffusion replaces AR everywhere
The agentic scores from June 2025 show that AR models dominate complex reasoning (GPT-5.5 at 98.2, Claude Opus 4.7 at 94.3). Diffusion excels at parallel generation and structured text. They are complements, not substitutes.
Mistake 3: Ignoring the from-scratch pre-training cost
A from-scratch UDLM costs significantly more in compute than an AR model of the same size, because the denoising objective is more complex to optimize than next-token prediction. Tohoku University made a considerable investment. Failing to mention this presents diffusion as a "free" performance solution, which is misleading.
Mistake 4: Using trust_remote_code=True in production without an audit
The trust_remote_code=True flag in transformers executes arbitrary code from the Hugging Face repo. This is acceptable for research. It is unacceptable in production without a full audit of Sumi's custom code.
❓ Frequently Asked Questions
Is Sumi really the first diffusion language model?
No, but it is the first from scratch pre-trained UDLM at the 7B/1.5T scale. Diffusion models for text existed before, but only at a small scale or as conversions of AR models.
Can Sumi replace GPT-5.5 or Claude Opus 4.7 today?
No. AR models still largely dominate reasoning and agentic tasks. Sumi is an academic testbed, not a competitive product on current general benchmarks.
What is the difference between uniform diffusion and Gaussian diffusion?
Gaussian diffusion (used in images) adds continuous noise to pixels. Uniform diffusion replaces discrete tokens with other random tokens with uniform probability. Since text is discrete, the uniform formulation is more natural.
How much VRAM is needed to run Sumi?
A dense 7B model in fp16 requires about 14GB of VRAM for the weights alone. Count on 16-18GB with the context and denoising buffers. An RTX 4090 or an A10G is sufficient for experimentation.
Is the RCD module integrated into Sumi?
No. RCD (Random Corruption Denoising) is a complementary module reported by PulseAugur in June 2026. Sumi uses standard uniform noise. RCD integration would be a natural evolution for a future version.
✅ Conclusion
Sumi will not kill autoregression. But it irrefutably proves that diffusion for text is no longer a niche academic exercise — it is a viable paradigm at scale, driven by Google (DiffusionGemma) as well as NVIDIA (Nemotron tri-mode) and now by public research. The code is available on GitHub, the paper on arXiv is open: it has never been easier to understand what comes after next-token prediction.