📑 Table of contents

Gated DeltaNet-2 : the Yejin Choi paper that solves the oldest problem of linear attention

LLM & Modèles 🟢 Beginner ⏱️ 14 min read 📅 2026-05-23

Gated DeltaNet-2 : the paper by Yejin Choi that solves the oldest problem of linear attention

🔎 Linear attention had a design flaw that no one dared to fix

For years, linear attention has promised to replace the exponential KV cache of Transformers with a fixed-size recurrent state. On paper, it's the miracle solution: infinite context, constant memory, fast inference.

Except there was a problem. Every time a linear attention model wanted to update a memory in its compressed state, it simultaneously destroyed adjacent associations. It's like trying to erase a word in a sentence and having to cross out the entire paragraph.

On May 14, 2026, an NVIDIA team led by Yejin Choi, Ali Hatamizadeh, and Jan Kautz published Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention on arXiv. Their proposal is surgical: decoupling erase and write operations via distinct channel-wise gates. An architectural fix that could redefine how the best LLMs for AI agents manage their long-term memory.

This is not just another model. It's a fix to a mechanism that dozens of papers had copied without questioning it.


The essentials

  • Gated DeltaNet-2 decouples the erase and write operations in linear attention, a problem that all previous delta-rule models (Gated DeltaNet, Kimi Delta Attention, KDA) ignored.
  • The paper is authored by Yejin Choi, Ali Hatamizadeh, and Jan Kautz (NVIDIA), with an official PyTorch implementation available on GitHub.
  • Results show clear superiority in long-context language modeling and retrieval tasks compared to Mamba2, Gated DeltaNet v1, and Kimi Delta Attention.
  • The main intended impact: stateful models and agents with persistent memory, where memory editing accuracy is critical.

Tool Main usage Price (June 2025, check on site) Ideal for
GatedDeltaNet-2 (GitHub) PyTorch implementation of the model Free (Apache 2.0) Research and prototyping
Gated DeltaNet-2 (HuggingFace) Paper page and benchmarks Free Analyzing results
Gemini 3 Pro Deep Think Research paper analysis Free to $20/month Understanding alternative implementations
GPT-5.5 Agent with long reasoning $20 to $200/month Comparing agentic approaches

What linear attention really is (and why it's holding things back)

Standard Transformer attention computes a softmax weight matrix between every pair of tokens. That costs O(n²) in time and memory. When your context goes from 8K to 1M tokens, the KV cache explodes.

Linear attention replaces this matrix with a recurrent accumulation: each new token updates a fixed-size state of dimension d. Constant cost, regardless of context length. It's the same logic as an RNN, but formulated within the Transformer framework.

The problem isn't the compression itself. It's what happens when you need to modify what has been compressed.

A fixed-size recurrent state encodes thousands of tokens into a dense vector. Each dimension of this state carries partial information about many different tokens. When the model decides that a piece of information is obsolete and needs to be erased, it applies a decay factor (a "forget gate"). But this factor is a scalar — a single number that multiplies the entire state or an entire channel.

Result: you erase the target, but you also attenuate everything that shares the same dimensions in the compressed state. This is the fundamental flaw that Subquadratic attempted to address with SubQ and its 12 million tokens of context, but through a different architectural path.


The scalar bug: why erase and write cannot share the same gate

All delta-rule models share an implicit assumption: erasing and writing are two sides of the same operation. You reduce the old content, you add the new. A single scalar controls both.

This is wrong. And the paper formally demonstrates it.

Erasure acts on the existing state — a multidimensional space where each dimension encodes mixtures of different information. Writing injects a new vector into this state. These two operations work on different dimensions of the memory dynamics.

When Gated DeltaNet v1 or Kimi Delta Attention apply a single scalar, they force an artificial correlation: "if I write a lot, I erase a lot." Or vice versa. This constraint appears in retrieval tasks when the model must replace specific information without disturbing neighboring associations.

The comparison published on Digg well summarizes the progression: each successive variant (Mamba2 → Gated DeltaNet → Kimi Delta Attention → Gated DeltaNet-2) adds finer control over the decay, erasure, and write operations. But it is DeltaNet-2 that crosses the threshold by completely separating the two.


What Gated DeltaNet-2 actually does

Two channel-wise gates instead of a scalar

Gated DeltaNet-2 introduces two distinct gating mechanisms, each operating at the channel level (by dimension of the state vector):

An erase gate that independently controls how much each dimension of the existing state should be attenuated. And a write gate that independently controls how much of the new content is injected into each dimension.

This is not a cosmetic modification. It is a redesign of how the recurrent state interprets updates. The model can now decide: "I want to strongly erase dimension 47 (which carried the old information) but moderately write into dimensions 12, 23, and 47 (which encode the new context)."

The legacy of Gated DeltaNet and KDA

Gated DeltaNet-2 does not start from scratch. It inherits two key mechanisms from its predecessors:

Adaptive forgetting from Gated DeltaNet, which allows the model to dynamically decide which parts of memory should decay. And channel-wise decay from KDA (Kimi Delta Attention), which applies this decay at the dimension level rather than on the entire state.

What DeltaNet-2 adds is the formal generalization of these ideas with the explicit erase/write separation. The authors present it as a unification: Gated DeltaNet and KDA become special cases of DeltaNet-2 with additional constraints.

The official implementation on GitHub shows that the computational overhead is minimal — a few extra multiplications per channel, negligible compared to the gain in memory quality.


Why it matters for AI agents

Persistent memory is the Achilles' heel of current agents

Look at the ranking of the best LLMs for AI agents. GPT-5.5 dominates at 98.2, followed by Gemini 3 Pro Deep Think at 95.4 and Claude Opus 4.7 at 94.3. They all operate with a classic KV cache that grows linearly with the conversation.

An agent running for 8 hours accumulates hundreds of thousands of context tokens. The KV cache becomes unmanageable. Current solutions — periodic summarization, sliding window, external RAG — are workarounds, not solutions.

Linear attention offers the constant memory that agents need. But it only becomes viable if the model can edit its memory with precision. An agent that "forgets" the wrong information when it learns new information is not reliable. This is exactly the bug that DeltaNet-2 fixes.

Stateful models: the real target

Stateful models maintain a recurrent state between sessions. No reset between two requests, no re-reading of the context. The state is the memory. In this regime, every erasure error is permanent. Every destroyed association does not come back.

DeltaNet-2 is architecturally designed for this scenario. Channel-wise gates allow for surgical updates to the state, exactly what a stateful agent needs to maintain coherent memory over extended sessions.

This is relevant when you see that Kimi K2.6 in self-host version reaches 88.1 in the agentic ranking. Open-weight models with linear attention are gaining ground in the agentic ecosystem, and DeltaNet-2 could accelerate this trend.


Results: what benchmarks really show

Long-context language modeling

The results published on HuggingFace show that Gated DeltaNet-2 outperforms its predecessors in language modeling on long sequences. The erase/write separation improves the model's ability to maintain long-range dependencies without degrading them during intermediate updates.

The gain is not marginal on short sequences — it is particularly visible when the context length exceeds the point where previous delta-rule models start to "saturate" their state and overwrite old information.

Retrieval tasks

This is where the impact is clearest. Retrieval from a recurrent state requires the model to locate specific information among thousands of compressed tokens. With a scalar erase gate, information adjacent to the target is degraded at every update.

DeltaNet-2, with its channel-wise gates, better preserves neighboring information during erasure. Retrieval scores increase significantly compared to Gated DeltaNet v1 and Mamba2.

Comparison with alternative approaches

Model Attention type Gating Long-context Retrieval
Mamba2 Linear (SSM) No explicit gating Good Moderate
Gated DeltaNet v1 Delta-rule Scalar (erase=write tied) Very good Good
Kimi Delta Attention Delta-rule Partial channel-wise Very good Good
Gated DeltaNet-2 Delta-rule Decoupled channel-wise Superior Superior

What this implies for the future of architectures

The end of quadratic attention is not a matter of "if"

O(n²) softmax attention is a legacy from the "Attention Is All You Need" (2017) paper. Eight years later, it is still dominant, but physical constraints are becoming impossible to ignore. A model like SubQ with its 12 million context tokens shows that the industry is actively seeking alternatives.

DeltaNet-2 does not claim to replace softmax attention tomorrow. It solves a specific problem with linear attention that made it impractical for demanding use cases. It is another step toward viable infinite-context architectures.

When OpenAI solves the Erdős problem with an AI model, we see the reasoning capabilities of LLMs reach unprecedented levels. But these long reasoning processes require huge contexts. Linear attention with precise memory editing is a natural candidate to support this type of task without costs exploding.

Similarly, the best LLMs for search like Perplexity or NotebookLM accumulate entire documents in context. A mechanism that allows adding and removing sources without degrading the rest of the compressed memory has a direct interest.


DeltaNet-2's position in the linear model ecosystem

Not a model, an architectural building block

To be precise: Gated DeltaNet-2 is not an LLM you can query. It's an attention mechanism that could be integrated into the next generations of models. The PyTorch implementation is a module that a researcher or lab can integrate into an existing architecture.

This distinction matters. The best LLMs in the overall ranking — Gemini 3.1 Pro (92), GPT-5.5 (91), Claude Opus 4.7 (90) — all use softmax attention variants. DeltaNet-2 does not replace them. It offers a credible alternative for labs looking to build the next generation without the quadratic burden.

Open-weight and the local ecosystem

The implementation is open-source. For the community of best LLMs to run locally, this is significant. Models with linear attention are naturally better suited for local deployment because their memory footprint is predictable and bounded. A DeltaNet-2 model with a 4096-dimensional state always consumes the same memory, whether it processes 100 tokens or 1 million.

For those following local LLM installation guides with Ollama or LM Studio, the arrival of models based on DeltaNet-2 could mean long-context models that fit on consumer hardware without compromising quality.


One often underestimated aspect: a model's memory quality directly affects its training. The General Preference RL paper unifies reinforcement learning and preference optimization for LLMs. But these training methods assume a model capable of maintaining consistent reward signals over long sequences.

An attention mechanism that crushes associations during training introduces noise into the gradients. DeltaNet-2, by better preserving information during updates, could indirectly improve the training stability of RLHF and related methods. This is speculative, but consistent with the direction research is taking.


Limitations and what the paper does not solve

It is not a universal solution

DeltaNet-2 solves a specific problem of linear attention. It does not solve the question of representation itself. A fixed-size state cannot encode an infinite amount of information without loss, regardless of the sophistication of the gates. Compression remains compression.

The paper implicitly acknowledges this by focusing on retrieval tasks and long-sequence modeling, rather than on pure reasoning tasks where raw representational capacity is the limiting factor.

The ecosystem is not ready

Integrating DeltaNet-2 into a production pipeline requires profound changes. Inference frameworks (vLLM, TensorRT-LLM) are optimized for softmax attention. GPU kernels for linear attention are less mature. And pre-existing training data was generated by softmax models.

The delay between the publication of a promising architecture and its adoption in the meilleurs LLM gratuits available to the public is measured in years, not months.

French-language models and availability

For now, nothing indicates that DeltaNet-2 will be prioritized for the meilleurs LLM en français. Linear attention research is dominated by English-speaking labs, and the training datasets reflect this bias. Models based on DeltaNet-2 could even temporarily widen the quality gap for underrepresented languages.


❌ Common mistakes

Mistake 1: Confusing DeltaNet-2 with a language model

DeltaNet-2 is an attention mechanism, not an LLM. You cannot download it and chat with it. It is an architectural building block designed to be integrated into future models. The GitHub implementation provides the module, not a pre-trained model.

Mistake 2: Thinking that linear attention replaces softmax attention everywhere

Linear attention excels on long sequences and stateful models. For short-reasoning tasks with limited context, softmax attention remains superior in absolute quality. DeltaNet-2 does not make softmax obsolete — it makes linear attention competitive where it wasn't.

Mistake 3: Equating DeltaNet-2 to Mamba or SSMs

Mamba2 is a State Space Model. DeltaNet-2 is a delta-rule model with explicit gating. They share the idea of a fixed-size recurrent state, but their mathematical formalism is different. Comparing them directly as "SSMs" is a category error that the Towards AI article rightly avoids by referring to a "write-and-edit memory model."

Mistake 4: Believing that channel-wise gates solve the state capacity problem

A 4096-dimensional state can only encode a finite amount of information, regardless of the precision of the gates. DeltaNet-2 improves the quality of the edit, not the raw storage capacity. If you compress 10 million tokens into 4096 floats, there is inevitably loss. The gates reduce the loss during updates, they do not eliminate it.


❓ Frequently Asked Questions

Who are the authors of Gated DeltaNet-2?

Yejin Choi, Ali Hatamizadeh, and Jan Kautz, all three researchers at NVIDIA. Yejin Choi is particularly known for her work on LLM reasoning and commonsense reasoning. The paper is published on arXiv (2605.22791).

What is the difference between Gated DeltaNet v1 and v2?

v1 used a single scalar gate linking erasure and writing. v2 decouples these two operations with distinct channel-wise gates, allowing independent control over each dimension of the recurrent state.

Can I use DeltaNet-2 in my projects?

The official PyTorch implementation is available on GitHub under the Apache 2.0 license. You can integrate the module into your architectures, but there is no pre-trained model based on DeltaNet-2 to date.

Will DeltaNet-2 make current models obsolete?

No. It is an architectural building block for future generations. Models like GPT-5.5, Claude Opus 4.7, or Gemini 3.1 Pro will continue to use softmax attention variants. The impact will be measured on models that deliberately choose linear attention for specific use cases.

How is this relevant for developers coding with LLMs?

For the best LLMs for coding like GPT-5.3 Codex, the interest is indirect. But if you are building autonomous agents that maintain a state between sessions, DeltaNet-2 represents the most promising architecture for precise and constant memory.


✅ Conclusion

Gated DeltaNet-2 will not make mainstream media headlines, but it solves an architectural problem that research had overlooked since the introduction of delta-rule models: the artificial coupling between erasure and writing in linear attention. By separating these two operations with channel-wise gates, Yejin Choi and her team provide stateful models and long-horizon agents with the memory precision they were missing. The rest will depend on the ecosystem: inference frameworks, training data, and labs willing to bet on linear attention. To follow the concrete evolution of these architectures, check out our monthly comparison of the best LLMs.