Attractor Models: the new architecture that beats Transformers on reasoning
🔎 Why Transformers might finally meet their limit
The Transformer architecture has dominated AI since 2017. Everything has been optimized for it: GPUs, frameworks, scaling infrastructures. Yet, a paper published on May 12, 2026, on arXiv proposes an alternative that doesn't just compete — it far outperforms Transformers at equivalent parameter counts.
The problem has been known for years: Transformers are fundamentally limited by their fixed depth. A token passes through a set number of layers, and then the model produces its response. No going back, no iterative refinement. It's like drafting an email without ever rereading it.
Looped Transformers, theorized at ICLR 2025, attempted to solve this by looping latent representations back on themselves. The idea was appealing: simulate implicit Chain-of-Thought by iterating T times over the same layers. Except in practice, these looped models were unstable during training and difficult to scale.
This is exactly the bottleneck that Attractor Models have just broken through. Their innovation: a backbone module that first proposes an output, followed by an attractor module that stabilizes it via a convergent attraction dynamic. The result is a Pareto improvement over standard Transformers — more performant for the same budget, cheaper for the same performance.
The key takeaways
- Attractor Models combine an iterative backbone with an attractor stabilization module, solving the instability of Looped Transformers.
- A 770M-parameter Attractor Model outperforms a 1.3B-parameter Transformer trained on twice as many tokens, with a 46.6% improvement in perplexity and 19.7% in accuracy on downstream tasks (arXiv 2605.12466).
- The architecture works just as well in large-scale pretraining as it does in reasoning with tiny models, paving the way for ultra-efficient local models.
- Training costs are reduced thanks to the reuse of the same parameters across multiple iterations, without the unstable side effects of previous looped approaches.
Reference models and tools
| Model / Tool | Type | Benchmark score | Relevant use case |
|---|---|---|---|
| Gemini 3.1 Pro | General LLM | 92 | Reference benchmark for reasoning tasks |
| GPT-5.5 | General LLM / Agentic | 91 / 98.2 | Comparison point for reasoning and agentic |
| Claude Opus 4.7 | General LLM | 90 | Reference for long-form reasoning |
| DeepSeek V4 Pro | Code LLM / Reasoning | 88 | Comparison in parameter efficiency |
| Claude Sonnet 4.6 | General LLM | 83 | Mid-range model for downstream comparisons |
What an Attractor Model is — in simple terms
An Attractor Model is a language model that doesn't produce its answer in a single pass. It iterates over its own internal representation until it converges on a stable answer — exactly like a human thought loop.
Specifically, the architecture breaks down into two distinct modules. The backbone module is a standard network (Transformer-type or otherwise) that takes a latent representation and produces a candidate output. The attractor module takes this output and pulls it back toward an equilibrium point — the attractor — by merging new information with the previous state.
This separation is crucial. In a classic Looped Transformer, the same network handles both the proposal and the update, which creates oscillations and divergences during training. The attractor decouples these two roles, and that is what makes the dynamics stable.
The concept of an attractor comes from dynamical systems theory. An attractor is a set of states toward which a system spontaneously evolves. Imagine a marble released into a bowl: it will oscillate, then stabilize at the bottom. The bottom of the bowl is the attractor. In an Attractor Model, the final answer is the bottom of the bowl — the state the system naturally reaches after several iterations.
This convergence property is empirically verified in the paper: latent representations stabilize after a finite number of iterations, without the gradient explosions that plagued previous recurrent architectures. The full details of the proof and experiments are available in the original paper and discussed in the Paper Reading Club.
Why recurrent approaches have failed so far
The history of recurrent architectures in NLP is a succession of unfulfilled promises. LSTMs and GRUs dominated before 2017, but their limited parallelization killed them in the face of Transformers. It is no coincidence that The Transformer Attractor describes Transformers and GPUs as a co-evolved ecosystem that creates an almost insurmountable technological lock-in.
Even after the dominance of Transformers, several attempts have sought to reintroduce recurrence. State Space Models like Mamba have shown promising results in linear inference, but have not surpassed Transformers on complex reasoning. Disconnected MoE architectures like UniPool explored other avenues for cost reduction, without challenging the fundamental paradigm either.
Looped Transformers, presented at ICLR 2025 in the paper Reasoning with Latent Thoughts, represented the most direct attack. Their theoretical insight was powerful: by passing the same sequence T times through the same layers, you implicitly simulate T steps of Chain-of-Thought, without any explicit reasoning tokens.
The problem? Training diverges. When you backpropagate through T iterations of the same parameters, gradients accumulate and eventually explode or vanish. It's the same problem as classic RNNs, but amplified by modern scale. The authors of the Attractor paper even cite this problem as the direct motivation for their work.
The innovation of Attractor Models isn't having invented iteration — it's having found a way to stabilize it. The attractor module acts as a damper that prevents oscillations while preserving the benefit of iterative refinement.
The numbers: 46.6% better in perplexity
The numbers speak for themselves, and they are impressive. The paper reports three categories of results that, together, constitute what is known as a Pareto improvement: you gain on all axes simultaneously.
In terms of perplexity, Attractor Models improve by 46.6% compared to unstable Looped Transformers and significantly compared to standard Transformers at equivalent parameters. Perplexity measures the model's ability to predict the next token — the lower, the better. An improvement of this magnitude at an equivalent scale is rare in the literature.
In downstream accuracy (classification, QA, reasoning tasks), the improvement reaches 19.7%. This means the model not only predicts the next word better, but that it builds higher-quality internal representations for subsequent tasks.
The most striking result is the size-against-size comparison: a 770M-parameter Attractor Model outperforms a 1.3B-parameter Transformer trained on twice as many tokens. In other words, the Attractor architecture achieves better performance with 40% fewer parameters and 50% less data. The implications in terms of compute costs are massive.
These results are corroborated by the analysis from the Paper Reading Club, which emphasizes that the improvement is a true Pareto improvement: no compromise is necessary. The model isn't simply better on one axis by sacrificing another — it is better everywhere.
How it works technically
The architecture of an Attractor Model follows a two-step scheme that repeats over T iterations.
At each iteration t, the backbone module takes the latent state h_t and produces a candidate output y_t. This backbone can be a standard Transformer, an MLP, or any architecture capable of processing latent representations. The important thing is that it proposes an update to the state.
Then, the attractor module takes y_t and h_t, and produces the new state h_{t+1}. This is where the magic happens. Instead of directly applying y_t (which would cause instability), the attractor calculates a controlled combination that brings the state closer to an equilibrium point. The exact formulation involves a gating mechanism that doses the amount of new information injected at each iteration.
This gating mechanism is the key to stability. In the early iterations, a lot of new information is injected — the model "thinks" actively. As the state converges toward the attractor, the gating progressively reduces the updates, until the state stabilizes. This is analogous to the exponential decay of the ball's oscillations in the bowl.
The number of iterations T is not set arbitrarily. The paper shows that a convergence criterion can be used: when the norm of the difference between h_t and h_{t+1} falls below a threshold, we stop iterating. This means that easy examples require less computation than difficult examples — a highly sought-after property of adaptive computational efficiency.
Pretraining vs reasoning: two regimes, one architecture
A crucial point of the paper is that Attractor Models operate in two distinct regimes, which makes them versatile.
In large-scale pretraining, the architecture directly replaces the Transformer as the next-token prediction backbone. The iterations allow the model to refine its understanding of the context before predicting. It is in this regime that the 46.6% perplexity improvement was measured.
In reasoning on tiny models, the architecture is used differently. We take a small model (a few tens of millions of parameters) and make it iterate many times to solve logic or math problems. In this regime, each iteration corresponds to a latent "thought step", similar to Chain-of-Thought but entirely internal to the model.
This duality is important because it opens up two distinct markets. For pretraining, Attractor Models could reduce the training costs of the next GPT-5.5 or Gemini 3.1 Pro — models that currently cost hundreds of millions of dollars to train. For local reasoning, they enable tiny but capable models of complex reasoning, perfect for deployment on a local machine.
If you are interested in local models, our local LLM installation guide or our comparison of the best LLMs to run locally remain the references for the current ecosystem. Attractor Models could quickly join them.
Attractor Models vs Transformers vs Looped Transformers
To see things clearly, here is a comparison of the three architectures on the criteria that matter.
| Criterion | Standard Transformer | Looped Transformer | Attractor Model |
|---|---|---|---|
| Effective depth | Fixed (N layers) | Variable (N × T iterations) | Variable (adaptive convergence) |
| Training stability | Excellent | Poor (divergence) | Excellent (attractor dynamics) |
| Training cost | Baseline | Theoretically reduced, unstable in practice | Reduced (parameter reuse) |
| Quality at equal parameters | Baseline | Variable | +46.6% perplexity, +19.7% accuracy |
| GPU scalability | Optimized (parallel attention) | Partially compatible | Compatible (parallelizable backbone) |
| Latent reasoning | No (1 pass) | Yes (T latent thoughts) | Yes (T convergent thoughts) |
This table shows that Attractor Models are not simply a marginal improvement over Looped Transformers. They combine the stability of Transformers with the iterative benefit of recurrent architectures, while adding the adaptive convergence that neither of the two had.
What this implies for current models
The models at the top of the current rankings — Gemini 3.1 Pro (score 92), GPT-5.5 (score 91), Claude Opus 4.7 (score 90) — are all Transformers. Their agentic scores are even more impressive, with GPT-5.5 at 98.2 on the agentic benchmark.
The natural question is: could these models be even better with an Attractor architecture? The answer suggested by the paper is yes, and significantly. If a 770M Attractor beats a 1.3B Transformer, the projection to the scale of frontier models is considerable.
Take DeepSeek V4 Pro, already known for its parameter efficiency with a score of 88 in general. An Attractor version of this model could theoretically reach or exceed the scores of GPT-5.5 with significantly fewer parameters. Even Claude Sonnet 4.6 (score 83) or GLM-5.1 (score 83) could benefit from this architecture to close the gap with the leading models.
In agentic reasoning, the gains could be even more marked. The attractor module is intrinsically suited to multi-step reasoning — each iteration can correspond to a planning or evaluation step. For LLMs for agents, this architecture is particularly promising.
The limitations to keep in mind
Despite the impressive results, the paper has limitations that would be dishonest to ignore.
First, the paper is an April 2026 preprint. The results have not yet been independently reproduced at scale. The history of AI is full of promising architectures in papers that did not survive contact with real scale — Mamba and SSM architectures are a partial example, promising but not replacing.
Second, iterative inference has a latency cost. A standard Transformer produces its response in a single forward pass. An Attractor Model requires T forward passes (even if each pass is cheaper because it reuses the same weights). For real-time applications, this trade-off can be a dealbreaker. The argument of adaptive convergence mitigates this problem, but does not eliminate it.
Compatibility with the current GPU ecosystem is also a question mark. As The Transformer Attractor points out, the hardware and software infrastructure has been entirely optimized for Transformers. CUDA kernels, distributed training frameworks, memory optimizations — everything is designed for attention and sequential feed-forward at a fixed depth. Attractor Models will require infrastructural adaptations.
Finally, the most spectacular results (770M vs 1.3B) are at a relatively modest scale. The question of whether the Pareto improvement holds at 100B+ parameters remains open. The paper mentions large-scale experiments, but the details are still patchy.
The link with AI business models
Beyond the technical aspects, Attractor Models have direct economic implications. If the architecture delivers on its promises at scale, it reduces the training cost of frontier models — and therefore the barrier to entry for new players.
Currently, only companies valued at tens of billions can afford to train a model competitive with GPT-5.5 or Gemini 3.1 Pro. If Attractor Models divide this cost by two or more, the landscape opens up. This is exactly the kind of disruption we analyze in our article on the 5 profitable business models around AI.
For hosts and infrastructure providers, the reduction in training costs could also modify the value chain. A 770M Attractor model that performs like a 1.3B Transformer consumes less VRAM, less inter-GPU bandwidth, and less energy. Cloud providers like Hostinger could benefit from this to offer more affordable AI instances.
The broader context: an architectural effervescence
Attractor Models do not appear out of nowhere. The year 2025-2026 is marked by a wave of architectural innovations, all seeking to move beyond the Transformer.
Alibaba's Qwen3.6 family pushed the boundaries of the standard Transformer with internal optimizations. Architectures like Decoupled MoE with UniPool explored the separation of routing depth. Search models like Perplexity and NotebookLM, which we compare in our guide to the best LLMs for research, optimized the architecture for RAG rather than changing the backbone.
But Attractor Models are potentially the most radical proposition because they challenge the fundamental paradigm: one pass, one prediction. If this architecture proves scalable, it could trigger a paradigm shift comparable to that of 2017.
For developers looking to prepare, training on the best LLMs for coding remains essential — today's tools will be the first to integrate these new architectures.
❌ Common mistakes
Mistake 1: Confusing Attractor Models and Looped Transformers
The most frequent mistake in discussions around this paper is reducing Attractor Models to mere stabilized Looped Transformers. This is inaccurate. The attractor module is a distinct architectural component with its own parameters and its own dynamics. Looped Transformers do not have a convergence mechanism — they iterate blindly. Attractor Models converge toward a stable state, which is fundamentally different.
Mistake 2: Thinking that iteration = classic Chain-of-Thought
Explicit Chain-of-Thought (CoT) generates reasoning tokens visible in the output. The iterations of an Attractor Model are entirely latent — no tokens are produced during the internal loops. The parallel with CoT is theoretical (simulating T steps of reasoning), not practical. Confusing the two leads to erroneous expectations regarding the model's interpretability.
Mistake 3: Believing that Attractor Models will replace Transformers tomorrow
The Transformer ecosystem has nearly a 9-year head start in hardware and software optimization. Even if the Attractor architecture is theoretically superior, the transition will take years. Models like GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 are not going to disappear. More likely, Attractor Models will first be adopted by more agile players — like the open source and local LLM ecosystem — before potentially percolating up to proprietary models.
Mistake 4: Ignoring inference cost
Focusing solely on reducing training cost without considering inference latency is a strategic mistake. A model that is cheaper to train but 3x slower in production can be unusable for many use cases. Attractor Models have a clear advantage in training, but the inference trade-off must be evaluated on a case-by-case basis.
❓ Frequently asked questions
Can an Attractor Model run on my PC?
Theoretically yes, and it is even one of the most promising use cases. A 770M Attractor that performs like a 1.3B Transformer requires less VRAM. With tools like Ollama or LM Studio (see our installation guide), local deployment is feasible. In practice, we will have to wait for the weights to be published and adapted to local formats.
Are Attractor Models compatible with fine-tuning?
The paper focuses on pretraining and zero-shot reasoning. Compatibility with LoRA, QLoRA, and other fine-tuning methods is not explicitly discussed. This is an open research question, but the architecture does not present any obvious theoretical barrier to fine-tuning.
How do they compare with French models?
The best LLMs in French are currently all based on Transformers. Attractor Models could benefit Francophone models because the reduction in training costs would allow more budget to be dedicated to French data, an area where data scarcity is a limiting factor.
Do GPT-5.5 or Claude already use this architecture?
There is no indication to confirm this. The current models at the top of the leaderboards (GPT-5.5 at 98.2 in agentic, Gemini 3.1 Pro at 92) are very likely optimized Transformers. If a major player were to integrate Attractor Models, it would be a major competitive advantage that they would have no interest in revealing.
Do Attractor Models work for code?
The paper does not separate results by domain (code vs. natural language). However, latent iterative reasoning is particularly suited to code generation, where planning and internal verification are crucial. The best LLMs for coding like GPT-5.3 Codex (score 87) or DeepSeek V4 Pro (score 88) could benefit from this architecture.
✅ Conclusion
Attractor Models are the most compelling architectural proposition since the Transformer itself: an empirically verified Pareto improvement, with a 46.6% gain in perplexity and a 770M model that beats a 1.3B Transformer trained on twice as much data. If the architecture scales beyond the billion-parameter mark, it could redefine the competition among the best LLMs on the market. Until then, keep a close eye on open source implementations — that is where Attractor Models will strike first.