📑 Table of contents

Unlocking Working Memory: this research shows how LLMs can reason without generating tokens

Agents IA 🟢 Beginner ⏱️ 14 min read 📅 2026-05-30

Unlocking Working Memory: this research shows how LLMs can reason without generating tokens

🔎 LLM reasoning is decoupling from text generation

Since the arrival of chain-of-thought (CoT) with GPT-4, all the reasoning of large language models goes through the same path: the autoregressive generation of tokens. The model "thinks out loud," and each reasoning step costs exactly as much as one output token. This is a financial chasm and an architectural constraint that the community had come to take for granted.

A paper published on May 28, 2026, on arXiv just demonstrated that this coupling is not necessary. The RiM (Reasoning in Memory) approach proposes a "latent working memory" that allows an LLM to iteratively refine its internal representation without emitting a single intermediate token. The results surpass existing latent reasoning methods on reasoning benchmarks.

This is not a marginal paper. The Hugging Face Papers page shows active community discussion, and the summary on AI Models FYI confirms that RiM works across different model families and sizes. Latent reasoning is moving from an experimental concept to a generalizable method.


The Essentials

  • Current LLM reasoning (CoT, ToT) relies on generating intermediate tokens, which linearly increases costs and latencies with problem complexity.
  • RiM (Reasoning in Memory) introduces a latent working memory: the model iterates over an internal representation without producing tokens, eliminating step-level supervision.
  • Results outperform previous latent reasoning approaches on reasoning benchmarks, and this across several model families.
  • For AI agent developers, this means background reasoning that does not consume visible context and costs a fraction of the price of classic CoT.

Tool Main Usage Price (June 2025, check website) Ideal for
Claude Opus 4.7 (Adaptive) Complex reasoning, agents Paid (Pro/Team subscription) Adaptive reasoning with CoT
GPT-5.5 General and agentic reasoning Paid (ChatGPT Plus subscription) Tasks requiring high test-time compute
Gemini 3 Pro Deep Think Deep reasoning Paid (Gemini Advanced subscription) Multi-step problems
Ollama Run models locally Free Testing latent reasoning locally
DeepSeek V4 Pro (Max) Cost-effective reasoning Paid (API) Developers sensitive to inference costs

The problem: why chain-of-thought is an architectural dead end

Chain-of-thought was a revolution. But it carries a fundamental limitation: every step of reasoning is a token, and every token costs money, time, and memory bandwidth.

Concretely, when you ask GPT-5.5 to solve a complex logic problem, the model potentially generates thousands of intermediate tokens before arriving at the answer. You pay for every generated token, including those that lead to dead ends or erroneous reasoning that the model later corrects. It's the equivalent of paying someone to think out loud, including their hesitations.

LLM billing is directly impacted by this mechanism. Output tokens are systematically more expensive than input tokens. The longer a model "thinks," the higher the bill. This linearity between problem complexity and resolution cost is a ceiling for the massive adoption of advanced reasoning.

Tree-of-thought (ToT) makes the problem even worse by exploring several branches of reasoning in parallel. Each branch generates its own tokens. Test-time compute explodes, and with it, infrastructure costs.


What RiM fundamentally changes

RiM proposes completely decoupling reasoning from generation. The idea: the model has a latent memory space in which it iterates over its internal representation, block by block, without ever producing intermediate tokens.

The latent working memory mechanism

The model receives a prompt, projects its understanding into a latent vector, and then iteratively refines this representation through several passes of "working memory". Each memory block takes the previous representation and improves it. No token is emitted during this process.

This is conceptually close to how a human thinks: you do not verbalize every step of your reasoning. You mentally turn a problem over and over, and then you produce an answer. RiM reproduces this pattern in the model's latent space.

The elimination of step-level supervision

Previous approaches to latent reasoning often required supervision at each step: intermediate reasoning steps had to be annotated to train the model. RiM eliminates this constraint. The model learns to refine its representation in a self-supervised manner, based solely on the correct final answer.

This difference is major for scalability. You no longer need massive datasets with annotated step-by-step reasoning. You can train with (question, answer) pairs and let the model discover its own internal iterations.


Latent reasoning: the state of the art before RiM

Latent reasoning did not appear with this paper. Several works have explored this path, with mixed results.

Meta's recurrent approach

Meta published a paper on reasoning in a continuous latent space, discussed on LessWrong. The approach uses a recurrent architecture that iteratively refines the latent representation at test-time. The deep-dive de deep-diver details this recurrent architecture and its performance.

The key difference with RiM: Meta's approach remains closer to a classic recurrent loop, while RiM explicitly introduces the concept of "working memory" with distinct memory blocks that follow one another. It is more structured, and the results show it.

Concerns raised by the community

The discussion on LessWrong raises interesting points. When a model reasons in a latent space, this reasoning is opaque. With CoT, you can at least read the intermediate steps and identify where the model derails. With latent reasoning, you get an answer without traceability of the mental process.

For critical applications (medical, legal, financial), this opacity is a real problem. But for the majority of use cases — information retrieval, code generation, data analysis — the traceability of reasoning is a luxury, not a necessity.


What this concretely changes for developers

Drastic reduction in inference costs

This is the most immediate impact. If a model like Claude Opus 4.7 or GPT-5.5 can solve a problem in 5 latent iterations instead of 2000 CoT tokens, the inference cost mechanically drops. Output tokens are the heaviest cost line in LLM billing.

For a developer running AI agents in loops, this saving is not marginal. It is a change in order of magnitude. An agent reasoning 100 times a day on complex tasks could see its API bill divided by 5 to 10, depending on the complexity of the tasks.

Background reasoning without consuming context

This is perhaps the most underestimated implication. With CoT, every reasoning token takes up the context window. If your agent reasons at length, it eats its own context, leaving less room for useful information.

With RiM, reasoning happens in a separate latent space. The visible context is not polluted by intermediate steps. An agent can "think" intensely while preserving the entirety of its context for truly relevant data.

This opens the door to much more sophisticated agent architectures, capable of maintaining long conversations with users while performing complex reasoning in the background. For the best autonomous AI agents, this is a paradigm shift.

Impact on agentic RAG architectures

The LatentRAG paper precisely explores this intersection: latent reasoning in an agentic RAG context. LatentRAG distinguishes pure generation tasks from tasks requiring the emission of sub-query tokens (for example, formulating a search query).

The RiM + LatentRAG combination suggests an architecture where the agent reasons latently to plan its actions, only emits tokens for necessary communications (queries to a search engine, final answers), and consumes a minimum of context. This is the optimal agent architecture toward which the field is converging.


Implications for current models

Which models could benefit from it?

Based on the results compiled on AI Models FYI, RiM works across different model families and sizes. This means the approach is not limited to a specific architecture.

The models that would benefit the most are those already geared towards complex reasoning. GPT-5.5 (agentic score 98.2), Gemini 3 Pro Deep Think (95.4) and Claude Opus 4.7 Adaptive (94.3) are natural candidates. Their reasoning capabilities are already advanced; RiM could make them much more efficient.

For the best LLMs for research like Perplexity or NotebookLM, latent reasoning could transform the experience: faster, cheaper answers, with reasoning happening in the background that doesn't clutter the interface.

And what about local models?

This is where it gets really interesting. One of the major bottlenecks for local LLMs is the slowness of reasoning. A 70B parameter model using CoT on a MacBook Pro generates tokens slowly, and the reasoning suffers as a result.

With RiM, the model could iterate in the latent space without generating tokens, which is pure matrix computation — significantly faster on a consumer GPU. For AI agents with Ollama, this could mean local agents capable of complex reasoning without the unbearable latencies of CoT.

If you are considering installing a local LLM, keep an eye out for open source implementations of RiM. This is potentially the killer feature that makes local agents truly usable.


Comparison with existing reasoning approaches

Approach Mechanism Cost per token Context consumption Traceability Speed
Chain-of-Thought Sequential token generation Full output rate High (each token occupies the context) Complete Slow
Tree-of-Thought Token-based branch exploration Very high (multiple branches) Very high Complete (per branch) Very slow
Latent reasoning (Meta) Recurrent loop in the latent space Low (no output tokens) None None Fast
RiM (this paper) Latent working memory per block Low (no output tokens) None None Fast
LatentRAG Latent reasoning + subquery tokens only Moderate Low Partial (subqueries visible) Moderate

The table is clear: RiM and latent reasoning approaches dominate on cost, context consumption, and speed. They lose on traceability. The trade-off depends entirely on your use case.


The current limitations of the RiM approach

The opacity of reasoning

This is the fundamental trade-off. When GPT-5.5 produces a 3000-token CoT, you can read each step, identify logical errors, and potentially intervene. With RiM, you get an answer without access to the internal reasoning process.

For agent debugging, this is a challenge. When an agent makes a bad decision, understanding why is essential. Latent reasoning makes this diagnosis much more difficult. Developers will need to develop new interpretability tools specific to the latent space.

Generalization to all tasks

The paper shows solid results on reasoning benchmarks. But "pure" reasoning (formal logic, mathematics, planning) is a subset of the tasks that LLMs accomplish. For code generation, for example, CoT has an advantage: the model can explain its structuring logic, which is useful for maintainability.

For French LLMs, the question of the quality of latent reasoning in languages that are less represented in the training data remains open. Is reasoning in the latent space as robust when the model has to produce an answer in a secondary language?

The required infrastructure

Paradoxically, although RiM reduces token costs, it requires a different computing infrastructure. Iterations in the latent space are dense matrix operations that put a different strain on the GPU. API providers will have to adapt their infrastructure to optimize this type of computation, which will not happen overnight.


The research ecosystem around latent reasoning

This paper is not an isolated case. It is part of a clear movement in research towards decoupling reasoning and generation.

The LatentRAG paper shows that the community is actively exploring how to integrate latent reasoning into broader architectures, particularly RAG systems. The distinction between tasks requiring sub-query tokens and purely internal tasks is a conceptual framework that will shape the next generation of agents.

Meta's approach to continuous latent reasoning, along with the ethical concerns it raises, shows that the field is maturing. The community is no longer content with just measuring performance: it is questioning the implications of invisible reasoning.

deep-diver's deep-dive on recurrent architecture offers a complementary technical perspective, focusing on the scaling mechanisms of test-time compute without intermediate tokens.

RiM positions itself as the most accomplished synthesis of these different avenues: a structured architecture (block-based working memory), generalizable (across several model families), and which eliminates the constraint of step-level supervision.


What developers should do now

Short term: optimize the use of existing CoT

RiM is not yet deployed in commercial APIs. In the meantime, optimize your use of CoT. Use models that offer adaptive reasoning like Claude Opus 4.7 (Adaptive), which automatically adjusts the depth of reasoning based on complexity. Limit CoT to tasks that actually benefit from it.

For simple tasks, disable extended reasoning. Every unnecessary CoT token is wasted money. The best free LLMs like Gemini or ChatGPT free sometimes offer simplified reasoning modes that are sufficient for trivial cases.

Medium term: monitor implementations

The Hugging Face Papers page is the best place to track open source implementations of RiM. As soon as an efficient implementation is available, test it on your specific use cases with a local LLM before integrating it into production.

For agent hosting, solutions like Hostinger offer the necessary GPU infrastructure to experiment with these approaches at a controlled cost.

Long term: rethink your agent architecture

Latent reasoning fundamentally changes the architecture of AI agents. If reasoning no longer consumes context and no longer costs tokens, you can design agents that think much more, much more often, and about much more complex problems.

Imagine a search agent that iterates 50 times in latent space on the reformulation of a query before submitting it to a search engine. The result would be significantly higher accuracy, for a marginal cost. This is the type of architecture that RiM makes possible.


❌ Common mistakes

Mistake 1: Confusing latent reasoning with a simple internal summary

This is not a summary of the CoT. The model does not "compress" its thoughts. It iterates on a mathematical representation in a high-dimensional space. The process is fundamentally different from generating and then compressing text. This mistake leads to underestimating the power of the approach.

Mistake 2: Thinking that RiM completely replaces CoT

RiM is superior on pure reasoning benchmarks. But for tasks where traceability is required (audit, debugging, education), CoT remains irreplaceable. The mistake is wanting to migrate everything to latent reasoning. The right approach is hybrid: latent for performance, CoT for explainability.

Mistake 3: Ignoring infrastructure constraints

Fewer tokens does not mean "runs on anything". Latent iterations are dense computations that require sufficient VRAM and appropriate GPU memory bandwidth. The mistake is deploying RiM on undersized infrastructure assuming that "no tokens = no resources".


❓ Frequently Asked Questions

Does RiM work on all LLM models?

Based on the results on AI Models FYI, RiM works across different model families and sizes. However, it requires specific fine-tuning. You cannot enable it as a simple parameter on an existing model without preparation.

Is latent reasoning faster than CoT?

Yes, significantly. Iterations in the latent space are parallelizable matrix operations, without the sequential latency of autoregressive token generation. The speed depends on the number of iterations, but remains largely lower than the equivalent CoT.

Can RiM be combined with RAG?

This is exactly what the LatentRAG paper explores. The agent reasons in latent space to plan, emits tokens only for search sub-queries, and then reasons in latent space again on the results. It is the most promising architecture for next-generation RAG agents.

What are the risks of invisible reasoning?

Opacity makes debugging difficult and raises security questions. Without access to intermediate steps, detecting biased or erroneous reasoning is more complex. It is an accepted trade-off for non-critical use cases, but problematic for regulated domains.


✅ Conclusion

The RiM paper marks a tipping point: LLM reasoning no longer needs to go through token generation. Latent working memory paves the way for faster, cheaper agents capable of reasoning in the background without polluting their context. Developers who understand this architecture will have a decisive advantage in designing the next generation of AI agents.