Multi-Stream LLMs: why the future of AI agents lies in parallel processing
🔎 The sequential wall that AI agents just broke through
Since the arrival of GPT-3, all major language models share the same architectural flaw: they process information in a strictly sequential manner. One token after another, one thought after another, one action after another. It's like asking a senior developer to read a Jira ticket, then close their eyes to write the code, then open their eyes again to test it. Absurd, but that's exactly what our meilleurs agents IA do today.
On May 12, 2026, a paper published on arXiv (2605.12460) by Guinan Su, Yanwu Yang, Xueyan Li, and Jonas Geiping proposes a radical paradigm shift: Multi-Stream LLMs. The idea? To allow a model to generate, read, and reflect on multiple streams simultaneously, while maintaining the causal dependency that ensures coherence.
This is not a speed optimization. It's a change in the very nature of what an AI agent can do. An agent that reads a document while drafting a summary while planning its next step — that didn't exist before this paper. And it changes everything for automation, coding, and architectures multi-agents.
The essentials
- Current LLMs are locked into a single sequential stream: they cannot generate tokens while consuming new inputs, which traps autonomous agents in inefficient loops.
- Multi-Stream LLMs introduces parallel streams of thoughts, inputs, and outputs, where each forward pass reads from multiple streams and writes to multiple streams, with causal dependency preserved.
- Sequential instruction-tuning is replaced by a parallel-stream training format, paving the way for agents capable of acting and perceiving simultaneously.
- The implications are massive for coding agents, system monitoring, and any use case requiring real-time reactivity.
Recommended tools
| Model | Main usage | Agentic score (June 2025) | Ideal for |
|---|---|---|---|
| GPT-5.5 | High-end generalist agent | 98.2 | Complex tasks requiring reasoning and action |
| Gemini 3 Pro Deep Think | Multi-step deep reasoning | 95.4 | Document analysis + simultaneous planning |
| Claude Opus 4.7 (Adaptive) | Long-duration adaptive agent | 94.3 | Autonomous projects spanning several hours |
| GPT-5.4 Pro | Versatile agent, price/quality ratio | 91.8 | Standard enterprise automation |
| Kimi K2.6 | Self-hosted agent | 88.1 | On-premise deployment with full control |
Scores from the reference agentic ranking, June 2025. The models above are the natural candidates to integrate a multi-stream architecture in their future versions.
The problem: why sequential processing blocks everything
The current architecture is a bottleneck
A classic LLM works like a linear reader. At each forward pass, the model takes as input the sequence of tokens accumulated so far, and produces exactly one output token. It cannot, in the middle of generation, "look at" a new document that just arrived. It must first finish its thought, then we inject the new context into a subsequent prompt, then it resumes.
This limitation comes from causal attention itself — each token can only look at previous tokens in the same sequence. It's a historical design choice, not a law of physics. But it has dramatic consequences for agents.
The sequential agent is a slow and expensive agent
Let's take a concrete scenario: a coding agent that needs to analyze a GitHub repo, identify a bug, write a fix, and run the tests. In the current architecture, each step is a separate loop. The agent reads the code (loop 1), then plans (loop 2), then writes the patch (loop 3), then reads the test results (loop 4), then adjusts (loop 5).
Each loop is a complete API call. Each call costs tokens, latency, and money. The 5 patterns d'agents IA qui marchent try to optimize these loops, but they all hit the same wall: the model cannot do two things at once.
IBM also points out that current multi-agent architectures go beyond the single-agent by distributing tasks, but that each individual agent remains trapped in this sequential processing. It's adding workers to a broken assembly line.
What Multi-Stream LLMs proposes
The simple definition
Multi-Stream LLMs replaces the single stream with multiple parallel streams that coexist within the same model. Concretely, during a single forward pass, the model simultaneously reads from multiple input streams and generates tokens in multiple output streams.
The technical key: causal dependency is preserved, but it now extends across the previous timesteps of all streams, not just one. The model "knows" what it generated in stream A when it generates in stream B, and vice versa.
How it works technically
The detailed paper on Paper Reading Club explains that each forward pass performs a multi-source read and a multi-destination write. Instead of a linear sequence (t1, t2, t3, ...), we have a stream matrix:
- Input Stream 1: the document being analyzed
- Input Stream 2: user or tool feedback
- Output Stream 1: the ongoing writing
- Output Stream 2: internal reasoning (chain-of-thought)
- Output Stream 3: tool calls / actions
Each forward pass at a timestep T can draw from all input streams available up to T, and write to all output streams. The authors make it clear that this is not naive parallelism — there is a cross-stream causal dependency that ensures coherence.
The training change: from sequential to parallel
The summary from Hugging Face Papers highlights a crucial point: this isn't just an inference change. Moving from sequential instruction-tuning to a parallel-stream format requires complete retraining. The training data must be restructured to present examples where the AI learns to manage multiple streams concurrently.
It's a heavy investment, but it's the price to unlock a capability that simply didn't exist before.
Concrete use cases: what agents will be able to do
Coding agents that read and write at the same time
The most obvious case is autonomous coding. An agent based on GPT-5.5 (agentic score 98.2) or Claude Opus 4.7 Adaptive (94.3) could, with a multi-stream architecture, read its test results in one stream while fixing the code in another. No more "read → stop → write → restart" loop.
Imagine an agent that continuously monitors a CI/CD pipeline (input stream), fixes failures in real time (code output stream), all while maintaining a log of its decisions (reasoning output stream). It's the shift from the "batch" agent to the "true streaming" agent.
Real-time monitoring and response
A monitoring agent that receives continuous logs (input stream 1), metric alerts (input stream 2), generates diagnostics (output stream 1) and triggers remediations (output stream 2) — all within a single model, without switching latency between modes.
IBM identifies this pattern as essential for enterprise agentic architectures, where one agent can specialize in NLP while another handles computer vision. Multi-Stream LLMs allows a single agent to play this multi-specialist role.
Conversational agents that think without making you wait
Today, when you ask a complex question to Claude Sonnet 4.6 (agentic score 81.4), the model "thinks" and then answers. With multi-stream, the internal reasoning could occupy a dedicated stream while the response stream starts producing the first safe elements. The user perceives immediate reactivity, even on difficult questions.
Impact on multi-agent architectures
Fewer agents, more capability per agent
The current approach to dealing with sequential limitations is to multiply agents. A reader agent, a writer agent, a reviewer agent. As explained in our article on multi-agents: making multiple AIs collaborate, this distribution comes at a cost in coordination, latency, and complexity.
Multi-Stream LLMs reduces the need for this fragmentation. A single multi-stream agent can internalize tasks that previously required three separate agents. It doesn't make multi-agent obsolete — it makes it more efficient, because each individual agent is more capable.
Agent configuration becomes more granular
For frameworks like OpenClaw, where we configure agents with SOULs, AGENTS, and Skills, the multi-stream architecture opens up new possibilities. An agent could have a dedicated stream per skill, dynamically activated or deactivated depending on the context. The agent's SOUL (its personality and goals) would remain the guiding thread across all streams.
Which LLM to take advantage of multi-stream?
This is the central question. The models best positioned to integrate this architecture are those that already dominate agentic leaderboards. OpenAI's GPT-5.5 (98.2) and Google's Gemini 3 Pro Deep Think (95.4) have the necessary reasoning depth. For local deployment, Kimi K2.6 in self-host (88.1) and Z.AI's GLM-5 Reasoning (82) are natural candidates if their teams adopt the multi-stream training format.
Selecting the best LLM for AI agents will soon take on a new criterion: native multi-stream support.
Comparison table: sequential vs multi-stream
| Criterion | Sequential LLM (current) | Multi-Stream LLM (proposed) |
|---|---|---|
| Input flow | Single stream, blocking | Multiple streams, simultaneous reading |
| Output flow | 1 token per forward pass | Tokens in multiple streams per forward pass |
| Causal dependency | On the single sequence | Cross-dependent across all streams, preserved |
| Agent during generation | Blind to new inputs | Can integrate new inputs in real time |
| Agent loops | Multiple sequential API calls | One continuous multi-stream call |
| Cost per complex task | High (N loops × tokens) | Reduced (1 continuous stream) |
| Perceived latency | High (waiting between loops) | Low (progressive multi-stream response) |
| Training required | Standard instruction-tuning | New parallel stream format |
| Availability | Immediate (all LLMs) | Research (May 2026), no production |
What this changes for developers
APIs will evolve
Today, an LLM's API is simple: you send a prompt, you receive a stream of tokens. With multi-stream, the API will have to expose multiple input and output channels. Developers will need to think in terms of "stream routes" rather than "single prompt".
This is a shift in abstraction as significant as the transition from completion mode to chat mode in 2023. Agent frameworks will have to adapt.
Orchestration patterns change
Current patterns — ReAct, Plan-and-Execute, Reflection — are all designed around sequential loops. With multi-stream, new patterns emerge:
- Stream-and-Act: the agent acts in one stream while it perceives in another
- Parallel Reflection: critical reasoning is applied in real time on the generation, not a posteriori
- Continuous Planning: the plan updates dynamically as inputs arrive, without interrupting execution
These patterns do not yet exist in the literature. The May 2026 paper formally opens the field.
The cost of inference
A multi-stream forward pass is more computationally expensive than a classic sequential forward pass — the model has to process more data per step. But the gain in the number of loops eliminated more than compensates. For a task that currently requires 5 sequential loops, a single multi-stream flow may suffice, dividing the total cost by a significant factor.
Limitations and open questions
Is cross-causal dependency really scalable?
The paper claims that causal dependency is preserved across streams. But in practice, the more streams you add, the more complex the cross-attention matrix becomes. The authors do not show results beyond a small number of streams. The question of whether this architecture scales to 10 or 20 parallel streams remains open.
Retraining is a major obstacle
Adopting multi-stream is not just a simple API change. The model needs to be retrained with a new data format. For a model like GPT-5.5 or Gemini 3 Pro Deep Think, this represents an investment of several million dollars. Teams from DeepSeek (DeepSeek V4 Pro, overall score of 88) or Moonshot AI (Kimi K2.6) might be more agile on this point.
The quality of multi-stream generation
Generating in multiple streams simultaneously could degrade the quality of each individual stream. The model's attention is shared — does the reasoning stream suffer when the code stream is active? The paper does not provide a fine-grained analysis of per-stream degradation.
❌ Common mistakes
Mistake 1: Confusing multi-stream with batching
Batching involves processing multiple independent requests in parallel to optimize GPU usage. Multi-stream is a single request with multiple interdependent internal flows. It's not the same thing. Batching improves server throughput; multi-stream improves the agent's capabilities.
Mistake 2: Thinking it replaces multi-agent
Multi-Stream LLMs make each agent more capable, but do not replace the multi-agent architecture. For truly distributed tasks — an NLP agent and a vision agent working on heterogeneous data, as described by IBM — distribution remains necessary. Multi-stream and multi-agent are complementary, not competing.
Mistake 3: Believing it's deployable tomorrow
The paper is a research contribution from May 2026. No production model natively supports multi-stream today. Developers who try to "simulate" multi-stream by multiplexing rapid sequential API calls will not reproduce the cross-causal behavior described in the paper. We have to wait for model providers to integrate the architecture.
❓ Frequently asked questions
Are Multi-Stream LLMs compatible with open source models like Ollama?
Not yet. The architecture requires specific retraining in a parallel stream format. Current open source models operate sequentially. But it's a natural candidate for open source AI agents with Ollama locally once multi-stream weights are published.
Which current model is closest to multi-stream?
None implement it natively. But Gemini 3 Pro Deep Think (95.4 agentic) and GPT-5.5 (98.2) have internal reasoning mechanisms that resemble separate streams (thinking vs output). It's an approximation, not the architecture described in the paper.
Will it be more expensive in inference?
The individual forward pass is more expensive, but the total number of passes decreases drastically. On a complex agentic task, the total cost should go down. On a simple prompt without multi-stream, the cost is identical to a classic sequential model.
Doesn't the causal dependency between streams create deadlocks?
No, because causality is unidirectional in time — each stream at timestep T depends on the timesteps T-1 of all streams, not on the timestep T of other streams. There is no possible circular dependency.
✅ Conclusion
Multi-Stream LLMs is the first paper to tackle head-on the true bottleneck of AI agents: sequential processing. By allowing a model to read, reflect, and act on parallel streams with preserved causal dependency, it opens the door to agents that no longer simulate reactivity but truly experience it. Multi-agent and the best autonomous AI agents will gain in power when this architecture lands in production. If you build agents today, prepare your architectures for a multi-stream world — it is arriving faster than we think.