LongSeeker: The Research Agent That Beats Tongyi DeepResearch by Managing Its Memory Intelligently
🔎 Why Research Agents Saturate Mid-Session
So-called "deep research" agents have a structural problem. The more they explore, the more their context explodes. Every page visited, every click, every snippet added to the context window eventually pollutes the agent's working memory.
Result: incoherent responses, hallucinations at the end of the session, and skyrocketing compute costs. It's an invisible wall that most users don't see, but which drastically limits the real depth of these tools.
A team of researchers (Yijun Lu, Rui Ye, Yuwen Du, Jiajun Wang, Songhua Liu, Siheng Chen) just published a solution on arXiv on May 6, 2026. LongSeeker proposes a radically different approach: instead of blindly accumulating, the agent actively decides what it keeps, summarizes, isolates, or discards.
The score speaks for itself: 61.5% on BrowseComp compared to 43.2% for Tongyi DeepResearch. That's a leap of +18 points on a benchmark recognized for measuring research capability on the open web.
The Essentials
- LongSeeker introduces Context-ReAct, a paradigm with 5 atomic operations to dynamically manage a research agent's working memory.
- Fine-tuned from Qwen3-30B-A3B on 10,000 synthetic trajectories, it reaches 61.5% on BrowseComp, where Tongyi DeepResearch plateaus at 43.2%.
- The Compress operator is proven to be expressively complete: it can simulate any other context management operation.
- The agent no longer suffers from context saturation: it orchestrates it elastically, like a conductor deciding which instruments play at any given moment.
Recommended Tools
| Tool | Main Usage | Price (May 2026, check website) | Ideal for |
|---|---|---|---|
| LongSeeker | Long-horizon web research with elastic context management | Open-source (model weights) | Researchers and agent developers |
| Qwen3-30B-A3B | Base model for agent fine-tuning | Open-source | Projects requiring low-cost reasoning |
| DeerFlow de ByteDance | Open-source agent for long-term research, code, and creation | Open-source | Multi-step code + research projects |
The Problem: Context Explosion Kills Research Agents
Current research agents operate on a simple but fragile principle. They visit pages, extract content, add it to their prompt, and continue. This linear mechanism has a hard limit.
After about ten pages visited, the context contains thousands of tokens, a large portion of which is obsolete. Information from page 3 contradicts that of page 8. Insignificant details consume precious tokens. The model's attention dilutes.
This is not a model quality problem. It's an architectural problem. The standard ReAct framework (Reasoning + Acting) provides no mechanism for managing working memory. It assumes that everything that enters remains useful. This is false for any research lasting more than 5 minutes.
The classic approach of RAG vs fine-tuning vs agents shows that agents are performant for complex tasks, but their scalability remains their Achilles' heel. LongSeeker tackles precisely this breaking point.
Context-ReAct: The 5 Operations That Change Everything
LongSeeker's central proposal is elegant: add a context management mechanism directly into the agent's reasoning loop. Instead of only "Think → Act", the agent can now "Think → Act → Manage".
Skip: Move Forward Without Loading
The agent identifies that a page or section is irrelevant and skips it without adding it to the context. This isn't just simple keyword filtering: it's a decision made after reading the beginning of the page and evaluating its relevance to the current objective.
The Skip operation drastically reduces noise. In the trajectories analyzed by the researchers, up to 40% of the pages visited by classic agents provided no added value.
Compress: Summarize Without Losing the Essentials
This is the most powerful operation in the system. The agent takes an existing block of context and compresses it into a shorter version, keeping only the information relevant to the current task.
What distinguishes Compress from a simple summary is that it is contextually motivated. The agent doesn't compress generically. It compresses based on what it knows it will need to use next. The results prove that this operator is expressively complete: it can simulate any other context management operation.
Rollback: Return to a Previous State
When the agent realizes a research trail is fruitless, it can revert to a previous snapshot of its context. It's the equivalent of a git reset for working memory.
This operation is crucial for exploratory research where the agent must test hypotheses. Without Rollback, false leads permanently pollute the context.
Snippet: Isolate a Specific Passage
Instead of keeping an entire page in the context, the agent extracts only the relevant passage and discards the rest. The Snippet operation acts like a scalpel: precise, minimal, efficient.
Delete: Permanently Remove
The simplest but most symbolic operation. The agent decides that a piece of information is no longer useful and simply deletes it from the context. It's the antithesis of the "keep everything just in case" mentality that characterizes current agents.
Architecture: Qwen3-30B-A3B Fine-Tuned on 10,000 Trajectories
LongSeeker is not a from-scratch model. It's a fine-tuning of Qwen3-30B-A3B, Alibaba's model with a MoE (Mixture of Experts) architecture featuring 30 billion parameters but only 3 billion active per token.
Why This Base Choice?
Qwen3-30B-A3B offers an excellent performance/cost ratio. The 3 billion active parameters allow for fast inference, while the 30 billion total guarantee sufficient reasoning capacity for complex research tasks.
It's a pragmatic choice. The researchers preferred to optimize context orchestration rather than increase the model size. The idea: a smart agent with a good memory system is better than an oversized agent with saturated memory.
The 10,000 Synthetic Trajectories
The fine-tuning was performed on 10,000 synthetically generated trajectories. Each trajectory represents a complete research session with annotated context management decisions.
Generating these trajectories is non-trivial work. It required creating scenarios where context management operations are truly necessary, and then annotating the optimal decisions (when to compress, when to rollback, etc.).
This synthetic trajectory approach echoes the strategy used by agents like Dexter for financial research, where the quality of training data directly determines the quality of behavior in production.
Results: 61.5% vs 43.2% on BrowseComp
The BrowseComp benchmark measures an agent's ability to find precise information on the open web. It's a difficult test because it requires navigating, understanding unstructured pages, and synthesizing scattered information.
Score Comparison
| Agent | BrowseComp Score | Base Model | Context Management |
|---|---|---|---|
| LongSeeker | 61.5% | Qwen3-30B-A3B | Elastic (Context-ReAct) |
| Tongyi DeepResearch | 43.2% | Non disclosed | Standard |
| OpenSeeker-v2 | ~55%* | Open-source | Hybrid |
| Classic ReAct Agents | ~30-35% | Variable | None |
*Estimation based on performances reported in the literature.
What the Score Doesn't Say
The 61.5% figure is impressive, but the most significant aspect lies elsewhere. It's the performance gap that widens as the complexity of the research increases.
On simple queries (1-2 pages), LongSeeker and Tongyi DeepResearch are comparable. The difference appears on long-horizon searches requiring 10+ pages. There, Tongyi collapses while LongSeeker maintains its performance thanks to elastic context management.
LongSeeker's approach fits into a trend where open-source agents like OpenSeeker-v2 are breaking the monopoly of industrial search agents, not by copying their architectures, but by innovating differently.
The LongSeeker workflow in practice
Understanding LongSeeker means understanding its decision flow. Here is how the agent actually proceeds during a search session.
Step 1: Initialization
The agent receives a query and plans a search strategy. The context is empty, the working memory is blank. So far, nothing different from a classic agent.
Step 2: Exploration with filtering
The agent starts visiting pages. For each page, it evaluates in real time whether the content is relevant. If not → Skip. If partially relevant → Snippet to extract the useful passage.
This is where the first gain appears. A classic agent would fully load every visited page. LongSeeker only loads what deserves to be loaded.
Step 3: Periodic compression
After a few pages, the context starts to fill up. The agent triggers a Compress: it takes the accumulated context blocks and condenses them, keeping what is relevant for the next steps of the search.
The key: the compression is goal-oriented. The agent knows what it is still looking for, so it knows what it can eliminate.
Step 4: Managing dead ends
If the agent goes down a path that proves fruitless, it uses Rollback to return to a clean previous state. Then Delete to remove the information from the false lead. The context remains healthy.
Step 5: Final synthesis
At the end of the search, the context only contains relevant, compressed, and organized information. The agent can then synthesize without being polluted by the noise accumulated over dozens of visited pages.
Compress: the operator that can do it all
Among the 5 operations of Context-ReAct, Compress deserves special attention. The researchers formally proved that this operator is expressively complete.
What is expressive completeness?
In simple terms: Compress can simulate any of the other 4 operations. A smart summary can serve as a Skip (compress to nothing), a Snippet (compress while keeping a passage), a Delete (compress to empty), or a Rollback (compress to a previous state).
This property has a major practical implication. Even if the agent does not perfectly master all operations, mastering Compress alone gives it near-total context management power.
Why keep all 5 operations then?
If Compress can do everything, why have 5 distinct operations? For efficiency and readability.
Each specialized operation is more efficient than Compress for its use case. Skip is faster than Compress for ignoring a page. Delete is cleaner than Compress for removing. Specialized operations are cognitive shortcuts for the agent.
Moreover, breaking down the operations makes the agent's behavior more interpretable. When the agent does a Skip, we understand exactly what is happening. When it does a Compress, the interpretation is fuzzier.
The question of memory in AI agents
The problem that LongSeeker solves is not new. It is an instance of a broader challenge: how to manage memory in AI agents.
Working memory vs long-term memory
LongSeeker tackles working memory (the context in the prompt). This is different from long-term memory (vector databases, log files, etc.).
Working memory has a hard constraint: the context window. Even models with 128k or 200k tokens eventually saturate if you accumulate into them indiscriminately. The question is not "how many tokens can we put in" but "which tokens deserve to be in there".
This distinction is fundamental in agent design. As explained in our guide on AI memory, there are several layers of memory, and each layer has its own optimal management mechanisms.
The parallel with human memory
The human brain does not remember everything. It filters, compresses, actively forgets. Research in cognitive science shows that forgetting is a feature, not a bug.
LongSeeker applies this principle to agents: selective forgetting (Delete), consolidation (Compress), targeted retrieval (Snippet) are all mechanisms inspired by human cognitive functioning.
Implications for the search agent ecosystem
LongSeeker is not just an academic paper. It has concrete implications for how search agents are designed today.
The myth of "more context = better search"
The industry has gotten into the habit of solving memory problems by increasing the context window. Gemini with 1M tokens, Claude with 200k, RAG solutions with massive databases. LongSeeker shows that this approach has a diminishing return curve.
Beyond a certain point, adding context degrades performance. Attention dilutes, hallucinations increase, costs explode. LongSeeker's elastic orchestration suggests that the solution is not "more memory" but "better memory management".
Impact on inference costs
Less context means fewer tokens to process at each step. With Qwen3-30B-A3B and its 3 billion active parameters, combined with an optimized context, the cost per search session drops significantly.
The researchers do not give exact figures on the savings achieved, but the architecture suggests a potential reduction of 30 to 50% in tokens processed per session, compared to a classic ReAct agent with linear accumulation.
A reusable pattern beyond search
Context-ReAct is not specific to web search. The elastic context orchestration pattern can be applied to any agent that maintains a state over a long duration: code agents, data science agents, customer support agents.
Projects like ByteDance's DeerFlow, which manage projects over the long term, could directly benefit from this pattern. The longer the task, the more critical context management becomes.
Limitations and open questions
Despite its impressive results, LongSeeker has limitations that the researchers honestly acknowledge.
Dependence on fine-tuning
The 10,000 synthetic trajectories are both a strength and a weakness. A strength because they teach the agent sophisticated context management behaviors. A weakness because the quality of the model depends directly on the quality of these trajectories.
Generating high-quality trajectories is costly and difficult to scale. If you want to adapt LongSeeker to a new domain (medical, legal, scientific research), you need to generate new specific trajectories.
The question of generalization
The results on BrowseComp are solid, but BrowseComp remains a specific benchmark. General public web search is a particular use case. The open question: does Context-ReAct work just as well on search tasks in closed corpora, internal databases, enterprise environments?
No comparison with proprietary agents
The paper compares LongSeeker with Tongyi DeepResearch and academic baselines. There is no direct comparison with proprietary agents like Gemini Deep Research or ChatGPT with Deep Research. The latter have non-public architectures and vastly superior engineering resources.
The score of 61.5% is excellent for an open-source agent, but it does not allow us to conclude that LongSeeker beats commercial solutions on all metrics.
❌ Common mistakes
Mistake 1: Confusing context management with context window
What's wrong: Thinking that LongSeeker solves problems by increasing the size of the context window. The solution is architectural, not quantitative.
The fix: Understand that LongSeeker better manages a fixed-size context. The gain comes from orchestration, not size.
Mistake 2: Seeing Context-ReAct as just an improved RAG system
What's wrong: Classifying LongSeeker in the RAG category. It is an agent with a reasoning loop that actively manages its memory, not a passive retrieval system.
The fix: Clearly distinguish it from RAG approaches. LongSeeker makes metacognitive decisions (what to keep, what to throw away), which is fundamentally different from a retrieval system.
Mistake 3: Ignoring the importance of training trajectories
What's wrong: Focusing on the Context-ReAct architecture while forgetting that everything relies on the quality of the 10,000 synthetic trajectories.
The fix: If you want to reproduce or adapt LongSeeker, invest time in the generation and validation of trajectories. The architecture without the data is worthless.
Mistake 4: Deploying LongSeeker without monitoring context operations
What's wrong: Using the agent in production without tracking which operations (Skip, Compress, Rollback, Snippet, Delete) are triggered and when.
The fix: Instrument the pipeline to log every context management operation. This is the key to understanding the agent's behavior and debugging it.
❓ Frequently Asked Questions
Does LongSeeker replace RAG systems?
No. LongSeeker is a web search agent that manages its working memory. RAG remains relevant for searching closed corpora. Both approaches are complementary and can coexist within a pipeline.
Can LongSeeker be used with a model other than Qwen3-30B-A3B?
In theory yes, but performance will depend on fine-tuning. Context-ReAct operations are learned, not programmed. Changing the base model implies regenerating training trajectories and redoing the fine-tuning.
Is the Compress operator really equivalent to other operations?
In terms of formal expressiveness, yes. The researchers prove this in the paper. In practice, specialized operations are more efficient and interpretable. Compress is a theoretical safety net, not a practical replacement.
Is BrowseComp a representative benchmark?
BrowseComp is one of the most widely used benchmarks for evaluating web search agents. It is not perfect, but it is recognized by the community. The fact that LongSeeker outperforms Tongyi DeepResearch by +18 points is significant regardless of any criticism of the benchmark.
How to generate synthetic trajectories for a new domain?
The paper does not fully detail the process, but it involves a "teacher" agent that solves tasks while annotating its context management decisions. This process is costly and requires a more powerful model as an oracle to generate the annotations.
✅ Conclusion
LongSeeker demonstrates that the true bottleneck of search agents is neither the model size nor the context window, but the absence of active mechanisms for managing working memory. With Context-ReAct and its 5 atomic operations, it opens a path that the ecosystem will likely adopt massively: the agent that knows how to forget is the agent that knows how to search. The full paper is available on arXiv.