OpenSeeker-v2 : open-source breaks the monopoly of industrial search agents
🔎 As few as 10,600 examples are enough to beat the giants
Deep search agents were considered the last bastion of Big Tech. Google, OpenAI, Anthropic: all have invested millions to train models capable of scouring the web, cross-referencing sources, and producing structured answers. The recipe seemed set in stone — massive pre-training, then CPT, SFT, RL — an industrial pipeline out of reach for academic research.
In May 2026, a team publishes OpenSeeker-v2 on arXiv and shatters this assumption. Their finding is radical: with a simple, well-targeted SFT, fueled by informative, high-difficulty trajectories, an open-source model rivals cutting-edge proprietary solutions. No need for RLHF, no need for billions of additional parameters.
The implication is clear. Deep search is no longer a resource-linked competitive advantage. It's a training data quality problem. And that changes everything for the open-source ecosystem.
The essentials
- OpenSeeker-v2 is an open-source deep search agent that rivals Big Tech's proprietary models on reference benchmarks.
- The model is trained with only 10,600 trajectories, compared to millions for classic industrial approaches.
- Three key modifications explain the performance: expanding knowledge graphs, expanding the tool set, and strict filtering by number of steps.
- The method replaces the heavy pipeline (pre-training + CPT + SFT + RL) with a targeted SFT on high-difficulty trajectories.
- The benchmarks cover multi-source deep search (HotPotQA-style) and code-search tasks (SWE-bench-style).
Recommended tools
| Tool | Main usage | Price (June 2025, check on site) | Ideal for |
|---|---|---|---|
| Hostinger | Hosting to deploy agents | Starting at 2.99 €/month | Deploying open-source agents in production |
| Ollama | Run open-source LLMs locally | Free | Testing OpenSeeker-v2 locally |
| OpenClaw | Autonomous agent framework | Free (open-source) | Building custom search agents |
What a frontier search agent really is
A search agent doesn't just do a Google query and summarize the results. It chains search iterations, evaluates the relevance of each source, identifies gaps in its reasoning, and relaunches targeted searches to fill those holes.
Concretely, the agent follows a trajectory: it asks an initial question, retrieves documents, analyzes them, decides if it has enough information, or reformulates its query. This loop can run for 5, 10, sometimes 20 steps before producing a final answer.
Models like GPT-5.5 (agentic score of 98.2) or Gemini 3 Pro Deep Think (95.4) excel in this type of iterative reasoning. But their training relies on heavy pipelines, inaccessible outside industrial labs. OpenSeeker-v2 demonstrates that it's not the size of the model that makes the difference, but the quality of the learning signal.
This distinction is fundamental. It brings search agents closer to the 5 AI agent patterns that work, where the observation-reflection-action loop takes precedence over computational brute force.
The method: informative and high-difficulty trajectories
This is the technical core of the paper. The OpenSeeker-v2 team starts from a simple observation: most training data for search agents is of poor quality. Either too easy (the model finds the answer in one step), or too redundant (thousands of nearly identical trajectories).
Their approach inverts the industrial logic. Instead of maximizing data volume, they maximize the difficulty and informational richness of each trajectory.
Informative trajectories: learning to search, not to answer
An informative trajectory is not judged by the final answer, but by the path taken. The model must learn to structure its search, identify sub-questions, cross-reference contradictory sources, and go back when a lead is dead.
Classic training on question-answer pairs produces models that jump to conclusions. Training on informative trajectories produces models that know how to navigate uncertainty.
High-difficulty filtering: keeping only the worst
This is low-step filtering, and it's counterintuitive. The team filters trajectories to keep only those where the model had the most difficulty finding the answer. If a trajectory is resolved in 2-3 steps, it is eliminated.
The result: the model learns exclusively from complex cases, where the search strategy makes the difference between success and failure. It's the equivalent of altitude training for athletes — you don't improve by repeating what you already know how to do.
Three data synthesis levers
To generate these high-quality trajectories, the team modified data synthesis along three axes:
- Knowledge graph scale : The graphs used to generate the trajectories were considerably enlarged, offering the model longer and more branched search paths.
- Tool set expansion : Instead of limiting itself to a search engine, the agent has an expanded set of tools (calculation, structured extraction, multi-source navigation), forcing more sophisticated search strategies.
- Strict filtering : As explained, only high-difficulty trajectories (requiring numerous steps) are kept.
These three modifications, combined, produce a dataset of 10,600 trajectories. A derisory figure compared to industrial standards, but sufficient to reach frontier performance.
The benchmarks: HotPotQA, SWE-bench and beyond
OpenSeeker-v2 is evaluated on two families of benchmarks that test distinct but complementary skills for a search agent.
Multi-source deep search (HotPotQA-style)
HotPotQA-style benchmarks require the model to cross-reference information from multiple documents to answer a question. This isn't simple factual lookup — it's multi-hop reasoning.
The model must identify which facts are needed, locate them in different sources, and combine them logically. On these benchmarks, OpenSeeker-v2 surpasses proprietary models that were trained with much heavier pipelines.
This is where the informative trajectory technique pays off the most. The model learned to plan its search in multiple steps, not to jump straight to the answer.
Code-search (SWE-bench-style)
The second family of benchmarks tests the agent's ability to solve code problems by combining search and implementation. This is a concrete use case: a developer reports a bug, the agent must search the documentation, understand the context, identify the source of the problem, and propose a fix.
On these SWE-bench-style tasks, OpenSeeker-v2 demonstrates that the high-difficulty trajectory approach generalizes beyond simple QA. The agent isn't just searching for information — it's searching with an action objective.
To understand how this type of agent fits into a larger architecture, see our article on configuring OpenClaw: SOUL, AGENTS, and Skills, which details how to structure an agent with search and action skills.
Why it challenges the industrial pipeline
The classic pipeline for training a frontier agent looks like this: massive pre-training on the web, then continual pre-training (CPT) on specialized data, then supervised fine-tuning (SFT) on demos, and finally reinforcement learning (RL) to optimize a reward.
Each step costs millions of dollars in compute. Each step requires infrastructures that only Big Tech possesses. This is the resource wall that maintained the monopoly.
OpenSeeker-v2 skips the first three steps. It takes a pre-trained base model and applies a targeted SFT with 10,600 well-chosen trajectories. Result: performance comparable to models trained with the full pipeline.
It's not that the industrial pipeline is useless. It's that it's poorly optimized. When you feed an SFT with mediocre data, RL becomes necessary to correct the flaws. When the data is excellent from the start, RL provides marginal gains.
The lesson for the open-source community is clear: quality beats quantity, and difficulty beats coverage. This principle actually applies beyond search agents — it is central in the debate RAG vs fine-tuning vs agents: choosing the right approach in 2026.
10,600 Trajectories: Why This Number Is Troubling
In the industry, we talk in millions, in billions of tokens. And here, an academic project comes along with 10,600 examples and beats the giants. This number deserves our attention.
The first reaction is skepticism. But the paper demonstrates that the difficulty distribution of the trajectories is the determining factor. A million easy trajectories teach the model nothing it doesn't already know how to do. Ten thousand difficult trajectories force the agent to develop new strategies.
This is consistent with what we observe elsewhere in ML. Curriculum learning, where you train first on easy examples and then progressively on harder ones, has long shown that the learning distribution matters just as much as the volume.
The difference here is that the OpenSeeker-v2 team completely eliminated the easy examples. Low-step filtering literally means: if it's easy, we don't even use it to warm up. We go straight to high-altitude training.
For the open-source community, this is excellent news. Generating 10,600 high-quality trajectories is achievable with a modest budget. It's not a question of compute; it's a question of methodology.
The Implications for the Open-Source Ecosystem
Until now, deep search was the main justification for proprietary models. "You can't have a good search agent in open-source" was the implicit argument of every Big Tech company selling API access to their search models.
OpenSeeker-v2 destroys this argument. And the consequences go beyond the single use case of deep search.
Democratization of Advanced Agents
If 10,600 trajectories are enough to train a frontier search agent, the same principle can be applied to other types of agents. Code agents, financial analysis agents, scientific research agents — anywhere the search strategy is key, the high-difficulty trajectory method should work.
This reinforces the movement towards open-source AI agents with Ollama locally, where anyone can deploy a high-performing agent without relying on a proprietary API.
New Criteria for Choosing LLMs for Agents
When specialized training costs so little, the choice of base model becomes more important than the post-training pipeline. The best LLMs for AI agents are no longer those with the best proprietary training, but those that offer the best foundation for targeted SFT.
In this context, models like Moonshot AI's Kimi K2.6 (agentic score of 88.1, self-host) or Z.AI's GLM-5 (82, self-host) gain in attractiveness. They offer a solid, open-source base on which to apply the OpenSeeker-v2 method.
Threat to Proprietary Search Models
Proprietary models like GPT-5.5 (98.2) and Gemini 3 Pro Deep Think (95.4) still top the overall agentic leaderboards. But their advantage in the specific domain of deep search is shrinking. If an open-source model matches their search performance, the justification for the premium price crumbles.
This is particularly true for B2B use cases, where data confidentiality pushes organizations to favor self-hosting. A high-performing open-source search agent eliminates the trade-off between performance and confidentiality.
What This Teaches Us About DeerFlow and Long-Term Agents
OpenSeeker-v2's approach retrospectively sheds light on other recent open-source projects. ByteDance's DeerFlow, for example, is an open-source agent designed to research, code, and create over the long term.
The common point with OpenSeeker-v2 is the importance of search strategy over time. An agent working over the long term cannot afford to follow inefficient trajectories. It must optimize every step, exactly like what OpenSeeker-v2 learns through its high-difficulty trajectories.
The low-step filtering method is also directly transferable: for a long-term agent, you could filter training trajectories to keep only those where the agent had to reformulate its strategy along the way. Easy cases where everything goes right the first time teach nothing about resilience.
This convergence between search approaches (OpenSeeker-v2) and long-term creation approaches (DeerFlow) suggests a more general principle: the most performant open-source agents are those that learn from their failures, not their successes.
How to Concretely Use OpenSeeker-v2
The arXiv paper is a research paper, not a packaged product. But the practical implications are immediate for developers and tech teams.
Locally with Ollama
The model can be run locally via Ollama, which allows you to test its search capabilities without sending data to a third party. This is essential for companies handling sensitive data that cannot use proprietary APIs.
The required configuration remains moderate compared to proprietary frontier models, precisely because the model was not bloated by a heavy training pipeline.
Integration into an Agent Framework
OpenSeeker-v2 is not a complete framework — it's a model. To turn it into an autonomous agent, it needs to be integrated into a framework like OpenClaw, which handles the action loop, tools, and memory persistence.
The best autonomous AI agents generally combine a high-performing base model with sophisticated orchestration. OpenSeeker-v2 provides the first element; the framework provides the second.
Production Deployment
For serious deployment, suitable hosting is necessary. Hostinger offers affordable solutions for hosting open-source agents, with sufficient performance for moderate search loads. The key point is having reliable web access and enough RAM to load the model.
The Limitations to Keep in Mind
Despite the impressive results, OpenSeeker-v2 is not a miracle model. Several limitations are important to mention.
The first is that the model is specialized. Its exceptional performance concerns deep search. On general reasoning, creativity, or conversation tasks, it probably does not surpass generalist frontier models like Claude Opus 4.7 (94.3) or GPT-5.4 Pro (91.8).
The second limitation concerns the generation of the training trajectories themselves. The paper does not fully detail the synthesis cost of the 10,600 trajectories. If each trajectory requires dozens of API calls to a powerful model to be generated and filtered, the real cost is not negligible — it is simply shifted from training to data preparation.
The third limitation is the dependence on the base model. OpenSeeker-v2 starts from an existing pre-trained model. If this base model has biases or gaps, SFT will not necessarily correct them. The method improves the search strategy, not the underlying knowledge.
Finally, the benchmarks, while standard, remain benchmarks. Lab performance does not guarantee real-world performance, with noise, unreliable sources, and users asking poorly formulated questions.
❌ Common Mistakes
Mistake 1: Confusing Data Volume with Data Quality
The classic mistake is thinking that training a search agent requires millions of examples. OpenSeeker-v2 proves the opposite: 10,600 high-difficulty trajectories beat datasets a thousand times larger but poorly filtered. The solution: invest in filtering and curation, not in raw collection.
Mistake 2: Reflexively Applying RL
When a search agent doesn't perform well, the industry instinct is to add a reinforcement learning step. But if the SFT data is of poor quality, RL will only polish a fundamentally flawed strategy. The solution: first improve the training trajectories, then evaluate if RL still adds anything.
Mistake 3: Ignoring Difficulty Filtering
Many open-source teams replicate OpenSeeker-v2's method but neglect low-step filtering. They keep easy trajectories "to add variety to the dataset." This is exactly what dilutes the learning signal. The solution: be ruthless with filtering. If it's easy, it doesn't belong in the dataset.
Mistake 4: Deploying a Search Agent Without a Framework
OpenSeeker-v2 is a model, not a product. The mistake is using it as-is, without an orchestration loop, without tool management, without persistence. The solution: integrate it into an agent framework like OpenClaw that handles the infrastructure around the model.
❓ Frequently Asked Questions
Does OpenSeeker-v2 replace proprietary search agents like Perplexity or SearchGPT?
Not exactly. OpenSeeker-v2 is a search model, not an end-user product. Proprietary solutions include an interface, real-time web indexing, and availability guarantees. OpenSeeker-v2 provides the reasoning engine, not the complete product.
Can the method be reused for other types of agents?
Yes, that is even the main implication of the paper. The principle of high-difficulty informative trajectories is transferable to any agent whose performance depends on an iterative strategy: code agents, analysis agents, planning agents.
Which base model should be used with the OpenSeeker-v2 method?
The paper does not prescribe a specific model, but the logic suggests a base model with good reasoning capabilities. Self-host models like Kimi K2.6 (88.1) or GLM-5 (82) are natural candidates for the open-source community.
Doesn't low-step filtering risk over-specializing the model?
This is a real risk. By training only on difficult cases, the model can become clumsy on simple questions. In practice, the authors consider that the base capabilities of the pre-trained model cover simple cases — SFT only serves to add the ability to handle complex cases.
Are 10,600 trajectories really enough for production?
For an academic proof-of-concept, yes. For large-scale production deployment, more data will probably be needed to cover the diversity of real-world cases. The figure proves a principle, not an absolute limit.
✅ Conclusion
OpenSeeker-v2 demonstrates that Big Tech's monopoly on deep search is not technological — it is methodological. With 10,600 well-chosen trajectories and targeted SFT, an open-source project achieves the performance of models trained with million-dollar industrial pipelines. The quality of the training data, not the quantity of compute, is the true performance lever for search agents. For teams that want to build their own search agents without relying on proprietary APIs, the message is clear: explore open-source agent frameworks and apply difficulty-based filtering principles today.