📑 Table of contents

FutureSim: this benchmark makes AI agents replay 3 months of real events to evaluate them

Agents IA 🟢 Beginner ⏱️ 15 min read 📅 2026-05-17

FutureSim : this benchmark makes AI agents replay 3 months of real-world events to evaluate them

🔎 AI agent benchmarks are broken — FutureSim proposes a radical reset

Traditional benchmarks measure AI agents on static tasks. One problem, one solution, one score. But the real world doesn't work like that: information arrives continuously, the context changes, and an agent must adapt without knowing what comes next.

This is exactly the gap that the paper FutureSim: Replaying World Events to Evaluate Adaptive Agents (arXiv:2605.15188, May 2026) fills. The researchers built a framework that replays three months of real-world events — January to March 2026 — in chronological order. The agents must predict daily news based on an evolving news corpus.

The result is unequivocal: a 25-point accuracy gap between the best and the worst frontier agent. This is not a minor detail. It is proof that some models are structurally better at adapting to a dynamic and open environment.

The problem is that current benchmarks did not capture this difference. FutureSim makes it visible.


The key points

  • FutureSim is a benchmark that simulates 3 months of real-world events (January-March 2026) in strict chronological order, without future data leakage.
  • Agents must predict daily news by interacting with a news corpus that evolves every day.
  • The best frontier agent achieves 25% more accuracy than the worst, revealing gaps masked by static benchmarks.
  • The evaluation was conducted in the native harnesses of Codex, Claude Code, and other standard agentic environments.
  • The paper was published on arXiv in May 2026, with discussions on HuggingFace Papers and AlphaXiv.

Tool Main usage Price (June 2025, check on openai.com) Ideal for
GPT-5.5 (OpenAI) High-level agentic agent Variable price (June 2025) Complex continuous adaptation tasks
Claude Opus 4.7 Adaptive (Anthropic) Adaptive reasoning Variable price (June 2025) Open dynamic environments
Gemini 3 Pro Deep Think (Google) Long-duration deep reasoning Variable price (June 2025) Analysis of evolving corpora
GPT-5.4 Pro (OpenAI) High-performance versatile agent Variable price (June 2025) Standard agentic benchmarks

What FutureSim exactly is — A time simulator, not a multiple-choice test

FutureSim is not just another benchmark where a model is asked 200 questions. It's a simulator that reconstructs a real informational environment, day after day.

The principle: researchers compiled a corpus of news covering January to March 2026. Each day, new information is injected into the environment. The agent has access to it, but only to information prior to the prediction date. Zero future data leakage.

The agent must then predict what will happen in the news of the following days. Not abstract predictions. Concrete predictions about real events that actually occurred.

This design is crucial because it tests exactly the skill missing from static benchmarks: the ability to update beliefs as new information arrives. This is what the FutureSim presentation on Digg describes as "continuous learning without future data leakage".

The agent cannot cheat by memorizing the answers. It must understand real-world dynamics and project them forward.

The difference with static benchmarks

A classic benchmark like MMLU or HumanEval gives a fixed set of problems. The model solves them once, and that's it. There is no notion of time, no information arriving in a stream.

FutureSim introduces the temporal dimension. The agent receives information on day 1, makes a prediction for day 2, receives the news from day 2, adjusts its prediction for day 3, and so on over 90 days.

It's a paradigm shift. We are no longer measuring "does the model know X?" but "does the model know how to update its understanding of X when new data arrives?".


The methodology in detail — Strict chronology and zero cheating

FutureSim's methodological rigor is what makes it credible. The original paper on arXiv details several essential safeguards.

First point: the timeline is imposed by the system, not by the agent. The environment controls access to information. The agent cannot "jump" to March 15 to read the news and then return to January 20. The flow is unidirectional.

Second point: no temporal leakage. This is the classic pitfall of evaluations on real-world data. If the model was trained on data post-dating March 2026, it already "knows" what happened. FutureSim uses mechanisms to ensure that predictions are based solely on the state of the corpus at the time of the prediction.

Third point: the evaluation takes place in native harnesses. This means that agents are tested in their own execution environments — Codex for OpenAI, Claude Code for Anthropic, etc. No artificial wrapper that would skew the performance.

This approach is described as measuring "the ability of agents to adapt to new information in dynamic and open environments" according to the summary on AlphaXiv.

The news corpus as an environment

The corpus is not just a simple dump of articles. It is a structured environment where information is organized, dated, and accessible via standard search and reading tools.

The agent must decide what to read, when to read it, and how to integrate this information into its mental model of the world. This is exactly the type of behavior expected from an autonomous AI agent in a real-world context.

This freedom to navigate the corpus is what makes the evaluation realistic. A good agent will not read the entire corpus every day — it will identify weak signals, follow thematic threads, and prioritize relevant sources.


The results — 25 points difference between frontier agents

This is the figure that catches the eye: 25% more accuracy for the best agent compared to the worst. In benchmarking, such a gap between models of the same "frontier" generation is unusual.

On static benchmarks, frontier models tend to cluster within a narrow performance range. The difference between the first and the fifth is often just a few points. FutureSim shows that this proximity is illusory — it disappears as soon as the temporal dimension is added.

The summary on HuggingFace Papers emphasizes that FutureSim "reveals significant gaps in the long-term prediction capabilities of current agents." The keyword is "long term." On a 24-hour prediction, the models manage fine. Over 90 days with context accumulation, the gaps explode.

What the scores tell us about the models

Without publishing the exact model-by-model ranking (the paper provides aggregated results), several lessons can be drawn.

The models with the best agentic scores — GPT-5.5 (98.2), Gemini 3 Pro Deep Think (95.4), Claude Opus 4.7 Adaptive (94.3) on reference leaderboards — are precisely those expected to perform best on FutureSim. Their extended reasoning capabilities and long context management are direct assets.

Conversely, more specialized models like GPT-5.3 Codex (80), optimized for code, might show weaknesses on this type of open-ended task. It is not a flaw in the model — it is a mismatch between its specialization and the nature of the benchmark.

This reinforces the idea that choosing the right LLM for an AI agent fundamentally depends on the type of environment in which the agent will operate.


Why it matters — AI agents must survive in the real world

An agent that solves a frozen problem in a lab is good. An agent that adapts to a changing world is essential.

Think of a financial agent. It doesn't receive a complete dataset at time T=0. It receives continuous data streams, news, reports, announcements. It must constantly reassess its positions. FutureSim simulates exactly this dynamic.

The same logic applies to cybersecurity agents, research assistants, infrastructure monitoring agents. Their value lies not in what they know initially, but in their ability to integrate new information and adjust their behavior.

This is linked to a broader problem: AI agents inherit harmful actions from their predecessors. If an agent is not capable of reevaluating its beliefs when the context changes, it risks reproducing outdated or even dangerous strategies.

Continuous learning as a distinct skill

FutureSim isolates a specific skill: continuous in-context learning. It is not fine-tuning, it is not classic RAG. It is the model's ability to modify its behavior during an interaction episode, based on new observations.

This skill is distinct from pure reasoning. A model can be excellent at formal logic but poor at continuous adaptation. FutureSim allows them to be measured separately.

The 5 patterns of AI agents that work precisely identify contextual adaptation as a key pattern. FutureSim now provides a framework to measure it objectively.


Implications for agent development

For developers building agents, FutureSim is a game-changer on several levels.

First, the choice of the base model becomes more strategic. If your agent needs to operate in a dynamic environment, a score on a static benchmark is no longer enough to make a decision. You need to look at continuous adaptation performance.

Next, the agent architecture matters as much as the model. The way the agent manages its memory, prioritizes information, and decides when to update its reasoning — all of this directly impacts its performance on FutureSim.

Developers working with local open-source AI agents with Ollama will need to pay attention to this dimension. A local model that performs well on static tasks could prove unsuitable if the environment is dynamic.

What this changes for agent configuration

Configuring OpenClaw with SOUL, AGENTS and Skills takes on a new dimension with this type of benchmark. The SOUL component — which defines the agent's personality and reasoning — must integrate a continuous adaptation capability.

Skills, for their part, must include contextual update mechanisms. An agent with rigid Skills will be at a disadvantage on a benchmark like FutureSim compared to an agent capable of modifying its procedures on the fly.

For those looking for more accessible use cases, certain AI tools allow you to earn €300 per month without coding. But for serious agents, it's the adaptation architecture that makes the difference between a toy and a production tool.


The limits of FutureSim — What the benchmark doesn't capture

No benchmark is perfect. FutureSim has biases that must be understood to correctly interpret its results.

First limit: the corpus is centered on current events. Agents that must operate in non-literal domains — mathematics, pure code, physics — are not optimally tested. A scientific agent does not need to predict the news, it needs to integrate new experimental data.

Second limit: three months is short. Continuous learning over 90 days is not the same as over 3 years. Long-term memory degradation mechanisms are not captured.

Third limit: news prediction is a specific task. It involves understanding natural language, geopolitical dynamics, and economic trends. A medical specialist agent could be excellent in its field but poor on this benchmark.

The paper on AlphaXiv acknowledge that FutureSim measures a specific form of adaptation — that linked to evolving textual information. It is not a universal measure of an agent's intelligence.

The risk of over-optimization

Like any benchmark, FutureSim risks becoming an optimization target. Laboratories could train their models specifically for this type of news prediction task, without this actually improving general adaptation capacity.

This is the classic problem of Goodhart's law: when a metric becomes an objective, it ceases to be a good metric. Benchmark developers know this, and it is likely that FutureSim will evolve towards more varied domains.


What FutureSim reveals about the state of the art

Beyond the numbers, FutureSim tells us something about where agentic AI actually stands in mid-2026.

Frontier models are impressive on isolated tasks. GPT-5.5 dominates the leaderboard with an agentic score of 98.2. But when placed in an environment that even slightly resembles the real world — continuous flow of information, need for adaptation, extended time horizon — the illusions fall away.

A 25-point gap between frontier agents means that the "frontier" category itself is probably too broad. We need to distinguish between models capable of continuous adaptation and those that are not.

The FutureSim coverage on Digg even notes that the results suggest "significant performance gaps in predicting real-world events" between agents presented as equivalent by classic benchmarks.

The illusion of the performance plateau

Since late 2024, many commentators have talked about a plateau in LLM performance. Gains have become marginal on MMLU, GSM8K, and other classic benchmarks.

FutureSim suggests that this plateau is partly artificial. Models continue to progress, but benchmarks no longer capture it because they don't test the dimensions where the real gains are — like continuous adaptation.

This is an important point for anyone following the evolution of models. If you only look at MMLU scores, you might think GPT-5 (78.1) and GPT-5.5 (98.2) are in different leagues. But on dynamic tasks, the gap could be even more pronounced — or conversely, a model with a lower score could outperform thanks to better adaptive capacity.


Hosting Your Own Agents — An Option in Light of These Results

The results from FutureSim also prompt reflection on infrastructure. If an agent's performance depends heavily on its adaptability, then control over the execution environment becomes critical.

Self-hosted models like Kimi K2.6 Moonshot AI (88.1) and GLM-5 Reasoning from Z.AI (82) offer this control. You can precisely configure the context window, memory mechanisms, and update strategies.

For hosting, solutions like Hostinger allow you to deploy these architectures without a massive initial investment. This is relevant if you want to reproduce FutureSim-type evaluations on your own data.

The advantage of self-hosting in this context: you can adapt the benchmark to your specific domain. FutureSim uses general news, but a medical agent would need a simulator based on scientific publications, a financial agent on quarterly reports.


❌ Common mistakes

Mistake 1: Confusing news prediction with general prediction

FutureSim uses news as an evaluation vehicle, not as an object of study. The mistake is thinking that the benchmark measures an agent's ability to "predict the news". What it measures is the capacity for continuous adaptation in a dynamic informational environment. The news domain is a pragmatic choice — it provides abundant, dated, and verifiable data.

Mistake 2: Directly comparing FutureSim scores with static benchmark scores

A 25% gap on FutureSim is not comparable to a 25-point gap on MMLU. The scales, tasks, and conditions are fundamentally different. FutureSim measures an additional dimension that does not exist in static benchmarks. Using both together provides a more complete picture, not a direct comparison.

Mistake 3: Concluding that a model is "bad" because it scores low on FutureSim

A model optimized for code — like GPT-5.3 Codex — is not designed for news prediction. A low score on FutureSim does not disqualify this model for its intended use. The benchmark is a diagnostic tool, not a universal verdict.

Mistake 4: Ignoring the language domain bias

The FutureSim news corpus is primarily in English. Models with a different language bias could be artificially disadvantaged. This is not a fatal flaw in the benchmark, but it is a factor to keep in mind when interpreting results for multilingual models.


❓ Frequently Asked Questions

Does FutureSim replace existing benchmarks like MMLU?

No. FutureSim measures a complementary dimension — continuous adaptation in a dynamic environment. Static benchmarks remain useful for evaluating frozen knowledge and reasoning. Together, the two provide a more complete picture.

Can FutureSim be used to evaluate open-source agents locally?

Yes, in principle. The framework is described in the paper and could be reproduced. However, compiling the news corpus with anti-leakage safeguards requires significant effort. It is more realistic for research teams than for individual developers.

Why 3 months and not longer?

The authors chose a duration that balances statistical significance and feasibility. Three months is enough to reveal adaptation dynamics without making the simulation unmanageable. Longer durations are being considered for future versions of the benchmark.

Is the 25% gap really significant?

Yes, especially between models in the same frontier category. On static benchmarks, these same models are often within a 5-10 point range. A 25-point gap suggests that continuous adaptation capacity is a much more powerful discriminator than out-of-context reasoning.

Will FutureSim be integrated into public model leaderboards?

Probably, but not immediately. The benchmark is still recent (May 2026). The community will need to validate and reproduce it, and leaders like HuggingFace or lmsys will need to integrate it into their dashboards. The discussion on HuggingFace Papers suggests strong interest in this integration.


✅ Conclusion

FutureSim doesn't just add another benchmark to the list — it changes the question we ask AI agents, shifting from "what do you know?" to "how do you adapt when the world changes?". The 25-point gap between frontier agents shows that this question reveals differences that static evaluations were completely masking. For anyone building or selecting AI agents, this is now a dimension that can no longer be ignored.