📑 Table of contents

OmniGameArena : the UE5 benchmark revolutionizing the evaluation of VLM agents in games

Agents IA 🟢 Beginner ⏱️ 16 min read 📅 2026-06-09

OmniGameArena: the UE5 benchmark revolutionizing VLM agent evaluation in games

🔎 Why a simple score is no longer enough to judge an AI agent

Until now, evaluating an AI agent in a game meant running it once and recording its score. A snapshot. A frozen picture of a skill at a given moment in time. The problem? An agent that scores 12 on a task on the first try then stagnates is judged equal to an agent that starts at 2 but reaches 15 after five attempts. That's absurd.

OmniGameArena, published on June 8, 2026 on arXiv (2606.09826) by a team led by Mingxian Lin and Xiaojuan Qi, breaks this paradigm. This unified benchmark built on Unreal Engine 5 doesn't just measure raw performance. It traces the agent's learning curve, round after round, thanks to a mechanism called the Improvement Dynamics Curve (IDC).

The stakes go beyond gaming. If a VLM (Vision-Language Model) agent doesn't know how to improve its own actions based on its failures, it will never be reliable in real-world tasks. Video games, with their fast and measurable trial-and-error loops, are the perfect training ground to test this capability.


The essentials

  • OmniGameArena is a benchmark of 12 UE5 games covering three modes: Solo (7 games), PvP (3 games), and Cooperative (2 games), with a unified action interface.
  • The key innovation is the Improvement Dynamics Curve (IDC): an agentic-reflection harness where a reflector LLM automatically refines the agent's skill prompt over multiple rounds.
  • The benchmark addresses three major flaws of existing evaluations: the single first-attempt score, the exclusive focus on solo play, and the lack of standardized protocols across heterogeneous games.
  • It supports commercial VLMs, open-weight models, and multi-agent architectures, making it a universal evaluation framework.

Tool Main usage Price (June 2026, check official website) Ideal for
GPT-5.5 IDC reflector LLM / VLM Agent Quote-based (OpenAI) Best overall agentic score (98.2)
Gemini 3 Pro Deep Think Long reflection on past errors Quote-based (Google) Complex multi-round analysis (95.4)
Claude Opus 4.7 Adaptive VLM Agent with dynamic adaptation Quote-based (Anthropic) Agents that adjust their strategy (94.3)
Kimi K2.6 Self-hosted VLM Agent Free (self-host) Local evaluation without external API (88.1)
OpenClaw Agent framework with SOUL/AGENTS Open source Configuring agent profiles for the benchmark
Ollama Local open-weight LLM execution Free Running local models on the benchmark

The three flaws of existing VLM benchmarks

Current benchmarks for VLM agents in games are broken. Not just a little. Fundamentally. The paper OmniGameArena sur OpenReview, submitted before May 26, 2026, identifies three structural problems that skew all research in the field.

The first-attempt trap

A single first-attempt score captures only a fraction of an agent's true capability. Imagine judging a chess player based on a single opening. That is exactly what benchmarks like Minecraft, ALFWorld, or WebArena do. They measure initial performance without any mechanism to evaluate whether the agent can improve.

Yet, in reality, an agent deployed in production will fail. The critical question is not "how much does it score on the first try" but "how fast does it improve after each failure." This dimension is completely absent from current evaluations.

The solo bias

Almost all existing benchmarks only test solo scenarios. One agent facing a static environment. But real-world applications of AI agents — whether in gaming, robotics, or automation — involve interactions with other agents, opponents, and teammates.

An agent that masters a solo puzzle can collapse as soon as a second agent modifies the environment in real time. Ignoring PvP and co-op is like evaluating a soccer player solely on penalty kick drills without a defender.

The chaos of non-standardized protocols

Each historical benchmark has its own action interface, its own observation format, its own metric. Result: it is impossible to compare the results of an agent on Minecraft with those of an agent on ALFWorld. The absence of a unified protocol fragments research and slows down progress.

This is precisely the triangle of problems that OmniGameArena solves in one stroke. The full details are available in the papier HTML sur arXiv.


The 12 UE5 games: a calculated diversity

OmniGameArena doesn't just pick 12 games at random. The selection is strategic: each game tests a distinct visuo-motor skill, and the three modes (Solo, PvP, Coop) are represented in significant proportions.

Solo: 7 games to assess perception and action

The 7 solo games cover tasks ranging from spatial navigation to object manipulation, including visual recognition under time constraints. The advantage of UE5 is the graphic and physical consistency across all games. Textures, lighting, gravity — everything is produced by the same engine, which eliminates biases related to rendering differences between engines.

For a VLM agent, this consistency is crucial. If the benchmark mixed 2D pixel art games with photorealistic 3D environments, it would measure the ability to adapt to visual styles rather than gaming skill. UE5 standardizes the problem while maintaining gameplay diversity.

PvP: 3 games to test adversarial interaction

PvP mode introduces a dimension that solo benchmarks completely ignore: strategic anticipation. An agent in PvP doesn't play against a predictable environment. It faces an opponent who adapts, who feints, who exploits its weaknesses.

The 3 PvP games in OmniGameArena force the agent to develop reactive behaviors. The VLM must read the opponent's intentions in the pixels of the visual frame and adjust its action accordingly. This is a fundamentally different challenge from solo, and the results show that the rankings are profoundly disrupted as a result.

Coop: 2 games for multi-agent coordination

Cooperative mode tests yet another skill: the division of labor and implicit communication. Two agents must accomplish a common objective without necessarily communicating directly with each other. They must infer the other's role and adapt accordingly.

This is where the architectures of the meilleurs agents IA autonomes truly differentiate themselves. An agent designed for solo will simply ignore its partner and try to do everything alone. A good cooperative agent, on the other hand, identifies synergies.


The IDC: OmniGameArena's true innovation

The Improvement Dynamics Curve is what transforms OmniGameArena from a simple benchmark into a revolutionary evaluation tool. The concept is elegant yet powerful: instead of a score, we plot a curve.

How agentic-reflection works

The mechanism relies on a reflector LLM separate from the player agent. After each round, the reflector analyzes the agent's trajectory — its actions, its errors, its missed decisions — and generates an improved skill prompt for the following round.

Specifically, if the agent missed a jump because it didn't correctly estimate the distance, the reflector will modify the prompt to include an instruction like "before jumping, evaluate the distance by counting the ground tiles." This refined prompt is injected into the agent for the next round.

The process repeats over several rounds. And it is the shape of the resulting curve that becomes the evaluation metric, not the final score.

What the curve reveals that the score hides

Let's take two agents with GPT-5.5 as the reflector. Agent A scores 15 in round 1 and plateaus at 16 in round 5. Agent B scores 5 in round 1 but reaches 18 in round 5. A classic benchmark would judge them on the final score and give the advantage to B. But the IDC reveals that A has almost zero improvement capacity — it is already at its ceiling — while B has a strong learning dynamic.

In a real deployment context, B is the one we prefer. An agent that learns quickly will catch up to and surpass a static agent over time. The IDC captures this temporal dimension in a quantitative way, as explained in the summary on Deep Learning Monitor.

The identified curve profiles

The paper identifies several typical profiles. The staircase curve (improvement in sudden steps), the linear curve (constant progress), the logarithmic curve (rapid progression followed by a plateau), and the flat curve (no learning). Each profile says a lot about the nature of the underlying model and the quality of the agent-reflector pair.


Unified Action Interface: Why It's Decisive

One of the major technical challenges of OmniGameArena was creating a single action interface that works across 12 radically different games. In a shooter game, the relevant action is "aim and shoot". In a puzzle game, it's "select and place". In a racing game, it's "accelerate and steer".

The Design of the Unified API

The team designed an abstract action space generic enough to cover all games, but expressive enough to preserve the specificity of each task. Every action is encoded as a combination of primitives: directional movement, contextual action, target selection.

This standardization enables something previously impossible: directly comparing the performance of the same agent on games of totally different genres. If an agent excels in puzzle games but fails in PvP, we know the problem isn't an interface bias but a genuine cognitive gap.

The Impact on Multi-Agent Research

For multi-agent architectures, this unified interface is a gift. An agent IA avec Ollama en local can be tested on all 12 games without any code adaptation. The same input-output pipeline works everywhere. This drastically reduces experiment setup time and increases reproducibility.

The lmgame-org/GamingAgent sur GitHub repo also provides the reference implementations for plugging any model into the benchmark, including the architectures presented at ICLR 2026.


Results: which models dominate on IDC?

The preliminary results from OmniGameArena are shaking up the usual rankings. A model that dominates on first-attempt does not necessarily dominate in terms of improvement dynamics.

The reflector paradox

Choosing the best LLM for agents as a reflector is not a trivial matter. GPT-5.5, with its agentic score of 98.2, produces the smoothest improvement curves. But Gemini 3 Pro Deep Think (95.4) sometimes generates deeper insights during intermediate rounds, producing more spectacular — but less predictable — improvement jumps.

Claude Opus 4.7 Adaptive (94.3) stands out for its ability to adjust the level of detail in the skill prompt based on the type of error. For a timing error, it produces a short, targeted instruction. For a strategic error, it generates a mini-action plan. This adaptive granularity translates into particularly smooth improvement curves.

Self-hosted models in the race

Kimi K2.6 (88.1) and GLM-5 Reasoning (82.0) show surprising improvement curves despite lower absolute scores. Their advantage: latency. A faster reflection cycle allows for more rounds within the same compute time, which partially compensates for the lower quality of each individual reflection.

This is a result with major practical implications. A local deployment with Ollama might be preferable to an expensive API if the criterion is the speed of improvement rather than the maximum achievable score.

Comparative table of IDC profiles

Model (reflector) Agentic score Typical IDC profile Average improvement R1→R5 Latency per cycle
GPT-5.5 98.2 Regular linear +42% ~3.2s
Gemini 3 Pro Deep Think 95.4 Stepped +38% ~5.8s
Claude Opus 4.7 Adaptive 94.3 Soft logarithmic +45% ~3.9s
GPT-5.4 Pro 91.8 Linear with plateau +31% ~2.8s
Kimi K2.6 (self-host) 88.1 Fast linear +35% ~1.1s
Claude Sonnet 4.6 81.4 Early plateau +18% ~2.4s
GPT-5.3 Codex 80.0 Irregular +22% ~2.6s

Configuring an agent for OmniGameArena

Running an agent on OmniGameArena requires precise configuration. The reference framework is available on the GamingAgent GitHub repo, but here are the key principles.

Separating the player agent from the reflector

The most common mistake is using the same model to play and to reflect. IDC works best when the reflector is a different model, ideally slower and more analytical. A good setup: GPT-5.4 (87.6) as the player agent for its execution speed, and GPT-5.5 as the reflector for the quality of its analysis.

For those using OpenClaw with SOUL, AGENTS and Skills, the separation is natural. The SOUL defines the cognitive profile of the player agent, the AGENT manages the game loop, and a separate Skill can be dedicated to post-round reflection.

Setting the number of rounds

The paper does not set a maximum number of rounds, but standard experiments use 5 rounds. Below 3, the curve is not significant. Beyond 8, the marginal gains collapse for most models.

The sweet spot depends on the game. PvP games often require more rounds because the opponent adapts as well, creating a race for reciprocal improvement. Solo games generally reach their plateau faster.

Managing inter-round memory

The reflector only has access to the trajectory of the previous round, not the complete history. This constraint is intentional: it forces the skill prompt to be self-sufficient in each round, making it testable in isolation. If the reflector had access to the entire history, the prompt would eventually become an unreadable monster that overfits past errors to the detriment of generalization.


OmniGameArena vs. other benchmarks

To understand the contribution of OmniGameArena, it needs to be placed in the existing landscape of agent benchmarks, including those that are not specifically gaming-related.

Against classic gaming benchmarks

Minecraft, Crafter, NetHack — they all share the three flaws identified by the team. Single score, solo only, proprietary protocol. OmniGameArena surpasses them on every axis without compromise.

Against web agent benchmarks

DeepWeb-Bench : le nouveau benchmark qui expose les faiblesses des agents de recherche IA shows that web agents have their own evaluation problems. But the web is an uncontrolled environment where pages change. UE5 offers a deterministic environment where the only variables are the agent's decisions. This controllability is a major scientific asset.

Against real-world simulation benchmarks

FutureSim : ce benchmark fait rejouer 3 mois d'événements réels aux agents IA pour les évaluer pushes evaluation towards temporal realism. FutureSim and OmniGameArena share a common philosophy: evaluating over time, not just at a single instant. But FutureSim works on passive temporal sequences (replaying events), while OmniGameArena measures active improvement (the agent modifies its behavior). They are complements, not competitors.


Implications for the future of AI agents

OmniGameArena is not just an academic tool. Its results have direct consequences on how AI agents are designed, trained, and deployed.

Rethinking agent training

If the improvement dynamic is more important than the initial score, then training methods must change. RL (Reinforcement Learning) optimized for the first-attempt score will produce agents that perform well at the snapshot but learn nothing afterward. IDC suggests that optimization should focus on the slope of the curve, not its starting point.

The critical role of streaming and latency

In a multi-round benchmark, the latency of each reflection cycle is a determining factor. An agent that improves its score by 5% per cycle but takes 10 seconds per cycle will be less efficient than an agent that improves by 3% per cycle in 1 second. This is where approaches like streaming, which reduces multi-agent latency, become strategic — every millisecond saved per cycle is multiplied across all rounds.

Towards agents that know what they don't know

IDC, in its essence, measures the agent's metacognition. Its ability to identify its own errors, formulate a correction, and apply it. This skill is exactly what current agents lack in real-world scenarios. An agent that knows it has made a mistake and knows how to correct it is infinitely more useful than an agent that makes fewer mistakes but cannot recognize them.


❌ Common mistakes

Mistake 1: Confusing final score with learning ability

This is the fundamental error that OmniGameArena points out. An agent with a high final score but a flat curve simply had a good starting point, not good learning ability. The solution: always look at the slope of the IDC, not the endpoint.

Mistake 2: Using the same model as player and reflector

The reflector has an analytical role different from the executive role of the player. Using the same model creates a confirmation bias: the reflector tends to justify the player's choices rather than criticize them. The solution: use two distinct models, ideally with complementary profiles (fast for playing, deep for reflecting).

Mistake 3: Ignoring PvP and Coop modes

Many teams settle for the 7 solo games to save time. But results show that solo and PvP rankings are weakly correlated. A dominant agent in solo can be mediocre in PvP. The solution: systematically test across all three modes to get a complete profile.

Mistake 4: Too many reflection rounds

Beyond 8 rounds, the skill prompt becomes so loaded with specific corrections that it loses generality. The agent starts to overfit past errors to the detriment of overall performance. The solution: stick to the 3-8 round range and analyze the curve rather than seeking the absolute maximum score.


❓ Frequently Asked Questions

Is OmniGameArena open to the public?

Yes, the benchmark is based on UE5 games and the reference code is available on the lmgame-org/GamingAgent GitHub repo. Researchers can plug in their own models via the unified action interface.

Does IDC replace first-attempt scores?

No, it complements them. OmniGameArena reports both the first-attempt score and the full IDC curve. Both metrics together give a richer picture than either one alone.

Can open-source models be used?

Yes, the benchmark explicitly supports open-weight models. Kimi K2.6 and GLM-5 Reasoning have been tested in self-host mode. The unified interface is compatible with any model exposing a completion API.

What is the cost of a full evaluation?

It depends on the number of rounds and the model used. With GPT-5.5 as a reflector over 5 rounds and 12 games, the token cost is significant but remains within standard research budgets. Self-hosted models via Ollama eliminate this cost at the price of a lower absolute score.

Does IDC apply outside of gaming?

The principle is transferable to any domain where the agent can iterate: coding, problem-solving, automation. But the feedback loop must be fast and measurable, which gaming naturally guarantees.


✅ Conclusion

OmniGameArena marks the transition from a static to a dynamic evaluation of AI agents. The Improvement Dynamics Curve does not measure what an agent knows how to do, but what it knows how to learn — and that is exactly the metric that matters for real-world deployment. If you are building agents, explore frameworks like OpenClaw to implement this type of iterative reasoning today.