📑 Table of contents

OmniGameArena: The UE5 benchmark that measures the learning dynamics of VLM agents in games

Agents IA 🟢 Beginner ⏱️ 12 min read 📅 2026-06-09

OmniGameArena : the UE5 benchmark that measures the learning dynamics of VLM agents in games

🔎 Why the "first-attempt score" is no longer enough to judge an AI agent

For two years, every new vision-language model (VLM) has been presented with a flat score on a gaming benchmark. The problem: this single score says absolutely nothing about the agent's ability to progress.

Is an agent that scores 45% on the first try but reaches 80% on the fifth round better than an agent that plateaus at 60% from the very first attempt? Existing benchmarks do not answer this question. They capture a snapshot, not a curve.

OmniGameArena, published on June 8, 2026, on arXiv (2606.09826) by a team from the University of Hong Kong led by Xiaojuan Qi, radically changes this approach. The benchmark introduces the IDC (Improvement Dynamics Curve), a metric that plots the improvement dynamics over multiple rounds. No more judging based on a single attempt: we observe how the agent learns.

This paradigm shift is all the more relevant since the best autonomous AI agents now integrate memory and reflection loops. Measuring only the first attempt means ignoring half of their value.


The key points

  • OmniGameArena brings together 12 Unreal Engine 5 games built specifically for the benchmark (7 Solo, 3 PvP, 2 Coop), available as packaged builds on Hugging Face.
  • The major innovation is the IDC (Improvement Dynamics Curve): instead of a first-attempt score, the benchmark plots the agent's improvement curve over several iterations.
  • The unified action interface makes it possible to compare heterogeneous classes of agents (commercial VLMs, open-weight VLMs, keyboard-mouse policies, gamepad policies) on an equal footing.
  • The code, builds, and dataset are fully open source and downloadable on Hugging Face (mxlin043/OmniGameArena).

Tool Main Usage Price (June 2026, check on huggingface.co) Ideal for
OmniGameArena Dataset UE5 builds of the 12 games Free Researchers and developers
OmniGameArena Paper Page Abstract and metadata Free Quick scientific monitoring
GamingAgent (GitHub) Comparative LLM/VLM agents Free Comparison with vanilla VLM approach

The benchmark architecture — 12 games, a single protocol

OmniGameArena doesn't tinker with existing games. The 12 environments are built from scratch in Unreal Engine 5, specifically to serve as an evaluation ground. This is a fundamental difference compared to benchmarks that recycle commercial games with interface hacks.

The distribution is designed to cover real-world gaming scenarios: 7 solo games, 3 competitive games (PvP), and 2 cooperative games (Coop). Each build is self-contained and directly downloadable from the Hugging Face dataset. No complex dependencies, no engine to recompile.

The key point of the architecture: the unified action interface. All types of agents — whether they generate keyboard-mouse commands, gamepad inputs, or textual instructions — go through the same connection protocol to the UE5 environment. This eliminates the integration bias that pollutes most current comparisons.

This standardization is reminiscent of what DeepWeb-Bench does for web search agents: a common protocol to reveal the true differences between models, rather than integration artifacts.


The IDC — From Snapshot to Learning Curve

This is the paper's most significant scientific contribution. The IDC (Improvement Dynamics Curve) replaces the first-attempt score with a trajectory.

The principle is simple but powerful: instead of measuring performance after a single attempt, we have the agent play over several rounds (with access to its previous traces) and plot the curve of its scores. This curve captures three pieces of information that a single score erases:

The initial learning speed (slope of the curve in the early rounds). The reached ceiling (asymptote of the curve). The stability (variance around the trend).

A commercial VLM like GPT-5.5 could excel at first-attempt thanks to its raw reasoning capacity. But an open-weight model run locally with Ollama could climb a steeper IDC curve if it better integrates the feedback from its previous failures. Both profiles are valid, but they reveal different qualities.

The IDC also changes how one can choose the best LLM for an agent. A model with a moderate first-attempt score but a strong improvement dynamic will be preferred for iterative tasks, while a "one-shot" model will be suited for missions with no room for error.


The four classes of agents tested

OmniGameArena doesn't just compare VLMs against each other. The benchmark defines four distinct agent classes, each with its own constraints and advantages.

Commercial VLMs: models like GPT-5.5, Gemini 3 Pro Deep Think, or Claude Opus 4.7, connected via API. They benefit from the best visual understanding and the most advanced reasoning, but are limited by network latency and API constraints.

Open-weight VLMs: models like Kimi K2.6 or GLM-5, deployed locally. They offer lower latency and total control over inference, at the cost of generally inferior visual understanding.

Keyboard-mouse policies: specialized models trained to directly generate keyboard-mouse actions. No textual reasoning chain, just an optimized perception-action mapping.

Gamepad policies: same principle, but with a constrained action space (analog joysticks, buttons). Tests the agent's ability to function with a more limited interface.

The fact that these four classes go through the same unified interface finally makes the comparisons legitimate. In previous benchmarks, we often compared apples and oranges: an API agent with visual parsing against a local agent with direct access to game states.


Solo, PvP, Coop — why multi-agent changes everything

The majority of gaming benchmarks for VLMs are limited to solo play. This is a major limitation, because competitive and cooperative games introduce fundamentally different dynamics.

In PvP, the agent must not only master the game, but also adapt its behavior to an opponent who changes strategy. In Coop, it must coordinate its actions with a partner, which requires communication and theory of mind capabilities that solo does not test.

OmniGameArena integrates these three modes with the same interface, which makes it possible to measure a complete profile of the agent. A model could be excellent in solo but collapse in PvP because it doesn't know how to react to an unpredictable opponent. The IDC is particularly revealing here: in PvP, the improvement curve can be non-monotonic, with the agent improving then regressing when the opponent adapts to it.

This multi-agent dimension brings the benchmark closer to real-world scenarios. AI agents that replay real events like in FutureSim show that the ability to adapt to dynamic environments is precisely what differentiates deployable agents from laboratory demonstrations.


Comparison with GamingAgent — two evaluation philosophies

The GamingAgent repo (presented at ICLR 2026) offers an interesting point of comparison. GamingAgent also evaluates LLM/VLM models on diverse games, but in "vanilla VLM" mode: the model receives an observation and produces an action, without a specific game harness.

OmniGameArena takes the exact opposite approach. The games are built for the benchmark, with documented adapters between each type of agent and the environment. The approach is less "natural" but more scientifically rigorous.

In practice, the two benchmarks are complementary. GamingAgent answers the question "can a VLM play an existing game without adaptation?" OmniGameArena answers "when we equalize the conditions for accessing the game, which agent learns the fastest?"

For a researcher, both are useful. For a developer building a game agent, OmniGameArena is probably more relevant because it better isolates the intrinsic quality of the agent from integration artifacts.


What the results reveal about current models

The detailed tables in the HTML version of the paper show interesting patterns. Commercial VLMs logically dominate in first-attempt score on solo games, thanks to their superiority in scene understanding and planning.

But the IDC reveals surprises. In PvP and Coop games, some specialized policies (keyboard-mouse) show steeper improvement curves than commercial VLMs after 3-4 rounds. Sophisticated textual reasoning becomes a handicap when reactivity takes precedence over reflection.

Open-weight VLMs like Kimi K2.6 (agentic score of 88.1) and GLM-5 (82) show very different IDC profiles from commercial models. Their curve starts lower but climbs more steadily, suggesting better exploitation of iterative feedback. This is not necessarily surprising: these models are often used in agent loops where long-term memory compensates for instantaneous reasoning.

Anthropic's Claude Opus 4.7 (94.3) and GPT-5.5 (98.2) remain the most versatile models, with good performance in both first-attempt and improvement dynamics. But their advantage shrinks in multi-agent settings, where coordination and reactivity matter more than pure reasoning.


The infrastructure — how to reproduce and extend the results

A benchmark is only valuable if it is reproducible. OmniGameArena ticks this box in an exemplary manner.

The 12 UE5 builds are available on Hugging Face as self-contained archives. Each build includes the game environment, the unified action interface, and the adapters for the four agent classes. You download, unzip, and run.

To run an agent, you simply connect to the unified protocol. No need to modify the game build, no custom wrapper. The architecture is designed so that any new model can be plugged in within a few hours.

For teams that want to deploy agents locally without relying on APIs, this architecture is particularly well-suited. An open source AI agent running with Ollama can be tested under the exact same conditions as a commercial model, which was virtually impossible with previous benchmarks.

The Hugging Face dataset (mxlin043/OmniGameArena) also includes the traces of the paper's experiments, allowing for direct comparison with the published results without having to rerun everything.


Limitations acknowledged by the authors

The team at the University of Hong Kong is transparent about the benchmark's constraints. First, the games are built specifically for evaluation. They do not capture all the visual and mechanical complexity of a commercial AAA game. This is a deliberate trade-off: controlled complexity versus raw realism.

Second, IDC measures improvement over several rounds, but the definition of what constitutes a "round" varies by game. In a single-player puzzle game, a round is an attempt at solving it. In a PvP game, it is a complete match. Cross-game comparison of IDC therefore requires caution.

Third, the benchmark focuses on real-time games. Turn-based games, which represent a significant category (strategy games, card games), are not covered. This is an area that benchmarks like FutureSim approach from a different angle, using sequential scenarios based on real-world events.

These limitations do not detract from the value of the contribution. They simply define the valid scope for interpreting the results.


What this means for the future of AI benchmarks

OmniGameArena is part of a broader trend: the shift from static evaluation to dynamic evaluation. The DeepWeb-Bench benchmark showed that web search agents performing well on single-query could collapse on multi-step tasks. OmniGameArena shows the same thing for game agents.

The pattern is clear: the first-attempt score is a misleading metric. It measures zero-shot generalization capability, not an agent's ability to operate in the real world. And in the real world, agents iterate, adapt, learn.

The IDC could well become a standard. The idea of plotting an improvement curve rather than capturing a snapshot is transposable to almost all domains: code, information retrieval, robotics. Other benchmarks will very likely start adopting this metric.

For teams building autonomous AI agents, the implication is direct: optimizing for first-attempt is a short-term strategy. The agents that win in the long run are those that learn the fastest from their mistakes.


❌ Common mistakes

Mistake 1: Confusing first-attempt score and agent capability

A model that excels at first-attempt is not necessarily the best agent. It could be an excellent zero-shot reasoner that doesn't know how to integrate feedback. OmniGameArena's IDC shows precisely this distinction. The solution: always look at the improvement curve, not just the starting point.

Mistake 2: Comparing agents on different interfaces

This is the mistake that most previous benchmarks make. An agent with direct access to game states (internal JSON) is not comparable to an agent that has to read the screen and generate keystrokes. OmniGameArena solves this problem with its unified interface. The solution: use a common protocol, or at the very least, document the difference in information access.

Mistake 3: Ignoring multi-agent mode

Testing only in solo mode gives an incomplete picture. PvP and Coop games test capabilities (adaptation, coordination, theory of mind) that solo mode doesn't touch. The solution: include at least one competitive scenario and one cooperative scenario in any game agent evaluation.

Mistake 4: Using commercial games without environment control

A commercial game receives updates, has variable network conditions, and its internal state is inaccessible. This introduces noise into the measurements. OmniGameArena builds its own environments to eliminate these variables. The solution: favor controlled environments for benchmarking, keeping commercial games for public demonstrations.


❓ Frequently Asked Questions

What exactly is the IDC?

The Improvement Dynamics Curve is a metric that plots an agent's score over several successive rounds of the same game. Instead of a single number, we get a curve that reveals the learning speed, the performance ceiling, and the stability of the agent across iterations.

Who created OmniGameArena?

The team is led by Mingxian Lin, under the supervision of Xiaojuan Qi, at the University of Hong Kong. The paper (arXiv 2606.09826) was published on June 8, 2026, and has 12 co-authors specializing in computer vision and reinforcement learning.

Which models were tested?

The commercial VLMs tested include GPT-5.5, Gemini 3 Pro Deep Think, Claude Opus 4.7, and GPT-5.4 Pro. On the open-weight side, Kimi K2.6 and GLM-5 were evaluated, alongside specialized keyboard-mouse and gamepad policies not based on LLMs.

Does OmniGameArena replace existing benchmarks?

No, it complements them. GamingAgent (ICLR 2026) remains relevant for evaluating the "vanilla" capability of a VLM on existing games. OmniGameArena adds an additional dimension: the improvement dynamics in a controlled and standardized environment.

How to use the benchmark in practice?

You download the UE5 builds from the Hugging Face dataset (mxlin043/OmniGameArena), connect your agent to the unified action protocol, and run the evaluations. The traces from the original experiments are included to allow for direct comparison.


✅ Conclusion

OmniGameArena marks the transition of AI gaming benchmarks from a snapshot logic to a trajectory logic. The IDC does not replace the first-attempt score; it contextualizes it — and in many cases, it corrects it. For teams building autonomous AI agents, it is the evaluation tool that was missing to measure what really matters: the ability to improve.