Agent Arena: Causal Evaluation of Agents in the Real World
).<br>============================================================ -->
Agents are increasingly doing real work. From chat to terminal to OpenClaw, users everywhere are interacting with complex agents, comprising a model and a harness with many subcomponents and tools. As a result, the task distribution has greatly expanded. This makes evaluating agents progressively more difficult, because both task coverage and task complexity are growing in tandem. We desire an agent evaluation that scales along with usage and capability.<br>Today we are releasing the Agent Arena leaderboard. Arena has always focused on evaluations in the real world. As such, Agent Arena collects and analyzes millions of in-the-wild interactions from people using Agent Mode on arena.ai/agent doing their jobs — software engineering, financial analysis, and more. From our observations of these agents running on our platform, we derive our first Agent Arena leaderboard, shown below:
Agent Arena Leaderboard
Agent Arena Leaderboard; calculated from an aggregate of all signals; Net improvement is the casual treatment effect; error bars are 95% CIs. Color runs green (top of the board) → red (bottom).
The methodology powering the Agent Arena Leaderboard is different from our previous arenas. Rather than pairwise votes, rankings are calculated using a methodology we call causal tracing . Causal tracing treats the agent as a multi-component system, with each component selection representing a possible treatment. We observe individual point-wise traces and measure signals such as task success rates, verbal feedback, tool error recovery, tool hallucinations, and, over time, much more. Then, by randomizing the component selections, we create a multi-intervention randomized controlled trial in which we can aggregate measurements to estimate causal treatment effects. We refer to these effects as "net improvement" in the figure above. The causal framework produces an interpretable ranking that represents the improvement in agent performance due to a component selection. This decouples the contributions of the main orchestrator model, any subagents, image generation models, and the different elements in the harness, letting us combine multiple signals into one coherent leaderboard.<br>This first leaderboard is the result of our causal evaluation of orchestrator models — the main LLMs that choose which tools to call. Rankings of other aspects of the agentic harness are coming soon. We include more methodological detail in the statistical-methodology section below.<br>Per-Signal Leaderboards<br>Every Agent Arena session contains a stream of rich feedback. Users iterate with the agent in natural language, expressing approval, frustration, or clarification turn by turn. They decide whether to download an artifact the agent produced. They click explicit approve / disapprove buttons. They issue in-line corrections when the agent goes off-track. And the agent, on its side, is interacting with an environment that talks back continuously: shell exit codes, tool errors, the absence of a tool it tried to call. Agent Mode lets us extract all of these signals — explicit user feedback, implicit user feedback, and feedback from the agent's environment. After we compute per-session outcomes for each signal, we turn them into leaderboards with causal methods and then aggregate them into the headline leaderboard. We present our first 5 signals today, and we plan to measure more in the near future.
Per-Signal Rankings
Each model's score on the canonical sub-signals that compose the aggregate (τ̂). Click a column to sort.
Each individual leaderboard signal. Cells shaded green → red by score within each column. The aggregate of all signals is shown on the left.
The headline leaderboard aggregates the following signals:<br>Confirmed success — the user marks a task as a success or failure using the Arena UI. Arena gives users approve and disapprove buttons on every turn; we use the final approval or disapproval of a given task's trajectory to determine the outcome. (There can be more than one task per session.)<br>Praise vs. complaint — the user praises or complains about the agent's output. For each task we identify messages expressing explicit verbal praise ("looks great", "this is exactly what I needed") or explicit verbal complaint ("this is broken", "you misunderstood entirely"). The task is marked a success if praise outnumbers complaints.<br>Steerability — the agent executes on user corrections. When a user issues an in-line correction ("no, do X instead", "you misread the file"), the agent should attempt to fix it. If the user accepts the fix, we mark the correction successful; if they reject it or give up, unsuccessful. When doing real work, mistakes are inevitable — this signal captures whether these errors are quickly resolved.<br>Bash recovery — turns taken to recover from a bash error. When the agent issues a bash command...