Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

[2606.13608] AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

-->

Computer Science > Artificial Intelligence

arXiv:2606.13608 (cs)

[Submitted on 11 Jun 2026]

Title:AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Authors:Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song View a PDF of the paper titled AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility, by Xiaoyuan Liu and 28 other authors

View PDF HTML (experimental)

Abstract:Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility.

To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

Subjects:

Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2606.13608 [cs.AI]

(or arXiv:2606.13608v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.13608

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Submission history From: Xiaoyuan Liu [view email] [v1] Thu, 11 Jun 2026 17:23:54 UTC (454 KB)

Full-text links: Access Paper:

View a PDF of the paper titled AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility, by Xiaoyuan Liu and 28 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.AI

next >

new recent | 2026-06

Change to browse by:

cs cs.LG

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y