AgingBench: AI Agents Age Too
โ" />
๐ arXiv<br>๐ป Code<br>Thread<br>BibTeX
Your Agents Are Aging Too:<br>Agent Lifespan Engineering for Deployed Systems
AI agents have lifespans, AgingBench measures them โ a longitudinal reliability foundation for agent lifespan engineering.
Jianing Zhu*,<br>Yeonju Ro*,<br>John T. Robertson,<br>Kevin Wang,<br>Junbo Li,
Haris Vikalo,<br>Aditya Akella,<br>Zhangyang "Atlas" Wang
The University of Texas at Austin<br>* Equal Contribution
85%
Maximum recall drop across 10 sessions โ frozen weights, same scaffolding (S7 ยท GPT-4o-mini ยท OpenHands)
4.5×
Half-life spread from memory policy alone โ bigger than any model swap (S1 ยท careful vs. lossy compaction)
67%
Post-shock cliff from a single flush-history maintenance event โ no recovery (S6 naturalistic)
Distinct aging mechanisms โ compression, interference, revision, maintenance
-->
15%
Claude Code 4.7 is better than 4.6? Mean pytest pass-rate drop from CLI using Sonnet-4.6 to using Opus-4.7 on S7.
We're looking for collaborators with production agent traces, sponsors for larger-scale benchmarking, and contributors with new scenarios for agent lifespan engineering.
Abstract
Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.
We introduce AgingBench , a longitudinal reliability benchmark for agent lifespan engineering : measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, where write-time summarization drops future-relevant details; interference aging, where accumulated similar memories crowd out the target fact; revision aging, where changed or derived state is not updated correctly; and maintenance aging, where lifecycle events such as flushing or recompaction trigger regressions. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline.
Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional : behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
Agent Lifespan Engineering (ALE)<br>Three key questions:
How long does a deployed agent remain reliable?
How does reliability decay: through compression, interference, revision, or maintenance?
Where should repair target: writing, retrieval, utilization, or the memory lifecycle?
AgingBench is NOT:<br>biological aging<br>one-shot hallucination<br>just long-context evaluation
Fresh deployment vs. aged agent โ same model, same input/output surface. After enough sessions, the memory store clutters, signals fade, and the agent starts looping on itself. (Click to enlarge.)
Three ways in
AgingBench is a paper, a leaderboard, and a runnable benchmark. Pick the door that matches what you want to do next.
For builders
Run AgingBench
One command, ten minutes. Three release modes (Lite / Full / โถ Lifespan Check) and a plug-and-play surface.
โ Get started
For comparison<br>Leaderboard
Multi-track results: model swaps, custom memory policies, runtime controllers, autonomous agents.
โ See results
For depth<br>Docs & methodology
Seven scenarios in detail. AgingCard schema. Counterfactual diagnosis. Contributing and roadmap.
โ Read the docs
โ How long remain reliable? Day 1 โ Day N
Across scenarios, models, and memory policies, agents that pass day-one evaluation often show longitudinal degradation across sessions. See more results in our evaluation โ
Day 1 โ Day N across the four mechanisms. Chat-bubble examples (left) show how each failure mode tends to read to a user (omission, confusion, staleness, collapse). Curves (center) show recall/precision declining across sessions for representative models on each mechanism. (Click to enlarge.)
โก How does reliability decay? Four aging mechanisms
Decay is rarely a single phenomenon. AgingBench organizes the observed failure patterns into four...