A Case for Simulation-Driven Resilience in Agentic Data Systems
Skip to main content
A Case for Simulation-Driven Resilience in Agentic Data Systems
Get link
Other Apps
June 07, 2026
As I mentioned in my previous post, I traveled to San Jose at the end of May for the ACM CAIS conference. On Day 0, I gave a very short talk at the Supporting our AI Overlords (SAO) workshop. This post is the promised summary of our paper, "A Case for Simulation-Driven Resilience in Agentic Data Systems", joint work with Aleksey Charapko (University of New Hampshire) and Akshat Vig (MongoDB).
Metastability is critical for building the next generation of distributed systems<br>Our story starts with metastability. Metastability is the failure mode where the mechanisms built to protect the system (retries, queues, timeouts, load shedding) turn into amplifiers. Even after the trigger that caused the overload goes away, the system stays behind, churning through busy work, perpetually trying to catch up with the remnants of failed and behind-schedule tasks. It's a bit like missing some foundational math in high school. You spend so long backfilling the old gaps that you never keep up with the new material piling on top, so you stay permanently behind. The catching up work is what keeps you behind, which requires catching up work later, which keeps you behind. (This is presumably why I'm not a rich machine learning scientist today at Deep Mind.)<br>Avoiding and tolerating metastable failures is critical to building the next generation of reliable distributed systems. Aleksey's Metastable Failures in the Wild study (OSDI'22) cataloged these failures from real production incidents. They matter so much because they are the hard faults that remain. We have largely learned to deal with the straightforward ones (crashes, corruptions, dropped packets), which leaves emergent performance failures as the final boss. Metastable failures are responsible for a disproportionate share of critical cloud unavailability incidents, with no single broken part to point at and no obvious way to fix/reset them when they emerge. The cloud economics make the problem worse. Providers and operators have every incentive to run with the absolute minimum of excess capacity and to trim slack and "waste". But this thin margin is exactly where metastability thrives.
Agents supercharge the feedback loops and put metastability on steroids<br>As AI agents are replacing humans as the primary clients of modern data systems, they are bringing a qualitative shift in workload characteristics. Agents retry aggressively while mutating the query on each attempt. They fan out into bursty parallel sub-tasks. They hold transactions open while they wait on an external LLM to provide the next step. The "Supporting our AI overlords" paper showed that agents create ~20x more branches and perform ~50x more rollbacks than humans do. These behaviors violate assumptions baked into every layer of a modern data system. Execution control assumes stationary arrivals. Caching assumes temporal locality. Concurrency control assumes bounded hold times. Agents break all three at once!
We propose simulation-driven resilience to address this problem<br>Simulation can enable us to systematically explore the agent-database boundary, and discover/prevent metastability failures before a production incident forces a reactive (and nonworking) patch. We propose a simulation based approach because only simulation is cheap enough to sweep an enormous trigger space, and deterministic enough to replay and dissect every failure it finds.
Benchmarks measure steady state. But, metastability is transient and emergent: it lives in the sequence of rare events, not in the average.<br>Queueing theory assumes mostly stationary independent arrivals. Agents break these assumptions.<br>Testing with the production system is hopeless for design sweeps and gives you almost no observability. When your database falls over under load, you still don't know what went wrong, or how it went wrong, and you can't explore the design space because you have no feedback to explore it with.
MESSI finds failures in the seams<br>Metastability scurries in the seams/interaction of the composition of subsystems, so we need a tool that lets us look there. Aleksey developed MESSI (MEtaStability SImulator), a discrete-event simulation framework for exploring metastability dynamics in distributed systems. MESSI enables modeling any (sub)system as a directed graph and composition of (sub)systems. Logic Nodes implement routing and state policy. Processors model the physical resource constraints (queues, I/O delays, network latency). Individual work items, QItems, carry state as they traverse the graph. There is a clear separation of roles: policy lives in the nodes, resource contention lives in the edges, and you can vary each independently.
Because this is a simulator, it is deterministic and replayable, and it exposes the full...