AI Agent Debugs Postgres I/O Spike & Invents a Crisis
Best Practices for Building Agents | Part 6: Discovery and Governance | READ HERE
Pricing<br>Resources
Developers
Platform<br>Capabilities
Pricing<br>Resources
Developers
Sign In<br>Get Started
Sign InGet Started
Engineering<br>An AI Agent Debugs a Postgres I/O Spike: Multixacts, SLRU Caches, and a Crisis It Invented
By:<br>Ian McGraw
June 15, 2026
A few weeks ago one of our Aurora Postgres instances started throwing off a strange signal. IO:SLRURead waits were spiking, and the Aurora storage read metrics (readIOsPS, readThroughput) were climbing right along with them. Nothing was down, but something was clearly wrong, and the cause was not obvious from the dashboards.
AWS RDS dashboard showing sudden IOPS spikes.I decided to run the investigation through Claude, treating it as an SRE. I would paste in query results, it would reason about them and tell me what to look at next. For most of the session this was impressive. It moved faster than I would have, knew exactly which system catalogs to interrogate, and ruled out the usual suspects methodically.<br>It also spent the better part of two hours convinced the database was heading toward a catastrophic failure that was never going to happen.<br>This post describes a specific structural failure mode of agentic debugging. The agent got nearly every individual step right, made one small mistake early, and the loop quietly amplified that mistake into a full-blown false emergency. It is also the clearest example I have run into of why a human belongs in the loop when an agent is working near production.<br>The ninety-nine percent that went right<br>A quick bit of background, because the failure only makes sense once you know what the agent was looking at. Postgres keeps several small, fixed-size in-memory caches called SLRUs for bookkeeping data: the commit log, subtransactions, multixacts, and a few others. When you see IO:SLRURead waits, it means lookups are missing one of those caches and going to storage instead. The job is to figure out which cache, and why.<br>Claude worked this methodically, and well. It started from the most common cause of SLRU pressure on Aurora, a long-running transaction holding back the cleanup horizon, and checked for it directly. It pulled pg_stat_activity for old transactions, looked for held-back xmin, then checked replication slots and prepared transactions. All clean:<br>All the usual horizon-blockers are clean — no slots, no prepared xacts, no long-running queries.<br>Then it went to pg_stat_slru, which breaks reads down per cache, and correctly isolated the culprit:<br>SELECT name, blks_hit, blks_read<br>FROM pg_stat_slru<br>ORDER BY blks_read DESC;<br>name | blks_hit | blks_read<br>-----------------+-------------+-----------<br>MultiXactMember | 15514387132 | 11876153<br>MultiXactOffset | 15520306300 | 793<br>other | 6903615 | 108<br>Subtrans | 0 | 0<br>Xact | 0 | 0<br>...<br>MultiXactMember was the only cache with meaningful storage reads, around 11.8 million of them, while every other SLRU sat near zero. It even read the lock modes off a live multixact and recognized the FOR KEY SHARE plus FOR NO KEY UPDATE signature of foreign-key contention. Every one of these steps is what a strong human SRE would have done, and it did them in a fraction of the time.<br>This matters for what follows. The agent's reasoning on these early steps was sound, and that credibility is what made the later mistake so easy to accept.<br>One function call<br>To see why MultiXactMember was missing cache, the agent needed to know how big the live multixact range was. So it queried the age of the oldest multixact reference on each table, using a standard-looking monitoring query built around Postgres's age() function:<br>SELECT n.nspname, c.relname,<br>age(c.relminmxid) AS mxid_age,<br>pg_size_pretty(pg_relation_size(c.oid)) AS size<br>FROM pg_class c<br>JOIN pg_namespace n ON n.oid = c.relnamespace<br>WHERE c.relkind IN ('r','m','t') AND c.relminmxid <> '0'<br>ORDER BY age(c.relminmxid) DESC<br>LIMIT 20;<br>nspname | relname | mxid_age | size<br>------------+-----------------------------+------------+------------<br>pg_toast | pg_toast_36154 | 1782314835 | 24 kB<br>public | base_role_to_inheritor_role | 1782313765 | 8192 bytes<br>public | attestation_records | 1782311944 | 8192 bytes<br>public | policies | 1782311944 | 8192 bytes<br>pg_catalog | pg_depend | 1782308199 | 216 kB<br>...The numbers came back enormous. Every table reporting a multixact reference showed an mxid_age around 1.78 billion.<br>At this point, the investigation took a turn. Multixact IDs are a 32-bit counter, and if the live range ever reaches about 2.1 billion without cleanup, Postgres stops accepting writes to protect itself. A value of 1.78 billion means we were at 83% of that hard limit. So the agent declared the real problem found:<br>Found it. The diagnostic is unambiguous now.<br>And then:<br>you're at ~83% of the multixact wraparound limit (2³¹ = 2.15B). For reference, autovacuum_multixact_freeze_max_age defaults to 400M — you're 4× past where...