Claude Debugs a Postgres Alarm: Multixacts, SLRU Caches, and a False Crisis

AI Agent Debugs Postgres I/O Spike & Invents a Crisis

Best Practices for Building Agents | Part 6: Discovery and Governance | READ HERE

Pricing Resources

Developers

Platform Capabilities

Pricing Resources

Developers

Sign InGet Started

Engineering An AI Agent Debugs a Postgres I/O Spike: Multixacts, SLRU Caches, and a Crisis It Invented

By: Ian McGraw

June 15, 2026

A few weeks ago one of our Aurora Postgres instances started throwing off a strange signal. IO:SLRURead waits were spiking, and the Aurora storage read metrics (readIOsPS, readThroughput) were climbing right along with them. Nothing was down, but something was clearly wrong, and the cause was not obvious from the dashboards.

AWS RDS dashboard showing sudden IOPS spikes.I decided to run the investigation through Claude, treating it as an SRE. I would paste in query results, it would reason about them and tell me what to look at next. For most of the session this was impressive. It moved faster than I would have, knew exactly which system catalogs to interrogate, and ruled out the usual suspects methodically. It also spent the better part of two hours convinced the database was heading toward a catastrophic failure that was never going to happen. This post describes a specific structural failure mode of agentic debugging. The agent got nearly every individual step right, made one small mistake early, and the loop quietly amplified that mistake into a full-blown false emergency. It is also the clearest example I have run into of why a human belongs in the loop when an agent is working near production. The ninety-nine percent that went right A quick bit of background, because the failure only makes sense once you know what the agent was looking at. Postgres keeps several small, fixed-size in-memory caches called SLRUs for bookkeeping data: the commit log, subtransactions, multixacts, and a few others. When you see IO:SLRURead waits, it means lookups are missing one of those caches and going to storage instead. The job is to figure out which cache, and why. Claude worked this methodically, and well. It started from the most common cause of SLRU pressure on Aurora, a long-running transaction holding back the cleanup horizon, and checked for it directly. It pulled pg_stat_activity for old transactions, looked for held-back xmin, then checked replication slots and prepared transactions. All clean: All the usual horizon-blockers are clean — no slots, no prepared xacts, no long-running queries. Then it went to pg_stat_slru, which breaks reads down per cache, and correctly isolated the culprit: SELECT name, blks_hit, blks_read FROM pg_stat_slru ORDER BY blks_read DESC;‍ name | blks_hit | blks_read -----------------+-------------+----------- MultiXactMember | 15514387132 | 11876153 MultiXactOffset | 15520306300 | 793 other | 6903615 | 108 Subtrans | 0 | 0 Xact | 0 | 0 ... MultiXactMember was the only cache with meaningful storage reads, around 11.8 million of them, while every other SLRU sat near zero. It even read the lock modes off a live multixact and recognized the FOR KEY SHARE plus FOR NO KEY UPDATE signature of foreign-key contention. Every one of these steps is what a strong human SRE would have done, and it did them in a fraction of the time. This matters for what follows. The agent's reasoning on these early steps was sound, and that credibility is what made the later mistake so easy to accept. One function call To see why MultiXactMember was missing cache, the agent needed to know how big the live multixact range was. So it queried the age of the oldest multixact reference on each table, using a standard-looking monitoring query built around Postgres's age() function: SELECT n.nspname, c.relname, age(c.relminmxid) AS mxid_age, pg_size_pretty(pg_relation_size(c.oid)) AS size FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind IN ('r','m','t') AND c.relminmxid <> '0' ORDER BY age(c.relminmxid) DESC LIMIT 20;‍ nspname | relname | mxid_age | size ------------+-----------------------------+------------+------------ pg_toast | pg_toast_36154 | 1782314835 | 24 kB public | base_role_to_inheritor_role | 1782313765 | 8192 bytes public | attestation_records | 1782311944 | 8192 bytes public | policies | 1782311944 | 8192 bytes pg_catalog | pg_depend | 1782308199 | 216 kB ...The numbers came back enormous. Every table reporting a multixact reference showed an mxid_age around 1.78 billion. At this point, the investigation took a turn. Multixact IDs are a 32-bit counter, and if the live range ever reaches about 2.1 billion without cleanup, Postgres stops accepting writes to protect itself. A value of 1.78 billion means we were at 83% of that hard limit. So the agent declared the real problem found: Found it. The diagnostic is unambiguous now. And then: you're at ~83% of the multixact wraparound limit (2³¹ = 2.15B). For reference, autovacuum_multixact_freeze_max_age defaults to 400M — you're 4× past where...

Claude Debugs a Postgres Alarm: Multixacts, SLRU Caches, and a False Crisis

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y