The hard part of AI root cause analysis is no longer the model

nikolay_sivko1 pts0 comments

The hard part of AI root cause analysis is no longer the model | Coroot Blog

Skip to main content<br>Customer PortalToggle menu

← All postsEngineering<br>The hard part of AI root cause analysis is no longer the model<br>Nikolay Sivko<br>June 30, 202610 min read

Every few weeks someone tells me root cause analysis is a solved problem now: pipe your telemetry into an LLM, let it tell you what broke. I wish it were that easy. After years on this, I think "can AI do RCA?" is the wrong question, because doing RCA with an LLM is really two separate jobs, and the answer is different for each. They break in completely different ways, so it's worth pulling them apart.

One is reasoning: can the model take the data in front of it and connect the dots? A service slows down. Three facts are on the table at once: it's starved of CPU, the node's CPU is maxed, and a neighbor on that node is eating all of it. A model that reasons ties them into one story, a noisy neighbor. A weaker one reports three unrelated "issues", or grabs the loudest symptom and calls it the cause.

The other is the harness: everything around the model. What data you put in front of it, in what shape. Usually it means tool-calling, letting the model decide what to fetch and when to stop. Plenty goes wrong here, and none of it is about whether the model could reason. It just never got the right data.

People mix these two up all the time. A model gives a bad answer, and everyone says LLMs can't do RCA. But usually the model just never got the data it needed. It's not that it couldn't reason, it never had a fair shot. And until you separate the two, you can't tell which one is the real problem.

Take the harness out of the picture #

So we did, on purpose. With Coroot's AI RCA, we don't hand the model tools and send it off to investigate. Instead, a deterministic pipeline does the heavy lifting: it correlates the signals and turns them into findings. The model gets those findings in one focused context, not the raw telemetry. No tools, no agent loop. Everything it needs to find the answer is already there.

That boils the whole thing down to reasoning. If the model has the full context and still misses the root cause, there's no one else to blame. Not the harness, not missing data. Just the model. And that's finally something you can measure cleanly.

So here's the experiment. Take a real failure where the context already holds the answer, hand that same dump to a bunch of models, and see which ones can distill it into the actual root cause. No fetching, no deciding what to look at. Just reasoning. And it's harder than it sounds: even with the answer sitting in the data, there are traps in there that can walk a model straight into the wrong conclusion.

The test #

I picked one scenario: a network delay between the catalog service and its Postgres database, db-main. The queries slow down, timeouts spread, and front-end starts serving 502s. But nothing is actually wrong with the database or the service. The culprit is a Chaos Mesh NetworkChaos experiment running in the cluster, injecting delay on the catalog↔db-main path, and it shows up right in the Kubernetes events. So the fix is to delete the experiment, and just as importantly, the schedule that would spin it right back up.

You can see the problem fan out: front-end shows errors and latency, but tracing the dependency chain leads through catalog, where the real signal lives, the TCP network and connection latency to db-main.

Notice the map also flags things that aren't the root cause: latency on kafka, CPU on catalog and db-main, storage on the database. In a distributed system one problem bleeds into metrics everywhere, and some of them point the wrong way. Take the database. When the round-trip time between catalog and db-main went up, the client started getting its query responses slower. But Postgres times a query from the first byte it receives to the last byte it sends back to the client, so that network delay gets counted as part of the query time. Read pg_stat_statements and it looks like the database suddenly got slower at the exact same queries.

It didn't. The extra time was on the wire, not inside Postgres. A naive read blames the database and moves on, which is exactly the trap. And it isn't a bug that the pre-processing surfaces a signal like this. Telling a real cause from its downstream effects is the model's job in the next step, and it takes real reasoning. More data here isn't a problem, as long as the model can reason well enough not to be fooled by it.

Alongside those traps, the real evidence is all there. Coroot traced the propagation path, found the network RTT to db-main tracking the slowdown, and flagged the Kubernetes event showing the chaos experiment started right when things broke. The answer is right there in the prompt, nothing hidden, nothing behind a tool call.

I then reused that exact same prompt against every model, asking each the same three things: what's the root...

model cause root main data answer

Related Articles