How we know if our agent is right

shad422 pts0 comments

How We Know If Our Agent Is Right | Mendral<br>How do we know our agent is right? We get the question all the time, internally and externally, and we don't have a clean answer.

In the last 60 days, our CI failure diagnosis agent ran 36,564 investigations on top of 5.7M CI jobs and 14.4 billion log lines. Median time to diagnosis was 134 seconds. Average cost per investigation was about $0.29. Completion rate was 96.6%. And we still can't tell you what its accuracy is, with one number, with a straight face.

(For context, Mendral is a team of AI DevOps agents that helps engineering teams ship faster. The agents analyze CI failures, fix flaky tests, spot performance regressions, and catch supply chain attacks before they hit production. They open fix PRs when confidence is high enough. This post is about how we evaluate one slice of that work: CI failure diagnosis.)

We don't have a public benchmark. We don't have a labeled dataset of "this CI failure had this root cause and this fix." We can't replay an investigation deterministically because the world keeps moving (the bug gets fixed, the dependency moves, the runner restarts). And every customer's repo is different.

What we have is production traffic, a handful of noisy signals, and two months of arguments inside the team about which signal to trust. The rest of this post is what we figured out.

Why CI agent eval is harder than it looks

If you're building a coding assistant, you have benchmarks. SWE-bench gives you 2,294 GitHub issues with labeled fixes. HumanEval gives you function signatures with hidden test suites. The eval problem is mostly about being honest with the data.

DevOps agents don't have that for CI. Three reasons.

First, no public benchmark exists. A CI failure isn't a self-contained problem. It's a state of a repo, a workflow file, a set of test runs, a history of recent commits, and the operational state of the runner that produced it. There's no SWE-bench equivalent because the inputs aren't even file-shaped.

Second, multiple fixes can be correct. A flaky test can be fixed by adding a retry, by adding a sleep, by removing a race condition in the code under test, by isolating shared state between tests, or by deleting the test entirely. We've seen all five in production. They're correct in different senses, and a "good" agent should pick the right one for the team's situation, not the most aggressive one.

Third, state changes after the fact. By the time you want to grade a diagnosis, the broken branch has been fixed, the runner has been restarted, the flake has stopped firing, or the dependency has moved. The investigation can't be replayed cleanly because the inputs no longer exist.

Every eval signal we have is a proxy. Some are cleaner than others. None of them are ground truth.

What we measure, and why each signal is noisy

Our actual signal stack, with 60-day counts:

Signal60-day countWhat it tells usNoiseAuto-Remediation PR merge rate628 PRs decidedDid the team accept the fix?Captures fix-style preference, not just correctnessDismiss-as-invalid81 incidentsTeam said "this isn't a real problem"Cleanest, but rareUser-initiated reassessment94 sessionsTeam said "you got this wrong, here's context"Cleanest, but rareSystem-initiated reassessment8,523 sessionsAgent re-ran on new contextNot a grade, but sometimes flips the diagnosisSlack thumbs (positive/negative)70 sessionsExplicit feedback0.14% coverage on completed sessions<br>The highest-volume signal (auto-remediation merge rate) is also the noisiest. The cleanest signals (dismiss-as-invalid, user reassessment) cover less than 2% of investigations.

Auto-Remediation merge rate is the closest thing we have to "did the team accept the agent's fix?" Of the 628 PRs decided in the last 60 days, 68.3% were merged. That sounds like a quality signal. The next section is about why it isn't.

Dismiss-as-invalid is when an engineer explicitly tags an insight with "this isn't a real problem." There were 81 of those in the last 60 days against 9,226 total incidents. That's a 0.88% floor on the demonstrably-wrong rate. It's a floor and not the truth, because most resolutions go untouched and many "wont_fix" or "duplicate" dismissals are also probably wrong but uninteresting.

User-initiated reassessments are the strongest "you got this wrong, here's the missing context" signal we have. There were 94 of those in 60 days against 35,305 completed sessions. High-quality but rare.

System-initiated reassessments are different. They're triggered automatically when the context around an insight changes (new occurrence, related incident, time threshold). They're not grades, they're re-investigations. But sometimes they flip the original diagnosis, which is the closest thing we have to the agent self-correcting. There were 8,523 of those in 60 days. One of them shows up in the case studies below.

Slack thumbs are what most teams reach for. We instrumented up/down feedback in Slack early. Coverage is...

agent signal days diagnosis team test

Related Articles