Auto-Verifying Your AI-SRE’s Fixes (Part II): HolmesGPT End-to-End on a Real Cluster | mirrord by MetalBear
Back to blogAuto-Verifying Your AI-SRE’s Fixes (Part II): HolmesGPT End-to-End on a Real Cluster<br>Eyal Bukchin<br>June 23, 2026<br>7 min read<br>Part II of two. See Part I for the recipe.
In Part I we’ve discussed how you can plug mirrord into your AI-SRE so it can autonomously test its fix in the real cluster. In this post, we’ll show this running end-to-end on a real Kubernetes cluster against HolmesGPT, the only OSS AI-SRE we found that is self-hostable. We planted two bugs to test both possible verdicts: PASS when the patched run clears the alert’s SLO without any regressions, REJECT when it doesn’t (the patched run still violates the alert’s SLO, or cleared the SLO but added a regression).<br>Flow: HolmesGPT investigates an alert and returns a markdown report. A small Claude wrapper turns the report into a code patch. The verifier runs that patch under mirrord exec, compares against an unpatched baseline, and returns a verdict tied to the alert’s actual SLO.<br>AI-SRE verification architectureWhat’s HolmesGPT? A prebuilt LLM-backed agent tailored for on-call work: it can run kubectl commands, fetch logs from the cluster, and read the runbooks you’ve already written (in Notion, Confluence, or markdown files) to ground its investigation.
The demo cluster #<br>A Python service called checkout handles HTTP requests; on each request it calls a pricing service for the item price. A loadgen pod sends requests to checkout continuously to keep the system warm. Prometheus scrapes checkout’s metrics and Alertmanager fires when SLOs are violated. Everything sits in one namespace.<br>HolmesGPT runs as a pod in the cluster, subscribed to Alertmanager. When an alert fires, HolmesGPT investigates and hands its markdown report off to a separate verifier pod. The verifier runs the bridge (the Claude call that turns the report into a code patch), then uses mirrord exec to run the patched code inside the verifier pod with deploy/checkout’s network identity, env, and mounts. When the patched copy calls pricing, it hits the real pricing pod.<br>The demo clusterScenario 1: error-rate alert (PASS) #<br>A recent change introduced a request format checkout() doesn’t handle: every request for an item_id ending in -3 raises ValueError and returns 500. Loadgen sends a mix; ~10% of requests fail. A Prometheus rule fires CheckoutErrorRateHigh once the error rate climbs over 5%.<br>HolmesGPT Investigates<br>This is the call HolmesGPT runs in-cluster when Alertmanager fires:<br>holmes investigate alertmanager \<br>--alertmanager-url http://localhost:9093 \<br>--alertmanager-alertname CheckoutErrorRateHigh \<br>--model 'anthropic/claude-sonnet-4-20250514'<br>HolmesGPT investigated for ~30 seconds (pulled pod descriptions, fetched logs, walked the service config) and concluded:<br>Root Cause: Application logic error causing 500 responses for specific item (item-3)<br>Error Details: Error rate: 20.09% (above 5% SLO). Specific error: ValueError: unsupported catalog shape for item_id=item-3. Pattern: repeated failures for item-3 checkout requests returning HTTP 500.<br>Remediation: Fix application code to handle item-3 catalog shape or add proper validation.
Clean diagnosis: HolmesGPT pulled the exception out of the logs and attributed it correctly.<br>Bridge to a code patch. HolmesGPT outputs a markdown report. The bridging step is a small Claude call running in the verifier pod (~80 lines around the Anthropic SDK) that takes the report plus the service’s source and returns a code patch. We asked it to faithfully implement HolmesGPT’s recommendation: handle the item-3 case (Claude chose to return zero) rather than raising. Claude produced the minimal edit to checkout.py.<br>Verify. Two runs (baseline and patched), 100 requests each. The verifier compares the alert’s SLO condition against the patched run, plus a regression watchlist on the other signals. The “baseline” here is the verifier’s own load test on the unpatched code, same load and same downstream as the patched run, not the live error rate HolmesGPT observed. That’s why the number below differs from the 20.09% in HolmesGPT’s report.<br>SLO under verification: error_rate > 5% (alert fires while this holds)<br>BaselinePatchedSLO thresholdStatusError rate10.00%0.00% 5.00%✅ satisfiesRegression watchlist :<br>SignalBaselinePatchedChangep50 latency (ms)469.2470.5+0.3%p99 latency (ms)505.7531.8+5.2%Verdict: PASS. Alert condition error_rate > 5% no longer satisfied (10% → 0%). Regression watchlist clean.<br>Scenario 2: latency alert (REJECT) #<br>The error-rate bug was easy: a literal exception in the logs with an obvious fix. What does the loop look like for a less log-visible bug? We swapped in the second planted defect: fetch_price() has no client-side timeout, and pricing has a long tail: 1 in 10 calls takes ~1.5s. p99 drifts to ~2 seconds,...