Using mirrord to verify AI-SRE fixes against the staging cluster

eyalbukchin1 pts0 comments

Auto-Verifying Your AI-SRE’s Fixes (Part II): HolmesGPT End-to-End on a Real Cluster | mirrord by MetalBear

Back to blogAuto-Verifying Your AI-SRE’s Fixes (Part II): HolmesGPT End-to-End on a Real Cluster<br>Eyal Bukchin<br>June 23, 2026<br>7 min read<br>Part II of two. See Part I for the recipe.

In Part I we&rsquo;ve discussed how you can plug mirrord into your AI-SRE so it can autonomously test its fix in the real cluster. In this post, we’ll show this running end-to-end on a real Kubernetes cluster against HolmesGPT, the only OSS AI-SRE we found that is self-hostable. We planted two bugs to test both possible verdicts: PASS when the patched run clears the alert&rsquo;s SLO without any regressions, REJECT when it doesn&rsquo;t (the patched run still violates the alert&rsquo;s SLO, or cleared the SLO but added a regression).<br>Flow: HolmesGPT investigates an alert and returns a markdown report. A small Claude wrapper turns the report into a code patch. The verifier runs that patch under mirrord exec, compares against an unpatched baseline, and returns a verdict tied to the alert&rsquo;s actual SLO.<br>AI-SRE verification architectureWhat&rsquo;s HolmesGPT? A prebuilt LLM-backed agent tailored for on-call work: it can run kubectl commands, fetch logs from the cluster, and read the runbooks you&rsquo;ve already written (in Notion, Confluence, or markdown files) to ground its investigation.

The demo cluster #<br>A Python service called checkout handles HTTP requests; on each request it calls a pricing service for the item price. A loadgen pod sends requests to checkout continuously to keep the system warm. Prometheus scrapes checkout&rsquo;s metrics and Alertmanager fires when SLOs are violated. Everything sits in one namespace.<br>HolmesGPT runs as a pod in the cluster, subscribed to Alertmanager. When an alert fires, HolmesGPT investigates and hands its markdown report off to a separate verifier pod. The verifier runs the bridge (the Claude call that turns the report into a code patch), then uses mirrord exec to run the patched code inside the verifier pod with deploy/checkout&rsquo;s network identity, env, and mounts. When the patched copy calls pricing, it hits the real pricing pod.<br>The demo clusterScenario 1: error-rate alert (PASS) #<br>A recent change introduced a request format checkout() doesn&rsquo;t handle: every request for an item_id ending in -3 raises ValueError and returns 500. Loadgen sends a mix; ~10% of requests fail. A Prometheus rule fires CheckoutErrorRateHigh once the error rate climbs over 5%.<br>HolmesGPT Investigates<br>This is the call HolmesGPT runs in-cluster when Alertmanager fires:<br>holmes investigate alertmanager \<br>--alertmanager-url http://localhost:9093 \<br>--alertmanager-alertname CheckoutErrorRateHigh \<br>--model 'anthropic/claude-sonnet-4-20250514'<br>HolmesGPT investigated for ~30 seconds (pulled pod descriptions, fetched logs, walked the service config) and concluded:<br>Root Cause: Application logic error causing 500 responses for specific item (item-3)<br>Error Details: Error rate: 20.09% (above 5% SLO). Specific error: ValueError: unsupported catalog shape for item_id=item-3. Pattern: repeated failures for item-3 checkout requests returning HTTP 500.<br>Remediation: Fix application code to handle item-3 catalog shape or add proper validation.

Clean diagnosis: HolmesGPT pulled the exception out of the logs and attributed it correctly.<br>Bridge to a code patch. HolmesGPT outputs a markdown report. The bridging step is a small Claude call running in the verifier pod (~80 lines around the Anthropic SDK) that takes the report plus the service&rsquo;s source and returns a code patch. We asked it to faithfully implement HolmesGPT&rsquo;s recommendation: handle the item-3 case (Claude chose to return zero) rather than raising. Claude produced the minimal edit to checkout.py.<br>Verify. Two runs (baseline and patched), 100 requests each. The verifier compares the alert&rsquo;s SLO condition against the patched run, plus a regression watchlist on the other signals. The &ldquo;baseline&rdquo; here is the verifier&rsquo;s own load test on the unpatched code, same load and same downstream as the patched run, not the live error rate HolmesGPT observed. That&rsquo;s why the number below differs from the 20.09% in HolmesGPT&rsquo;s report.<br>SLO under verification: error_rate > 5% (alert fires while this holds)<br>BaselinePatchedSLO thresholdStatusError rate10.00%0.00% 5.00%✅ satisfiesRegression watchlist :<br>SignalBaselinePatchedChangep50 latency (ms)469.2470.5+0.3%p99 latency (ms)505.7531.8+5.2%Verdict: PASS. Alert condition error_rate > 5% no longer satisfied (10% → 0%). Regression watchlist clean.<br>Scenario 2: latency alert (REJECT) #<br>The error-rate bug was easy: a literal exception in the logs with an obvious fix. What does the loop look like for a less log-visible bug? We swapped in the second planted defect: fetch_price() has no client-side timeout, and pricing has a long tail: 1 in 10 calls takes ~1.5s. p99 drifts to ~2 seconds,...

rsquo holmesgpt alert cluster error patched

Related Articles