Aggressively Hunting Down Flaky CI Tests with AI

Aggressively Hunting Down Flaky CI Tests with AI - OpenWISP Blog AI-assisted debugging of a flaky CI crashIf you have ever contributed to OpenWISP, you have probably met flaky CI. A test passes on your machine, fails on CI, then passes again when you re-run the job. Most OpenWISP modules are Django applications with Selenium browser tests, and over time those intermittent failures became more than a minor annoyance: red builds that were green a minute later, contributors re-running jobs to unblock their pull requests, and maintainers learning to double-check whether a red build was real or just another ghost.A self-healing bot, and the debt it hid To stop flaky runs from blocking everyone, I asked Sarthak Tyagi to help me build a small safety net into our shared tooling: a reusable GitHub Actions workflow in openwisp-utils that inspects a failed run, offers recommendations for potential fixes, recognizes the signatures of known flaky failures, and automatically restarts the failed build. You can read about it in the docs for the automated CI failure bot.The bot kept the pipeline moving and saved us a lot of manual re-runs. But it treated the symptom, not the disease. The real bugs were still there, scattered across several modules, and we never found the time to chase them down. Breathing room is useful, but if you never come back to the root cause it quietly turns into debt. The process I used Diagram of the supervised AI debugging loop: CI failures become a report, fixes are tested locally and on GitHub Actions, and human review steers the next iteration.So I decided to use an AI coding agent as a tireless assistant on a machine with cores to spare. I was still skeptical. A few months earlier I had tried an AI agent on exactly this kind of debugging and it had been useless: confidently wrong, eager to chase the wrong clue, and unable to hold a long investigation together. The tools had improved since then, and so had my prompting, but before betting on the hardest bug I wanted proof that the approach could work at all.The useful part was not that the agent had some special insight. I made it do the repetitive work I would have done myself: read failed CI logs, compare patterns, try fixes, and run the tests again. Doing that manually would have taken me at least a full day of focused work. Instead, I used Claude Code with Opus 4.8 in high-effort mode and kept it working with minimal supervision while I was busy with other things.The process looked roughly like this:Fetch the output of flaky CI builds that had failed and then been restarted by the CI failure bot.Analyze those logs, identify the most common failures, and look for possible solutions to each one.Write the findings into a local Markdown report that I could inspect.Start with the low-hanging fruit from that report.Keep running tests locally until the agent could show that a change really reduced the flaky failures it was targeting.Push branches to GitHub and keep restarting GitHub Actions CI jobs, so we could compare local stress runs with repeated CI runs instead of trusting just one environment.Address feedback from CodeRabbit.Update the report whenever the agent found another recurring failure pattern.Treat a targeted change as invalid if the same flaky failure kept showing up, then send the agent back to look for a different solution.Read the updated report from time to time, give the agent hints, and steer priorities or methodology when it started drifting.The loop was simple: collect failures, rank them, try the easiest useful fix, prove whether it helped, update the report, and keep iterating until the flaky errors came down.Later, I used opencode with GPT 5.5 in medium-effort mode to clean up the solution, polish the code, and make the comments more human-readable. Earning trust on an easier target I started with openwisp-controller, where the flaky failures were annoying but more tractable, and left the scariest crash alone for the moment.It went better than expected. I did not hand the problem over and wait for an answer. I used the agent to do the time-consuming part of my own normal debugging loop: keep running tests, inspect failures, try small changes, and verify whether each change really moved the needle. That process fixed several flaky tests and, more interestingly, exposed a real bug behind some of them: background tasks were resurrecting already-deleted rows with a stray INSERT. The resulting FOREIGN KEY error could corrupt the SQLite test state and leave the Selenium browser waiting for a page that would never recover. The same work also moved session storage out of the shared cache, a test fix that doubled as production hardening. Nothing about it was magic. What changed since my earlier attempt was that I treated the agent like a very patient pair of hands on a spare machine, not like an oracle . I kept pushing it through the same loop I would have followed myself: reproduce, change one thing, run it again, disprove the...

Aggressively Hunting Down Flaky CI Tests with AI

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs