AI's "overnight" solution for our flaky tests took two weeks to adopt
We're live
Live on Twitch!
thoughtbot is livestreaming
Work alongside the thoughtbot team as we collaborate with each other and our clients, live. Ask us anything, we're live right now!
Menu
Let’s get started!
Back
View all Services
Development
Ruby on Rails
Hotwire
AI and Machine Learning
Maintenance
Mobile
Design
Shaping Sprints
UX, UI, and Product Design
Design Systems
Product
MVP
Product Management
Accessibility
Team and Processes
Team Augmentation
Fractional Leadership
View all Services
Back
Healthcare
Finance
Insurance
Back
View all Resources
Development
Tech Leadership Resources
Open Source
Books
The Bike Shed Podcast
Live Streaming on YouTube
Conference Talks
The business of great software
Playbook
Giant Robots Smashing Into Other Giant Robots Podcast
Design Sprint Guide
Live Streaming on LinkedIn
View all Resources
Recently I stopped a group of flaky tests from running in CI. 60% of CI runs were failing because of this group, which was unsustainable. Three weeks later I was able to restore that group to CI, with 0% failures on main1 resulting. Our “non-flaky” tests now give more false positives than the (previously) flaky group.
This is not really a post about tests though, it’s really about AI’s contribution (a lot) and what it took to make that contribution usable (also a lot).
The hardest problem
Developers on this project had been quarantining tests with a :flaky label for several years. The strategy was to quarantine a small group which could be expected to fail randomly but could also be re-run easily and separately from the full suite. Apart from the flakiness, the test suite is comprehensive and gives us high confidence that if we merge something after tests pass, it works.
Over the years, several developers had tried for a week at a time to reduce flakiness, all resulting in failure. In our defense, the flaky tests centred around interactive pages using Stimulus or Hotwire, and online discussion of this topic is a combination of ideas we tried already, plus someone saying: “I tried a lot, it doesn’t work, I think there’s a bug”.
The most promising angle was adopting Playwright, which did improve some things but also left us with some tests that failed permanently and needed to be skipped. There’s a dissatisfying way in which this is better than tests that only fail some of the time.
The problem started to look more and more like a trap set for enthusiastic developers. As a manager I always had to urge caution: “sure, you can see some approaches that could help, but bear in mind the last five times anyone tried they found very promising angles that didn’t change the stats in github at all”. Developers whom I trust were seriously recommending deleting the entire group.
Opus “solved it” overnight
One night, Opus 4.6 running in Claude Code solved “the problem” by running the flaky test group hundreds of times and analyzing failures. There was some prompting to help Claude avoid premature conclusions and be aware that the problems could not be reproduced without repetition, plus a markdown file where it would record progress. Otherwise, no special magic.
I could see Claude’s progress over time because it needed to run the flaky group in larger and larger batches. At first, five times was sufficient because the errors it found occurred 20% of the time. As those were fixed, I had to tell it to use batches of ten, fifty, and then one hundred. Finally, it reached a point where zero errors were found.
A “nice” thing about needing such large batches is that I could leave Claude alone for hours at a time while my normal evening continued. Flaky specs may be a problem uniquely suited to coding agents in that way. There’s not even much token use: it just kicks off a long run and surfaces for an internal conversation, then kicks off the next batch.
Two weeks to make the results useful
This isn’t a post about test failure strategy, so I’ll spare you details of what was flaky and what fixes applied. Instead I’ll try to communicate some of the meta concerns I had with the resulting code changes.
Given a test that looked something like this:
1 create objects<br>2 visit page<br>3 click A<br>4 click B<br>5 expect expression 1 to be true<br>6 click C<br>7 expect expression 2 to be true
Unchecked, Claude would have turned it into something like this:
1 create objects in a slightly different way that makes no difference<br>2 visit page<br>3 explicit sleep<br>4 unnecessary scoping to a specific section of the page<br>5 click A<br>6 end of unnecessary scoping<br>7 click B, with 3 second wait passed as option arg<br>8 a clever improvement that should have been on line 3<br>9 expect expression 1 to be true<br>10 click C<br>11 an improvement that worked in other tests but was irrelevant here<br>12...