It took two weeks to make Claude's "overnight solution" for flaky tests useful

AI's "overnight" solution for our flaky tests took two weeks to adopt

We're live

Live on Twitch!

thoughtbot is livestreaming

Work alongside the thoughtbot team as we collaborate with each other and our clients, live. Ask us anything, we're live right now!

Let’s get started!

Back

View all Services

Development

Ruby on Rails

Hotwire

AI and Machine Learning

Maintenance

Mobile

Design

Shaping Sprints

UX, UI, and Product Design

Design Systems

Product

MVP

Product Management

Accessibility

Team and Processes

Team Augmentation

Fractional Leadership

View all Services

Back

Healthcare

Finance

Insurance

Back

View all Resources

Development

Tech Leadership Resources

Open Source

Books

The Bike Shed Podcast

Live Streaming on YouTube

Conference Talks

The business of great software

Playbook

Giant Robots Smashing Into Other Giant Robots Podcast

Design Sprint Guide

Live Streaming on LinkedIn

View all Resources

Recently I stopped a group of flaky tests from running in CI. 60% of CI runs were failing because of this group, which was unsustainable. Three weeks later I was able to restore that group to CI, with 0% failures on main1 resulting. Our “non-flaky” tests now give more false positives than the (previously) flaky group.

This is not really a post about tests though, it’s really about AI’s contribution (a lot) and what it took to make that contribution usable (also a lot).

The hardest problem

Developers on this project had been quarantining tests with a :flaky label for several years. The strategy was to quarantine a small group which could be expected to fail randomly but could also be re-run easily and separately from the full suite. Apart from the flakiness, the test suite is comprehensive and gives us high confidence that if we merge something after tests pass, it works.

Over the years, several developers had tried for a week at a time to reduce flakiness, all resulting in failure. In our defense, the flaky tests centred around interactive pages using Stimulus or Hotwire, and online discussion of this topic is a combination of ideas we tried already, plus someone saying: “I tried a lot, it doesn’t work, I think there’s a bug”.

The most promising angle was adopting Playwright, which did improve some things but also left us with some tests that failed permanently and needed to be skipped. There’s a dissatisfying way in which this is better than tests that only fail some of the time.

The problem started to look more and more like a trap set for enthusiastic developers. As a manager I always had to urge caution: “sure, you can see some approaches that could help, but bear in mind the last five times anyone tried they found very promising angles that didn’t change the stats in github at all”. Developers whom I trust were seriously recommending deleting the entire group.

Opus “solved it” overnight

One night, Opus 4.6 running in Claude Code solved “the problem” by running the flaky test group hundreds of times and analyzing failures. There was some prompting to help Claude avoid premature conclusions and be aware that the problems could not be reproduced without repetition, plus a markdown file where it would record progress. Otherwise, no special magic.

I could see Claude’s progress over time because it needed to run the flaky group in larger and larger batches. At first, five times was sufficient because the errors it found occurred 20% of the time. As those were fixed, I had to tell it to use batches of ten, fifty, and then one hundred. Finally, it reached a point where zero errors were found.

A “nice” thing about needing such large batches is that I could leave Claude alone for hours at a time while my normal evening continued. Flaky specs may be a problem uniquely suited to coding agents in that way. There’s not even much token use: it just kicks off a long run and surfaces for an internal conversation, then kicks off the next batch.

Two weeks to make the results useful

This isn’t a post about test failure strategy, so I’ll spare you details of what was flaky and what fixes applied. Instead I’ll try to communicate some of the meta concerns I had with the resulting code changes.

Given a test that looked something like this:

1 create objects 2 visit page 3 click A 4 click B 5 expect expression 1 to be true 6 click C 7 expect expression 2 to be true

Unchecked, Claude would have turned it into something like this:

1 create objects in a slightly different way that makes no difference 2 visit page 3 explicit sleep 4 unnecessary scoping to a specific section of the page 5 click A 6 end of unnecessary scoping 7 click B, with 3 second wait passed as option arg 8 a clever improvement that should have been on line 3 9 expect expression 1 to be true 10 click C 11 an improvement that worked in other tests but was irrelevant here 12...

It took two weeks to make Claude's "overnight solution" for flaky tests useful

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars