The Test Suite Was the Incident

jruohonen1 pts0 comments

The Test Suite Was the Incident · A night of brittle fixtures, 49 failed CI runs, and an expensive lesson about what happens when AI agents write the test suite and nobody owns the test data. | Christopher Meiklejohn

Skip to content

At 10:00 PM EDT last night I signed up for the $100 plan for Codex, the OpenAI coding agent that was driving most of the night’s work, because about a dozen Zabriskie PRs were in flight and I wanted them merged before I went to bed. Before 3:00 AM I had upgraded to the $200 plan. Somewhere in between, GitHub’s usage page ticked through roughly 90% of a $200 CI quota, and I sat there watching pull request after pull request go red for reasons that had nothing to do with the pull requests.

Some context for readers arriving fresh. Zabriskie is a social app for live-music fans: you RSVP to shows, track the ones you attended, post about what you’re hearing, and compare tour stats with friends. I have written almost none of it. AI agents built the features, and AI agents also built the test suite that guards them: the specs, the shared seed data, the fixture users, the helpers, nearly all of it written by some agent in some session that was about something else. That second part turns out to be the entire subject of this post.

Nothing was wrong with GitHub Actions. Nothing was catastrophically wrong with the application. No single PR was especially risky. Every PR was simply paying to rebuild the same brittle world from scratch: backend build, fresh Postgres, the full database migration chain, the shared seed data, the frontend, and the end-to-end browser tests (Playwright) split into eight parallel shards. That world was held together by fixtures nobody owned. I spent almost two hundred dollars last night proving that my test suite was lying to me.

The morning-after numbers, measured from the GitHub run history:

thing<br>amount

CI runs observed<br>168

failed runs<br>49

cancelled runs<br>10

total runner time<br>11,721.8 job-minutes (195.4 hours)

failed runner time<br>3,411.0 job-minutes

failed plus cancelled runner time<br>3,903.0 job-minutes

GitHub CI quota used<br>about 90% of $200

estimated agent token waste<br>about 1.7M to 2.5M tokens

If I allocate the $180 actually consumed across the observed runner time, the failed jobs alone cost about $52, and the failed plus cancelled jobs about $60. That number is technically true and emotionally useless. The green runs existed because the red runs forced another commit. The reruns existed because the failure kept moving from one shared assumption to another. The successful checks weren’t clean proof of health; they were the tax paid after pushing the suite back into one of the few shapes it still tolerated. So the honest accounting is this: the directly attributable CI waste was about sixty dollars, and the practical cost of the bad test design was most of the $180, plus the evening, plus something like two million tokens spent asking agents to debug symptoms of the same structural problem. The tests were not protecting me. They were charging me rent.

The Shared World

The actual application work was not the disaster. There were real features, real fixes, and pull requests that should have been small, reviewable, and mergeable. Instead, every one of them entered the same grinder, and the first wave of failures had nothing to do with the PRs that were failing.

One migration, a data fix for a birthday feature, expected a user named queenofthemean to already exist in the database it was migrating, and in a fresh CI database she didn’t:

birthday-honoree fix matched 0 rows<br>no user with username/display_name = queenofthemean

Another batch of failures came from migrations over the project’s incident log. Zabriskie’s repository tracks its own operational incidents, complete with required estimates of what each one wasted, and the old incident rows didn’t include those waste estimates, so the migrations tripped a database constraint. These weren’t product regressions. They were fixture assumptions leaking into migration behavior: a PR could be about one corner of the app and still lose because some other part of the global test universe had drifted.

Then came the collision that named the whole problem. Zabriskie has a feature called the Lot, a hangout space with a tour stats card that summarizes the shows you’ve been to. A change made that card count only past shows the user had actually attended, which was the correct product behavior, and a test was updated to assert it. The setup for that test added a show RSVP for e2etester, the single global user that half the suite borrows for whatever scenario it needs. Meanwhile, a spec for show recaps elsewhere in the suite expected its fixture show to have exactly one attendee. The recap test started failing, not because recaps were broken, but because the shared world had quietly acquired one extra person.

It’s worth pausing on how a suite ends up shaped like this, because nobody designed it. An agent...

test suite because failed runs shared

Related Articles