Catch Flakes On Main
A small Mechanical Habit today:
When using not rocket science rule / merge queue, continue to<br>redundantly run the full test suite on main. Maintain an easily<br>accessible list of recent main failures — these are the flaky<br>tests to eradicate.
For an example, see the “Flakes” link on<br>https://devhub.tigerbeetle.com
Flaky tests are tests that fail intermittently, once in a thousand<br>runs. This might be due to a genuine bug (assumptions about scheduling<br>that mostly hold) or due to instability of underlying<br>infrastructure (e.g., inability to download a release from GitHub, or<br>to delete a folder on Windows). In either case, flaky tests are a huge<br>productivity drain — as the size and complexity of test suite grows,<br>more and more CI runs fail spuriously, even as each individual test<br>almost always passes.
Flaky tests are challenging to deal with — if you are working on<br>landing a PR and your CI fails due to an obvious flake, the temptation<br>to just re-run the test suite is enormous, especially if there’s a<br>certain background dissatisfaction with infrastructure stability.
If you are of a mind to do some flake squashing, then your PRs will be<br>green just to spite you! And working off of others’ PRs would require<br>first to separate flakes from genuine failures.
This is why the merge queue is powerful: if there’s a guarantee that<br>every commit on the main branch passes the tests, then every failure<br>on main is a flake, by definition. Collecting all such failures into a<br>single list compresses time, allows to prioritize the most impactful<br>sources of instability, and reveals correlations between failures.