Test Coverage Won't Save You
A year ago, if a codebase accumulated three different ways to do the same thing, somebody would usually notice.
A reviewer might leave a comment. A senior engineer would mutter something about "yet another helper," and eventually the team would clean it up. Maybe they'd write a short doc. People would learn, collectively, that this was not the way.
This was not a perfect system. Human review can be slow. People miss things. Pattern docs get stale. Engineers develop opinions that are 70% wisdom and 30% scar tissue.
Still, the friction did something useful. It kept codebases from drifting too quickly.
Coding agents change that.
"Now, this is a room with electricity. But it has too much electricity."
Once agents are writing a meaningful share of code, the question is no longer just "can we get working code faster?" Obviously, yes. We can. Sometimes hilariously so.
The more interesting question is: what does the codebase teach the next agent?
Because it does teach.
Every merged PR becomes a precedent. If the repository contains one clear pattern, the next agent has a decent shot at following it. If it contains three slightly different patterns, the next agent may extend one, combine two, or invent a fourth for reasons that are, in technical terms, vibes.
The funhouse builds itself
Coding agents, like some of us, have a chaotic inner goth teenager.
Their non-deterministic nature means they can be inconsistent, often in surprising ways. They can generate the weird little leap that solves the problem, or they can generate novelty where nobody asked for it.
The dangerous thing is that most of the code is fine, if you look at it in isolation.
A PR passes tests. The implementation makes sense. The file is clean enough. You look at the diff and think, yeah, sure, that seems okay.
But zoom out, and the codebase as a whole has started to get weird.
three near-duplicate utilities
multiple data-fetching patterns
parallel abstractions solving almost the same problem
The codebase remains locally cohesive , while slowly losing global coherence .
That matters more than it used to, because even more than humans, agents are responsive to the signals around them. A clear codebase gives them clear precedent. A muddled codebase gives them confusion.
You can put principles in CLAUDE.md or docs, but they're... advisory. They compete with task instructions and inline comments for attention, get summarized when context windows compact, and depend on the agent attending to the right instruction at the right moment.
If your codebase contains three slightly different ways to solve a problem, the next agent has to infer which one is canonical.
From the agent's point of view, this is not irrational. The repository gave it mixed evidence.
The scary outcome is not code that obviously fails. It is code that superficially keeps working, while the shape of the system increasingly gets worse.
Tests are nice
Most discussions about AI-native development jump from this problem – agents' tendency to accumulate tech debt – directly to tests.
And yes, across the industry, teams are writing dramatically more tests than they ever have. Agents have made high test coverage affordable in a way it never used to be.
Recently, Garry Tan argued that the primary way to keep AI agents on track is 90% test coverage:
Tests are the ratchet. 90% coverage, every PR, no exceptions.
And indeed, test coverage is the simplest kind of ratchet: a mechanism that allows motion in one direction only, like a socket wrench that turns the bolt forward but never lets it spin back. Once a test locks in a behaviour, it becomes difficult to accidentally regress.
But it's worth being specific about what tests actually do.
Unit tests check that a function still returns what it returned before. Integration tests check that pieces still wire together the same way. E2E tests reach further: they check whether the product still does what it used to, and the assertions you prioritize there should be an important act of human judgment.
But they all share the same basic shape: tests verify that code does what it did before.
Whether what it did was even the right way to do it is a separate question.
Tests ratchet behavioural sameness.
Meanwhile docs and evals can pin down reasoning and behaviour bars, which is sometimes more important. But none of these is directly looking at the shape of the system.
90% coverage of good patterns
A coverage ratchet only improves the codebase if the patterns it's locking in are already good ones.
If a questionable pattern already exists in the code, the next agent is more likely to extend it. The tests generated around that implementation only reinforce it further.
Over time, high coverage can unintentionally fossilize architectural drift, rather than prevent it in the first place.
At first, especially during prototyping, this can be easy to miss, because nothing looks obviously broken....