Is AI-Written Code Buggier Than Human Code?<br>Ask ten engineers whether AI-written code is buggier than the code they write by hand and you'll get ten confident answers, roughly half in each direction, and zero of them backed by a number. It's one of those questions where everyone has a prior and nobody has the data. The takes are loud, the measurements are scarce, and the few measurements that exist tend to be a survey of how people feel about AI code rather than what the code actually does once it's merged.
So we measured it the way the defect-prediction field measures everything else: by going back through the git history of real projects and blaming every bug fix to the commit that introduced the bug. We did this across 28 public repositories and 112,382 commits spanning the first year of coding agents merging real pull requests, and then we asked a simple question of the data. When an agent wrote the commit, was it more likely to be the one that planted a bug than a human commit in the same codebase?
The short version: no. If anything, the opposite. And the agent-written lines that did land stuck around longer than the human-written ones. The longer version, with the caveats that make me actually believe it, is the rest of this post.
What "agent code" even means
The first problem with this question is that "AI wrote it" isn't one thing. A senior engineer driving Claude Code through a careful refactor and an unattended bot opening PRs on a schedule are both "AI code," and lumping them together would hide whatever signal exists. So before measuring anything we built a per-commit provenance detector that reads eight different signals: bot account identities, service email addresses, commit-message footers, co-author trailers, and merged-PR evidence like agent branch prefixes and PR-body markers. We then blind-validated it, handing 124 commits to six independent reviewers, and it came back at 96.2% precision, with six of the eight detection channels perfect. The one real failure mode, a human pushing a follow-up commit inside an agent's PR, gets its confidence downgraded so we can filter it.
With provenance in hand we split agent commits into three tiers and, importantly, never pooled them:
T1, bot agents. Near-autonomous, no human in the immediate loop. Think Devin, the Copilot coding agent, Cursor's cloud agents.
T2, human-driven agents. Claude Code, Codex, and friends, where a developer is steering and reviewing as they go. This is the overwhelming majority of agent commits in the wild.
T3, AI-assisted. A co-author trailer and not much else, the lightest touch.
That tiering turns out to matter, because the tiers behave differently, and any honest version of this story has to keep them apart.
How you measure "this commit caused a bug"
The standard tool here is SZZ, named after Śliwerski, Zimmermann and Zeller. The idea is mechanical: find the commits that fix bugs, then git blame the lines they changed to find the commit that last touched those lines, and call that earlier commit the one that introduced the bug. Do this across a whole repo and you get a labeled dataset of bug-inducing and not-bug-inducing commits.
We ran SZZ within each repository, comparing agent commits to human commits in the same codebase, which controls for the obvious confound that some projects are just buggier than others. Then we fit a logistic model with the lines added, lines deleted, and files touched as controls, plus a repo fixed effect, so that every result reads as "bug risk beyond what the size of the change already explains." That size control is not optional. The single strongest predictor of whether a commit introduces a bug is how big the commit is, and if you skip the control you mostly end up measuring whether agents write bigger or smaller diffs than humans.
There's one more piece of discipline that ended up being the difference between a believable result and a flattering one. Naive SZZ has a built-in bias in favor of agents here. It excludes fix commits from being counted as bug-inducers, and agents do a disproportionate amount of fixing, so the naive method quietly shields them. To catch that, we ran a stricter variant (B-SZZ) as a mandatory sensitivity check on every single result. If a finding only shows up under the friendly variant and evaporates under the strict one, it isn't real. I'll tell you below exactly where that line falls.
The headline: agent commits are not more bug-inducing
Here is the core result, the adjusted odds of a commit introducing a bug, by authorship tier, relative to human commits in the same repo. Below 1.0 means fewer bugs than the human baseline.
Adjusted odds of introducing a bug by authorship tier, controlled for change size and churn
Every tier lands at or below the human line. Human-driven agents (T2) come in at an odds ratio of 0.57, with a 95% confidence interval of 0.42 to 0.76, so the whole interval sits below 1.0. Bot agents (T1) are at 0.75 [0.43, 0.95]....