Twelve Ways to Be Wrong About AI-Assisted Coding

The Third Bit: Twelve Ways to Be Wrong About AI-Assisted Coding

⇐ previous

Posted 2026-05-20

next ⇒

Suppose your manager asks you next week to demonstrate that the AI coding tools your company signed up for are worth the subscription cost. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; the sections below explain why.

Note: this post is about how people are assessing AI, not at LLM-assisted coding itself; with a little rewording, these criticisms could be applied to a lot of the claims that have been made about agile development, test-driven development, and other practices. If I’ve learned anything in the last twenty years, it’s that software engineering would be a lot further ahead today if we had been willing to let our peers in the human sciences teach us how to study these kinds of things properly.

Also, if you’d a one-day introduction to the research methods you should use to avoid making these errors, please reach out. I’m not qualified to teach it, but I know people who are, and I could probably talk them into doing it…

Counting Lines of Code Generated

Proxy metrics stand in for concepts that are hard to measure directly, and lines of code is one of the oldest. LLMs generate more code, but not necessarily better outcomes: a team that sees a 40% increase in lines of code per developer after adopting LLM tools has measured verbosity, not productivity. Deleting 2000 lines of tangled logic and replacing it with 200 clean ones is an improvement that looks like a loss on this metric [Sadowski2019]. More code also means more to read, maintain, and debug, and AI’s contribution to that future burden does not appear in the line count.

Timing Artificial Tasks

A widely cited study found that developers who used GitHub Copilot completed a task 55% faster than those who did not [Peng2023]. The task was implementing an HTTP server in JavaScript from scratch, in ninety minutes; the developers had no other obligations that day. Real software development involves navigating a large codebase you did not write, understanding a requirement described ambiguously in a ticket, coordinating with colleagues, and attending meetings. Speed on a greenfield toy task does not predict speed on any of that. A randomized controlled trial with experienced open-source developers found the opposite of what participants themselves predicted: giving them access to AI tools increased task completion time by 19% [Becker2025].

Before/After With No Control Group

You start using LLMs in January; by June, pull requests are shipping faster, so the tools must be working, right? But between January and June you hired twelve engineers, refactored the CI pipeline, and switched your cloud provider. Without a group that did not adopt the tools, you cannot separate the effect of LLMs from any of the other changes that happened at the same time. Internal validity requires a credible counterfactual, i.e., some way of knowing what would have happened otherwise.

Asking Developers If They Feel More Productive

Survey results like “87% of developers report feeling more productive with AI tools” are regularly cited as evidence that the tools work [Liang2024], but three things make self-report systematically misleading:

The Hawthorne effect means people work differently when they know they are being observed and evaluated;

The novelty effect means new tools feel faster because they are novel, and that feeling typically fades within weeks; and

Social desirability bias means respondents tend to say what they believe the survey wants to hear, especially when management chose the tool.

Counting Commits, Pull Requests, and Tickets

In 2023, McKinsey proposed measuring individual developer productivity using counts of commits, pull requests, code reviews, and similar activities [McKinsey2023]. Goodhart’s Law states that when a measure becomes a target, it ceases to be a good measure [Goodhart1984]. When developers know their commit count is tracked, they make more, smaller commits; when ticket counts are tracked, tickets get split. The numbers improve while the underlying work does not [Beck2023]. Activity is not output; output is not value.

Measuring Only the Easy Half

LLMs make code generation faster, and that half is easy to measure. The other half is harder: time spent reviewing LLM-generated code for correctness, time lost debugging confidently wrong suggestions, security vulnerabilities introduced by plausible-looking but insecure code, and technical debt from suggestions that solved the immediate problem while ignoring the surrounding design. A study of GitHub Copilot’s code found that a substantial fraction of generated code contained security...

Twelve Ways to Be Wrong About AI-Assisted Coding

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast