Twelve Ways to Be Wrong About AI-Assisted Coding

calcifer1 pts0 comments

The Third Bit: Twelve Ways to Be Wrong About AI-Assisted Coding

Skip to main content

Twelve Ways to Be Wrong About AI-Assisted Coding

⇐ previous

Posted 2026-05-20

next ⇒

Suppose your manager asks you next week to demonstrate that<br>the AI coding tools your company signed up for<br>are worth the subscription cost.<br>Would you measure lines of code generated, or tickets closed?<br>Or would you send out a survey asking whether developers feel more productive?<br>Each of those approaches is flawed in a different way;<br>the sections below explain why.

Note: this post is about how people are assessing AI,<br>not at LLM-assisted coding itself;<br>with a little rewording,<br>these criticisms could be applied to a lot of the claims that have been made<br>about agile development, test-driven development, and other practices.<br>If I&rsquo;ve learned anything in the last twenty years,<br>it&rsquo;s that software engineering would be a lot further ahead today<br>if we had been willing to let our peers in the human sciences<br>teach us how to study these kinds of things properly.

Also, if you&rsquo;d a one-day introduction to the research methods you should use<br>to avoid making these errors, please reach out.<br>I&rsquo;m not qualified to teach it, but I know people who are,<br>and I could probably talk them into doing it…

Counting Lines of Code Generated

Proxy metrics stand in for concepts that are hard to measure directly,<br>and lines of code is one of the oldest.<br>LLMs generate more code, but not necessarily better outcomes:<br>a team that sees a 40% increase in lines of code per developer<br>after adopting LLM tools has measured verbosity, not productivity.<br>Deleting 2000 lines of tangled logic and replacing it with 200 clean ones<br>is an improvement that looks like a loss on this metric [Sadowski2019].<br>More code also means more to read, maintain, and debug,<br>and AI&rsquo;s contribution to that future burden does not appear in the line count.

Timing Artificial Tasks

A widely cited study found that<br>developers who used GitHub Copilot completed a task 55% faster than those who did not [Peng2023].<br>The task was implementing an HTTP server in JavaScript from scratch, in ninety minutes;<br>the developers had no other obligations that day.<br>Real software development involves navigating a large codebase you did not write,<br>understanding a requirement described ambiguously in a ticket,<br>coordinating with colleagues,<br>and attending meetings.<br>Speed on a greenfield toy task does not predict speed on any of that.<br>A randomized controlled trial with experienced open-source developers<br>found the opposite of what participants themselves predicted:<br>giving them access to AI tools increased task completion time by 19% [Becker2025].

Before/After With No Control Group

You start using LLMs in January;<br>by June, pull requests are shipping faster, so the tools must be working, right?<br>But between January and June you hired twelve engineers,<br>refactored the CI pipeline,<br>and switched your cloud provider.<br>Without a group that did not adopt the tools,<br>you cannot separate the effect of LLMs<br>from any of the other changes that happened at the same time.<br>Internal validity requires a credible counterfactual,<br>i.e., some way of knowing what would have happened otherwise.

Asking Developers If They Feel More Productive

Survey results like<br>&ldquo;87% of developers report feeling more productive with AI tools&rdquo;<br>are regularly cited as evidence that the tools work [Liang2024],<br>but three things make self-report systematically misleading:

The Hawthorne effect means people work differently<br>when they know they are being observed and evaluated;

The novelty effect means new tools feel faster because they are novel,<br>and that feeling typically fades within weeks; and

Social desirability bias means respondents tend to say<br>what they believe the survey wants to hear,<br>especially when management chose the tool.

Counting Commits, Pull Requests, and Tickets

In 2023,<br>McKinsey proposed measuring individual developer productivity<br>using counts of commits, pull requests, code reviews, and similar activities [McKinsey2023].<br>Goodhart&rsquo;s Law states that when a measure becomes a target,<br>it ceases to be a good measure [Goodhart1984].<br>When developers know their commit count is tracked, they make more, smaller commits;<br>when ticket counts are tracked, tickets get split.<br>The numbers improve while the underlying work does not [Beck2023].<br>Activity is not output; output is not value.

Measuring Only the Easy Half

LLMs make code generation faster, and that half is easy to measure.<br>The other half is harder:<br>time spent reviewing LLM-generated code for correctness,<br>time lost debugging confidently wrong suggestions,<br>security vulnerabilities introduced by plausible-looking but insecure code,<br>and technical debt from suggestions that solved the immediate problem<br>while ignoring the surrounding design.<br>A study of GitHub Copilot&rsquo;s code found<br>that a substantial fraction of generated code contained security...

code tools developers rsquo coding measure

Related Articles