Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools
Skip to main content
Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools
Get link
Other Apps
June 09, 2026
The transformative power of LLMs in coding has been irrefutable, and it feels like we are living through a magical computing renaissance. On the socials, we hear impressive numbers of lines of code generated, features delivered, and bugs fixed. But, the macroeconomic indicators seem to be still lagging. Heck, if you talk with an engineering manager, you find that their product shipping dates haven't miraculously compressed by a factor of five, either.<br>This paper just landed 10 days ago. It is from MIT and Wharton by Mert Demirer, Leon Musolff, and Liyuan Yang. Their study attempts to provide a structured economic model for evaluating actual productivity obtained from AI coding tools. By pairing confidential Microsoft telemetry with the public footprints of over 100,000 GitHub developers (tracking everything from open-source utilities to web app repositories), the authors show significant systemic friction downstream of AI code generation.<br>Of course, I do my usual skeptical critic of the paper. In this case, this is especially heightened because these are economists peeking into the messy non-linear world of software engineering and trying to impose a "production hierarchy" abstraction onto it. But if we reconsider their analysis from a different perspective, it becomes possible to translate their complex production functions into Amdahl's Law terms, and then we can start doing our own evaluations and draw our own conclusions as I discuss below.
The Monotonic Decay<br>The core of the paper rests on this monotonic decay argument. The sheer task-level velocity gains we see from AI coding tools start to bleed out as they move up the production hierarchy. The authors break down AI tool adoption into three distinct generational tiers:
Autocomplete (intelligent text prediction),<br>Synchronous (Sync) Agents (interactive, real-time code modifiers like Claude Code or Cursor),<br>Asynchronous (Async) Agents (autonomous async agents).
When we look at the task-level velocity of these tools, we see impressive numbers. The paper's abstract claims that autocomplete, interactive sync agents, and autonomous async agents increase overall commit activity by cumulative totals of 40% , 140%, and 180% respectively. But as that work climbs toward an official production milestone, the improvements decay significantly.
For Autocomplete, the +228.2% explosion in raw lines of code bleeds out layer-by-layer until becomes a meager +10.2% increase in actual shipped software releases. For Sync agents, a gigantic +741.3% surge in code syntax reduces down to a modest +20.3% final weekly releases.
My immediate reaction to this vertical hierarchy (Lines -> Files -> Commits -> PRs -> Repositories -> Releases) is skepticism. Treating code production like a neat production line feels superficial. Software engineering is not a linear conveyor belt, as coding is highly nonlinear, and a single commit routinely alters fifty files. However, giving the authors the benefit of doubt and reading onwards, I find that there is still value to this naive abstraction, as it points to human gatekeeping and coordination overhead at higher levels of the CI/CD pipeline. AI can write lines of code instantly, not being bogged down by the code syntax at the lower layers. But as that work climbs toward an official production milestone, the structural constraints of the system and human bottlenecks take over, and the massive improvements at the task level decay down to nearly nothing.
Let's dive deeper on the mathematical modeling behind this. By taking only the performance of Autocomplete into account (because it operates exclusively at the code-writing level), the authors chose parameters that minimized the differences between their model's predictions and actual developer behavior. Through this exercise, they extracted a local Upstream Output Elasticity ($\theta$) of approximately 0.75. In this layered production model, $\theta = 0.75$ acts as a vertical pass-through metric. It means that at any given stage (say, turning raw commits into clean pull requests), 75% of that layer's success leans entirely on the upstream technical assets flowing into it from the layer below, while a remaining fraction represents the local human effort added at that layer. Because they model software development as a vertically sequential aggregation process, that human intervention operates like a compounding efficiency tax. A massive initial code productivity surge at the bottom layer gets relentlessly multiplied by 0.75 over and over again as it attempts to climb the hierarchy, mathematically forcing the steep vertical attenuation we see in the empirical data.<br>Now, the sync and async agents aren't just typing lines...