Why Your Team Ships 2x the PRs and Delivers the Same

__natty__1 pts0 comments

The Productivity-Reliability Paradox: Why Your Team Ships 2x the PRs and Delivers the Same | Open Mercato

Boot a cloud sandbox with Open Mercato in 30 seconds

Try in Cloud Sandbox

Learn AI-EngineeringEnterprise Clients

Company

ContactGitHub

Get Started

Get Started

Home/<br>Blog/<br>The Productivity-Reliability Paradox: Why Your Team Ships 2x the PRs and Delivers the Same

AI Engineering

The Productivity-Reliability Paradox: Why Your Team Ships 2x the PRs and Delivers the Same

A new paper (arXiv:2605.01160) names the problem every CTO already feels: AI coding tools double individual output while degrading system-level delivery. The fix isn't a better model - it's specification discipline. Here's how it maps onto Spec-Driven Development at Open Mercato.

Tomasz Karwatka

June 17, 2026

Software is about to be built completely differently

Clone the Repo

Table of contents

Heading 2

A new academic paper just gave a name to the problem every CTO deploying AI coding tools already feels in their gut. It is called the Productivity-Reliability Paradox .<br>Here are the numbers (arXiv:2605.01160, May 2026, 67 sources reviewed):<br>98% more pull requests merged<br>91% longer review times<br>Flat delivery metrics<br>Developers perceive themselves as faster - even when objective measurements show a 19% slowdown for experienced engineers on real tasks<br>Read that again. Your team ships twice the PRs. Your reviewers drown. Your delivery velocity does not move.<br>What the paper actually says<br>The paper calls it PRP - the Productivity-Reliability Paradox : AI coding assistants simultaneously improve individual output metrics AND degrade system-level dependability. The contradiction is not noise in the data. Controlled studies report 20-56% productivity gains on well-scoped tasks, the most rigorous RCT documents a 19% slowdown for experienced developers, and telemetry across 10,000+ developers shows the PR explosion with flat delivery.<br>This is not a model problem. The paper's conclusion is blunt:<br>"Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability."<br>Not better prompts. Not GPT-5. Not a faster Cursor tab-complete. Specifications.<br>The proposed fix: Specification Governance Model (SGM)<br>The paper proposes the Specification Governance Model (SGM) , grounded in Transaction Cost Economics. The core idea: deterministic specifications serve as governance contracts between non-deterministic AI generators and the deterministic requirements of production systems.<br>In plain English: if you do not tell the AI WHERE things go and HOW they should be structured, you get code that works in isolation and breaks everything else.<br>The paper evaluates two instantiations of this model:<br>GitHub's Spec Kit<br>The TDAD (Test-Driven AI Agent Definition) pipeline - reporting 86-100% mutation scores<br>Both share the same principle: specs first, generation second.<br>Why this hit home: Spec-Driven Development at Open Mercato<br>At Open Mercato, we have been building exactly this way since day one. We call it Spec-Driven Development (SDD) - every module, every entity, every event contract starts with a specification that ships inside the repo.<br>When an AI agent (Cursor, Claude Code, Codex) generates code, it reads the spec and knows:<br>Where the code belongs architecturally<br>What boundaries to respect<br>Which patterns to follow<br>What would break if ignored<br>Without the spec? Same model, same prompt - code that compiles, passes local tests, and breaks 3 other modules in production. We have seen this reduce "AI-generated code that needs senior review" by roughly 60%.<br>The paper validates what we learned through building: the bottleneck is not generation. It is governance. The three moderating variables the paper identifies - task abstraction level, codebase maturity, developer experience - are exactly the dimensions where specifications make the biggest difference. A junior developer with a well-written spec produces architecture-aware code. A senior developer without one produces AI-generated spaghetti, faster.<br>The code review bottleneck (the part that matters most)<br>The most important insight from the paper for me is the code review bottleneck . AI tools dramatically increase the volume of code submitted for review. But review capacity is fixed - it depends on senior engineers who are already stretched thin.<br>The result: either reviews become superficial (quality drops) or review queues grow (velocity drops).<br>Specification-driven governance attacks this from the supply side. By constraining what AI generates through specs, the volume of "wrong but plausible" code decreases. Reviews get faster because the code is architecturally predictable.<br>This is why we designed Open Mercato as an AI-Engineering Foundation Framework - not another code generator, but the foundation on which code generators produce reliable output.<br>Three takeaways for engineering leaders<br>Name the problem. If your team merged twice the PRs last quarter but shipped...

code paper model spec review team

Related Articles