AI coding agents need evidence-first review, not just cheaper routing

Cheaper AI Code Generation and Engineering Cost | Undes

Get started

Contents 1. Verification tax 2. Total decision cost 3. Control layers 4. Side by side 5. Trust debt 6. Break-even 7. Implementation 8. Benchmark 9. Limits

In many AI-assisted workflows, code generation is no longer the only bottleneck. Assistants read repositories, edit files, run commands, and write tests. Agentic systems plan, call tools, retrieve more context, and assemble an answer over several steps or several models.

What was actually checked, what did the model merely assume, and how much of this result can I rely on before merge?

Producing plausible code has become cheaper. Checking its foundations has not necessarily followed. Comparing AI tools only by token price, generation speed, or agent count misses the engineering decision that matters: the path from a request to a justified merge decision.

This article asks three questions:

Does AI reduce total decision cost once calls, review, rework, and escaped-error risk are counted?

Which part of that cost is targeted by routing, retrieval, multi-model deliberation, and automated checks?

What should a verification layer produce, and how can its value be falsified rather than merely claimed?

1. The verification tax

The productivity evidence is mixed. METR ran a randomized controlled trial with 16 experienced open-source developers performing 246 real tasks in mature repositories they knew well, using early-2025 tooling. With AI, tasks took 19% longer on average [1].

In February 2026, METR reported that newer data probably shows a larger uplift, but explicitly called the signal unreliable. The raw estimate for returning developers was -18% change in completion time with a confidence interval of [-38%, +9%]; for newly recruited developers it was -4% with [-15%, +9%], where negative means speedup. Both intervals include zero effect [2].

The honest conclusion is neither “AI always speeds developers up” nor “AI always slows them down.” Productivity depends on tool maturity, repository familiarity, task shape, context acquisition, and the cost of checking the result.

The 2025 DORA report provides a different, observational view of nearly 5,000 technology professionals: 90% use AI at work, more than 80% perceive a productivity gain, but 30% have little or no trust in AI-generated code. AI adoption is positively associated with delivery throughput and product performance and negatively associated with delivery stability [9]. This is not a causal estimate. It is consistent with a systems hypothesis: faster local generation may increase downstream load if testing and delivery controls do not scale with change volume.

A synthesis of seven Google studies found that 39% of external developers trust GenAI output quality only slightly or not at all. Perceived rigor of review and testing, and developer control over where AI is used, were positively associated with trust [7].

Review itself is not only defect-finding. In Bacchelli and Bird’s study of 200 Microsoft review threads and 570 comments, code improvements accounted for 29% of comments and defects for 14% . The authors identify understanding the context and the change as central to review and record knowledge transfer as an outcome in its own right [3].

An illustrative review-load model

Assume a team handles 20 PRs per week and an average review takes 30 minutes:

20 PR × 0.5 h = 10 reviewer-hours / week If AI doubles throughput while review cost per PR stays fixed:

40 PR × 0.5 h = 20 reviewer-hours / week If AI-assisted PRs become wider and review time rises by 25%:

40 PR × 0.625 h = 25 reviewer-hours / week

ScenarioPR/wkReview/PRReview load

Pre-AI2030 min10 h 2× throughput4030 min20 h 2× throughput + wider PRs4037.5 min25 h

This is a sensitivity model, not a market statistic. It shows the mechanism: faster generation may move work from writing to checking rather than remove it.

2. The total cost of an engineering decision

The token bill is not the total cost. Define the expected cost of one decision:

C_total = C_model + C_tools + R_hour × (T_review + T_rework) + P_escape × L_escape

C_model: model calls;

C_tools: CI, sandbox, retrieval, and other compute;

R_hour: internal cost of one engineering hour;

T_review: time to an apply/review/reject decision;

T_rework: expected time to fix issues found before merge;

P_escape: probability that a material error passes review;

L_escape: expected loss from such an escape.

Take an illustrative baseline: C_model = $5, review takes 60 minutes, and R_hour = $80. Set tools, rework, and risk aside temporarily:

C_total = $5 + $80 = $85

The ceiling on pure model-bill optimization

If model calls are a fraction f = C_model / C_total, then optimizing only the model bill while holding workload, quality, review, rework, and risk fixed lowers C_total by at most f. At the reference numbers:

f = 5 / 85 = 5.9%

This is not a ceiling on routing’s total...

AI coding agents need evidence-first review, not just cheaper routing

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi