Why reviewing AI-generated code is devilishly hard

DSpinellis1 pts0 comments

blog dds: 2026-05-23 — Why reviewing AI-generated code is devilishly hard

Diomidis Spinellis blog

Why reviewing AI-generated code is devilishly hard

Search

Search the blog

Close

Here’s the thing: when working on code with GenAI assistance<br>(from a chat-bot, through IDE auto-completion, or, increasingly,<br>with an AI agent)<br>you need a better understanding of the system than when working without.<br>Cognitive psychology and the workings of large language models (LLMs)<br>give us four clues on why this happens.

When working without AI assistance on a non-trivial task<br>and on code you don’t know,<br>you first need to comprehend it in order to perform your task.<br>Otherwise you’re hacking (in the sense of performing undisciplined changes),<br>not programming, and most likely you won’t go anywhere (fast).<br>This is an objective built-in control gate of the human-only<br>software development process:<br>if you don’t understand the code, you can’t contribute to it and you you fail.

When working with AI assistance and you have to review an AI-generated change<br>that passed continuous integration, the control gate is missing:<br>there’s no objective mechanism to determine whether you truly understand<br>the code or not.<br>This means that you may accept an AI-generated change in the mistaken belief<br>that you understand it, when in fact you don’t.

The required type of thinking, metacognition,<br>involves not only understanding the code but also assessing your understanding.<br>In brief, it involves two abilities.

Metacognitive monitoring :<br>assessing what you know<br>(e.g. a specific data structure or design pattern used in the code),<br>how well you understand the change,<br>confidence in your review comments,<br>and detecting confusion.

Metacognitive judgment : specific evaluative acts within monitoring,<br>such as

judgment of learning<br>(I’ll remember this change when we discuss our new architecture);

feeling of knowing (I’ll understand this code when I see it again);

confidence judgment (I’m 95% sure this code is correct); and

ease-of-processing judgments<br>(this change feels easy, so I’ll probably understand it).

Unfortunately, a couple of influential psychological studies have shown that<br>people are often poor at accurately evaluating their own knowledge<br>or performance.<br>Most famously,<br>the Dunning-Kruger effect<br>states that people with low competence tend to overestimate their competence<br>because the skills needed to perform a task are also needed to evaluate<br>performance.<br>The implication of this is that junior programmers are more at risk<br>from accepting faulty AI-generated code.

In addition, through the<br>illusion of explanatory depth<br>people think they understand mechanisms with far greater precision,<br>coherence, and depth than they really do.<br>Moreover, this gap is strongest for explanatory knowledge<br>than many other kinds of knowledge,<br>such as that for facts, procedures, or narratives.<br>In the original study, its authors tested how well students could explain<br>the working of devices such as a sewing machine, a can opener,<br>a self-winding watch, a nuclear power plant, or a photocopier.<br>This could well apply to explanatory knowledge of code:<br>algorithms, data structures, designs, interactions, and architecture.<br>Importantly, in programming, explanatory knowledge,<br>which allows reasoning across the software development lifecycle,<br>is a lot more important than<br>facts (can be established with tools),<br>procedures (should be automated),<br>or narratives (are rarely embedded into code).

Finally, the fluency of LLMs makes faulty code appear more trustworthy<br>than it deserves and also feeds another cognitive trait.<br>Consider the diverse ways in which a programmer can err:<br>slips, lapses, mistakes, knowledge-based reasoning failure, rule-based<br>misapplication, cognitive overload, confirmation bias, availability bias,<br>anchoring, overconfidence, abstraction mismatch, inattentional blindness,<br>plan-composition failure, or specification ambiguity.<br>For most, it is probable that a reviewer (especially a more experienced one)<br>may be able to detect the resulting fault by thinking differently.<br>In contrast, AI-generated code is written so as to look plausibly correct,<br>even when it’s faulty.<br>(This is how due to how LLMs work: they generate a series<br>of the next most probable tokens.)<br>This makes it more difficult for a human reviewer to detect a fault in the code.

The plausibility of LLM code gets compounded by a specific trait in the<br>complex relationship of humans and automation,<br>automation bias:<br>the well-documented tendency of people to place unwarranted trust in<br>automated systems, reducing their own independent verification effort.<br>A study of automation across diverse critical domains<br>has shown that when automation provides recommendations, people<br>are more likely to accept incorrect suggestions (errors of commission,<br>say a duplicated routine)<br>and less likely to detect problems that the automation failed to address<br>(errors of omission, e.g. a missing handler or test).

In short,<br>when reviewing...

code generated understand knowledge working change

Related Articles