AI Coding Feels Like Using an Unreliable Compiler

-->

AI Coding Feels Like Using an Unreliable Compiler - Strumenta

Schedule a meeting

Written by

Federico Tomassetti

Language Engineering, Reflections

19 May 2026

Facebook

Threads

BlueSky

Table of contents

AI Coding Feels Like Using an Unreliable Compiler

Every developer I know is asking roughly the same questions about LLMs.

Yes, there are LLM-fanatics and LLM-skeptics, but most are LLM-confused: sometimes LLMs seem amazing and sometimes they seem so dumb one wonders why we are using them in the first place.

So we keep asking where can they help us? Where do they waste our time? Where are they dangerous? Where do they really change the way we build software?

In other words, how can we use them to reliably write better software or software more productively?

Why I think most of us are asking themselves this question, we asked them based on our own experience, with our own years of scars from software development.

And from my point of view this reminds me of the time when we did not fully trust compilers . Believe it or not, there was a time where compilers felt a bit buggy, so when an issue in the code came up we could ask ourselves "could it be maybe a bug in the compiler and not in my code?". 99% of the time the problem was in the code but from time to time… the compiler of a language as complex as C++ could be wrong. For example, I remember an issue with a compiler from Microsoft handlingly incorrectly variable declarations in for-loops.

So there was a bit of misstrust towards the thing that translated what we typed and what was then executed.

And this is exactly how LLMs make me feel today: LLMs used for coding today feel like unreliable compilers.

What I mean is that, when we use an LLM to transform our intention into code, the transformation is not faithful, repeatable, or reliable enough for us to just trust the output without any verification. So we cannot simply write the prompt and move on. We must inspect the generated code, understand it, test it, and often correct it.

And that means that now the bottleneck is reviewing code, instead of writing it.

This makes for a very un-sexy and un-exciting story. How can we make that better?

The Many Ways We Use LLMs to Code

There are several levels at which we can use AI for programming.

At the beginning, many of us used LLMs in the most primitive possible way: we opened a chat, pasted a piece of code, asked a question, copied the answer, and pasted it back into the project. The user had to do most of the work. We had to decide which files to paste, how much context to include, which details to omit, how to explain the architecture, and how to integrate the answer back into the codebase. The LLM was doing the generation, but we were manually building the context window around it.

Then the tools moved closer to the code: first through IDE plugins and completions, then through AI-native editors such as Cursor, and finally through command-line coding harnesses such as Claude Code and Codex. These systems can inspect a repository, modify several files, run commands, iterate on failures, and produce a change that looks much more like the work of a developer.

So now the shell around the model counts a lot. It gives the model tools. It decides when to search the codebase, when to read a file, when to run tests, when to use grep or git, when to inspect the diff, and when to retry after a failure.

And this shell is not an LLM: it is ordinary software engineering. It has deterministic logic, predictable commands, tool protocols, heuristics, guardrails, context management, and feedback loops.

This matters.

It means that the quality of AI coding tools does not depend only on the underlying model. It depends on the engineering around the model.

That is why two tools using comparable models can produce very different results. Cursor, Claude Code, Codex, or any other coding environment may behave differently not only because the model is better or worse, but because the harness around the model is better or worse.

The tool that gives the model the right context, exposes the right operations, constrains the right choices, and validates the right outcomes will usually produce better results.

In a sense, software engineering comes back through the window.

A Typical Session with a Coding Harness

Imagine I am working on a system that stores Abstract Syntax Trees in memory.

Now suppose I want to change the memory strategy of this system.

I could tell Claude Code something like:

This system currently keeps all ASTs in memory. Change it so that we keep only the 100 most frequently used ASTs in memory and store the others on disk. ASTs generated by our own transformations can be evicted more aggressively than ASTs obtained from parsings, because the latter are more likely to be reused soon.

The agent starts working.

It reads...

AI Coding Feels Like Using an Unreliable Compiler

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast