The Thing We All Obviously Want
Generated by AI—notice the perspective.
Over the past year, we have seen the rapid development of AI-assisted<br>programming to an astounding degree. Even five years ago,<br>fully-automated program synthesis of large-scale, production systems<br>would have seemed unthinkable. Today, this is not an ambition, it is a<br>reality, at least by some measure. To some computer scientists,<br>natural-language-driven program synthesis was the endgame. On the<br>other hand, the software I use day-to-day doesn’t seem to be getting<br>appreciably better overall. Systems are still broken, apps<br>unresponsive (even on well-resourced hardware), crashes are still<br>common, and interfaces are generally as clunky as before. Personally,<br>I believe we will eventually see many systems adapted by AI-assisted<br>refactoring tools; but I also recognize there are human barriers to<br>deploying those things at full thrust in the short (even medium) term.
In any case, my position is that AI-assisted programming, giving us<br>real-time, on-demand generation of any app, is the thing that we all<br>obviously want. There are a few tensions with this reality: (a) it<br>seriously changes the value proposition of what “code” is in a<br>meaningful way, (b) there are externalities: wasted computation,<br>energy replicating junk, and (c) it challenges the role of humans<br>in the knowledge-generation process.
Note: In the rest of this essay I will use the term “LLM(s).” In<br>general, when I say this, I mean a state-of-the-art integration of a<br>frontier model alongside relatively simple tools (e.g., Claude Code,<br>Codex, etc.). There is some nuance in building these tools, but given<br>that the innovation is the model, I will casually refer to the whole<br>agentic process as the “LLM.”
Program Synthesis: Did it Fail?
Traditional program synthesis (by which I generally mean, SMT-based,<br>search-based, or similar) leveraged a rigorous and formal enumeration<br>/ proof to produce a synthesized program–potentially with a<br>certificate of its correctness–driven via a rigorous<br>specification. Like many academic fields, the goal of program<br>synthesis was not only to effect fully-automated programming tools: it<br>was to advance the frontier of understanding in semantics,<br>verification, specification, etc. These were the challenging problems,<br>especially given that traditional search (on the CPU) was so slow.
LLMs allow rigorous concepts to gracefully degrade by using text. The<br>underlying model has such a deep understanding of language that fuzzy,<br>hazily-posed descriptions often still give some sensible<br>interpretation. The obvious issue is hallucination: when you push the<br>embedding space into some inconsistency, won’t it just generate junk?<br>And of course, this is absolutely an issue–but when the error rate is<br>low enough that it’s practically useful, many people will not care.
My position is that LLM-guided software engineering was so wildly<br>successful not just because it nailed the generation part, but also<br>because LLMs ended up practically solving the problem of<br>specification. Humans are simply used to the failure modes of<br>underspecification: even from a young age we’re trained to expect<br>disappointment if we miscommunicate our expectations, and so having<br>the LLM fail doesn’t sting as badly as you might expect.
Granularly-Evolving Formal Specs
One potential issue I foresee with current-generation AI is that<br>they focus the process on a textual-only workflow. In<br>practice, smart humans do want to read something that looks<br>like code most of the time–the issue is that they want to be able to<br>focus their limited mental attention rather than sifting through<br>thousands of lines of code. Most anybody who ever worked on a large<br>codebase (that they did not write entirely themselves) never had more<br>than an LLM-level understanding of parts of the codebase<br>anyway. Instead, we embarked upon code understanding efforts whenever<br>we faced tricky bugs, needed to add new features, etc. We codified<br>this in our own mental model (memory, notes, etc.), but also<br>(sometimes) documentation, bug reports, etc. Hilariously, this is now<br>the kind of thing that the LLM loves to ingest.
As we build software, we want to be able to start with a hazy<br>specification (probably in English, but maybe in a big document) and<br>be able to begin building an application. At key decision-making<br>points we want to be able to solicit input and, finally, be dropped<br>into an exploratory state where we may make our thoughts more<br>granularly precise. For many reasons, I still believe that this should<br>be formal, executable code, not English prose.
The issue is that no single language is perfect. English is great for<br>laypeople: any arbitrarily complex topic can be compressed into an<br>arbitrarily-simple soundbite. Unfortunately, English is imprecise and<br>even lies to you via the embedding. On the other end of the spectrum,<br>we might have Lean in a loop with the LLM. The LLM is speaking Lean<br>and there is some amount of grammar- and...