Why LLMs (still) lack taste - Beyond the Prior
Frontier LLMs are really smart, and they’re becoming particularly good at software development. It feels like every week there’s a new model release that achieves SOTA scores on a handful of benchmarks. I use LLMs to build software every day, and they’re incredibly useful, and getting better. But I’m still frequently surprised by the types of mistakes they make.<br>I don’t expect LLMs to be perfect. Even smart humans make mistakes! But LLMs often make errors that a human with a similar depth of knowledge would never make. Their capabilities feel jagged; they’ll brilliantly pull together thousands of error logs into a coherent analysis that would’ve taken me hours, but then use blatantly flawed reasoning to derive the root cause. So why does “PhD-level intelligence” make these kinds of mistakes?<br>It feels like, despite all the benchmarks, there’s some orthogonal “taste” property that LLMs lack.<br>What is “taste”?<br>It’s really easy to shift goalposts when talking about LLM performance. So to be precise, I’ll define taste as the capacity to choose the best option from a set of correct options . In software, for example, it’s the ability to look at two pieces of code that both pass tests performantly, and choose the one that’s going to cause the least pain six months later. Of course, this is often context-dependent and subjective. But that’s what makes it so valuable and difficult!<br>The more LLMs are used, the more taste matters. If you’re reviewing every line of code that a model spits out, you can use your own taste to identify code smells, ask the model to use a different approach, and move on. This gets harder as LLMs do more work and create large PRs to review, but it’s manageable. If you’re taking the dark factory approach, though, subtle taste errors will compound into an unmaintainable mess.<br>Often, you can work around taste issues by giving the LLM better context on the problem. But I see the need for context engineering as a failure of taste! If I woke up in a dark void with no memories and was told to build an analytics dashboard, I’d probably ask for some context before blindly writing code. How many users does it need to support? What does the business do? What metrics are important and actionable? Gathering the right degree of context is itself an art that requires taste. In an ideal world, context management would be an emergent ability of tasteful models, not something managed from the outside.<br>How do humans acquire taste?<br>So how do humans develop the ability to decide between two options that seem equally good? Matheus Lima writes<br>When I was junior, I’d review PRs myself and genuinely have no idea if the code was good. I’d read through it and think “this… seems fine?” I hadn’t lived through enough production incidents to recognize what “this will break at scale” actually looks like. I hadn’t read enough good code to spot bad code by feel. Now, after years (16!) of this, I can look at a PR and something just catches.
That really clicked for me! Taste isn’t some magical ability to reason about some intrinsic property of code, it’s something that comes from experiencing code either working or failing in a particular production context. People often try to compile lists of rules for writing maintainable/performant/debuggable code. But all of these rules are context-dependent, which means they often conflict with one another. Developing taste requires learning which properties of code are desirable in different contexts. For humans, developing taste takes place over years of working on a variety of projects with different contextual goals and constraints.<br>Why do LLMs struggle?<br>Billions (trillions?) of dollars have been poured into training ever-more-capable coding agents, so why isn’t this a solved problem? At first, LLMs were trained on next token prediction, which taught them to write perfectly mediocre code. At this point, they were great for replacing StackOverflow, but not much else. As frontier labs poured more money and compute into post-training, with instruction fine-tuning and then RLHF, LLMs got better at writing good code, but still struggled to demonstrate good taste.<br>Without explicit context on the problem at hand, good taste is an ill-defined problem. Imagine two terraform configs for a web server, one with a load balancer and multiple container instances, and one running on a single machine. Which is better? It depends!<br>Fine-tuning and RLHF allowed LLMs to write code that matched the distribution of “good code”. But code alone doesn’t contain enough context for any number of weights to capture why it was written the way it was. To go further, LLMs needed to learn from experience, just like humans. Luckily, there was already decades of research into learning from experience...