The Untrainable

The Untrainable - Sarah Guo

Sarah Guo

SubscribeSign in

The Untrainable

Sarah Guo Jun 10, 2026

The mid-2026 investor’s version of AI psychosis is a despair that nothing is investable, that we should put all our money into Anthropic and Nvidia and go home. I have never felt it. I have been sure the models are smarter than me for several sub-versions now, I’d be a happy buyer of Anthropic and Nvidia at the market price, and all my smartest friends are quite convinced that self-improvement is soon to work – and I still don’t feel it. The despair isn’t stupid. The logic runs: if the model keeps getting better at everything, then every company built on top of one is a thin wrapper waiting to be absorbed, and the only value that survives is the compute and the frontier weights. Thanks for reading! Subscribe for free to receive new posts and support my work.

Take software, the case the despair leans on hardest. Devin shipped in 2024 solving thirteen percent of the tasks on the standard software benchmark, and was largely dismissed. A year and a half later the best agents hit the high eighties, and they’re doing real work inside Goldman Sachs and the U.S. Army. Nearly everyone drew the same wrong lesson: the model ate software engineering. But as the model swallowed the part of software engineering you can best measure, we’re relearning what many teams knew – engineering has always resisted measurement, and the most measurable parts may not be the only important ones. Mert Demirer and coauthors at MIT finally put numbers on it: across more than 100,000 developers, the latest coding agents lifted how much code got written by roughly 180%, and how much actually shipped by about 30%. Writing got cheap. The rest still runs through a person, and it matters. The net impact is, of course, still amazing. A benchmark is a thing you can measure, and a thing you can measure is a thing you can train against. Thus, coding agents matured first: a compiler is a free verifier, a test suite is a free verifier, and when the answer checks itself for nothing you can grind against the check until you beat it. But passing the test never told you the change was the right one for a decade-old codebase with three undocumented reasons that module exists and a deploy pipeline held together by a cron job no one will admit to writing. That kind of correctness can’t be read off a leaderboard, and it can’t really be read off anything. You find out whether a system that complex works by running it in the world long enough to learn, and a smarter model doesn’t make the world run faster. Nobody unit-tests something the size of Google and trusts the green check; you trust it because it survived years of real load. Correctness like that isn’t only private, it’s the slow kind of moat capital can’t collapse. Even the optimists grant the clock can’t be skipped: Noam Brown, who has pioneered OpenAI’s reasoning models, wrote recently that the only sure way to evaluate an agent over a one-year horizon may be to run it…for a year. As Gabe Pereyra says, real automation isn’t only the model getting better. It’s the product, the model, the workflow, and the firm moving together, and three of those four move at the speed of an organization. Moving people is the part no benchmark touches: getting a skeptical partner to change how she runs her matters, holding a team together through a rebuild. It’s why, when we hire a CEO, the ability to deal with people weighs at least as much as the analytical horsepower, and a smarter model doesn’t change that weighting. The feedback is ambiguous, the horizon is years, and the trust belongs to a person. Every company I know has every engineer on frontier coding models, and not one has changed its eng org at anything close to that speed. Adoption took a quarter, and what a magical quarter of token growth it was! But the rebuild is taking years. What’s legible is what’s leaving. The valuable work is illegible by construction: anything you can put on a leaderboard, you can train against, so anything measurable is already on its way to commodity. The process takes time and is never total, but the direction never reverses. Put it in money terms, the way my friend Matt MacInnis at Rippling does: a token spent answering a generic question is worth almost nothing, since anyone’s model can answer it, while a token spent reasoning over your company’s data is worth much more, because it does the thing you actually want, not just the plausible thing. The legible work gets eaten from two directions. From below, tasks saturate: once a job can be checked cheaply, the buyer stops asking which model did it and starts asking what it costs, and the work falls to whatever open or distilled model is cheapest that week. Everywhere they can matter, margins eventually matter. From above, the labs are trying to get the models to swallow their own scaffolding. The retrieval, the routing between cheap and expensive calls, the...

The Untrainable

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs