Karpathy's Autoresearch Beyond ML

The Loop That Improves Almost Anything — The Mental Faculty

All posts

The Loop That Improves Almost Anything

Drew McCormack · 8 June 2026

Earlier this year Andrej Karpathy published a little thing called autoresearch, and it has been rattling around in my head ever since. It was a few hundred lines of Python that let an AI agent improve a machine learning model on its own, overnight, while he slept. The agent edits one file, one metric, one five-minute window per attempt. Keep the change if the model got better, throw it away if it didn’t, and go again. By morning it had run dozens of experiments and found real improvements he hadn’t told it about.

That’s the narrow version. The more I looked at it, the more I became convinced it was hiding a much bigger one. What Karpathy had really built was a pattern, and once you see the pattern you can point it at almost anything — code, prompts, a network of agents, even the words you’re reading right now. Something you’ve already made that’s pretty good, ground down round after round until it’s the best you can get.

If machine learning isn’t your world, don’t worry. That’s just where the idea started. By the end I’ll have the same loop sharpening an ordinary written report, and then this very post. Let me explain how it works, and then show you how I turned it into a skill you can install in Claude Code in about thirty seconds.

The loop

Here’s the original idea, stripped to its bones.

A language model runs a loop. On each turn of the loop, it produces a version of whatever it’s trying to optimize. In Karpathy’s case, that was the Python code that trains a machine learning model. Call that thing the artifact.

The artifact then gets tested, and the test produces a single number that says how good it is. In Karpathy’s setup that was the model’s error after a short burst of training: lower is better. Call that number the fitness.

The model sees the result. It knows what it tried and how well that attempt scored. If this artifact beat everything that came before, it gets stored as the new best. Then the model starts another turn, using everything it learned from the previous attempts to try to do better still. And so on, round after round.

So far this sounds like ordinary optimization. But there’s a twist that makes it special. The model isn’t just nudging a few numbers in a program that already exists. It’s free to write new code, to invent approaches nobody tried. It’s playing the part a human scientist usually plays: staring at the last result, having an idea, and chasing it.

The pattern hiding inside

That’s the part that grabbed me. Nothing in that loop is actually about machine learning.

Think about the ingredients. You need an artifact you can change. You need a way to measure how good it is. And you need a model that can read that measurement and play a hunch about what to try next. Machine learning code has all three, but so do a thousand other things.

You could improve a report you’ve written. Some code. A prompt for a language model. A whole network of agents working together. Anything you can look at, change, and score.

The one place Karpathy had it easy was the measurement. His training run spits out a single score, and that score settles the matter with no argument. Out in the wider world you rarely get that. But you don’t need it. Your measure of fitness can be anything you like — including the judgment of a subagent (a second AI you hand one small job) that reads the work and tells you what it thinks. And you’re not stuck with one number, either. You can use a whole rubric: a short list of things you care about, each scored on its own, the way a teacher grades an essay for argument, evidence, and style rather than one overall mark.

Say I’m writing a report. The report has to land near a target length, so I’ll write a little script that counts the words. But length isn’t the point — I want the writing to be good. So I add more measures, each one a subagent with a job. One judges how easy the report is to read. One checks that it’s factually sound. One looks at the narrative and the flow. One asks whether it actually suits its purpose. I might say all of those count equally, except the word count, which matters more than the rest.

Then I hand the loop my rough first draft and let it run. It calls the subagents, runs the counting script, reads the scores, decides what to change for the next round, and goes again. Each round it keeps the version that scored better and throws the rest away, so the draft only ever moves uphill.

betterbest

This pattern felt too useful to leave as a loose idea, so I packaged it into a Claude Code skill called betterbest.

I built a rough first version with Claude. Then I pointed that version at itself and let it rewrite its own...

Karpathy's Autoresearch Beyond ML

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy