Karpathy's autoresearch, 50 DPO experiments, 300 human judges

bradfeh1 pts0 comments

When does autoresearch need a human?

Log In<br>Sign Up

Back to Articles

When does autoresearch need a human?

Community Article Published<br>May 21, 2026

Upvote -

Nora Petrova nlpetprolific Follow

ProlificAI

Running Karpathy's autoresearch on a DPO task, and what 300 Prolific participants made of it.

Autonomous research agents are everywhere in AI research workflows now. The setup is familiar: the agent reads experimental results, modifies a training script, and iterates, all without a human in the inner loop. The promise is real, too. The agent doesn't get bored, doesn't sleep, and can run far more experiments than any researcher could babysit.

That they're useful is well established. The more open question, and the one we wanted to pin down, is how well they keep themselves on track when left to their own devices. So instead of "do they work?", we asked two more specific questions. First, when an autoresearch agent is left to its own devices and optimises a held-out metric, do participants agree that a higher score on that metric corresponds to a model they actually prefer? Second, when a researcher is available for guidance to the agent at specific moments, what does that human contribute that the loop alone couldn't?

To find out, we applied Karpathy's autoresearch framework to a DPO post-training task: fine-tuning SmolLM2-360M-Instruct on the UltraFeedback preference dataset. The agent inside the loop was Claude Opus 4.7. It ran 50 experiments autonomously, about 10 minutes each. Once it was done, we opened a Claude Code session and asked the same model to take stock of the autoresearch run's 50 experiments and propose a recipe worth trying that the loop hadn't explored. That conversation produced two more recipes. That gave us five models in total: one untrained baseline, two from the autoresearch run, and two from the conversational session. We then asked 300 Prolific participants which they preferred, across 1.5K pairwise comparisons.

You can explore the interactive technical report for the full per-pair tables, charts, and the LLM-clustered comment themes, or download the annotation dataset on Hugging Face.

Key Findings at a Glance

The metric and participants disagreed on whether the autoresearch loop improved the model at all. By its own metric, the autoresearch loop's committed best scored just below the untrained reference's chance level — so the metric said the agent's stable recipes made the model slightly worse than no training. Prolific participants saw it the other way: they preferred the agent's trained recipes over the untrained base, but only barely (~52% win rate, within statistical noise).

A single conversational session with the same model unlocked the only recipes that won decisively. The conversational recipes (LoRA adapters and high-quality data filtering) beat the untrained base at 66% and 60% in human head-to-heads — the only DPO recipes in the study with clear wins, produced after about 5 minutes of researcher guidance. The steer was generic: we didn't suggest LoRA or any specific intervention, we asked Claude to take stock of what had been tried and propose something new. Anyone with experience running agents could have given that prompt. The agent just couldn't ask itself that question.

The metric was directionally right but wobbly at the finish line. Across the four trained recipes, the rank correlation between the agent's success metric and the human preference ranking was strong (Spearman ρ = +0.80). But the recipe with the highest metric score wasn't the one participants ranked first in aggregate, and the metric and participants don't cleanly agree on the top pick (in direct comparison the top two are statistically indistinguishable). Past a sweet spot, the metric is measuring "GPT-4-likeness" more than human taste — UltraFeedback's preference labels were produced by GPT-4 acting as a judge.

Capability ≠ Agency Same model in both stages, same access to the same training infrastructure. What changed was the loop structure around the model. The autoresearch loop optimised efficiently within a frame but didn't naturally step outside it. A researcher with a single meta-prompt did.

In this post, we walk through each stage of the experiment, what participants told us, why the automated metric disagreed with them at the top, and what we take away from the whole thing.

Why we ran this study

Agentic research workflows are everywhere, but two questions about them are mostly answered with vibes. Whether the metric an agent optimises actually tracks what participants want from the resulting model. And where a human alongside the agent contributes something the loop alone can't.

We wanted a clean case study that pinned down both. Same model, same task, same dataset, but with and without a human alongside the agent at specific moments. All of the resulting models then evaluated by participants on Prolific. The setup is deliberately small: SmolLM2-360M-Instruct fits on a...

agent metric autoresearch human participants loop

Related Articles