First Steps Toward Automated AI Research - Recursive
First Steps Toward Automated AI Research<br>Early results from Recursive’s automated AI research system on model training and GPU kernel benchmarks<br>JUNE 11, 2026
Today we are releasing early results from Recursive’s automated AI research system. Across three benchmarks, the system achieves state-of-the-art results: in fixed-budget language model training, small-model training speed, and GPU kernel optimization.<br>The system automates the research loop for a target objective: it proposes an idea, implements it, runs an experiment, validates the result, and uses what it learns to choose the next experiment. It runs many research threads over long horizons, keeps useful context from prior experiments, combines promising branches, and puts results through validation for reward hacks and variance before treating improved performance as real progress. It is designed to scale and harnesses principles of open-ended algorithms, building on ideas from previous work by our team and others into recursively self-improving AI.<br>We tested the system on benchmarks chosen for both practical importance and tight feedback loops. They stress three core levers of AI progress: better training algorithms, faster training, and more efficient use of hardware. They are also well suited to automated research because they have clear metrics, relatively low variance, and evaluators that can be hardened against reward hacks.<br>We are open-sourcing artifacts from these runs so others can inspect and build on the system’s outputs.
Benchmark<br>Task Type<br>Metric<br>Previous State of the Art<br>Recursive<br>Improvement
NanoChat Autoresearch<br>Train a small language model to highest performance given a small compute budget<br>Validation BPB<br>0.9372<br>0.9109<br>0.0263 lower Validation BPB, or a 1.3x speedup to reach the same loss
NanoGPT Speedrun<br>Train a small language model to a certain performance as fast as possible<br>Training time required to reach a 3.28 validation loss<br>79.7 s<br>77.5 s<br>2.2s faster training
SOL-ExecBench<br>Optimize GPU kernels toward hardware limits<br>Mean SOL score across 235 kernels<br>0.699<br>0.754<br>18% reduction in gap to the optimal performance estimate of 1.0
Case study 1: NanoChat Autoresearch<br>Andrej Karpathy’s NanoChat autoresearch repo is a popular starting point for automated research systems. The task is to train a small language model to the lowest validation loss, measured in bits per byte (BPB), within a fixed five-minute budget on a single GPU. It is a natural test of our system because experiments are fast, variance is low, and reward hacks are relatively easy to detect.<br>Perhaps for those reasons, a public collaborative effort has already formed around this setup. autoresearch@home extends the original setup into a collaborative setting where several dozens of humans and hundreds of their agents collectively improve performance. That gives us a stronger comparison point than Karpathy’s single overnight run. We wanted to test if our system could improve on solutions produced by an entire community of humans and agents.<br>Our system starts from the same initial seed solution the Autoresearch code starts from. We initially searched on NVIDIA H100 GPUs, then transferred the discovered solution to run on an NVIDIA B200 GPU for a fair comparison to public results. After removing minor reward hacks from the previous best autoresearch@home solution and evaluating it on 10 random seeds, its mean performance is 0.9372 BPB. Our system found a solution that reached 0.9109 BPB, a 0.0263 BPB improvement. Measured another way, our solution reaches the quality of Karpathy’s original overnight autoresearch BPB in roughly 1.3x less training time than the best autoresearch@home solution.<br>FIGURE 1<br>Autoresearch starts from an already optimized model with some non-trivial design decisions baked in. To this end, we tested whether our system could also make improvements from a much weaker starting point, a naive initial implementation (a vanilla Transformer with AdamW). Our system improved the model from 1.059 BPB to 0.9344 BPB (evaluated on an NVIDIA B200 GPU), again outperforming the best solution produced by the autoresearch@home community. This does not necessarily prove independent rediscovery, since the underlying models may know many public techniques including those used by or created by the autoresearch@home community, but it does show that the search process can assemble a competitive training stack from a much weaker starting point. The resulting solution also differed in several ways from the public best solution.<br>FIGURE 2<br>Figure 3<br>What modifications did our system come up with? The best solutions were not driven by one trick. They combined changes to architecture, short-context memory, auxiliary losses, attention, optimizer behavior, weight decay schedules, compiler settings, and more.<br>One of the biggest gains came from a richer short-context memory mechanism. The baseline already uses value embeddings;...