Recursive Self-Improvement

aborovykh1 pts1 comments

Recursive Self-Improvement - by Anastasia Borovykh

Ana’s Substack

SubscribeSign in

Recursive Self-Improvement<br>In recursive self-improvement (RSI) AI keeps building more and more powerful versions of itself. Where does RSI already work or will soon work, and where is it still out of reach?

Anastasia Borovykh<br>Jun 12, 2026

Share

On June 8 2026 Anthropic called for a pause on AI development. A few days later they released their new Fable model. Was it just a hype, or could it be that internally they saw something intriguing: did the so-called recursive self-improvement (RSI), where AI keeps building more and more powerful versions of itself, work?<br>I wrote this post to make it more clear for myself in which tasks RSI already works or could reasonably be expected to work soon, and where it’s still out of reach.<br>Share your thoughts with me!<br>Recursive Self-Improvement

The definition I use is simple and close to how one would like a human to improve over time: the model gets a task, generates an output, obtains feedback from an environment, reflects on this feedback and learns from it (through weight updates, or other ways). For this to ‘take off’, meaning we can significantly improve model capacity, we’d want to automate the source of feedback (if we rely on feedback from human labellers, we’ll always be bottlenecked there) and have the ability to both generate outputs and learn from the feedback fast (to explore as much as possible).<br>Reminder on training AI models

Models need a lot of data and a lot of compute. The pre-training phase consists of maximising the log-likelihood between the tokens predicted by the model, and trillions of tokens of diverse data (from the web, synthetically generated, distilled from other models — whatever you can find that is of reasonable quality). After the model has some base capabilities, you’d want to inject more specialised, complex knowledge: coding challenges, mathematical reasoning, general reasoning, instruction following; this is sometimes called mid-training. And finally, if you’ve run out of all the ‘labelled’ data or labelled data becomes very expensive as you need to pay human experts to create it, you move into the post-training stage, where reinforcement learning (RL) is used to improve model capabilities through minimal feedback and / or through cheaply available environmental feedback. Most of my focus will be on this latter stage, and specifically how one could make this work with as little human intervention as possible; let’s explore the tasks on which this is possible.<br>RSI in the game of Go

Already in 2017, David Silver and colleagues from DeepMind released AlphaZero, an AI model that would beat top human Go players. AlphaZero showed RSI can work. The AI simulated games of self-play where the moves were the output of a neural network (another network also tracks the value of the game, but I’ll forget about that here): starting from randomly initialised parameters, games are played until the end. The final state of the game is used as a score function to update the network parameters to improve which moves are worth considering.<br>Many researchers at Deepmind experienced how a game that was considered to be very difficult got resolved by learning from environmental feedback. Unsurprisingly, many of the top researchers working on these ideas left to start their own companies to keep progress going in this domain: Recursive SuperIntelligence, Ineffable Intelligence, and Inherent Labs.<br>RL in LLMs

But if the game of Go was deemed challenging due to the large number of possible moves, language and reasoning is even more challenging: the possible tokens you could generate as an answer to some maths or coding challenge, or frankly any other prompt, is huge. And just randomly generating tokens would never get to a correct model, hence never receive a reward, and thus would never result in improved performance. The paper “Front-Loading Reasoning” says that for post-training to improve reasoning, including reasoning capabilities already in pre-training is critical. Hence: for RL to work on LLMs, the base model had to have a certain amount of base knowledge to guide the search and to unlock recursive improvement through RL.<br>In 2024 this stage was reached: OpenAI released o1, showcasing that RL could indeed work to improve base model abilities. That same year, DeepSeek’s team released an open-source paper shedding more light on how RL may be successfully applied to LLMs. DeepSeekMath introduced Group-Relative Policy Optimisation (GRPO). The method worked as follows: take an LLM, pass in a prompt, generate several answers (a group), compare against some ground-truth to get a reward per answer and then use the value of each answer relative to the others in the group as a learning signal. DeepSeekMath reported their results on high-school and college math (GSM8k, MATH, SAT, OCW Courses, MMLU-STEM), formal maths (miniF2F), reasoning over diverse tasks (MMLU, BIG-Bench...

model from recursive work feedback self

Related Articles