Where does the race to automate AI research end?
Simon Lermen
SubscribeSign in
Playback speed
Share post
Share post at current time
Share from 0:00
0:00
Transcript
Where does the race to automate AI research end?<br>This is a research talk I gave on how automating AI research could lead to an unrecoverable catastrophic alignment failure.
Simon Lermen<br>Jun 02, 2026
Share<br>Transcript
TL;DW: A recording of a recent MATS research talk where I argue that the automation of AI research — which OpenAI and Anthropic say is imminent — could lead to an unrecoverable alignment failure. Three properties make it especially dangerous: oversight breaks down at scale, capabilities self-amplify, and capabilities will be sped up asymmetrically faster than alignment. The outcome could be a lethal, unrecoverable alignment failure. Link to the paper preprint.<br>Transcript
[0:00] So my talk is about automated AI research and the risks that come with it. This is a very relevant and imminent topic.<br>[0:08] So we have OpenAI and Anthropic both talking about this. Roughly, the timeline for both of them is that in a few months, they want to have<br>[0:16] maybe thousands of research interns. And then by 2028, they want to have totally automated AI research,<br>[0:24] maybe hundreds of thousands of fully human- or superhuman-level AI researchers. For those out of the loop, this is what Jack Clark says:<br>[0:33] No humans in the loop by 2028; it’s more than 60% likely, in his view. OpenAI has a very similar view on this topic.<br>[0:43] We had somebody from MATS 8.0, Sev Field. He interviewed 25 researchers from labs and academia. 20 out of 25 said automating AI research<br>[0:55] is one of the most urgent risks posed by AI systems. It is a very urgent, very imminent thing.<br>[1:02] I’m going to go into one argument why this is very dangerous and very imminent. The basic point I’m making is actually closely related<br>[1:11] to a lot of the talks that came before me. Oversight is going to be very difficult. You’re going to look at thousands of agents<br>[1:19] that are going to be increasingly more intelligent. And there’s going to be a huge upscaling due to effective compute improvements,<br>[1:29] algorithmic advances, and also by physically having more compute available to systems.<br>[1:35] So oversight mechanisms are going to be going to a phase where the effective compression is going to go up very fast.<br>[1:44] Less and less of what these agents are going to produce, humans are going to be able to read about. The second property of this<br>[1:52] is going to be self-amplification. With this process, we’re using AI to improve AI.<br>[1:58] This is a self-amplifying process. The better the agents get at AI research, the faster the process is going to move.<br>[2:05] So we could have very explosive progress with very little monitorability. And then the third point is going to be<br>[2:11] on asymmetry of this. What I’m talking about here in particular is that I expect there to be much slower progress<br>[2:20] on automating alignment research than capabilities research. I think there’s very strong arguments<br>[2:26] why these two things are not going to be sped up at the same rate. I think that one very possible,<br>[2:33] quite likely outcome of that is a rapid, unrecoverable failure of alignment, where we very rapidly, without much warning,<br>[2:42] end up with a robustly superhuman AI system that is misaligned. We have very little monitorability.<br>[2:49] It’s very fast. And we get into a state where we cannot recover from this failure of alignment.<br>[2:54] So just to get some kind of rough scale, we heard numbers from OpenAI that they’re looking at something like 10,000 research interns<br>[3:02] being run quite soon, possibly this year. But these are not fully autonomous, in their expectation. If you just do the scaling based on compute,<br>[3:13] you would expect, in just a few years, there to be many more of these agents. And even in the beginning,<br>[3:22] You would have an enormous amount of data produced for these agents. But the experiments they run, the chain of thought,<br>[3:28] you would have a high implicit compression ratio. Like, how much can the 1,000 human researchers at OpenAI, Anthropic really read?<br>[3:37] Now, there are methods, for example, by Google DeepMind proposed for this. One of them is hierarchical summarization.<br>[3:43] And there’s variants of this. That’s some form of, there are monitor AIs that summarize what’s going on<br>[3:50] and that flag dangerous behavior. And we have a chain of thought, and we can summarize the chain of thought of models<br>[3:57] and potentially figure out what they’re thinking about. But there are big questions about this. So one thing is the compression ratio<br>[4:04] would be expected to go up over time. So we have many more agents also being more intelligent over time.<br>[4:14] So we would expect the compression ratio to change in such a way that more and more data gets produced and less and less gets read by humans.<br>[4:22] So the amount...