Where does the race to automate AI research end?

DalasNoin1 pts0 comments

Where does the race to automate AI research end?

Simon Lermen

SubscribeSign in

Playback speed

Share post

Share post at current time

Share from 0:00

0:00

Transcript

Where does the race to automate AI research end?<br>This is a research talk I gave on how automating AI research could lead to an unrecoverable catastrophic alignment failure.

Simon Lermen<br>Jun 02, 2026

Share<br>Transcript

TL;DW: A recording of a recent MATS research talk where I argue that the automation of AI research — which OpenAI and Anthropic say is imminent — could lead to an unrecoverable alignment failure. Three properties make it especially dangerous: oversight breaks down at scale, capabilities self-amplify, and capabilities will be sped up asymmetrically faster than alignment. The outcome could be a lethal, unrecoverable alignment failure. Link to the paper preprint.<br>Transcript

[0:00] So my talk is about automated AI research and the risks that come with it. This is a very relevant and imminent topic.<br>[0:08] So we have OpenAI and Anthropic both talking about this. Roughly, the timeline for both of them is that in a few months, they want to have<br>[0:16] maybe thousands of research interns. And then by 2028, they want to have totally automated AI research,<br>[0:24] maybe hundreds of thousands of fully human- or superhuman-level AI researchers. For those out of the loop, this is what Jack Clark says:<br>[0:33] No humans in the loop by 2028; it’s more than 60% likely, in his view. OpenAI has a very similar view on this topic.<br>[0:43] We had somebody from MATS 8.0, Sev Field. He interviewed 25 researchers from labs and academia. 20 out of 25 said automating AI research<br>[0:55] is one of the most urgent risks posed by AI systems. It is a very urgent, very imminent thing.<br>[1:02] I’m going to go into one argument why this is very dangerous and very imminent. The basic point I’m making is actually closely related<br>[1:11] to a lot of the talks that came before me. Oversight is going to be very difficult. You’re going to look at thousands of agents<br>[1:19] that are going to be increasingly more intelligent. And there’s going to be a huge upscaling due to effective compute improvements,<br>[1:29] algorithmic advances, and also by physically having more compute available to systems.<br>[1:35] So oversight mechanisms are going to be going to a phase where the effective compression is going to go up very fast.<br>[1:44] Less and less of what these agents are going to produce, humans are going to be able to read about. The second property of this<br>[1:52] is going to be self-amplification. With this process, we’re using AI to improve AI.<br>[1:58] This is a self-amplifying process. The better the agents get at AI research, the faster the process is going to move.<br>[2:05] So we could have very explosive progress with very little monitorability. And then the third point is going to be<br>[2:11] on asymmetry of this. What I’m talking about here in particular is that I expect there to be much slower progress<br>[2:20] on automating alignment research than capabilities research. I think there’s very strong arguments<br>[2:26] why these two things are not going to be sped up at the same rate. I think that one very possible,<br>[2:33] quite likely outcome of that is a rapid, unrecoverable failure of alignment, where we very rapidly, without much warning,<br>[2:42] end up with a robustly superhuman AI system that is misaligned. We have very little monitorability.<br>[2:49] It’s very fast. And we get into a state where we cannot recover from this failure of alignment.<br>[2:54] So just to get some kind of rough scale, we heard numbers from OpenAI that they’re looking at something like 10,000 research interns<br>[3:02] being run quite soon, possibly this year. But these are not fully autonomous, in their expectation. If you just do the scaling based on compute,<br>[3:13] you would expect, in just a few years, there to be many more of these agents. And even in the beginning,<br>[3:22] You would have an enormous amount of data produced for these agents. But the experiments they run, the chain of thought,<br>[3:28] you would have a high implicit compression ratio. Like, how much can the 1,000 human researchers at OpenAI, Anthropic really read?<br>[3:37] Now, there are methods, for example, by Google DeepMind proposed for this. One of them is hierarchical summarization.<br>[3:43] And there’s variants of this. That’s some form of, there are monitor AIs that summarize what’s going on<br>[3:50] and that flag dangerous behavior. And we have a chain of thought, and we can summarize the chain of thought of models<br>[3:57] and potentially figure out what they’re thinking about. But there are big questions about this. So one thing is the compression ratio<br>[4:04] would be expected to go up over time. So we have many more agents also being more intelligent over time.<br>[4:14] So we would expect the compression ratio to change in such a way that more and more data gets produced and less and less gets read by humans.<br>[4:22] So the amount...

research going alignment agents from failure

Related Articles