Where does the race to automate AI research end?

Simon Lermen

SubscribeSign in

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Where does the race to automate AI research end? This is a research talk I gave on how automating AI research could lead to an unrecoverable catastrophic alignment failure.

Simon Lermen Jun 02, 2026

Share Transcript

TL;DW: A recording of a recent MATS research talk where I argue that the automation of AI research — which OpenAI and Anthropic say is imminent — could lead to an unrecoverable alignment failure. Three properties make it especially dangerous: oversight breaks down at scale, capabilities self-amplify, and capabilities will be sped up asymmetrically faster than alignment. The outcome could be a lethal, unrecoverable alignment failure. Link to the paper preprint. Transcript

[0:00] So my talk is about automated AI research and the risks that come with it. This is a very relevant and imminent topic. [0:08] So we have OpenAI and Anthropic both talking about this. Roughly, the timeline for both of them is that in a few months, they want to have [0:16] maybe thousands of research interns. And then by 2028, they want to have totally automated AI research, [0:24] maybe hundreds of thousands of fully human- or superhuman-level AI researchers. For those out of the loop, this is what Jack Clark says: [0:33] No humans in the loop by 2028; it’s more than 60% likely, in his view. OpenAI has a very similar view on this topic. [0:43] We had somebody from MATS 8.0, Sev Field. He interviewed 25 researchers from labs and academia. 20 out of 25 said automating AI research [0:55] is one of the most urgent risks posed by AI systems. It is a very urgent, very imminent thing. [1:02] I’m going to go into one argument why this is very dangerous and very imminent. The basic point I’m making is actually closely related [1:11] to a lot of the talks that came before me. Oversight is going to be very difficult. You’re going to look at thousands of agents [1:19] that are going to be increasingly more intelligent. And there’s going to be a huge upscaling due to effective compute improvements, [1:29] algorithmic advances, and also by physically having more compute available to systems. [1:35] So oversight mechanisms are going to be going to a phase where the effective compression is going to go up very fast. [1:44] Less and less of what these agents are going to produce, humans are going to be able to read about. The second property of this [1:52] is going to be self-amplification. With this process, we’re using AI to improve AI. [1:58] This is a self-amplifying process. The better the agents get at AI research, the faster the process is going to move. [2:05] So we could have very explosive progress with very little monitorability. And then the third point is going to be [2:11] on asymmetry of this. What I’m talking about here in particular is that I expect there to be much slower progress [2:20] on automating alignment research than capabilities research. I think there’s very strong arguments [2:26] why these two things are not going to be sped up at the same rate. I think that one very possible, [2:33] quite likely outcome of that is a rapid, unrecoverable failure of alignment, where we very rapidly, without much warning, [2:42] end up with a robustly superhuman AI system that is misaligned. We have very little monitorability. [2:49] It’s very fast. And we get into a state where we cannot recover from this failure of alignment. [2:54] So just to get some kind of rough scale, we heard numbers from OpenAI that they’re looking at something like 10,000 research interns [3:02] being run quite soon, possibly this year. But these are not fully autonomous, in their expectation. If you just do the scaling based on compute, [3:13] you would expect, in just a few years, there to be many more of these agents. And even in the beginning, [3:22] You would have an enormous amount of data produced for these agents. But the experiments they run, the chain of thought, [3:28] you would have a high implicit compression ratio. Like, how much can the 1,000 human researchers at OpenAI, Anthropic really read? [3:37] Now, there are methods, for example, by Google DeepMind proposed for this. One of them is hierarchical summarization. [3:43] And there’s variants of this. That’s some form of, there are monitor AIs that summarize what’s going on [3:50] and that flag dangerous behavior. And we have a chain of thought, and we can summarize the chain of thought of models [3:57] and potentially figure out what they’re thinking about. But there are big questions about this. So one thing is the compression ratio [4:04] would be expected to go up over time. So we have many more agents also being more intelligent over time. [4:14] So we would expect the compression ratio to change in such a way that more and more data gets produced and less and less gets read by humans. [4:22] So the amount...

Where does the race to automate AI research end?

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy