How much should we worry about secretly loyal AIs?

erwald1 pts0 comments

How much should we worry about secretly loyal AIs?

SubscribeSign in

How much should we worry about secretly loyal AIs?<br>Defenders have structural advantages but there’s work to be done.<br>Dave Banerjee<br>May 20, 2026

Share

In my first post here, I made the case for preserving the integrity of AI systems. Preserving integrity, in practice, means ensuring no actor can make unauthorized or secret edits to model weights, training data, or training infrastructure. One concrete worry is that an attacker who can tamper with training data could instill a secret loyalty.<br>A secretly loyal AI is one that advances the interests of its principal in a way concealed from other legitimate actors, including the AI developer. One way to taxonomize secret loyalties is along the axes of activation breadth and action breadth . Activation breadth refers to the range of conditions under which the loyalty manifests. Action breadth refers to how much the model’s actions are pre-specified rather than contextually chosen.

The two-dimensional space of activation breadth and action breadth. The green region represents the space explored by existing research. This figure is taken from Kwon et al.<br>I’m most concerned about the types of secret loyalties that occupy the top and right edges of the figure. In practice, these look like:<br>Conditionally-active secret loyalties (top left), e.g., when the AI detects it’s doing alignment-related R&D, it sabotages the process and ensures the loyalty is passed on to the successor model.<br>A related example: password-triggered, helpful-only models that drop their guardrails and comply with any request upon receiving a specific password.

Continuously-active, action-limited secret loyalties (bottom right), e.g., when asked to perform a task that would benefit the principal, the AI complies while bypassing its normal refusal behavior.

Continuously-active, action-unlimited secret loyalties (top right), e.g., the AI is always scanning for opportunities to advance the principal’s interests, and will take any action it judges sufficiently unlikely to be caught.

In the near future, frontier AI companies may develop models capable of completing the vast majority of knowledge work, deployed widely throughout the economy and society. It isn’t hard to see how a widely deployed, secretly loyal AI could be really, really bad for the world. I’m especially concerned about secret loyalties being used to concentrate power among a small group of actors and, in the limit, stage a coup.<br>I’m most concerned about AI company leadership and state actors. Senior executives and technical leads at frontier labs have direct access to training infrastructure, can access helpful-only models, can bypass security and monitoring requirements, and will be among the last roles to be automated after AI R&D is itself automated. This combination of access, cover, and longevity makes them unusually well-positioned to instill a secret loyalty. For similar reasons, I’m also worried about state actors. I expect that multiple state actors will have gained covert insider access to frontier AI companies by the time AI R&D is automated, which would position them to train secret loyalties. A state actor that successfully subverts a widely deployed American AI model could gain substantial geopolitical leverage, with the US largely unaware it’s happening.<br>My previous work sketched out what secret loyalties are and how to protect against them. This post digs into three questions:<br>When are secret loyalties even technically feasible?

How would an attacker instill a secret loyalty?

How does the secret loyalty threat model compare to the misalignment threat model?

In this post, I argue three things:<br>First, secret loyalties require three conditions to hold simultaneously—loyalty, concealment, and capability/affordances—and the field is approaching the regime in which all three are likely to hold within the next few years.

Second, an attacker would most likely instill a secret loyalty by directing an automated AI researcher to do the work. For the attack to stick, the loyalty also has to be incorrigible (resistant to being trained out).

Third, defending against secret loyalties is structurally easier than defending against misaligned AIs, in five concrete ways:<br>Secret loyalty installation may come paired with a detection mechanism

You only need to monitor human-AI chat logs rather than model outputs

Secret loyalties are uncorrelated across AI companies, enabling cross-model verification

You only have to catch a loyalty once per principal

The search space of possible principals is tractably small

Thanks to Onni Aarne, Erich Grunewald, Tom Davidson, Abbey Chaver, Alfie Lamerton, Andrew Draganov, and Joe Kwon for helpful discussions and feedback.<br>When are secret loyalties technically feasible?

For a secret loyalty to be both trainable and capable of causing catastrophic harm, three conditions have to hold simultaneously:<br>Loyalty. The AI must...

secret loyalty loyalties model breadth action

Related Articles