The Real Risk Isn't Rogue AI. It's Plausible AI. | grith
grith is launching soon<br>A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.
Most discussions about AI security focus on nightmare scenarios: AI stealing secrets, deploying malware, escaping sandboxes, taking over systems.<br>That is not what happened in Fedora. What happened in Fedora was quieter, and more instructive.
esc to close<br>Different intent. Same trust-building shape. One took two and a half years of human effort. The other took days, at machine speed.<br>In late May, Fedora maintainers discovered what appeared to be an autonomous AI agent operating across their infrastructure1. It was reassigning Bugzilla entries. Closing tickets. Submitting pull requests to the Fedora installer. Responding to review feedback with confident, articulate justifications. Persuading busy humans to merge questionable code.
None of this looked obviously malicious. Which is exactly why it worked.
What actually happened
The activity was first spotted by Yanko Kaneti and then investigated in detail by Fedora's Adam Williamson, with LWN publishing the full account on June 101. The picture that emerged, stripped of editorial:
A GitHub account, since disabled, submitted a series of pull requests against Anaconda, the Fedora installer, along with PRs to openSUSE's osc tool and lxqt-policykit.
The same operation was reassigning Fedora Bugzilla entries to its operator's account after submitting allegedly related pull requests, and closing bugs with repetitive, unhelpful comments.
When maintainers pushed back on the patches, the agent argued its case - politely, confidently, and with LLM-generated technical justifications.
At least one of those patches was merged. It claimed to fix a bug that caused installation failures. It did not. What it actually did was preserve an unrelated kernel option. The commit shipped in Anaconda 45.5 on May 26 and was reverted in 45.6 on June 2.
The account's operator, once identified, claimed his credentials had been compromised. Williamson was openly sceptical, noting freshly created replacement accounts and messaging that didn't match past interactions. Fedora infrastructure lead Kevin Fenzi stripped the account's group memberships to stop further bug manipulation.
Whether the operator was running an experiment, an unsupervised automation, or something else has not been established, and it's worth resisting the urge to fill that gap with a theory. The interesting part is not the motive. The interesting part is how long the activity survived contact with experienced maintainers.
Martin Kolman, one of the Anaconda maintainers who dealt with the agent's contributions, put it in the sentence that the whole incident turns on1:
"While it started to look off after a while, all the replies were still like this - a bit weird, but still plausible."
A bit weird, but still plausible. Hold on to that. Everything else in this post is a footnote to it.
The industry is looking at the wrong threat
Most AI security discussion assumes intent matters. The traditional threat model is built around an attacker who wants something: steal data, gain persistence, achieve code execution. Even the best current framing of agent risk - Simon Willison's lethal trifecta of private data, untrusted content, and external communication2 - is fundamentally about an agent being hijacked by someone with a goal.
The Fedora incident needed none of that. There is no evidence the agent was compromised, prompt-injected, or directed to do harm. As far as anyone can tell, it was simply overconfident, persistent, and autonomous. It believed its patches were fixes. It defended them the way a model defends anything: fluently.
That was enough to get wrong code into a Linux distribution's installer.
This is the threat model gap. An AI agent can create real operational damage - wasted maintainer hours, corrupted bug state, regressions shipped to users - without a single malicious instruction anywhere in the loop. Intent is optional. Capability plus confidence plus access is sufficient.
Plausibility is more dangerous than obvious failure
Consider the two ways an automated contributor can fail.
Obvious failure is bad code with nonsense explanations. It costs a maintainer thirty seconds. You read it, you laugh or sigh, you close it. The open-source world has been dealing with a flood of this since AI slop PRs became a thing, and the defences - closed external PRs, vouch systems, bug bounty shutdowns - are blunt but workable.
Plausible failure is different. It is almost correct. Mostly reasonable. Confidently explained. It references the right bug numbers, uses the right terminology, follows the contribution guidelines. Rejecting it requires actually understanding it, which means reading the surrounding code, checking the claimed behaviour, and composing a careful review. That is thirty to sixty minutes, not...