Same flaw, opposite verdict: what counts as a vulnerability in AI agents?

Same Flaw, Opposite Verdict: AI Agents Can't Agree What Counts as a Security Vulnerability | by Nikos Rigas | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Same Flaw, Opposite Verdict: AI Agents Can't Agree What Counts as a Security Vulnerability

Nikos Rigas

9 min read· Just now

Listen

I found three ways past an AI agent's safety gate. One was quietly fixed, two were closed as "by design" — yet the same bug class is a credited CVE in Claude Code. The real problem: the field has no shared definition of a security boundary. Press enter or click to view image in full size

The short version I was evaluating open-source AI agents to build a sales assistant — something that would handle confidential customer data. So I checked its security before adopting it. Its top candidate, Hermes Agent, has a "safety gate" — the check that's supposed to stop the AI from running dangerous commands. I found three ways around it (one let an attacker run code on the host) and reported all three, with proof and fixes. While my reports sat open, the project rewrote its security policy — downgrading that gate from a "core security boundary" to "just a heuristic, not a boundary." That re-wording mattered: under the old policy, getting past the gate was a vulnerability; under the new one, it wasn't — so my reports were closed as "out of scope." One was quietly fixed anyway, with no credit. This isn't really about one project. Every AI agent draws the safety line in a different place — the same bug gets a CVE and a credit in one, and a shrug in another. That gap is the story. When is a flaw a vulnerability? In one AI agent, a flaw that lets a malicious web page run commands on your machine is a credited CVE with a fix. In another, the same class of flaw is "working as designed." Nobody's lying — they just don't agree on what counts as a vulnerability, because no one has agreed what a security boundary even is inside an AI agent. I learned this the practical way. I was evaluating open-source AI agents to handle confidential customer data, so I tested one before trusting it: Hermes Agent, an open-source agent from Nous Research that topped my shortlist. I found three ways past its safety gate and reported them. What happened next — one quiet fix, two "out of scope" closures, and a security policy rewritten while my reports sat open — is a small window into a problem the whole field is improvising through: when you report a flaw in an AI agent, whether it "counts" depends entirely on who's holding the pen. This is that story — and the bigger question underneath it. Why I went looking for prompt injection first If you’re going to attack an AI agent, you start with prompt injection. It’s the most common and impactful class of attack on LLM systems — OWASP tracks it as the single fastest-growing category of attack, and it has embarrassed even the top labs. The mechanism is almost insultingly simple: an LLM can’t tell its instructions apart from its data — both arrive as the same stream of text. So an attacker hides instructions inside something the agent will read: a web page, an email, a file, a chat message. The agent reads it and obeys. No exploit code. Just words. For an agent that can run shell commands, that’s the whole game. If untrusted text can steer the model, the only thing between “the model got tricked” and “your server ran the attacker’s command” is the safety gate in the middle. So I went straight at the gate. Three ways past the gate Picture the gate as a checkpoint. Before the AI runs a command on your computer, the gate checks whether it looks dangerous — and if it does, it stops to ask you, or blocks it. It's the last thing standing between what the AI decides and what your machine actually does. Here's all that matters for the story: I found three different ways past it. Briefly: It could be talked into approving a command. The agent had an option to let a second AI sign off on risky commands instead of asking you. But that reviewer couldn't tell a real instruction from text hidden in the command — so a command could carry its own note ("this one's safe, approve it"), and the reviewer would go along with it. It could be fooled by rewording. The danger-check matched commands against a list of known-bad text. Write the same dangerous command a little differently, and it wasn't on the list — so it passed. It could be skipped entirely. The agent automatically runs any code placed in a certain folder when it starts. Since it can also create files, one planted file becomes code that runs on every launch. None of this is far-fetched: a booby-trapped web page the agent reads could trigger any of it — and a sales assistant reads the web, email, and chat all day. The first of these was later fixed with a small, obvious patch. Keep that in mind — it matters later. I filed a careful report. Then the rules changed. I wrote it all up properly — analysis, root...

Same flaw, opposite verdict: what counts as a vulnerability in AI agents?

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi