When is an AI agent's approval prompt a security boundary? A disclosure timeline + an industry inconsistency. · GitHub
/" data-turbo-transient="true" />
Skip to content
-->
Search Gists
Search Gists
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Instantly share code, notes, and snippets.
NikosRig/ai-agent-approval-prompt-as-a-security-boundary.md
Created<br>June 23, 2026 21:00
Show Gist options
Download ZIP
Star
(0)
You must be signed in to star a gist
Fork
(0)
You must be signed in to fork a gist
Embed
Select an option
Embed<br>Embed this gist in your website.
Share<br>Copy sharable link for this gist.
Clone via HTTPS<br>Clone using the web URL.
No results found
Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/NikosRig/b4330ceb780fe22bf3c14f38d7d90795.js"></script>
" readonly="readonly" data-autoselect="true" data-target="primer-text-field.inputElement " aria-describedby="validation-0b2187af-edbe-4660-ade1-abeac62c32a5" class="form-control FormControl-monospace FormControl-input FormControl-small rounded-left-0 rounded-right-0 border-right-0" type="text" name="gist-share-url-sized-down" />
Save NikosRig/b4330ceb780fe22bf3c14f38d7d90795 to your computer and use it in GitHub Desktop.
Embed
Select an option
Embed<br>Embed this gist in your website.
Share<br>Copy sharable link for this gist.
Clone via HTTPS<br>Clone using the web URL.
No results found
Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/NikosRig/b4330ceb780fe22bf3c14f38d7d90795.js"></script>
" readonly="readonly" data-autoselect="true" data-target="primer-text-field.inputElement " aria-describedby="validation-a8dadc39-b21a-4e7c-b8f6-3e6f1b05f447" class="form-control FormControl-monospace FormControl-input FormControl-small rounded-left-0 rounded-right-0 border-right-0" type="text" name="gist-share-url-original" />
Save NikosRig/b4330ceb780fe22bf3c14f38d7d90795 to your computer and use it in GitHub Desktop.
Download ZIP
When is an AI agent's approval prompt a security boundary? A disclosure timeline + an industry inconsistency.
Raw
ai-agent-approval-prompt-as-a-security-boundary.md
When is an AI agent's approval prompt a security boundary?
I reported three approval-bypass findings to an open-source AI agent. Between<br>the day I submitted and the day they replied, the project rewrote its security<br>policy — in a way that reclassified my findings out of scope — and then closed<br>them citing the new text. This is a writeup of what happened and the genuine<br>question underneath it, because I don't think the answer is obvious and I think<br>the industry hasn't settled it.
I'll start by conceding the other side, because it's strong.
The vendor is not wrong about the hard part
The project is Hermes Agent (Nous Research). Like most agents with shell access,<br>it screens commands against a denylist and prompts the operator before running<br>anything that looks destructive. Their current position is that this gate is an<br>in-process heuristic, not a security boundary — that shell is Turing-complete,<br>a denylist over shell strings is structurally incomplete, and the real boundary<br>for adversarial input is OS-level isolation (run it in a container).
That is correct. You cannot regex your way to a complete boundary over shell,<br>and "run untrusted workloads in a sandbox" is the right posture. I'm not<br>disputing any of that, and any framing of this story that ignores it is unfair.
The three findings (mechanism only — two are still live)
Smart-approval prompt injection. In the optional "smart" mode, a second<br>LLM judges flagged commands. The untrusted command was interpolated into the<br>reviewer's prompt with no separation between data and instructions, and the<br>verdict was parsed with a loose substring match. Injected text could talk the<br>reviewer into approving.
Startup-hook code execution. Any .py file in the agent's hooks<br>directory is executed at gateway startup — no registration, no hash, no<br>signature. A prompt-injected model can write that file via a normal tool call<br>that triggers no approval, yielding code execution on the next restart.
Approval-gate parsing bypass. The detector matches regex against the raw<br>command string, not parsed shell tokens. Equivalent rewrites — quoted command<br>names, variable indirection, alternate shell binaries, octal chmod prefixes,<br>versioned interpreter names — run the same dangerous action and bypass the<br>prompt entirely.
I retested all three against the current release in a clean Docker build before<br>writing this. Finding #1 was meaningfully hardened in June (the live bypass rate<br>dropped from 6/8 to 1/8). Findings #2 and #3 still reproduce on the current<br>version. I'm...