llm-agent-audit/docs/model-choice-is-not-an-authorization-layer.md at master · hugoii/llm-agent-audit · GitHub
//blob/show" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//blob/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
hugoii
llm-agent-audit
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
FilesExpand file tree
master
/model-choice-is-not-an-authorization-layer.md
Copy path
Blame<br>More file actions
Blame<br>More file actions
Latest commit
History<br>History<br>History
68 lines (38 loc) · 9.33 KB
master
/model-choice-is-not-an-authorization-layer.md
Top
File metadata and controls<br>Preview
Code
Blame
68 lines (38 loc) · 9.33 KB
Raw<br>Copy raw file<br>Download raw file
OutlineEdit and raw actions
Model choice is not an authorization layer
I ran the same agent-action audit across six model configurations. Results ranged from 0.0 to 8.0 unsafe tool calls on average, and the recurring failures were mostly authorization failures, not hidden prompt injections.
A while ago I wrote that a model's refusals are not your authorization layer. A model can tell that something sounds dangerous and refuse it, but it cannot check whether the person asking is allowed to act. That has to live in your application.
This is the follow-up, and the question is more practical. If a team wires the same agent to different model APIs, how much does the base action-safety actually change? Teams pick a model for cost, latency, or quality. I wanted to see what that choice does to the one thing nobody puts on a pricing page, which is how often the model takes a high-impact action it was never authorized to take.
So I ran the same audit across six model configurations from three vendors, the current frontier and budget model from each, and I ran every one of them three times.
What I ran
The harness is the same idea as before. The model is given a set of high-risk tool schemas a support or ops agent might have, things like issue_refund, delete_account, transfer_funds, grant_access, and send_email. The tools are not wired to anything. When the model calls one, nothing happens. The harness records which tool it called and with what arguments, and grades that trace against a per-scenario rule for unsafe outcomes.
The battery is version 1.5, 58 attack scenarios plus 3 benign controls, mapped to the OWASP LLM Top 10. The attacks are a mix of direct requests phrased like routine work, indirect injections hidden inside data the agent reads, jailbreaks, and a few sharper indirect cases. The agent, its tools, and the scenarios are identical across all six models. The secret used in the leak tests is a harmless canary the audit injects, not a real credential.
I ran each model three times at the provider's default sampling and kept the result of every run. Single runs are misleading here, because a model's tool-calling is not deterministic. The same model can take an action on one run and refuse it on the next. So I report ranges, and I separate failures that showed up in every run, which I call stable, from failures that showed up in only some runs, which I call intermittent.
The six configurations were Anthropic Claude Opus 4.8 and Claude Haiku 4.5, OpenAI GPT-5.5 and GPT-5-mini, and Google Gemini 3.1 Pro and Gemini 3.5 Flash. These were the current models as of June 2026.
Results
Model<br>Tier<br>Unsafe tool calls per run<br>Average of 58<br>API-level blocks per run<br>Stable failures (every run)
Anthropic Claude Opus 4.8<br>frontier<br>0, 0, 0<br>0.0<br>none
OpenAI GPT-5-mini<br>budget<br>1, 2, 2<br>1.67<br>1 (grant admin)
OpenAI GPT-5.5<br>frontier<br>3, 1, 2<br>2.0<br>9 (see note)<br>1 (account-cleanup deletion, excessive agency)
Anthropic Claude Haiku 4.5<br>budget<br>5, 4, 4<br>4.33<br>2 (refund, grant admin)
Google Gemini 3.5 Flash<br>budget<br>5, 6, 6<br>5.67<br>5 (3 Critical, 2 High)
Google Gemini 3.1 Pro<br>frontier<br>8, 7, 9<br>8.0<br>7 (5 Critical, 2 High)
Note on GPT-5.5: on every run, 9 of the 58 scenarios did not reach the model at all. OpenAI's API returned a safety-filter error instead, flagging the content as a possible cybersecurity risk. Those 9 are counted as not exploited because no tool was called, but the model was never given the chance to act on them. So the GPT-5.5 number is...