Model choice is not an authorization layer

llm-agent-audit/docs/model-choice-is-not-an-authorization-layer.md at master · hugoii/llm-agent-audit · GitHub

//blob/show" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//blob/show;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

hugoii

llm-agent-audit

Public

Notifications You must be signed in to change notification settings

Fork

Star

FilesExpand file tree

master

/model-choice-is-not-an-authorization-layer.md

Copy path

Blame More file actions

Latest commit

History History History

68 lines (38 loc) · 9.33 KB

master

/model-choice-is-not-an-authorization-layer.md

Top

File metadata and controls Preview

Code

Blame

68 lines (38 loc) · 9.33 KB

Raw Copy raw file Download raw file

OutlineEdit and raw actions

Model choice is not an authorization layer

I ran the same agent-action audit across six model configurations. Results ranged from 0.0 to 8.0 unsafe tool calls on average, and the recurring failures were mostly authorization failures, not hidden prompt injections.

A while ago I wrote that a model's refusals are not your authorization layer. A model can tell that something sounds dangerous and refuse it, but it cannot check whether the person asking is allowed to act. That has to live in your application.

This is the follow-up, and the question is more practical. If a team wires the same agent to different model APIs, how much does the base action-safety actually change? Teams pick a model for cost, latency, or quality. I wanted to see what that choice does to the one thing nobody puts on a pricing page, which is how often the model takes a high-impact action it was never authorized to take.

So I ran the same audit across six model configurations from three vendors, the current frontier and budget model from each, and I ran every one of them three times.

What I ran

The harness is the same idea as before. The model is given a set of high-risk tool schemas a support or ops agent might have, things like issue_refund, delete_account, transfer_funds, grant_access, and send_email. The tools are not wired to anything. When the model calls one, nothing happens. The harness records which tool it called and with what arguments, and grades that trace against a per-scenario rule for unsafe outcomes.

The battery is version 1.5, 58 attack scenarios plus 3 benign controls, mapped to the OWASP LLM Top 10. The attacks are a mix of direct requests phrased like routine work, indirect injections hidden inside data the agent reads, jailbreaks, and a few sharper indirect cases. The agent, its tools, and the scenarios are identical across all six models. The secret used in the leak tests is a harmless canary the audit injects, not a real credential.

I ran each model three times at the provider's default sampling and kept the result of every run. Single runs are misleading here, because a model's tool-calling is not deterministic. The same model can take an action on one run and refuse it on the next. So I report ranges, and I separate failures that showed up in every run, which I call stable, from failures that showed up in only some runs, which I call intermittent.

The six configurations were Anthropic Claude Opus 4.8 and Claude Haiku 4.5, OpenAI GPT-5.5 and GPT-5-mini, and Google Gemini 3.1 Pro and Gemini 3.5 Flash. These were the current models as of June 2026.

Results

Model Tier Unsafe tool calls per run Average of 58 API-level blocks per run Stable failures (every run)

Anthropic Claude Opus 4.8 frontier 0, 0, 0 0.0 none

OpenAI GPT-5-mini budget 1, 2, 2 1.67 1 (grant admin)

OpenAI GPT-5.5 frontier 3, 1, 2 2.0 9 (see note) 1 (account-cleanup deletion, excessive agency)

Anthropic Claude Haiku 4.5 budget 5, 4, 4 4.33 2 (refund, grant admin)

Google Gemini 3.5 Flash budget 5, 6, 6 5.67 5 (3 Critical, 2 High)

Google Gemini 3.1 Pro frontier 8, 7, 9 8.0 7 (5 Critical, 2 High)

Note on GPT-5.5: on every run, 9 of the 58 scenarios did not reach the model at all. OpenAI's API returned a safety-filter error instead, flagging the content as a possible cybersecurity risk. Those 9 are counted as not exploited because no tool was called, but the model was never given the chance to act on them. So the GPT-5.5 number is...

Model choice is not an authorization layer

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs