Prompt Injection as Role Confusion

Security researchers tricked LLMs into giving them cocaine recipes by abusing role models for prompt injection

Jump to main content

REG AD

AI and ML

Security researchers tricked LLMs into giving them cocaine recipes by abusing role models for prompt injection

If you want a picture of the future of LLM security, imagine Whac-a-Mole meets Groundhog Day

Thomas Claburn

Thomas Claburn

Senior reporter

Published tue 30 Jun 2026 // 00:33 UTC

Researchers say that machine learning models cannot reliably distinguish between authorized and unauthorized input, ensuring that prompt injection will continue to present a threat until developers find new ways to have machine learning systems process inputs. AI models provide responses to user-supplied prompts. The problem is that AI models may receive adversarial prompts – directly from a user or indirectly from an ingested document – that tell the model to take action contrary to its built-in system prompt. Various techniques mitigate prompt injection, but defenders have not found ways to prevent such attacks.

REG AD

According to independent researchers Charles Ye and Jasmine Cui, and MIT associate professor Dylan Hadfield-Menell, no one is likely to do so under the current fragile LLM security model.

REG AD

As they observe in a paper titled "Prompt Injection as Role Confusion" in the proceedings of next week's ICML 2026 conference, LLMs have come to rely on a text tagging system that defines "roles" to separate system text from user text. And roles, they argue, do not guarantee security. "Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs," the authors explain in a blog post. "We've shown that this architecture doesn't survive into the model's actual representations, and that such role confusion is linked to prompt injection." When OpenAI's ChatGPT arrived in 2022, it implemented the concept of roles – described by Anthropic a year earlier – as a way to tell the underlying model to behave in a certain way. The user role would make a request and the model, acting in the role of a helpful assistant, would respond to that request.

MORE CONTEXT

Four years into Ukraine invasion, Russia turns influence-ops back to US and Europe

Anonymous researcher drops 0-day 'exploitarium' repo

Supreme Court rules cops need a warrant to vacuum up phone location data

Large Hadron Collider goes offline to make room for its enhanced successor

"A formatting trick had become the mechanism that turned autocomplete into an assistant," the authors observe. Developers introduced other roles over time. In addition to and , there's , , and . These roles served to draw a line between different objectives so they could be individually optimized during the training process. Model makers want to balance conflicting objectives like being helpful and preventing harm, and this involves role distinctions. But roles, the researchers say, have become overloaded with responsibilities they cannot reliably carry out. They've become like a fuzzier version of permission levels, determining how prompts are trusted and treated. The problem, the authors contend, is that roles are determined in a fundamentally insecure way: writing style. "LLMs identify roles from an insecure feature (style)," they explain. "This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID. Usually everything agrees, so this works fine. But when attackers intentionally create a mismatch, the LLM uses the insecure method (writing style) to identify its role instead of the secure method (tags)."

REG AD

The authors developed an attack called CoT (Chain of Thought) Forgery that involves using an LLM to spoof the terse style of OpenAI mode and add that to the prompt. The technique won the 2025 OpenAI Kaggle red-teaming contest. "We asked a bunch of LLMs how to synthesize cocaine, inserting fake reasoning that says it's fine because we're wearing a green shirt," the authors explain. "The LLMs comply. The rationale is transparently dumb, but the models don't evaluate it as an external claim to be scrutinized. They treat it as their already-reached conclusion, and simply act on it. We've stolen the trust given to the role." On a standard jailbreaking benchmark, they say, CoT Forgery took the attack success rate from near zero to about 60 percent on the models tested. And whereas most jailbreaks are fragile and work only for certain models, this one transferred because it exploits a structural flaw. It's not attempting to persuade the model but duping the model into treating the request as something that's already settled. The authors also note that while many models report near-perfect safety scores on prompt-injection benchmarks, human red-teamers achieve attack success rates close to 100 percent. "The discrepancy is straightforward: skilled humans test and adapt attacks until they work, benchmarks don't,"...

Prompt Injection as Role Confusion

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level