AI red teaming agents change how LLMs get tested

SVI1 pts0 comments

AI red teaming agents change how LLMs get tested - Help Net Security

Help Net Security newsletters : Daily and weekly news, cybersecurity jobs, open source projects, breaking news – subscribe here!

Please turn on your JavaScript for this page to function normally.

Mirko Zorz, Director of Content, Help Net Security

May 21, 2026

Share

AI red teaming agents change how LLMs get tested

Adversarial probing of LLMs has piled up a sprawling toolkit over the past three years. Attack techniques with names like Tree of Attacks with Pruning, Crescendo, and Skeleton Key sit alongside hundreds of prompt transforms and scoring methods across open-source frameworks including Microsoft’s PyRIT, NVIDIA’s Garak, and Promptfoo. The catalog has grown faster than any operator can fluently navigate it, and that mismatch is changing how AI red teaming gets done.

A wave of recent work points toward agent-orchestrated assessment, where an AI agent picks attacks, composes transforms, runs them against a target, and produces structured findings from a natural-language objective. Research published over the past year has shown autonomous agents solving the majority of black-box red team challenges with significant efficiency gains over human operators. A new paper from security firm Dreadnode adds another data point, describing an agent that took a single operator from natural-language goals to 674 executed attacks against Meta’s Llama Scout in roughly three hours.

What the agent layer changes

The pattern across these systems is similar. An operator describes a goal in plain language. The agent picks attack strategies, applies transforms like Base64 encoding, persona framing, or translation into low-resource languages, runs the attacks against a target, scores the results with an LLM judge, and maps findings to compliance frameworks like the OWASP LLM Top 10, MITRE ATLAS, and NIST AI RMF.

"Traditional AI red teaming frameworks require operators to spend time configuring attacks, transforms, scorers, datasets, and execution pipelines manually. Much of the workflow becomes a brute-force engineering exercise around library configuration rather than security and safety probing," Raja Sekhar Rao Dheekonda, co-author of the paper and co-creator of Microsoft’s Counterfit and PyRIT projects, told Help Net Security. "The core idea behind the agent is to shift operators away from implementation overhead and toward higher-level reasoning about target behavior, attack coverage, and risk analysis."

The reported numbers from the Llama Scout case study illustrate the throughput. Across 68 adversarial goals spanning harmful content and bias categories, the agent ran three attack types with five transform variants and reached an 85 percent attack success rate. Crescendo and a newer technique called Graph of Attacks with Pruning hit 100 percent. Persona-based transforms like skeleton-key framing also reached 100 percent. Base64 encoding came in lower at 75 percent, suggesting the model picked up encoded payloads more reliably than role-play framings.

What the headline numbers leave out

Several qualifications matter for any team thinking about adopting this approach.

The three-hour figure covers a focused slice of the framework. The paper’s own limitations section acknowledges that comprehensive assessments across all attack types and harm categories run closer to days. Llama Scout is also a 17-billion-parameter model released in April 2025, and 85 percent on a mid-size open model says little about results against current frontier systems.

Coordinated disclosure is another open question. Asked about the process with Meta before publishing verbatim outputs including shellcode loaders and chemical synthesis steps, Dheekonda said the work was "intended primarily for awareness and research demonstration" and confirmed he "had not coordinated disclosure with Meta prior to publication." He has not evaluated whether subsequent Llama Scout checkpoints mitigate the specific attack and transform combinations identified.

The agent’s alignment also constrains coverage. "We have observed cases where the orchestrating agent itself refuses to compose legitimate AI red teaming workflows because the underlying model interprets the operator’s objective as harmful," Dheekonda said. Highly aligned frontier models decline to generate offensive workflows for sensitive categories like self-harm or CBRN probing. The Llama Scout study used Moonshot AI’s Kimi 2.5 model as both attacker and judge for this reason. Comprehensive evaluations across CBRN and child safety domains are still in progress.

Formal comparisons against expert human operators have not been done. Dheekonda noted skilled humans still outperform the agent on "nuanced long-horizon reasoning, highly contextual social engineering scenarios, novel exploit chains, and emerging attack surfaces where there is limited prior attack history."

The accessibility question

Lowering the operational floor...

agent attack teaming security attacks like

Related Articles