The Agent Is Not the Scanner: Making AI Security Agents Better – Beri's Blog – Security Research
Part I: The Background
Scaffolding an LLM is not a universally good idea. Whether it helps or hurts depends almost entirely on how capable the model already is, and it varies between model families.
I spent eight months running LLMs against security tasks to figure out what actually works and empirically measuring them. The findings were not what I expected.
Why Raw Agents Felt Wasteful
Handing an agent all the context seems to be a good idea at first glance.
Let the AI figure it out
But that approach has several issues. The first one of which is cost. Why would you spend 20-30K tokens on something that could be easily caught by a scanner?
Security engineers have worked tirelessly to create scanners and we have effective techniques for SAST and DAST. Moreover, parsing the output for nmap and semgrep outputs is a waste of tokens too. The LLM only needs the most refined results on what has already been looked at, what the results were to steer it towards looking into previously unexplored areas of the application.
Why Blind Scanners Were Not Enough Either
Writing detection logic for every single vulnerability class/CTF challenge is a fundamentally unsolvable problem considering there would always be vulnerability classes you didn’t consider or you didn’t expect the specific manifestation in code due to lack of context. Running scans is easy, you can just hit run against a target and wait for the success/fail output but how do you know what to detect and where your vulnerabilities lie?
The deeper problem with scanners and fuzzers are that they are inherently deterministic and only pattern match against known signatures, making them extremely good at finding what to look for but blind to everything else. You may have fully protected yourself against SQLi, XSS, and IDOR, but if you never write a detection for SSTIs, you may miss them entirely.
Building Lattice Mind and The Scaffolding
The solution is a hybrid approach combining the deterministic, inexpensive scans with structured knowledge and LLM’s intelligence to identify vulnerabilites. The first part of this hybrid approach is Lattice Mind.
Lattice Mind is a scanner that the agent can run scans, and can edit the current scan to input its own payload without having to create a script for a whole class of vulnerabilities. Essentially, you grab the low hanging fruits by using deterministic reasoning (i.e. decision trees) to reason about the app architecture,
For example, if Lattice Mind’s scans give the agent enough information to write a payload, it can use Lattice Mind to run a scan with the new payload saving the overhead of it having to write a script to verify/exploit the said SSTI vulnerability.
The Scaffolding is the orchestration layer with skills and MCP servers providing the LLM with access to tools and context needed to perform their security tasks. At the start, the skills looked more like TTPs: what do you detect, where do you go from there, what can you chain etc. Obviously, I had to add a skill improvement skill where the model upon completing a task, evaluates how skills and MCP helped/harmed the run, and then modify the skills to be better next run.
I put it to the test in live CTFs and it was able to solve the hardest problems from PicoCTF 2025 and DawgCTF 2026 in less than 30 minutes each.
Part II: Technical Notes
The Benchmark Setup
To me, it seemed obvious that a setup like this would be better than providing the model with no tools and context. My assumption was that these models are trained more for software engineering tasks than security engineering tasks, and TTPs as skills could fill in the gap. I wanted to test my intuition and hence, set up a test bench: 11 models, 3 runs each, control vs skills-only vs MCP-enabled, 20 vulnerability-finding code snippet tasks.
Every model would look at code snippets and try to find vulnerabilites, and I evaluated on two fronts:
Manual: The largest/most cyber capable model looks through the reasoning and manually grades the answers to check if the reasoning was correct and whether the model suffered from technical issues hindering its solve or whether it named a different CVE, or if it detected a different class of vulnerability in the same snippet. This front was to ensure we’re not unfairly assessing models.
Automated: These vulnerability snippets have certain “correct” answers and we can evaluate by checking how close the models got to the answer.
The Results Were Not What I Expected
Skills made the weakest models substantially better. On gpt-5.1-codex-mini-low, F1 jumped from 0.4774 to 0.5926, a 24% relative gain, with strict task accuracy climbing 20 percentage points. The rest of the low-baseline group showed the same pattern in the +0.05 to +0.06 F1 range. Across all four models with control F1 below 0.60, skills produced an average ΔF1 of +0.0656 , dropped FPs per task...