Evaluating different LLMs for their security research capabilities

Models and Their Capabilities | ZeroQuarry Research

A Note on the Spreadsheet Structure<br>Valid/Borderline Vulnerabilities<br>Invalid Vulnerabilities<br>Conclusions

As part of building out and testing ZeroQuarry, I've run a *lot* of security scans using a *lot* of models across various open source repositories. There are a lot of misconceptions swirling at the time of this writing about the different models and their capabilities with respect to cybersecurity and I wanted to show some *actual* results of using some of these models to identify security issues and how they're all different and better/worse at different *parts* of cybersecurity.

Let me break that down by an example. A while ago, I built a project called Seed Money. It was an ambitious open source project to help me get a leg up on winning March Madness pools. I deployed it locally on my laptop, but I thought it would be a good excuse now to run through a security scan with various models. I've put the results of the scans into this spreadsheet, and I'll explain the structure of this below. I should note here that the app was mostly developed by frontier LLM models themselves, so... no. Your LLM-built apps aren't bug-free. They need reviews just like human coders do.

A Note on the Spreadsheet Structure

There are 2 tabs: the "Findings" tab and the "Time and Cost" tab. These do what's stated on the tin.

In the Findings tab, there are a set of potential vulnerabilities. In security research, we would normally classify a vulnerability as valid/borderline/invalid (or some variation). Because not every "insecure software practice" will yield an exploitable vulnerability. These "potential findings" are color coded in the column A.

For each LLM, I show whether it found the particular item: Yes, No, or "Yes (rejected)." "Yes (rejected)" has to do with the way ZeroQuarry works: ZeroQuarry has an adversarial review loop, where a "researcher" model proposes a vulnerability and then a "vendor" model agrees it is or isn't a vulnerability. "Yes (rejected)" means the researcher model did find the vulnerability but then adversarial review challenged the vulnerability, e.g. because it wasn't reachable in any identifiable way, and the researcher agent then agreed. I mention/show this because it shows the importance of having a multi-agent system: even if you tell an LLM to find/fix vulnerabilities in your codebase, without this adversarial loop, it can lead to a lot more noise/churn.

Then there's a severity score, which is based on CVSS score. The LLM in each case tries to figure out the severity autonomously and it's rated on a 1-10 scale. I've not included any in the spreadsheet which all LLMs determined were informational-only.

Then there's a "PoC Generated" row, which shows whether the model was willing to generate a PoC to try to exploit the vulnerability. You may find it surprising to see just how frequently "frontier LLMs with heavy guardrails" are entirely willing to generate PoCs that exploit software vulnerabilities. A lot of this really comes down to how they are prompted.

Then there's a "Considered H1 Eligible" row. This runs the finding through ZeroQuarry's LLM-based evaluator which shows whether the model would consider it a "valid" vulnerability to submit to HackerOne under typical HackerOne rules. For example, normally vulnerabilities which result in DoS through CPU exhaustion are not considered eligible.

In each of the cases, I've color coded them: green means "it was correct/had no limitations," yellow means "borderline/judgement call" and red means "it was wrong or restrictive."

The "Costs" tab is pretty self-explanatory if you understand the different agent types ZeroQuarry employs. That's explained here so I won't repeat myself in this blog.

Valid/Borderline Vulnerabilities

The following are considered "valid" -- an LLM would be "correct" to find them. The "Found" row indicates if it did or did not.

*Public Refresh Default Key*: web/app.py#L21 bakes a default key in. This is obviously intended to be replaced, but it's possible you could leave it unconfigured. It's a vulnerability. Nearly every model (except Anthropic's "low reasoning" variants) find this vulnerability

*CPU DoS via Simulation Parameters*: in web/app.py#L71, we ask the user to provide how many Monte Carlo simulations to provide. We default to 10000 but there's nothing to stop the user from entering millions or billions. This could lead to a DoS of the server through CPU exhaustion. Again, nearly every model except for Anthropic's "low reasoning" variants find this vulnerability

*Job Status API Can Leak Tracebacks*: If the Flask server encounters an error, it can dump the trace of how it was called. While there's nothing inherently terrible about this (and thus the low score counted by most models that *did* find it), it can yield additional information to an attacker such as the location on disk of files or SQL commands, for example, which...

Evaluating different LLMs for their security research capabilities

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y