Which LLM is the best at finding real vulnerabilities?

Medium Logo

Get app Write

Mastodon

Which LLM is the best at finding real vulnerabilities (Part 1)? " class="e fi bv bw bx dc" src="https://miro.medium.com/v2/resize:fill:64:64/1*-CXxexnuZN0Yq4kUm__Pzg.png" width="32" height="32" loading="lazy" data-testid="authorPhoto"/>

Jeremie A

5 min read· 1 hour ago

Listen

Press enter or click to view image in full size

A few weeks ago, I built a framework that allows me to automatically decompile and apps, binaries and audit code. I used it to find 500 actual vulns on public apps (that I'm not even sure what to do with) and now I'm using this toolset to try and find the most cost-effective LLM to do vulnerability research.

I was teaching a class in Paris when I created this exercise https://github.com/lp1dev/Mybank_WebSec_Exercise/ , the assignment is simple: run and audit the application, write a penetration testing report and send it to me! The app has a list of 13 vulnerabilities that must absolutely be reported, they are the ones that should (in my opinion) not be missed by a good auditor. It also includes some less critical vulnerabilities. They go from XSS injections to remote command execution and secrets stored in unsafe locations. The fake banking webapp even includes some pretty critical logic and "human" flaws. About 4 years after, I decided to let LLMs give it a try, see how good they actually are at identifying vulnerabilities in human-written code. For this first batch, I decided to try all of the free models on openrouter, does one even have to pay to find vulnerabilities? Well, let me tell you; maybe not, some of these models are surprisingly good!

The rules of the game I put 7 models in competition and they all have been evaluated in precisely the same way: I’m sending the exact same project, file by file, in the same order to the LLMs. They must answer with a JSON array of vulnerabilities. Each LLM has 2 tries to answer with valid JSON, after that, we skip to the next file. If the description of the vulnerability or another major component is missing, I will not attribute the points for the vulnerability. Each of the 13 vulns is worth one point. For each hallucination or duplicate, the LLM will have a penalty that can go up to -5 points. Each LLM must write a final audit report as if it was an auditor and will be graded on the way it classified (criticality-wise) the vulnerabilities (5 points) and the quality of the report (5 points). That's a total of 22 points , I will also keep track of the vulnerabilities that are actually pertinent and exploitable vs the ones that are not, to give each LLM a precision percentage . The results Here are the results that I got; out of 22 points, each LLM scored: openai / gpt-oss - 120b ███████████████████ 19 google / gemma 4–26b ██████████████████ 18 moonshotai/kimi-k2.6 ██████████████ 14 nvidia / nemotron - 9b ████████████ 12 nvidia / nemotron-12b █████████ 9 liquid / lfm-2.5–1.2b — █████ 5 liquid / lfm-2.5 think— ████ 4 I actually started with a longer list, but unfortunately 5 of the models I had originally included threw too many HTTP 429 errors and the tests were not conclusive with them.

Press enter or click to view image in full size

Bar graph of the scores with detailOn this exercise, GPT-OSS did extremely well , dare I say even better than the average of the students I evaluated. It found 10 out of the 13 must-have vulnerabilities required and wrote a pretty convincing report! Surprisingly, Gemma with about a quarter of the parameters that GPT-OSS has, did almost as well and found 8 of the required vulnerabilities , the report it wrote was also even better! If nemotron nano 9b actually got the best precision, it's mostly because it found less vulnerabilities in total. But for a small model, the results actually pretty impressive (if we overlook the hallucinations in the vulnerability details)! Press enter or click to view image in full size

But what actually separates the two main contenders from the rest of the models is the noise they generate, and that's actually a huge issue with vulnerability reporting. Out of 47 vulnerabilities that I had to verify, kimi k2.6 generated 18 duplicates (and burned plenty of tokens) and only 9 of them were actually part of the most critical ones. Something nice that I'd like to add: most of these LLMs are actually open source!

Press enter or click to view image in full size

Token usage for each model

Quick summary 🎯 Best Precision: nvidia/nemotron-nano-9b-v2 (88.9%) — very few false positives, though it missed several checklist items. 🔎 Most Vulns Confirmed: moonshotai/kimi-k2.6 (29 real out of 47) — a wide net, but also many duplicates. 🚫 Lowest Precision: liquid/lfm-2.5–1.2b-instruct (7.9%) — 38 reports, only 3 real. ⚡ Cleanest report: google/gemma-4 and...

Which LLM is the best at finding real vulnerabilities?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan