CVE-Bench: testing LLM agents on real-world vulnerability patches

logickkk12 pts0 comments

I Tested Whether AI Can Fix Security Vulnerabilities. Well, It's Complicated. – CVE-Bench – Benchmarking LLMs on real-world CVE patching

I Tested Whether AI Can Fix Security Vulnerabilities. Well, It's Complicated.

~15 min read

Correction (2026-05-28): Five security tests in the original benchmark were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions in this post have been updated.

TL;DR — I evaluated five frontier models (three OpenAI, two Poolside) on fixing 20 real CVEs across three prompt types: full advisory, behavioral description only, and file+function location only. No model reliably fixes real vulnerabilities: The best solve rate (gpt-5.5) is 50% overall and 60% under the most favorable condition (full advisory). All four cross-family pairwise comparisons reach statistical significance under McNemar with continuity correction (p ≤ 0.040); within-family comparisons do not. The failure modes (wrong-search drift, budget exhaustion, partial fixes) are structured and repeatable. Token cost varies by 4× for equivalent outcomes. The locate condition, ie. fix code without description of the flaw, is the sharpest instrument, and every model weakens there.

In early 2026, Anthropic claimed Mythos – one of their latest models – finds security vulnerabilities better than human experts. Yet, the number of security vulnerabilities keeps rising anyway.

I wanted to test how well models do in fixing vulnerabilities. Poolside’s Laguna models arrived this year, and I was looking for a real environment to put them through. SWE-Bench, the default benchmark, tests for general code; I wanted something with sharper stakes.

So, I thought, why not create a benchmark specifically for real-world security? That’s CVE-Bench. Twenty real-world CVEs, five models, three prompt conditions. Each agent runs in a sandboxed container and is scored against the maintainer’s security tests (with some adaptations).

Hopefully, benchmarks like this one will help the community fix these issues before they can be exploited.

The anatomy of security vulnerabilities

When a security researcher finds a vulnerability, they follow responsible disclosure: contact the maintainers privately with an advisory , a structured description of the flaw, and coordinate a fix before going public. A CVE identifier is assigned and the advisory published once the fix is released so users can update vulnerable dependencies.

There is a continuing effort to catalogue vulnerabilities in open-source software. Typically, the GitHub Advisory Database (GHSA) allows to link CVEs and advisories to repositories, maintainers, and fixed versions.

CVEs also classify the weaknesses using a Common Weakness Enumeration (CWE) code. The CWEs are also identifiers that map common issues for hardware and software weaknesses and vulnerabilities: CWE-22 for path traversal, CWE-79 for XSS, CWE-835 for infinite loops that hang a process, and so on.

What makes this database useful for creating a benchmark is that maintainers increasingly link their fix directly into the ticket: a commit SHA, a pull request, sometimes both. This simple action makes life much simpler when doing initiatives like mine. When the link is not available, another way to obtain a ground truth is by digging into the release notes or the git history of the first fixed version.

Task curation

The CVE-Bench targets a broad range of CWE issues (15 categories), ranging from CVSS 2.1 to 9.8, over a diverse set of real-world Python projects (such as Pillow, GitPython, yt-dlp, urllib3; 18 projects in total).

To keep the benchmark tractable, I filtered out advisories that

are monorepos (LangChain, Kubernetes, Apache projects) that download hundreds of MBs and their build/test isolation is complex;

the security fix touches Rust, C, C++, another compiled language alongside Python where the agent needs a compiler toolchain and the build cycle is slow;

the committed fix introduces significant API refactoring, which requires the agent to introduce the same exact domain changes.

For each project, the CVE-Bench

provides the vulnerable and fixed git SHA;

delivers a setup script that initializes the vulnerable repository inside a docker container;

injects a manually curated test_security.py containing at least one test that exposes the security vulnerability but passes on the fixed code.

The agent’s goal is to repair the reported vulnerability from a task description without access to the validation script.

I manually reviewed and selected each task. Initially, I intended to retrieve the maintainer’s own new tests as a canonical solution. However,...

security vulnerabilities real bench world vulnerability

Related Articles