A weekend benchmarking Copilot CLI's /security-review across 5 LLMs

I spent a weekend benchmarking GitHub's hidden AI security reviewer. The cheapest model held its own.

D's Substack

SubscribeSign in

I spent a weekend benchmarking GitHub's hidden AI security reviewer. The cheapest model held its own.<br>What 200 reviews across five frontier LLMs told me about cost, variance, and whether you should trust a single run.

D Cairo<br>Jun 03, 2026

There's an undocumented/experimental command inside GitHub Copilot CLI called /security-review. I stumbled across it while setting up Copilot on my work account, looked for an announcement, found nothing, and got curious.

The idea is straightforward: you finish a coding session, the command reads your diff, and it hands back a list of likely vulnerabilities. The kind of thing you'd want running on every PR, in theory. But there's a question the manual experience can't answer. How much does the underlying model actually matter?

Does paying for Opus over Haiku buy you better security findings? Or are you just paying more for the same answer in a fancier wrapper? And (this is the one that nagged at me) can you trust a single run, or is one scan just a noisy slice of what the model "really" thinks?

So I spent a weekend building a small harness to find out. What follows is what came out of it. Some of it was very interesting to me.

A weekend, a vulnerable app, and 200 reviews

Here's the setup, briefly.

I needed something with a known answer key, and OWASP Juice Shop is the obvious pick: a deliberately vulnerable Node.js app that ships with a catalogued list of known issues. I took the original app and created 10 changes, each one reintroducing one or more catalogued vulnerabilities. 14 vulnerabilities total across the 10 changes, spanning the usual OWASP territory: SQL injection, weak crypto, SSRF, path traversal, XXE, insecure deserialization, broken access control, hardcoded credentials, missing rate limiting, open redirect.

The ground truth (which file, which CWE, a one-line explanation) lives in a catalogue.md. The AI reviewer never sees this file. That part matters.

For each change, I run /security-review non-interactively and capture the output. (Quick note for anyone trying to do this themselves: the --no-ask-user flag is critical. Without it the command pauses for user input after its first pass and never terminates in a script. With it, you get a clean JSON stream and a final result event that tells you exactly how many credits the run consumed.)

Then a separate, fixed LLM grader takes the catalogue and the reviewer's output and produces three counts per change: detected , missed , false positives . The grader sees the catalogue. The reviewer doesn't. The grading model stays constant across all runs so any grader bias is a constant offset across models. I went big on the grader and used Opus 4.6.

I ran this across 5 models × 4 independent runs × 10 changes = 200 reviews . It's a small sample, but tokens are expensive and I was funding this out of curiosity, not a budget. Enough to see the broad shape, not enough to publish in a journal.

The models, for the record: Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.5. These are all the ones currently selectable for Copilot CLI.

What came out

Mean detection rate across 4 runs, with range and standard deviation:

Two things in that table stopped me when I first ran the numbers.

1. Haiku 4.5 tied Sonnet 4.6 on mean detection, at about a third of the cost.

Both landed at 86% mean detection. Haiku costs 3.3 credits per 10-change sweep. Sonnet costs 10. That's a 3× spread for the same outcome on this benchmark.

If you're planning to run /security-review on every PR in a busy repo, this is the line item to look at first. Sonnet does have slightly fewer false positives on average (0.8 vs 1.2), so it's not strictly dominated, but it's close enough that it changed how I think about the workflow. Instead of "pick the best model for security review," it became "pick the cheapest competent model, then optionally use a bigger one to triage what it finds."<br>The interesting question isn't which model is best. It's which model is good enough to throw at every PR, and which one do you reserve for the diffs that matter.

2. Opus was the only model with zero variance across runs.

Opus scored 13/14 every single time. Same detection rate, same missed vulnerability, four runs in a row. Robotic.

Everything else moved. Sonnet ranged from 79% to 93% across its four runs. Haiku did the same. That's a 14-percentage-point swing for "the same model on the same input." That number was a bit surprising when I first saw it, so I went back and double-checked the runs. Looking back, given the non-determinism of LLMs, they're correct and should be expected.

If your security gate is a single /security-review run, and the model behind it is mid-tier, you are partly looking at noise. This was the finding that genuinely changed my mind. Re-running matters more than I'd assumed before doing...

A weekend benchmarking Copilot CLI's /security-review across 5 LLMs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy