Mythos: Given Enough Inference, All Bugs Are Shallow

Mythos: Given Enough Inference, All Bugs Are Shallow | Corgea

Get demoSign in Products AI SAST Dependency Scanning IaC Scanning Container Scanning Code Quality Scanning Secrets Scanning SBOMs & License Enforcement Developer Experience Attack Surface Mapping

Resources Blog Learn Research Security Research Program

Explore Docs Pricing Careers

Compare Corgea vs Snyk Corgea vs Checkmarx Corgea vs Semgrep Corgea vs GitHub Corgea vs Claude Code Security

Company About Contact

Get demoSign in

One of the most famous laws in open source is Linus’s Law: given enough eyeballs, all bugs are shallow.

The idea was simple: if enough people inspect, use, and contribute to software, bugs eventually surface. For decades, that model gave defenders a kind of economic advantage. Finding serious vulnerabilities required scarce human expertise, patience, and time.

That advantage is disappearing. LLMs have changed the model to something more unsettling: Given enough inference, all bugs are shallow.

A single attacker can now compress weeks of research into hours. I have personally found multiple high-severity issues in under 15 minutes, including one in Axios, a JavaScript package downloaded nearly 100 million times per week.

That is the security arms race Mythos represents. It shows what happens when frontier models are pointed at real software with enough inference behind them. But it also raises the question every security buyer should care about: If more inference finds more bugs, who pays for the inference?

The Benchmark

We ran a simple benchmark to test this question.

We took a deliberately vulnerable multi-language application suite covering Python, JavaScript, Java, Terraform, Kubernetes manifests, and other common application surfaces.

Then we scanned it three ways:

Claude Opus 4.6 using /security-review with a 1-million-token context window

Corgea v1 using GPT-4.1

Corgea v2 using GPT-5.4

Before scanning, we removed all canary comments. Then we manually verified every result against the actual source code.

The benchmark included 49 ground-truth vulnerabilities across SQL injection, XSS, SSRF, command injection, CSRF, path traversal, unsafe deserialization, access control, information exposure, and more.

We wanted to show how a small and simple app showed key differences between them.

The results:

Claude (Opus 4.6)Corgea v1 (GPT-4.1)Corgea v2 (GPT-5.4)Reasoning YesNoNoDetected 303634Missed 191315False Positives 130Precision 96.80%93%100%Recall 61.20%73.50%69.40%F1 Score 75.00%82.10%81.90%Cost in per million $5.00$2.00$2.50Cost out per million $25.00$8.00$15.00Scan speed in seconds 1,586 seconds171 seconds232 seconds

While Opus 4.6 has premium pricing above 200K tokens at $10/M input and $37.50/M output, we reflected it as $5/M input and $15/M output to be fair.

The result is not that models do not matter. They clearly do. GPT-5.4 eliminated every false positive produced by GPT-4.1. It improved CWE classification. It found vulnerabilities no other scanner found, including unbounded request body handling, stack-trace leakage, and SQL error exposure.

But the result also shows something important:

Model capability alone is not enough.

Claude used a frontier model, reasoning, and a massive context window. It still had the lowest recall, took 6–9x longer to run, and missed entire vulnerability classes. The GPT-4.1 to GPT-5.4 comparison is the most revealing data point in the benchmark, because it shows what purpose-built architecture does with better base models.

GPT-5.4 eliminated every false positive that GPT-4.1 produced. It corrected CWE misclassifications, applying CWE-639 (authorization bypass through user-controlled key) where GPT-4.1 had used the less precise CWE-306 (missing authentication). It discovered three vulnerabilities—an unbounded request body (CWE-400), an error handler leaking full stack traces (CWE-209), and SQL error messages exposed to end users (CWE-209)—that no scanner in the benchmark detected, including Claude on a million-token context window. And it picked up two findings that had previously been exclusive to Claude: a path traversal via unsanitized filename and plaintext credit card numbers stored in source.

The F1 scores tell the aggregate story: 81.2% on GPT-4.1, 81.0% on GPT-5.4, 75.6% on Claude Opus 4.6. A model at $2.50 per million input tokens matches or exceeds the performance of a frontier model at $5.00 per million input tokens.

Corgea ran faster, cheaper, non-reasoning model, and with better aggregate performance because the model was not operating alone. It was operating inside a purpose-built security architecture.

Models find possibilities. Architecture turns them into security outcomes.

The right debate is not “model versus architecture.”

Models and architecture compound each other. A stronger model helps, but only if the system gives it the right context, asks the right question, verifies the right evidence, and turns the result into something that both...

Mythos: Given Enough Inference, All Bugs Are Shallow

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast