30+ confirmed vulnerabilities in production OSS, at $6–7 per full codebase scan | by Fergallardogalaviz | Jul, 2026 | MediumSitemapOpen in appSign up<br>Sign in
Medium Logo
Get app<br>Write
Search
Sign up<br>Sign in
30+ confirmed vulnerabilities in production OSS, at $6–7 per full codebase scan
Fergallardogalaviz
5 min read·<br>Just now
Listen
Share
We’re not naming the platforms yet. Several findings are still in responsible disclosure, some without a confirmed patch. But we can share the methodology, the numbers, and enough about the findings to make this worth reading.<br>Over the past several months we ran our system against real open-source projects in production — used by thousands of organizations — across 7 programming languages. Result: 30+ confirmed real vulnerabilities, with at least one independently corroborated by a security researcher who has no connection to us. Here’s how it works and what we actually found.<br>The architecture<br>The system has two main components.<br>Structural code modeling. Not text pattern matching, a graph that combines syntax, control flow, and data flow across the entire codebase. This is what lets the system reason about how files interact with each other, not just what’s inside each file in isolation. Most vulnerabilities in real codebases don’t live in a single function; they emerge from the interaction between components.<br>AI agents that reason like an attacker. For each candidate location flagged by the structural model, an agent generates a concrete exploitation hypothesis, then actively searches the code for evidence to disprove it. A finding only gets reported if the agent can’t rule it out.<br>The hallucination problem in AI-based vulnerability detection is real and well-documented. Our answer to it is a three-layer independent verification pipeline:<br>Layer 1: Agent generates the attack hypothesis.<br>Layer 2: An independent analyst agent — with more context and access to surrounding code — whose sole job is to attempt to disprove Layer 1’s finding.<br>Layer 3: Literal factual verification using a model from a different provider than the one that generated the original finding. It checks every specific claim, line numbers, function names, actual values against the source code, word for word.<br>Layer 3 already caught a real false positive in production that Layers 1 and 2 had passed as confirmed. The false positive was caused by a variable reassignment several function calls up the chain that neither earlier pass had followed far enough. That’s the kind of thing that makes AI vulnerability reports unreliable in practice and the reason we built the verification architecture the way we did.<br>The prioritization engine: why the cost is so low<br>The $6–7 figure per full codebase scan comes from a specific architectural decision: the expensive AI reasoning only runs on a small fraction of the codebase.<br>The first pass is a deterministic prioritization engine, no LLM / no API cost, that scores every function in the repo based on structural signals: data flow complexity, external input handling, memory management patterns, interaction with authentication or cryptographic primitives, and similar heuristics. This runs fast and cheap. Its job is to filter out the 98–99% of code that almost certainly isn’t worth deep analysis.<br>Only the functions that score above a threshold go to the AI reasoning layers. Here’s what that looked like across three real runs:
Project B’s higher percentage (17.4%) is an outlier, that codebase had a higher density of functions touching external input directly. The 0.4–0.9% range on the other two is more representative.<br>For comparison: a thorough manual security review runs at roughly 100–150 LOC/hour per specialist, by standard industry estimates. Project A would take approximately 1,900 human-hours at that rate around 48 weeks of full-time work. We’re not claiming this is a controlled equivalence; a human expert prioritizes differently and wouldn’t attempt 100% coverage of a repo that size. But the system does run 100% of the code through the prioritization engine before deciding what to go deep on, the coverage is systematic, not intuition-based.<br>Blind CVE reproduction as a reliability test<br>Before running against unknown targets, we validated the system’s reliability by giving it codebases with known vulnerabilities it hadn’t seen without telling it where the vulnerability was.<br>Test 1 : Pre-patch version of a widely-used JWT/authentication library, with a public CVE for an algorithm confusion vulnerability. The system found it.<br>Test 2 : Pre-disclosure version of infrastructure used across a large part of the AI application industry, with a critical pre-authentication SQL injection. The system found it. According to public reports, this same vulnerability was actively exploited within 36 hours of its official public disclosure.<br>These weren’t needle-in-a-haystack scenarios designed to make the system look good, they were real codebases with the full complexity of...