Dashboard — RealVuln
Dashboard<br>Methodology<br>Dataset<br>Findings<br>Roadmap<br>GitHub ↗
Benchmark dashboard<br>24 scanners · 26 repositories · ranked by F3 (strict)
Metric
F2<br>F3
24<br>Scanners<br>3 categories
26<br>Repositories<br>Python · Type 1
92.4<br>Best F3 (strict)<br>Kolega Enterprise
95.3<br>Highest recall %<br>Kolega Enterprise
93.2<br>Highest precision %<br>Grok 4.20
Leaderboard
ranked by active metric
Scanner ▼<br>F3 ▼<br>Recall % ▼<br>Prec % ▼<br>Repos ▼<br>Cost $ ▼
Precision vs. recall<br>hover a point
Performance vs. cost<br>F3 vs cost
Recall ranking<br>fraction of vulnerabilities found
Precision ranking<br>fraction of flags that were real
By category<br>three-tier summary
Detection by vulnerability class<br>recall %, best by approach
▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.
Dataset composition<br>697 vulnerabilities · 120 FP traps · 26 repositories
Findings
697 vulnerabilities
120
Real vulnerabilities<br>FP traps (14.7%)
18<br>CWE families
20,062<br>Python LOC
Frameworks (26 repos)
Flask15
Django3
FastAPI3
aiohttp1
Tornado1
custom3
Scanner categories
GP-LLM19
Rule SAST3
Sec.-spec.2
Frameworks
24<br>Scanners tested
All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →