Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

jfaganel992 pts0 comments

Dashboard — RealVuln

Dashboard<br>Methodology<br>Dataset<br>Findings<br>Roadmap<br>GitHub ↗

Benchmark dashboard<br>24 scanners · 26 repositories · ranked by F3 (strict)

Metric

F2<br>F3

24<br>Scanners<br>3 categories

26<br>Repositories<br>Python · Type 1

92.4<br>Best F3 (strict)<br>Kolega Enterprise

95.3<br>Highest recall %<br>Kolega Enterprise

93.2<br>Highest precision %<br>Grok 4.20

Leaderboard

ranked by active metric

Scanner ▼<br>F3 ▼<br>Recall % ▼<br>Prec % ▼<br>Repos ▼<br>Cost $ ▼

Precision vs. recall<br>hover a point

Performance vs. cost<br>F3 vs cost

Recall ranking<br>fraction of vulnerabilities found

Precision ranking<br>fraction of flags that were real

By category<br>three-tier summary

Detection by vulnerability class<br>recall %, best by approach

▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

Dataset composition<br>697 vulnerabilities · 120 FP traps · 26 repositories

Findings

697 vulnerabilities

120

Real vulnerabilities<br>FP traps (14.7%)

18<br>CWE families

20,062<br>Python LOC

Frameworks (26 repos)

Flask15

Django3

FastAPI3

aiohttp1

Tornado1

custom3

Scanner categories

GP-LLM19

Rule SAST3

Sec.-spec.2

Frameworks

24<br>Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →

recall scanners repositories precision cost vulnerabilities

Related Articles