Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

Dashboard — RealVuln

Dashboard Methodology Dataset Findings Roadmap GitHub ↗

Benchmark dashboard 24 scanners · 26 repositories · ranked by F3 (strict)

Metric

F2 F3

24 Scanners 3 categories

26 Repositories Python · Type 1

92.4 Best F3 (strict) Kolega Enterprise

95.3 Highest recall % Kolega Enterprise

93.2 Highest precision % Grok 4.20

Leaderboard

ranked by active metric

Scanner ▼ F3 ▼ Recall % ▼ Prec % ▼ Repos ▼ Cost $ ▼

Precision vs. recall hover a point

Performance vs. cost F3 vs cost

Recall ranking fraction of vulnerabilities found

Precision ranking fraction of flags that were real

By category three-tier summary

Detection by vulnerability class recall %, best by approach

▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

Dataset composition 697 vulnerabilities · 120 FP traps · 26 repositories

Findings

697 vulnerabilities

120

Real vulnerabilities FP traps (14.7%)

18 CWE families

20,062 Python LOC

Frameworks (26 repos)

Flask15

Django3

FastAPI3

aiohttp1

Tornado1

custom3

Scanner categories

GP-LLM19

Rule SAST3

Sec.-spec.2

Frameworks

24 Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →

Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs