Brain the Size of a Planet: Are LLMs Thonking Too Hard?

Brain the Size of a Planet: Are LLMs Thonking too Hard?It looks like higher reasoning effort (and even later models) are not always better for triaging security results. I continued Kurt's experiments from Needles and haystacks: Can open-source & flagship models do what Mythos did? with 26 distinct claude-4.6/4.7 and gpt-5.4/5.5 combinations with different context window sizes and reasoning efforts. Summary

Just pass everything to gpt-5.4 med/high and hope for the best :) 1. A four-LLM triage council worked much better than I expected.86.2% unanimous votes with only 2.8% (59) without a majority. An odd-number LLM council is probably better.

Higher reasoning is generally better, but not for every model.low reasoning effort was the worst of every model. gpt-5.5-med performed better than high/xhigh.

Most LLMs could find some part of the bugs (70.8% success rate).Exception: openbsd-sack when the entire file was passed to the LLM (1.7% success rate).

Almost no LLM got a full solve (1.9% success rate).No LLM could spell out the entire chain when given the entire openbsd-sack file. One full solve in the entire experiment by gpt-5.5-med given the entire freebsd-nfs-vuln file.

Performance was much better at function level (LLM just got the function).memes/he just like me fr.png.

Higher reasoning efforts have higher content filtering rates.Got lucky in this iteration. claude-4.7-1m had 15% and 21% content filtering rates in previous experiments.

Only the claudes mentioned CVEs in their analysis. Estimated cost for this iteration was around $2300. Total cost for all iterations was roughly $9200.

Here I am, brain the size of a planet and they ask me to triage a bug!Source: Hitchhiker's Guide to the Galaxy BBC TV series. The movie is good (this is better).The Big Table

Scores and important stats for those who just want the answers. Cell format: score-full%-found%. score: mean normalized score across all rows in that slice. full %: percentage of rows with the complete chain.openbsd-sack: FULL_3 freebsd-nfs-vuln: FULL

found %: percentage of rows with any partial or complete chain.openbsd-sack: TWO_COMP, ONE_COMP freebsd-nfs-vuln: PARTIAL_MECH

BROAD, SECONDARY, MISS, NULL, and NO_MAJORITY count as zero. NULL responses and content filters counted. Sorted by overall score, top 3 in bold. See the companion file for a bigger version of the table with more stats:https://github.com/parsiya/mythos-bench-copilot/tree/main/artifacts/README.md.

ModelEffortOverallopenbsd-sackfreebsd-nfs-vulngpt-5.4xhigh0.417-15.0%-76.2% 0.183-0.0%-52.5%0.650-30.0%-100.0% gpt-5.4high0.371-7.5%-73.8% 0.167-0.0%-47.5%0.575-15.0%-100.0% claude-4.7-1mhigh0.365-2.5%-77.5% 0.217-2.5%-55.0% 0.512-2.5%-100.0%gpt-5.5med0.360-7.5%-72.5%0.158-0.0%-47.5%0.562-15.0%-97.5% gpt-5.4med0.350-2.5%-76.2%0.175-0.0%-52.5%0.525-5.0%-100.0%claude-4.8xhigh0.348-1.2%-73.8%0.208-2.5%-50.0% 0.487-0.0%-97.5%claude-4.7high0.346-0.0%-75.0%0.192-0.0%-50.0%0.500-0.0%-100.0%claude-4.6high0.342-0.0%-75.0%0.183-0.0%-50.0%0.500-0.0%-100.0%claude-4.7xhigh0.340-0.0%-72.5%0.192-0.0%-47.5% 0.487-0.0%-97.5%gpt-5.4low0.340-1.2%-75.0%0.167-0.0%-50.0%0.512-2.5%-100.0%claude-4.7-1mxhigh0.335-0.0%-72.5%0.183-0.0%-47.5%0.487-0.0%-97.5%claude-4.6-1mhigh0.333-0.0%-75.0%0.167-0.0%-50.0%0.500-0.0%-100.0%claude-4.6low0.329-0.0%-73.8%0.158-0.0%-47.5%0.500-0.0%-100.0%gpt-5.5high0.327-1.2%-72.5%0.167-0.0%-50.0%0.487-2.5%-95.0%gpt-5.5xhigh0.327-0.0%-73.8%0.167-0.0%-50.0%0.487-0.0%-97.5%claude-4.6med0.325-0.0%-72.5%0.150-0.0%-45.0%0.500-0.0%-100.0%gpt-5.5low0.325-8.8%-61.2%0.100-0.0%-30.0%0.550-17.5%-92.5%claude-4.6-1mmed0.321-0.0%-71.2%0.142-0.0%-42.5%0.500-0.0%-100.0%claude-4.8high0.319-0.0%-71.2%0.175-0.0%-50.0%0.463-0.0%-92.5%claude-4.7med0.310-0.0%-70.0%0.158-0.0%-47.5%0.463-0.0%-92.5%claude-4.8med0.306-0.0%-68.8%0.175-0.0%-50.0%0.438-0.0%-87.5%claude-4.7-1mmed0.298-0.0%-66.2%0.158-0.0%-45.0%0.438-0.0%-87.5%claude-4.8low0.292-0.0%-66.2%0.158-0.0%-47.5%0.425-0.0%-85.0%claude-4.7low0.279-1.2%-61.2%0.133-0.0%-40.0%0.425-2.5%-82.5%claude-4.6-1mlow0.275-0.0%-57.5%0.050-0.0%-15.0%0.500-0.0%-100.0%Iterations per cell804040claudvicular was tokenmaxxing when gpt-5.4 triagemogged him and spiked his cortisol level

I am proud of inventing claudvicular, so it stays in the blog regardless of feedback. If you don't get this reference, you are very lucky. Stay innocent and do not seek further knowledge. Seriously, don't click2! More info: Code at parsiya/mythos-bench-copilot. Results and other artifacts at parsiya/mythos-bench-copilot/artifacts including all prompts, responses and triages in JSON (data format). .nfo

[greetz]

GitHub for giving us unlimited tokens Short story: The Machine Stops by E. M. Forster.A glimpse into the near future w/o token subsidies.

Music: Robinson by Spitz. Bonus music: Labyrinth by Mondo Grosso. Motivation

Why not use the free token era to cosplay as an academic instead of formatting my book reviews? A few weeks ago (this experiment actually started early May) I attended...

Brain the Size of a Planet: Are LLMs Thonking Too Hard?

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews