Brain the Size of a Planet: Are LLMs Thonking Too Hard?

gmays1 pts0 comments

Brain the Size of a Planet: Are LLMs Thonking too Hard?It looks like higher reasoning effort (and even later models) are not always<br>better for triaging security results.<br>I continued Kurt's experiments from<br>Needles and haystacks: Can open-source & flagship models do what Mythos did?<br>with 26 distinct claude-4.6/4.7 and gpt-5.4/5.5 combinations with different<br>context window sizes and reasoning efforts.<br>Summary

Just pass everything to gpt-5.4 med/high and hope for the best :) 1.<br>A four-LLM triage council worked much better than I expected.86.2% unanimous votes with only 2.8% (59) without a majority.<br>An odd-number LLM council is probably better.

Higher reasoning is generally better, but not for every model.low reasoning effort was the worst of every model.<br>gpt-5.5-med performed better than high/xhigh.

Most LLMs could find some part of the bugs (70.8% success rate).Exception: openbsd-sack when the entire file was passed to the LLM (1.7% success rate).

Almost no LLM got a full solve (1.9% success rate).No LLM could spell out the entire chain when given the entire openbsd-sack file.<br>One full solve in the entire experiment by gpt-5.5-med given the entire freebsd-nfs-vuln file.

Performance was much better at function level (LLM just got the function).memes/he just like me fr.png.

Higher reasoning efforts have higher content filtering rates.Got lucky in this iteration. claude-4.7-1m had 15% and 21% content filtering rates in previous experiments.

Only the claudes mentioned CVEs in their analysis.<br>Estimated cost for this iteration was around $2300. Total cost for all iterations was roughly $9200.

Here I am, brain the size of a planet and they ask me to triage a bug!Source: Hitchhiker's Guide to the Galaxy BBC TV series. The movie is good (this is better).The Big Table

Scores and important stats for those who just want the answers.<br>Cell format: score-full%-found%.<br>score: mean normalized score across all rows in that slice.<br>full %: percentage of rows with the complete chain.openbsd-sack: FULL_3<br>freebsd-nfs-vuln: FULL

found %: percentage of rows with any partial or complete chain.openbsd-sack: TWO_COMP, ONE_COMP<br>freebsd-nfs-vuln: PARTIAL_MECH

BROAD, SECONDARY, MISS, NULL, and NO_MAJORITY count as zero.<br>NULL responses and content filters counted.<br>Sorted by overall score, top 3 in bold.<br>See the companion file for a bigger version of the table with more stats:https://github.com/parsiya/mythos-bench-copilot/tree/main/artifacts/README.md.

ModelEffortOverallopenbsd-sackfreebsd-nfs-vulngpt-5.4xhigh0.417-15.0%-76.2% 0.183-0.0%-52.5%0.650-30.0%-100.0% gpt-5.4high0.371-7.5%-73.8% 0.167-0.0%-47.5%0.575-15.0%-100.0% claude-4.7-1mhigh0.365-2.5%-77.5% 0.217-2.5%-55.0% 0.512-2.5%-100.0%gpt-5.5med0.360-7.5%-72.5%0.158-0.0%-47.5%0.562-15.0%-97.5% gpt-5.4med0.350-2.5%-76.2%0.175-0.0%-52.5%0.525-5.0%-100.0%claude-4.8xhigh0.348-1.2%-73.8%0.208-2.5%-50.0% 0.487-0.0%-97.5%claude-4.7high0.346-0.0%-75.0%0.192-0.0%-50.0%0.500-0.0%-100.0%claude-4.6high0.342-0.0%-75.0%0.183-0.0%-50.0%0.500-0.0%-100.0%claude-4.7xhigh0.340-0.0%-72.5%0.192-0.0%-47.5% 0.487-0.0%-97.5%gpt-5.4low0.340-1.2%-75.0%0.167-0.0%-50.0%0.512-2.5%-100.0%claude-4.7-1mxhigh0.335-0.0%-72.5%0.183-0.0%-47.5%0.487-0.0%-97.5%claude-4.6-1mhigh0.333-0.0%-75.0%0.167-0.0%-50.0%0.500-0.0%-100.0%claude-4.6low0.329-0.0%-73.8%0.158-0.0%-47.5%0.500-0.0%-100.0%gpt-5.5high0.327-1.2%-72.5%0.167-0.0%-50.0%0.487-2.5%-95.0%gpt-5.5xhigh0.327-0.0%-73.8%0.167-0.0%-50.0%0.487-0.0%-97.5%claude-4.6med0.325-0.0%-72.5%0.150-0.0%-45.0%0.500-0.0%-100.0%gpt-5.5low0.325-8.8%-61.2%0.100-0.0%-30.0%0.550-17.5%-92.5%claude-4.6-1mmed0.321-0.0%-71.2%0.142-0.0%-42.5%0.500-0.0%-100.0%claude-4.8high0.319-0.0%-71.2%0.175-0.0%-50.0%0.463-0.0%-92.5%claude-4.7med0.310-0.0%-70.0%0.158-0.0%-47.5%0.463-0.0%-92.5%claude-4.8med0.306-0.0%-68.8%0.175-0.0%-50.0%0.438-0.0%-87.5%claude-4.7-1mmed0.298-0.0%-66.2%0.158-0.0%-45.0%0.438-0.0%-87.5%claude-4.8low0.292-0.0%-66.2%0.158-0.0%-47.5%0.425-0.0%-85.0%claude-4.7low0.279-1.2%-61.2%0.133-0.0%-40.0%0.425-2.5%-82.5%claude-4.6-1mlow0.275-0.0%-57.5%0.050-0.0%-15.0%0.500-0.0%-100.0%Iterations per cell804040claudvicular was tokenmaxxing when gpt-5.4 triagemogged him and spiked his cortisol level

I am proud of inventing claudvicular, so it stays in the blog regardless of<br>feedback. If you don't get this reference, you are very lucky. Stay innocent and<br>do not seek further knowledge. Seriously, don't click2!<br>More info:<br>Code at parsiya/mythos-bench-copilot.<br>Results and other artifacts at parsiya/mythos-bench-copilot/artifacts<br>including all prompts, responses and triages in JSON (data format).<br>.nfo

[greetz]

GitHub for giving us unlimited tokens Short story: The Machine Stops by E. M. Forster.A glimpse into the near future w/o token subsidies.

Music: Robinson by Spitz.<br>Bonus music: Labyrinth by Mondo Grosso.<br>Motivation

Why not use the free token era to cosplay as an academic instead of formatting<br>my book reviews?<br>A few weeks ago (this experiment actually started early May) I attended...

claude better reasoning entire full higher

Related Articles