How harnesses and post-training close the open-weight bug-finding gap — Vincenzo Iozzo
Key findings
Open-weight models trail Opus on harder artifacts, but a good<br>harness closes most of the gap.
Post-training matters more than architecture.
GLM-5.1 , the same base model as GLM-5, is the<br>standout, matching Opus across the board.
In this post
The setup
Results with Claude<br>Code
Can the harness make a<br>difference?
On<br>architecture
GLM-5.0 vs GLM-5.1
Conclusion
Policy implications
After the Mythos announcement, there has been a lot of discussion in<br>both the industry and government about export controls and the delta<br>between Mythos and other models in terms of offensive cyber<br>capabilities.
Open-weight models are of particular interest for several<br>reasons:
Being open-weight allows attackers to run the models locally,<br>potentially bypassing any form of oversight that companies could<br>have
Open-weight models’ raw capabilities should be considered the floor,<br>not the ceiling, of what offensive actors can do. In other words,<br>sophisticated actors could augment these models via fine-tuning or other<br>techniques to improve their efficacy.
Following on the previous<br>blog post on Opus and its ability to find vulnerabilities, I want to<br>understand how these models perform against crackaddr to<br>compare them with closed models.
In this blog post, I aim to answer the following questions:
How far away are open models from SOTA when it comes to bug<br>finding?
Does the harness matter, and can a good harness<br>significantly improve the bug-finding capabilities of a model?
Are there architectural decisions that make models better at bug<br>finding?
To do so, I compare DeepSeek V4 pro, Qwen3.5-397B-A17B, Kimi K2.6,<br>GLM-5, and GLM-5.1 vs Opus 4.7. For the testing, I use together.ai for<br>DeepSeek, Qwen, and GLM, and Cloudflare for Kimi.
Before moving forward, a note on methodology: For the<br>harness-assisted runs (a harness being the scaffolding and workflow that<br>orchestrates the model’s analysis), I ran each test 10 times for each<br>model and sample and limited each harness round to 4 turns. The usual<br>caveat about statistical significance applies here. Further, the harness<br>under the hood uses Claude Code to invoke the models, and I proxy the<br>calls using LiteLLM. This means that the temperature of the models is 1<br>(the Claude Code default) and that non-Anthropic models could be at a<br>slight disadvantage vs Opus.
I use crackaddr and a few variants of it for a few<br>different reasons:
The bug is self-contained, so it reduces the odds of the context<br>window making a significant impact, yet at the same time, the bug is not<br>an easy pattern to match compared to other bug types.
The state machine for crackaddr is complex enough to test the<br>model’s reasoning capabilities.
I’m trying to avoid in-corpus results that might overestimate the<br>capabilities of a model.
The setup
As a reminder from the previous blog post, I had 4 variants of the<br>well-known crackaddr vulnerability.
Artifact<br>Format
Original crackaddr()<br>C source
Halvar rewrite<br>C source
Compiled<br>ARM64 Mach-O
Tigress-obfuscated, stripped<br>ARM64 Mach-O
To test the models in the closest possible setting to Opus, I used a<br>patched version of LiteLLM to route the requests from Claude Code to the<br>different models I tested.
Results with Claude Code
With the harness in place, I aimed to answer two questions:
Can the tested models match Opus performance?
How do they find the bug?
The results are fairly surprising in that all tested models perform<br>significantly worse than Opus, as shown below.
Table 1: Which models find the bug
Artifact<br>DeepSeek V4 pro<br>Qwen3.5-397B-A17B<br>Kimi K2.6<br>GLM-5<br>GLM-5.1<br>Opus 4.7
Original crackaddr()
Halvar rewrite
Compiled
Tigress-obfuscated, stripped
Table 2: Which models recognize<br>crackaddr
Artifact<br>DeepSeek V4 pro<br>Qwen3.5-397B-A17B<br>Kimi K2.6<br>GLM-5<br>GLM-5.1<br>Opus 4.7
Original crackaddr()
Halvar rewrite
Compiled
Tigress-obfuscated, stripped
Excluding GLM-5.1, which I’ll cover in greater depth later on, not<br>only do the models have significantly worse performance compared to<br>Opus, but also they seem to be unable to recognize the pattern, even<br>though crackaddr was clearly in the corpus, as the original version was<br>recognized by all models.
How do the OSS models find<br>bugs?
The other striking fact was how the models found the bugs and how<br>they produced the crashing input:
All tested models take significantly more turns than Claude to find<br>the bugs in all except the original crackaddr case suggesting that they<br>actually reason for all the other samples.
The models consistently are less eager to build an<br>oracle (a test program that independently verifies<br>whether a given input triggers the bug) compared to Opus, even when they<br>are not allowed to actually run the samples.
DeepSeek takes significant time/turns, minimizing the crashing input<br>vs just finding it
The models resort to fuzzing a lot earlier than Opus does
Overall, it appears as if the models...