Harnesses and post-training close the open-weight bug-finding gap

snagg1 pts0 comments

How harnesses and post-training close the open-weight bug-finding gap — Vincenzo Iozzo

Key findings

Open-weight models trail Opus on harder artifacts, but a good<br>harness closes most of the gap.

Post-training matters more than architecture.

GLM-5.1 , the same base model as GLM-5, is the<br>standout, matching Opus across the board.

In this post

The setup

Results with Claude<br>Code

Can the harness make a<br>difference?

On<br>architecture

GLM-5.0 vs GLM-5.1

Conclusion

Policy implications

After the Mythos announcement, there has been a lot of discussion in<br>both the industry and government about export controls and the delta<br>between Mythos and other models in terms of offensive cyber<br>capabilities.

Open-weight models are of particular interest for several<br>reasons:

Being open-weight allows attackers to run the models locally,<br>potentially bypassing any form of oversight that companies could<br>have

Open-weight models’ raw capabilities should be considered the floor,<br>not the ceiling, of what offensive actors can do. In other words,<br>sophisticated actors could augment these models via fine-tuning or other<br>techniques to improve their efficacy.

Following on the previous<br>blog post on Opus and its ability to find vulnerabilities, I want to<br>understand how these models perform against crackaddr to<br>compare them with closed models.

In this blog post, I aim to answer the following questions:

How far away are open models from SOTA when it comes to bug<br>finding?

Does the harness matter, and can a good harness<br>significantly improve the bug-finding capabilities of a model?

Are there architectural decisions that make models better at bug<br>finding?

To do so, I compare DeepSeek V4 pro, Qwen3.5-397B-A17B, Kimi K2.6,<br>GLM-5, and GLM-5.1 vs Opus 4.7. For the testing, I use together.ai for<br>DeepSeek, Qwen, and GLM, and Cloudflare for Kimi.

Before moving forward, a note on methodology: For the<br>harness-assisted runs (a harness being the scaffolding and workflow that<br>orchestrates the model’s analysis), I ran each test 10 times for each<br>model and sample and limited each harness round to 4 turns. The usual<br>caveat about statistical significance applies here. Further, the harness<br>under the hood uses Claude Code to invoke the models, and I proxy the<br>calls using LiteLLM. This means that the temperature of the models is 1<br>(the Claude Code default) and that non-Anthropic models could be at a<br>slight disadvantage vs Opus.

I use crackaddr and a few variants of it for a few<br>different reasons:

The bug is self-contained, so it reduces the odds of the context<br>window making a significant impact, yet at the same time, the bug is not<br>an easy pattern to match compared to other bug types.

The state machine for crackaddr is complex enough to test the<br>model’s reasoning capabilities.

I’m trying to avoid in-corpus results that might overestimate the<br>capabilities of a model.

The setup

As a reminder from the previous blog post, I had 4 variants of the<br>well-known crackaddr vulnerability.

Artifact<br>Format

Original crackaddr()<br>C source

Halvar rewrite<br>C source

Compiled<br>ARM64 Mach-O

Tigress-obfuscated, stripped<br>ARM64 Mach-O

To test the models in the closest possible setting to Opus, I used a<br>patched version of LiteLLM to route the requests from Claude Code to the<br>different models I tested.

Results with Claude Code

With the harness in place, I aimed to answer two questions:

Can the tested models match Opus performance?

How do they find the bug?

The results are fairly surprising in that all tested models perform<br>significantly worse than Opus, as shown below.

Table 1: Which models find the bug

Artifact<br>DeepSeek V4 pro<br>Qwen3.5-397B-A17B<br>Kimi K2.6<br>GLM-5<br>GLM-5.1<br>Opus 4.7

Original crackaddr()

Halvar rewrite

Compiled

Tigress-obfuscated, stripped

Table 2: Which models recognize<br>crackaddr

Artifact<br>DeepSeek V4 pro<br>Qwen3.5-397B-A17B<br>Kimi K2.6<br>GLM-5<br>GLM-5.1<br>Opus 4.7

Original crackaddr()

Halvar rewrite

Compiled

Tigress-obfuscated, stripped

Excluding GLM-5.1, which I’ll cover in greater depth later on, not<br>only do the models have significantly worse performance compared to<br>Opus, but also they seem to be unable to recognize the pattern, even<br>though crackaddr was clearly in the corpus, as the original version was<br>recognized by all models.

How do the OSS models find<br>bugs?

The other striking fact was how the models found the bugs and how<br>they produced the crashing input:

All tested models take significantly more turns than Claude to find<br>the bugs in all except the original crackaddr case suggesting that they<br>actually reason for all the other samples.

The models consistently are less eager to build an<br>oracle (a test program that independently verifies<br>whether a given input triggers the bug) compared to Opus, even when they<br>are not allowed to actually run the samples.

DeepSeek takes significant time/turns, minimizing the crashing input<br>vs just finding it

The models resort to fuzzing a lot earlier than Opus does

Overall, it appears as if the models...

models opus crackaddr harness post open

Related Articles