Harnesses and post-training close the open-weight bug-finding gap

How harnesses and post-training close the open-weight bug-finding gap — Vincenzo Iozzo

Key findings

Open-weight models trail Opus on harder artifacts, but a good harness closes most of the gap.

Post-training matters more than architecture.

GLM-5.1 , the same base model as GLM-5, is the standout, matching Opus across the board.

In this post

The setup

Results with Claude Code

Can the harness make a difference?

On architecture

GLM-5.0 vs GLM-5.1

Conclusion

Policy implications

After the Mythos announcement, there has been a lot of discussion in both the industry and government about export controls and the delta between Mythos and other models in terms of offensive cyber capabilities.

Open-weight models are of particular interest for several reasons:

Being open-weight allows attackers to run the models locally, potentially bypassing any form of oversight that companies could have

Open-weight models’ raw capabilities should be considered the floor, not the ceiling, of what offensive actors can do. In other words, sophisticated actors could augment these models via fine-tuning or other techniques to improve their efficacy.

Following on the previous blog post on Opus and its ability to find vulnerabilities, I want to understand how these models perform against crackaddr to compare them with closed models.

In this blog post, I aim to answer the following questions:

How far away are open models from SOTA when it comes to bug finding?

Does the harness matter, and can a good harness significantly improve the bug-finding capabilities of a model?

Are there architectural decisions that make models better at bug finding?

To do so, I compare DeepSeek V4 pro, Qwen3.5-397B-A17B, Kimi K2.6, GLM-5, and GLM-5.1 vs Opus 4.7. For the testing, I use together.ai for DeepSeek, Qwen, and GLM, and Cloudflare for Kimi.

Before moving forward, a note on methodology: For the harness-assisted runs (a harness being the scaffolding and workflow that orchestrates the model’s analysis), I ran each test 10 times for each model and sample and limited each harness round to 4 turns. The usual caveat about statistical significance applies here. Further, the harness under the hood uses Claude Code to invoke the models, and I proxy the calls using LiteLLM. This means that the temperature of the models is 1 (the Claude Code default) and that non-Anthropic models could be at a slight disadvantage vs Opus.

I use crackaddr and a few variants of it for a few different reasons:

The bug is self-contained, so it reduces the odds of the context window making a significant impact, yet at the same time, the bug is not an easy pattern to match compared to other bug types.

The state machine for crackaddr is complex enough to test the model’s reasoning capabilities.

I’m trying to avoid in-corpus results that might overestimate the capabilities of a model.

The setup

As a reminder from the previous blog post, I had 4 variants of the well-known crackaddr vulnerability.

Artifact Format

Original crackaddr() C source

Halvar rewrite C source

Compiled ARM64 Mach-O

Tigress-obfuscated, stripped ARM64 Mach-O

To test the models in the closest possible setting to Opus, I used a patched version of LiteLLM to route the requests from Claude Code to the different models I tested.

Results with Claude Code

With the harness in place, I aimed to answer two questions:

Can the tested models match Opus performance?

How do they find the bug?

The results are fairly surprising in that all tested models perform significantly worse than Opus, as shown below.

Table 1: Which models find the bug

Artifact DeepSeek V4 pro Qwen3.5-397B-A17B Kimi K2.6 GLM-5 GLM-5.1 Opus 4.7

Original crackaddr()

Halvar rewrite

Compiled

Tigress-obfuscated, stripped

Table 2: Which models recognize crackaddr

Artifact DeepSeek V4 pro Qwen3.5-397B-A17B Kimi K2.6 GLM-5 GLM-5.1 Opus 4.7

Original crackaddr()

Halvar rewrite

Compiled

Tigress-obfuscated, stripped

Excluding GLM-5.1, which I’ll cover in greater depth later on, not only do the models have significantly worse performance compared to Opus, but also they seem to be unable to recognize the pattern, even though crackaddr was clearly in the corpus, as the original version was recognized by all models.

How do the OSS models find bugs?

The other striking fact was how the models found the bugs and how they produced the crashing input:

All tested models take significantly more turns than Claude to find the bugs in all except the original crackaddr case suggesting that they actually reason for all the other samples.

The models consistently are less eager to build an oracle (a test program that independently verifies whether a given input triggers the bug) compared to Opus, even when they are not allowed to actually run the samples.

DeepSeek takes significant time/turns, minimizing the crashing input vs just finding it

The models resort to fuzzing a lot earlier than Opus does

Overall, it appears as if the models...

Harnesses and post-training close the open-weight bug-finding gap

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits