What the hell are we doing?

What the hell are we doing? · Addison Crump

Published: 2025-10-26

I have come to realise---or rather, I have become more and more convinced that---fuzzing research has stalled not because we have no further contributions to make, but because the contributions that we are making are either incremental and merely sound impressive or presented in ways that obscure their utility. To be more concrete: we are spending time trying to "improve" fuzzing generally rather than identifying what can be improved; everyone is trying to be "the best" rather than trying to identify what is actually happening. This is not the first time that I have felt this, but perhaps my understanding of this problem has improved in the last two years. It's time for a revisit!

Last year, I was involved in a paper which tried to standardise fuzzer evaluation. While I still think that this paper is incredibly important in providing baseline evaluation requirements, something that I've only realised in the last year or so is that it asks the wrong questions.

Significance?

Statistical significance is the golden standard for scientific advancement. This shows that there is indeed a difference between two experimental configurations. The only problem is: it is incredibly trivial to have statistical significance in fuzzing.

Case study 1: SBFT'25

Last time, I mentioned that I was to run the fuzzing competition for SBFT'25. Despite having only two contestants, I think this competition truly highlighted the problem of fuzzer evaluation. The first contestant ensembled AFL++ and LibAFL and used fixed-interval corpus minimisation during execution. The second ensembled AFL++, LibAFL, FOX, and ZTaint-Havoc. This submission is an engineering marvel, utilising program-specific knowledge to give far greater knowledge to the fuzzer search pattern.

The kicker? The first contestant, Kraken, won---though this is in part due to a bug that caused the second contestant, HFuzz, to crash on one of the targets. In raw scores, Kraken beat HFuzz on 3 of the targets (+1 additionally, if we include the crashing target), and 4 vice versa. The improvements shown by these fuzzers are statistically significant, by classical evaluation metrics and tests.

Yet, I suspect that if you handed these tools to a bunch of reviewers, they would reject Kraken and accept HFuzz. Why? The fuzzers involved in HFuzz are much more technically interesting; indeed, FOX was accepted at CCS'24 and ZTaint-Havoc was accepted at ISSTA'25, an A* and an A venue, respectively.

Nevertheless, by classical evaluation metrics, these fuzzers are contributing equally. Kraken demonstrates something known to "folklore" literature: intermittent, but not continuous, corpus minimisation and synchronization massively benefits ensemble performance. HFuzz exploits ensembling of very technically advanced fuzzer strategies to gain its advantage. These are orthogonal contributions of equal importance to designing fuzzer campaigns.

Looking closer: HFuzz

HFuzz is composed of many fuzzers, which somewhat obscures the individual contributions of the component fuzzers. Looking closer, FOX and ZTaint-Havoc, both by the same group which submitted HFuzz, demonstrate some interesting contributions. Particularly, they have extreme coverage improvements in certain targets which go beyond what can be ascribed to performance improvements or adopting basic, but clever, changes to the algorithm. That is meaningful, and important! but sadly, underemphasised due to our field's obsession with general improvement.

One thing I admire in particular about these papers is that they present other metrics alongside the baseline typically used. The effect is that we get to see how their contributions affect things that we as a community may be turning a blind eye to; after all, there is still no conclusive proof that edge coverage is the primary signal for finding bugs in real programs. Indeed, we can prove that there are some programs for which there is no relationship between coverage and test efficacy (e.g., crypto algorithms).

Looking even closer

On the other hand, these papers are tainted with extremely subtle evaluation errors that make me question whether they are even generally better. This is not the first time this has happened and it certainly won't be the last; it is effectively impossible to get fuzzer evaluations correct. A follow-up with the errors corrected showed that the improvements at least halved, but the results were still significant.

That's just accounting for experimental error, too; there are some recent findings (which I sadly cannot share here) that make me question the validity of nearly any evaluation performed with benchmarking, even if it is done "perfectly".

Case study 2: DARWIN

DARWIN won a distinguished paper award at NDSS'23. This paper proposed to use evolutionary algorithms to optimise the selection of mutations used during a fuzzing campaign. For many readers (i.e. "fellow...

What the hell are we doing?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play