Binary Coverage the Wrong Way | Red Vice The Right Way<br>Way back in the age of the dinosaurs, if you wrote a program and wanted to test that it was correct, you mostly had two choices: you could either manually construct malformed inputs and try them in succession, or hook up a program which would randomly generate inputs and try them automatically 1. Those programs were called fuzzers, or sometimes generators, and this was “fuzz testing”. It also for quite a long time wasn’t very good: the generator would have no feedback from the program under test except “did this random input crash or not”. This feedback, although useful, doesn’t allow the fuzzer to tell if it is making progress relative to the last random input or doing a good job, and so your fuzzer would just repeatedly hit the same shallow code. You’d leave it running for a week and then when you come back realize it was never generating inputs that would hit interesting parts of your program, and they were all immediately being thrown out or never progressing past a if(packet_header == 0xdeadbeef) check.<br>American Fuzzy Lop revolutionized the field by introducing coverage-guided fuzzing. Instead of having the only feedback to your fuzzer be whether an input crashed or not, it also recoded what code coverage it hit over the course of a program execution; if an input executes a piece of the program that it never reached before, then it hit new code, and probably is now able to also test some new interesting surface area of the program which might have some bugs. This was immensely more effective than “dumb fuzzing” of yesteryear.<br>AFL collects coverage for programs by compiling the program under test with a custom clang compiler pass, which adds coverage feedback on CFG edges: if you hit a new basic block in the code, you get a new coverage bitmap entry set2. If running the program on an input sets results in a bitmap which has new bits set than previously, it hit new code; that input can be saved to use as the seed input to mutate for future executions.<br>You can also do feedback based fuzzing for blackbox binaries that you don’t have the source code to. AFL++ for example has a qemu_mode backend: it executes the program under test with qemu-user, which will lift your x86 assembly to QEMU’s TCG IR and then lower it back with its JIT compiler to x86, but now which can safely execute under its type-2 hypervisor environment. AFL++ has a fork of QEMU which adds a custom TCG hook that adds coverage instrumentation to blocks in the middle-end; the IR for each block is extended with TCG operations that write a coverage bitmap entry the same way that AFL enhances clang IR with extra LLVM operations. Despite sounding like a lot of work, this qemu_mode is only like ~3-5x slower than running a program natively.<br>The gold standard for blackbox binary coverage is to let hardware do it for you. Instead of needing to add extra operations to the program under test, which need to be executed when you run a fuzz testcase, you can instead use Intel PT. Enabling certain performance counter configuration registers will start processor tracing: your CPU will then automatically record metadata as part of its execution pipeline, and the fuzzer can look at the resulting metadata to pull out a trace of what basic blocks it executed. This is unreasonably effective, with Intel PT based tracing only being something like ~10% slower than normal native execution.<br>The issue, of course, is that if you’re like me your laptop has an AMD processor and not Intel. AMD have their own magic performance counting features, including a branch trace buffer that records what code is being executed, but its 1) much worse documented 2) much slower than Intel PT 3) impossible to use in a precise manner - that is, the counters are either sampling based and so only give you 1/n events, or may end up dropping events if a metadata buffer fill up before you’re able to drain it. In either case that makes it sadly unsuitable for fuzzing (you technically could build a fuzzer on top of merely stochastic coverage, and it might even not be that bad. But in practice no one has done so, and it’s probably not worth it.)<br>Which brings me to my bad idea. I wanted to write my own fuzzer for a side-project, and so wanted to collect precise blackbox binary coverage. But didn’t want to write my own dynamic recompilation engine (again), or buy a new laptop. What is a software engineer to do?<br>The Wrong Way<br>We can instead instrument the guest code by replacing all the branches with INT3 instructions. Wait come back I can explain.<br>In order to collect precise coverage we need the program to do something whenever it hits a branch. We can’t insert new behavior in each block, because it already is using all of the bytes in the block for its normal instructions: trying to resize blocks requires fixing up code pointers, and then you’re just back at QEMU-but-bad. The minimal length of a branch instruction is two bytes long: Jcc has a...