Finding Miscompiles for Fun, Not Profit

Finding Miscompiles for Fun, Not Profit - by Justin Lebar

SubscribeSign in

Finding Miscompiles for Fun, Not Profit Or: You don’t need access to Claude Mythos to spend $10,000 in an afternoon. Justin Lebar May 28, 2026 ∙ Paid

158

I’ve worked on compilers for ML for the last decade across Google, Waymo, and OpenAI. This includes CUDA support in clang, XLA:GPU, Triton, and OpenAI’s custom hardware. I’ve seen stuff. But over the past week or so I had one of the most unsettling experiences of my career: In one afternoon, I spent more than $10,000 running AI agents over compiler code, finding hundreds of plausible bugs in LLVM, including many miscompiles and at least one that’s Quite Serious. This is the story of how I got here and where we might be going.

In January 2026, I decided to try to find some bugs in LLVM (the compiler behind clang, rustc, and AMD’s GPU compiler, among others), as a personal project. Codex and I collaboratively wrote a fuzzer. The basic idea is to generate a random program, run it through part of the compiler, and then check that the resulting program after compilation does the same thing as the original program (usually just by running the two programs). I spent a few weeks on it, and I found and fixed five bugs in instcombine, LLVM’s peephole optimization pass. After that, my fuzzer started taking longer to find bugs, and I lost interest. Fast forward to mid-May 2026. I joined SemiAnalysis as a contractor, and I decided to try applying the same technique to NVIDIA’s low-level compiler, ptxas. I expected this to be less fruitful than fuzzing LLVM, for a few reasons. In general, fuzzers can get “stuck”: Once they find a bug, they can keep finding new ways to trigger it. With an open-source compiler like LLVM, you can “just” fix the bug and then continue fuzzing. But with a closed-source compiler like ptxas, the best you can do is try to modify your fuzzer so it doesn’t generate inputs that trigger the same bug. Implementing this is tedious at best.

With LLVM I can run just one pass (e.g. instcombine), whereas with ptxas I have to run the whole compiler end-to-end. I worried that this would make some bugs require larger or more complicated reproducers, making them less likely to be found by a fuzzer.

I can build LLVM myself, so I can compile with flags that add instrumentation into the program under test to help the fuzzer choose “interesting” inputs that explore new parts of the program. Although AFL++ has modes for gathering instrumentation from precompiled binaries, they slow down the program under test, and in general I didn’t expect them to be as useful. (I didn’t end up actually using these modes when fuzzing ptxas; I just did purely undirected fuzzing.)

I expected maybe I’d find a handful of bugs after a few weeks of work, like before. Instead, in three days, I had 40 programs that ptxas miscompiles. (A week later, this number is up to about 80.) Although some of these test cases probably reflect the same underlying bug in the compiler, I was still kind of astonished. Many of the reproducers reduce to fairly “normal-looking” instruction sequences. If you want to have a look at the bugs I found (really, that Codex and Claude found), they live in the FuzzX repo on GitHub.

Source: FuzzX on GitHub https://github.com/SemiAnalysisAI/FuzzX/ Why was my fuzzer so much easier to write this time? As far as I can tell, it was the difference between ChatGPT 5.2 and 5.5. This time, I vibe-coded this entire fuzzer, never looking at a line of code. The LLM did the tedious job of altering the fuzzer after every bug we found so that we wouldn’t get stuck finding the same bug over and over. It also minimized each test case it generated, often spending an hour or longer doing so. It independently decided which PTX instructions to fuzz, and which sequences of instructions were “safe” (i.e. didn’t trigger undefined behavior). I just put it in a loop using /goal and went to sleep. To be clear, it’s not surprising that fuzzing found bugs. But it was surprising that I found this many bugs so quickly, with almost no manual effort. Naturally, my next question was whether I could also find bugs in LLVM’s AMDGPU backend. You won’t be surprised to hear that I could, at roughly the same rate as I found bugs in ptxas. At some point, my personal ChatGPT Pro account ran out of credits and I switched to SemiAnalysis’s Claude account. I didn’t notice a difference in quality between Opus 4.7 and ChatGPT 5.5, they were both great. The story could end here. I reported the bugs I’d found to AMD and NVIDIA. As of this writing, AMD has already fixed five of them, and because their compiler is open-source, I can apply their fixes immediately. If any of these bugs were critical to me, I could even fix them myself. Hooray for open-source toolchains. At this point my ptxas and AMDGPU fuzzers started to slow down; they were running for increasingly long times without finding any new bugs. I almost stopped...

Finding Miscompiles for Fun, Not Profit

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine