The unreasonable effectiveness of LLMs for auditing Rust code

The unreasonable effectiveness of LLMs for auditing Rust code | by Sergey "Shnatsel" Davidoff | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Sergey "Shnatsel" Davidoff

7 min read· 4 hours ago

Listen

As a lead of the Rust Secure Code Working Group, I got free access to GPT-5.5 via the Codex for Open Source. Since then I’ve found and reported dozens of issues of varying severity in widely used Rust crates. Separately, the Rust Foundation security initiative got access to Mythos via Project Glasswing, and their report should also be coming soon. I’ve coordinated with them so that our audit targets would not overlap. While I haven’t found any truly devastating vulnerabilities, I am very impressed with GPT-5.5 for auditing Rust source code, and I’ll absolutely be adding it to my toolkit alongside fuzzers. Note: All opinions expressed in this article are my own, not that of any organizations I am a part of. Methodology Since maintainers could already be dealing with a large amount of vulnerability reports, it is imperative that I do not submit any invalid vulnerability reports and waste maintainers’ already limited time. So I’ve decided to look for unambiguously problematic class of vulnerability that’s easy to verify: memory safety bugs. Wait, isn’t Rust memory-safe? Yes, with an asterisk. Most code you’d write in Rust is memory-safe, but at some point you have to talk to the operating system or a C library or implement things like intrusive data structures, all of which involves raw pointers. Most languages implement these parts in C (e.g. CPython), provide unsafe interoperability with C, and have you write C for your own unsafe code, while Rust has its own unsafe subset where you can muck about with raw pointers. This puts Rust’s memory safety properties on par with Python’s, ahead of Go which violates safety on data races, and behind browser-sandboxed JavaScript (but you can match that by compiling Rust to WebAssembly). As for the amount of safe vs unsafe code in the wild, my own scan from 2020 showed that 95% of the code on crates.io is memory safe. The authors of the 2020 paper “How do programmers use unsafe Rust?” independently arrived to the 95% number, although they didn’t put it into the final paper because they weren’t confident in their methodology for it. My own scan is also rather crude, but two completely different measurements arriving to the same number is encouraging. In practice memory safety vulnerability rate reduction compared to C++ is about 1000x, which is more than you’d expect based on the above figures. Preventing false positives Rust has a tool called miri that runs Rust code in an interpreter and tells you precisely whether it committed any crimes against the language rules or not. Safe Rust cannot violate them by construction, but unsafe Rust can, and a validator that immediate tells you whether you messed up or not instead of having to parse dozens of pages of dense prose is indispensable. It also completely eliminates false positives from LLM vulnerability findings. If the LLM can construct a unit test that causes miri to fail, I can report that to the maintainers and be certain that it’s a bug. I don’t ever have to argue if it’s a real issue or not, either — the proof is right there. And if miri says the execution is completely fine, then the LLM false positive gets discarded before anyone even sees it. To the best of my knowledge, no other language has a practical tool with this level of precision. Sanitizers are very nice, but can’t catch everything, so verifying against them does not prove absence of issues. Sadly miri is not without limitations — execution with extra checks is slow, calling into C is not supported, and syscall support is limited. When miri is not applicable, you can fall back on the sanitizers and get some filtering. I also had to switch miri to the newer Tree Borrows aliasing model (as opposed to the older Stacked Borrows) to avoid false positives, but fortunately that’s just one flag, -Zmiri-tree-borrows. Harness My setup was very basic: just Codex and a prompt, with GPT-5.5 set to xhigh reasoning effort. It is important for the model to be able to write and run unit tests to try and trigger the issue under miri, so I consider something like Codex essential. It would be interesting to try a more elaborate harness like metis, but even this basic setup was enough to discover interesting bugs. Findings The most serious issue I’ve found is an out-of-bounds write in jxl-grid crate. It is a part of jxl-oxide, a JPEG XL decoder in Rust (not to be confused with jxl-rs, which Firefox and Chromium are adopting for JPEG XL decoding; that one came up clean in my audit). I’ve already fuzzed this crate earlier, but the fuzzer didn’t catch this issue because it only happens on 32-bit platforms and requires very large image dimensions. I...

The unreasonable effectiveness of LLMs for auditing Rust code

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews