Finding Bugs Using LLMs

Finding bugs using LLMs | Materialize

Table of Contents Sessions Prompt Tools & Skills Models Staying Honest Conclusion

Table of Contents

At Materialize we’ve had success in finding bugs in existing code and open pull requests using LLM-based coding agents since February 2026, coinciding with the release of Anthropic’s Opus 4.6 (now mostly running on 4.7). In this post we’ll look into some of the considerations that went into the system we are currently using as well as lessons learned. Sessions

We have a basic shell script that determines the next unit to operate on and feeds it to claude . There are multiple units we scan, each in a fresh coding agent session: Every pull request that becomes ready for review (not in draft): Ideally we want to find bugs before we even merge them into our main branch. Unfortunately there can be many versions of a PR, so we still have to check every commit that lands in main in addition, even if the PR itself was already reviewed. Every commit that ever landed on main, back-filling our existing repository’s history: Considering the entire diff of a commit gives a better overview of everything in the source code that had to be touched for a specific change. This ended up finding many additional bugs. Every production source code file: This is the most basic unit people use, code in the same file is often related, and even for code in other files the LLM agent can look them up. We originally started out with this approach, but adding PR/commit reviews on top turned out to be fruitful. N-th iteration of every production source code file with a list of already known (but not yet fixed) bugs in this file: Not all bugs are of equal importance. By telling the LLM to ignore the already known bugs we don’t waste further tokens looking into them again, and instead have a chance of finding more serious bugs in key files which might not be as obvious. What we end up running is claude --dangerously-skip-permissions --model claude-opus-4-7 --effort max --output-format stream-json --verbose -p $PROMPT. Since the sessions should run automatically without user interaction, --dangerously-skip-permissions with a dedicated VM is the easiest approach. See the documentation. Prompt

Bugs are categorized into high/medium/low severity, and only high and medium are considered further by writing a markdown file for the reviewed unit. Existing findings for the relevant file are already marked in the prompt so we don’t waste time on them, otherwise we end up rediscovering the same bugs again and again. Each newly suspected bug is additionally cross-checked against our already open bugs in GitHub and Linear to deduplicate against existing issues and save valuable time for the reviewer. I have recently extended the prompt with specific categories of bugs we are looking for, for example correctness, kinds of vulnerabilities and race conditions - based on the serious bugs we have found previously, and also the categories Materialize most cares about. The jury is still out on whether that is better than letting the LLM look for anything. I have considered having a separate session per bug category, but that would increase token usage by a lot with questionable benefit. We are also asking it to prevent false positives in a bunch of ways, for example by tracing the entire chain of execution, or creating and executing a small test. Tools & Skills

Trailmark and LSP are valuable to enable more efficient traversals through large code bases. Trail of Bits also has relevant skills for looking for vulnerabilities as well as disregarding false positives. Our own repository also contains skills about how some complex parts of the system work, where to find our existing issues, and how to use the existing test frameworks well. Having made the skills agent-agnostic is helpful here since it allows experimenting with OpenAI’s Codex and GPT 5.4/5.5. Models

Anthropic’s Opus 4.7 with max thinking is what we’re currently employing most of the time, with a fallback to OpenAI’s GPT 5.5. In the limited evaluations I did Opus 4.7 didn’t find more bugs than Opus 4.6, but had fewer false positives since it investigated more context to ensure the bug could actually be triggered end to end. On the flip side that uses way more tokens. Future models like Mythos are bound to be interesting not just for security research, but bug finding in general. Recently both Anthropic and OpenAI have gotten more careful about allowing attackers to use their LLMs to find vulnerabilities. Unfortunately this also bites you when trying to find bugs in your own software, for which you can/have to apply for safeguard adjustments (Anthropic, OpenAI). Otherwise you’ll just keep running into API errors like this: API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). This request triggered restrictions on violative cyber content and was blocked...

Finding Bugs Using LLMs

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play