LLMs Are the Key to Mutation Testing and Better Compliance - Engineering at Meta
Skip to content
By Mark Harman
Following our keynote presentations at FSE 2025 and Eurostar 2025, we’re delving further into the development of Meta’s Automated Compliance Hardening (ACH) tool, an LLM-based tool for software testing that is automating aspects of compliance adherence at Meta, while accelerating developer and product velocity.
By leveraging LLMs we’ve been able to overcome the barriers that have prevented mutation testing from being efficiently deployed at scale. This allows us to greatly simplify risk assessments, reduce cognitive load for developers, and, ultimately, create a safer online ecosystem by enabling continuous compliance.
We’re also inviting the community to join us in exploring new challenges and opportunities for leveraging LLMs in software testing through efforts like our Catching Just-in-Time Test (JiTTest) Challenge.
Today, AI is accelerating the pace and complexity of technology development worldwide, requiring compliance systems to keep up. However, compliance has traditionally relied on manual processes, which can be error-prone and challenging to scale.
At Meta, we’ve been investing in advanced AI-enabled detection mechanisms to help us ensure we’re upholding our responsibility to keep our products and services safe for everyone while adhering to compliance obligations at scale. AI-powered solutions help our engineers, developers, and product teams meet global regulatory requirements more easily and efficiently so they can spend more time focusing on building new and innovative products and services.
Earlier this year, we released new research into leveraging large language models (LLMs) for mutation-guided test generation – where faults (mutants) are deliberately introduced into source code as a method of assessing how well a testing framework can detect those faults.
Meta’s Automated Compliance Hardening (ACH) tool successfully combines automated test generation techniques with the capabilities of LLMs to generate highly-relevant mutants for testing as well as tests that are guaranteed to catch those mutants. Through simple, plain-text prompts where engineers describe the mutant to test, ACH makes this process intuitive and reliable. It’s one of our latest AI-powered detection mechanisms that helps us safeguard our operations and catch code that is out of compliance. With ACH we can more easily and proactively identify bugs that would negatively impact our compliance, and prevent them from entering our systems in the future. This technology provides Meta engineers and our product teams with the consistency and confidence they need to ensure our codebase remains risk- resilient.
Since empowering ACH with our research findings, we’ve presented our work at keynote presentations at FSE 2025 and EuroSTAR 2025. Our presentations shared insights into how we’ve used LLMs to solve the major barriers that have prevented mutation testing at scale and highlighted new areas in automated software testing where LLMs can have a significant impact.
For a long time people thought of mutation testing as a way of assessing test quality but less as a way to generate tests. By leveraging generative AI, we’ve been able to make what studies have consistently shown to be the most powerful form of software testing even more efficient and scalable.
The Challenge of Scaling Mutation Testing
The idea behind mutation testing is to go beyond traditional structural coverage criteria like statement coverage or branch coverage (which only show if lines of code are run), to a more robust system of testing. Where statement or branch coverage might still fail to detect a bug if a line still runs, mutation testing reveals whether a test fails after inserting a mutation, indicating that the tests are not effectively checking the code’s behavior. As an example, ACH can simulate privacy faults that would introduce compliance risk (such as messages being shared with unintended audiences) to model a potential real-world issue. It then creates unit tests to catch these bugs, preventing them from reaching production, even if they’re reintroduced in future code changes.
Even though mutation testing cannot exist on its own (it requires a test to already exist), it helps engineers and developers identify weak assertions and encourages them to write tests that truly validate code behavior instead of just executing it.
In practice however, mutation testing has been notoriously difficult to deploy. Despite over five decades of research, mutation testing has traditionally faced five major barriers.
1. Mutation Testing Isn’t Scalable
Traditional mutation testing generates a very large number of mutants, making it computationally expensive and difficult to scale to large industrial codebases. The sheer volume of mutants can overwhelm testing infrastructure and slow down development cycles.
2. Mutation Testing Can Create...