On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications | Proofs and Intuitions
On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications | Proofs and Intuitions
In this post, we show that property-based testing (PBT) is surprisingly effective for validating LLM-synthesised specifications of Lean programs: it is a cheap alternative to symbolic proofs, which helped to detect underspecification in 10% of the specs in state-of-the-art benchmarks for verified code generation.
Getting Program Specifications that are “Just Right”
Formal program verification and program synthesis are only as reliable as the specifications used to validate the programs in question. A specification is a mathematical contract between a programmer and a machine: it captures what a program is supposed to do, and the verifier (or the synthesiser) ensures the contract is respected when producing the implementation and the proof. If the contract is flawed, the result is meaningless—a program verified against a wrong spec does what the specification states, but not necessarily what the user wants.
It is easy to write a contract that is useless on purpose. In Hoare logic, a specification takes the form
\[\{P\}\ \texttt{program}\ \{Q\},\]
read as: if the precondition $P$ holds of the inputs, then after program runs, the postcondition $Q$ holds of the outputs. This is, for instance, the style of specification used by Velvet—a program verifier embedded in Lean that we discussed in one of previous posts. Setting $P \equiv \bot$ (i.e., $\mathit{false}$) makes the triple hold vacuously for any program: there are no inputs satisfying the precondition, so there is nothing to check, and even completely degenerate code is certified. Symmetrically, setting $Q \equiv \top$ ($\mathit{true}$) means that every possible output satisfies the postcondition, so once again any program will pass. The first specification’s precondition is too strong (it rules out every input), and the second specification’s postcondition is too weak (it rules out no output, making verification and synthesis useless, since such a program can return anything). Neither tells us anything about what the program should do.
The interesting middle ground—a specification that is just right—is an open research problem. A good specification must be precise enough to pin down the programmer’s intent, but not so precise that it inadvertently re-states the program itself. Consider a textbook task: sort a list of integers in ascending order. A specification that says “the output is the list produced by merge sort applied to the input” is, technically, a specification, but it has fused the what with the how: it commits to an algorithm and inherits all of its incidental details, defeating the purpose of having a specification in the first place. A good specification of sorting, by contrast, demands only two things of the result: it is in ascending order, and it is a permutation of the input. Whether the implementation runs merge sort, quicksort, gnome sort, bogosort, or sleepsort is now irrelevant—the specification abstracts over all of them. Producing this kind of clean, intent-capturing formal statement from an informal English description is the part that humans are, even today, still better at than machines.
Hence the conundrum. The most reliable way to write a “just right” specification is to put a human expert in the loop: someone who reads the formal statement, compares it to the informal intent, and corrects it. But reading formal specifications fluently requires considerable training—precisely the kind of cognitive overhead that has kept formal methods out of the mainstream. If we want certified programs to become widely accessible, we need to dramatically reduce the human effort required to write and validate specifications, without simply passing control to a large language model and hoping for the best.
Our goal, thus, is to produce formal specifications with minimal user involvement , and to identify principles for generating specifications that are close enough to a human’s intent to drive the synthesis of certified programs—i.e., programs that come bundled with their formal specifications and machine-checkable correctness proofs. This problem has only become more urgent: in a world where more and more of the code we run is produced by LLMs, we cannot afford to also delegate the synthesis of specifications to LLMs without a reliable, automated way to validate that the resulting specs are just right. The rest of this post discusses two such validation techniques and the trade-offs between them.
Symbolic Specification Testing
One natural approach is to validate an LLM-synthesised specification by proving it on a handful of representative inputs—using a verifier and an SMT solver in the loop. This idea has been actively explored over the past year by Shuvendu Lahiri, who frames the problem as...