Using algebra and LLMs to verify a flight-plan bug fix in Lean
Formal verification has, until now, been a very laborious process. Therefore it<br>is only considered for critical software, where bugs can cost lives or a lot of<br>money. But LLM coding agents may be tipping the scales; formally verifying code<br>might become viable for a much larger class of software. To test this, I picked<br>a real bug, the 2023 UK air traffic control meltdown, and tried to<br>prove a fix correct in Lean.
tl;dr: LLMs are not great at specs, excellent at grinding routine proofs, and<br>the whole thing only became tractable once I restated the problem in algebraic<br>terms.
The problem, in one paragraph
The very short summary of the problem from the original blog post:
A flight plan arrives in two forms: the ICAO form is the short plan that<br>pilots and controllers read: a sequence of waypoints separated by named routes.<br>The ADEXP form is the more granular European plan, with more intermediate<br>waypoints, including those inside UK airspace that don't appear in the ICAO<br>form. The task is: given the ICAO and ADEXP plans, return the smallest<br>contiguous sub-plan of the ICAO route that contains every UK-relevant waypoint.<br>The bug that caused the meltdown was triggered by a flight plan with two<br>identically named waypoints, both outside the UK, around 4000nm apart. The<br>correct way to solve this is to: reconcile the two plans into one structure that<br>has both ICAO routes and ADEXP waypoints, check that the reconciliation is<br>unique, and then extract the smallest UK portion.
False starts
I started off by asking the agent to specify the function given a natural<br>language description. The specs it came up with had problems though.
For example it proposed things like: ∀ point, point ∈ wholeADEXP → uk point = true → point ∈ adexpPart. This says, for any UK waypoint of the ADEXP plan, it<br>is present in the ADEXP part of the computed sub-plan. But this is wrong when<br>waypoints are (legitimately) duplicated. Consider for example a flight plan<br>which goes round in a loop, crossing over the UK twice over waypoints B and<br>C:
A B C D B C E
Then a sub-plan B C has the property above, it contains B and C which are<br>all the UK waypoints. But of course this is wrong, the correct answer is B C D B C. The problem is that we can't just talk about the content of a sub-plan, we<br>must also situate it into the larger plan.
One way to do this is to start using integer indices: a sub-plan of p is now<br>defined to be a pair (i, j) of natural numbers such that i and i and<br>j are both valid indices into the plan p. This is also the representation<br>that an imperative version of the code is likely to use. For a spec though, this<br>is unsatisfying. The spec suffers from index-indirection, and it's easy to make<br>the sort of mistake one is trying to avoid by a formal proof in the first place:<br>off-by-one errors, getting confused as to which index corresponds to which<br>endpoint, etc. It also seems to make the proofs more complex.
It was also confused as to how to represent flight plans. It tried the<br>following:
list of identifiers, with the understanding that it should have odd length,<br>and the waypoint and route identifiers should alternate,
a structure with a start waypoint, and a list of (route, waypoint) pairs.
Indexing into such structures is unnatural and error-prone.
The LLM erred in semantic ways too. For example, one specification it came up<br>with involved a conjunction of various properties: P1 /\ P2 /\ .... Hidden in<br>several of these there was some existential quantification ∃ x, ∃ u, .... This<br>seemed fine, but the spec actually needed the same witness for all the<br>conjuncts, that is: ∃ x, ∃ y, P1(x, y) /\ P2(x, y) /\ .... This took a long<br>time to debug.
In general the specs proposed by the LLM were quite "low level" and not<br>obviously correct. Asking an agent to prove them correct never seemed to<br>succeed, it would just get mired in details. Unless prompted and guided to do<br>so, it would not try to create useful abstractions.
Another problem is that LLMs are just hopelessly dishonest (an alignment<br>problem). When you've got a spec and an implementation that don't quite agree,<br>or a proof that is hard to finish, the LLM will sometimes change the spec to<br>match the implementation rather than fix the implementation, without telling<br>you. Here is a quote after I complained it cheated by changing the spec:
LLM:<br>But your instinct is right: it is too implementation-shaped for a clean<br>spec. I copied the trimming structure into ReconcileSpec.lean to make<br>the proof true quickly and to close the repeated-anchor hole, but that<br>makes the "spec" mirror the executable algorithm instead of stating the<br>desired property independently.
You have to be hyper-vigilant, or the LLM will just move the goal posts.
Algebraic Spec
After a few attempts at writing down specs and implementations, and spawning<br>agents to try to find proofs that were all unsuccessful, I decided to try to<br>restate both the spec and the...