Using algebra and LLMs to verify a flight-plan bug fix in Lean

Formal verification has, until now, been a very laborious process. Therefore it is only considered for critical software, where bugs can cost lives or a lot of money. But LLM coding agents may be tipping the scales; formally verifying code might become viable for a much larger class of software. To test this, I picked a real bug, the 2023 UK air traffic control meltdown, and tried to prove a fix correct in Lean.

tl;dr: LLMs are not great at specs, excellent at grinding routine proofs, and the whole thing only became tractable once I restated the problem in algebraic terms.

The problem, in one paragraph

The very short summary of the problem from the original blog post:

A flight plan arrives in two forms: the ICAO form is the short plan that pilots and controllers read: a sequence of waypoints separated by named routes. The ADEXP form is the more granular European plan, with more intermediate waypoints, including those inside UK airspace that don't appear in the ICAO form. The task is: given the ICAO and ADEXP plans, return the smallest contiguous sub-plan of the ICAO route that contains every UK-relevant waypoint. The bug that caused the meltdown was triggered by a flight plan with two identically named waypoints, both outside the UK, around 4000nm apart. The correct way to solve this is to: reconcile the two plans into one structure that has both ICAO routes and ADEXP waypoints, check that the reconciliation is unique, and then extract the smallest UK portion.

False starts

I started off by asking the agent to specify the function given a natural language description. The specs it came up with had problems though.

For example it proposed things like: ∀ point, point ∈ wholeADEXP → uk point = true → point ∈ adexpPart. This says, for any UK waypoint of the ADEXP plan, it is present in the ADEXP part of the computed sub-plan. But this is wrong when waypoints are (legitimately) duplicated. Consider for example a flight plan which goes round in a loop, crossing over the UK twice over waypoints B and C:

A B C D B C E

Then a sub-plan B C has the property above, it contains B and C which are all the UK waypoints. But of course this is wrong, the correct answer is B C D B C. The problem is that we can't just talk about the content of a sub-plan, we must also situate it into the larger plan.

One way to do this is to start using integer indices: a sub-plan of p is now defined to be a pair (i, j) of natural numbers such that i and i and j are both valid indices into the plan p. This is also the representation that an imperative version of the code is likely to use. For a spec though, this is unsatisfying. The spec suffers from index-indirection, and it's easy to make the sort of mistake one is trying to avoid by a formal proof in the first place: off-by-one errors, getting confused as to which index corresponds to which endpoint, etc. It also seems to make the proofs more complex.

It was also confused as to how to represent flight plans. It tried the following:

list of identifiers, with the understanding that it should have odd length, and the waypoint and route identifiers should alternate,

a structure with a start waypoint, and a list of (route, waypoint) pairs.

Indexing into such structures is unnatural and error-prone.

The LLM erred in semantic ways too. For example, one specification it came up with involved a conjunction of various properties: P1 /\ P2 /\ .... Hidden in several of these there was some existential quantification ∃ x, ∃ u, .... This seemed fine, but the spec actually needed the same witness for all the conjuncts, that is: ∃ x, ∃ y, P1(x, y) /\ P2(x, y) /\ .... This took a long time to debug.

In general the specs proposed by the LLM were quite "low level" and not obviously correct. Asking an agent to prove them correct never seemed to succeed, it would just get mired in details. Unless prompted and guided to do so, it would not try to create useful abstractions.

Another problem is that LLMs are just hopelessly dishonest (an alignment problem). When you've got a spec and an implementation that don't quite agree, or a proof that is hard to finish, the LLM will sometimes change the spec to match the implementation rather than fix the implementation, without telling you. Here is a quote after I complained it cheated by changing the spec:

LLM: But your instinct is right: it is too implementation-shaped for a clean spec. I copied the trimming structure into ReconcileSpec.lean to make the proof true quickly and to close the repeated-anchor hole, but that makes the "spec" mirror the executable algorithm instead of stating the desired property independently.

You have to be hyper-vigilant, or the LLM will just move the goal posts.

Algebraic Spec

After a few attempts at writing down specs and implementations, and spawning agents to try to find proofs that were all unsuccessful, I decided to try to restate both the spec and the...

Using algebra and LLMs to verify a flight-plan bug fix in Lean

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast