Cheap code means formal verification is reasonable now

Cheap code means formal verification is reasonable now — Antfly Blog

Get Started

Back to BlogIt would be an understatement hardly worth uttering to say that coding agents are a big deal. But using them most effectively isn't exactly as simple as telling Claude to build you a SaaS product and make no mistakes. Collectively, as software engineers (or whatever you call this job these days), it's up to us to find ways to be most effective with them while minimizing harm to what we're building.

One of the best techniques I know is having your agent hill climb on verifiable problems. This can be as simple as giving your agent a hundred (or a thousand, or ten thousand) checkboxes to check, like your unit test suite, and having it make sure that every test passes. Just be sure to double-check that it didn't mark a bunch to be skipped, and if your test suite is decent, you can be reasonably confident that your code behaves correctly. Or, you can give the agent a metric and ask it to optimize it. You might not use any of the code it produces, but reading its devlog as to how it went about optimizing for that metric could uncover valuable information. We've used this technique successfully to improve queries per second in Antfly by orders of magnitude, for example.

This is why comprehensive conformance test suites can be so valuable. If they're actually comprehensive, then you have strong guarantees that you'll end up with something that actually does what you want it to (see Simon Willison's writeup here). Agents are also very good at using a pre-existing set of ideas and procedures to create their own hills to climb. Martin Kleppmann (author of Designing Data-Intensive Applications) made this argument in December by pointing out that formal languages would make excellent targets for coding agents. Essentially, formal verification of your codebase becomes vastly cheaper with good coding agents because formal languages have proof checkers and other ways of catching many of the types of hallucinations that language models create.

Enter TLA+. It's a formal spec language for modeling complex systems with many moving parts, with a model checker that exhaustively searches all possible states which are implied by the spec. It's perfect for finding the kinds of gnarly bugs like race conditions which can be hard to find other ways.

introducing the hill#

TLA+ is a way to formally model systems to check whether it's possible to reach certain states. For example, you could model an escape room and check whether it's possible for players to ever become stuck. You define possible starting points and transitions, like players discovering the Red Key in Room 1. If it's possible for some other part of the escape room to close access to the Red Key before players discover it, and the Red Key is necessary to progress, the TLA+ checker should discover that. Or you model participants in a distributed data storage system and check that it is impossible for data to be lost due to unresolved intents (ask me how I know).

Point is, this is very useful for distributed and concurrent systems like Antfly. There are all sorts of processes that are straightforward to model in isolation (shard A splits into shards B and C under conditions XYZ) but fiendishly complex when interacting with lots of other such components.

So our question was: how much value could we get out of modeling some or all of this in TLA+ without needing to write it or even read it ourselves? Sure, it might be more effective if we spent a long time studying the language's intricacies and carefully modeling our systems with it, but how much value could we get from just a few afternoons of modeling it with agents? The answer, it turns out, is quite a lot!

how it works#

Here's the workflow I eventually settled on. At first, I kept everything in one SKILL.md file that I directed the agents to use, but eventually I wrote my own CLI to scaffold it better.

Decide what we're actually focusing on. Generally the agent will have a list of hypotheses of problems we will be looking out for once I direct it to a section of the codebase we're interested in. I'll have it write up an assumptions.md and boundaries.md so it's forced to be clear about what is and is not in-bounds during the modeling step. This should exclude enough of the rest of the codebase so that we don't blow up the TLA checker (what actually checks all possible states of the system once we have written the spec) with a combinatorial number of pathways. We generally don't want to model third-party dependencies, operating system quirks, or cosmic rays hitting our VRAM, especially when the checker will happily write tens of gigabytes of system traces if your spec is sufficiently complicated. So it's worth getting clear on what we're excluding from our model now.

Have the agent write the model and run the checker. This is another gate where the agent can find flaws in its initial assumptions. If you know that in...

Cheap code means formal verification is reasonable now

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play