Who verifies the verifier? Notes on DeepMind's formal proof-search paper

Who Verifies the Verifier — Alex Korbonits May 28, 2026 · 14 min read<br>Who Verifies the Verifier<br>aimathematicsverificationlean

Who Verifies the Verifier

Three days ago I published an essay that ended on a problem I couldn’t see a solution to. The short version: an AI had disproved one of Erdős’s conjectures, the result was real, and it was verified — but verified in the only way we currently know how, which is that nine of the world’s relevant experts — one of them a Fields medalist — read a hundred-page argument over a weekend and put their names on it. That doesn’t scale. The supply of plausible-looking proofs is about to go parabolic and the supply of expert weekends is fixed, so the binding constraint isn’t generating proofs: it’s checking them. I argued that formal verification — proofs machine-checked in a language like Lean — was the only thing that obviously scales, and then I admitted the catch: nobody had formalized that proof, and the tooling that would make formalization routine doesn’t exist. “We are,” I wrote, “relying on a process that everyone involved can see won’t hold.” I didn’t just write about this last week after the OpenAI announcement. I’ve been thinking about it since 2023.

What I didn’t know is that two days before I hit publish, Google DeepMind had posted the machine I was describing.

The paper is Advancing Mathematics Research with AI-Driven Formal Proof Search, submitted May 21. The headline numbers are the kind that get screenshotted: an agent that autonomously resolved 9 of 353 open Erdős problems, proved 44 of 492 open conjectures from the OEIS, and did it at an inference cost of a few hundred dollars per problem. The architecture is simple: a language model generates a candidate proof in Lean, the Lean compiler checks it, the error messages feed the next attempt, repeat until the compiler is satisfied. A generate-and-verify loop.

If you read my last post, this looks like a potential answer. The verification layer that saved the unit-distance disproof was nine scarce humans. The verification layer here is a compiler that costs cents and never gets tired.

Two frontiers

Both results claim “AI does Erdős.” They are not the same arc.

The unit-distance disproof was: generate deep mathematics in natural language, then have humans verify it. The hard part was the mathematics, and the verification was the expensive, unscalable, human step.

This paper is: generate the proof natively in Lean from the start, and let the compiler verify each step as it goes. The mathematics is shallower (for now), and the verification is free.

These are two different frontiers. DeepMind ran their agent against the 353 Erdős problems that already had Lean formalizations in their open-source Formal Conjectures repository. The problems the machine can attack are, by construction, the ones already expressible in Lean. The unit-distance disproof is not in that set. Its proof imports infinite class field towers and Golod–Shafarevich theory, machinery with no mature support in Lean’s library. You cannot point this agent at it, because there is no formal statement to point it at.

Native Lean generation is the tractable frontier, and it is advancing fast and getting cheap. Autoformalizing the proofs humans actually write at research depth is the other frontier, and this paper — for all its real achievement — does not touch it.

Inside problem #125

I picked #125 because it is simple to state, and therefore easier to communicate to a lay audience.

Briefly: let A be the set of integers that use only the digits 0 and 1 when written in base 3, and let B be the same thing in base 4. Does the sumset A + B have positive lower density — that is, is there some fixed fraction c > 0 such that, for every sufficiently large N, at least a c-share of the integers below N lands in A + B? (“Lower” means the worst large scales must stay above the floor, not just some of them — a set can be thick at carefully chosen scales and thin at others, and lower density rules out the easier case.) The agent proved the answer is no: the lower density is zero.

The idea behind the proof is elegant, and it rides a single observation about two number systems that almost line up. The powers of 3 and the powers of 4 never coincide — 3^k is never exactly 4^m — because log 4 / log 3 is irrational. But they come arbitrarily close: for any tolerance you like, you can find exponents where 3^k and 4^m are within it. This near-miss is the crux of the whole argument. The proof exploits it to show that every time you zoom out by one of these matched scales, the sumset can capture at most 11/12 of the proportion it held at the previous scale. Apply that thinning d times and your density bound is (11/12) raised to the d — which marches to zero. The paper names the technique “Inductive thinning via Diophantine approx. 3^m ≈ 4^k,” and that one phrase is the approach to the solution.

Surprisingly, you can read that English paragraph straight off the...

Who verifies the verifier? Notes on DeepMind's formal proof-search paper

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan