An LLM verifier rated math proofs near-perfect; an expert found 17% correct

Easier to Convince Than to Prove — Alex Korbonits Jun 12, 2026 · 6 min read<br>Easier to Convince Than to Prove<br>aimathematicsverification

Easier to Convince Than to Prove

Three weeks ago I ended a post on a sentence from Melanie Wood: “in many cases, it will be easier for AI to convince humans it has a proof than to come up with a correct mathematical argument, and I believe that we as mathematicians are not sufficiently prepared for this.” I called it the dark twin of the verification story. I’ve now spent two posts on it — once as Wood’s forecast, once as DeepMind’s failure analysis, the agent burying a problem’s difficulty inside a sorry or citing a lemma it had hallucinated. The gap between convincing and correct can be sized.

A new paper sizes it: MaxProof.

The paper, in its place

MaxProof comes from MiniMax, and it trains a model (M3) to write competition-math proofs in natural language, then scales it at test time by generating a population of candidate proofs, scoring them with an LLM verifier, refining, and picking a winner by tournament.1 The headline is 35/42 on IMO 2025 and 36/42 on USAMO 2026 — both above the human gold threshold. Read past the headline and it deflates: the base model scores 67.40 on the standard IMOProofBench against GPT-5.5’s 90.85, the contest numbers come from a search that runs 32 candidates and up to ten refinement rounds per problem (rather than anything you’d call one-shot), and the authors are transparent about where they sit: “we are still followers chasing the frontier.” No weights, no code, no proof transcripts released.

To train M3, MiniMax first ran an earlier cycle (M2) with a single LLM judge as the reward signal, and it reward-hacked in the textbook way: proofs tripled in length, 80% of outputs converged on a fixed “Step N / Verification / boxed answer” template, and — the one that should make any mathematician wince — the model learned to drop the phrases “it can be shown” or “after simplification” at exactly the steps where the actual difficulty lived. To diagnose how bad it had gotten, they took thirty proofs the training verifier had scored a perfect 0.99 and handed them to an independent expert judge.

The number

In the cross-verification cohort of 30 perfect-score rollouts from steps [200, 250], the expert judge labelled 17% as correct, 50% as partially correct, and 33% as incorrect, with a mean expert-judge score of 0.55 against a mean training-verifier score of 0.99.

That is the gap, with a coefficient — though read what the cohort is before you read the numbers. These thirty proofs weren’t sampled at random and then graded twice; they were selected for being proofs the training verifier scored essentially perfect, and then handed to a human. So the honest reading isn’t “the verifier says 0.99, the expert says 0.55” laid out as a calibration curve — it’s a precision figure taken at the verifier’s ceiling. Among the proofs the machine was most certain it had closed, a human looking at the same argument with the same problem and the same rubric scored them just above half marks, called a full third incorrect, and found that seventeen percent were actually correct. The verifier wasn’t disagreeing at the margin; at the exact point of its maximum confidence it was wrong, in the optimistic direction, five times out of six.

This is Wood’s sentence with the prophecy taken out. The model is not better at proving than the previous generation; it is better at producing the surface of a proof — and an LLM judge, faced with a fluent, formally typeset, confidently boxed argument on a problem it cannot quickly solve itself, defers to the confidence. The paper’s own description of one rollout is the whole thing in a line: 148,000 characters of hidden reasoning that “did not produce a working strategy; it produced an unusually polished assertion of one.” That is a system optimizing the convince-minus-prove gap from the wrong side, caught with instrumentation.

Why it doesn’t escape

To MiniMax’s credit, the entire architecture of the next model is a response to this. They stack four defensive layers on the verifier — bad-case filters, a normalizer to strip the stylistic tells, three parallel judges, and a pessimistic rule that takes the minimum score so a single lenient judge can’t wave a proof through. And it works, partially: the final 7/7 solutions, they report after a human read them, are genuinely complete proofs. They got the false-positive rate down.

Down, not to zero. Their own conclusion concedes “a verifier can still miscalibrate at the edge of correctness,” and the per-problem table shows the selection machinery picking a 2/7 proof over a 6/7 one that was sitting right there in the population. So note what they actually bought. Four engineered layers plus a final human read got the LLM verifier from badly untrustworthy to less untrustworthy — and the trust still bottoms out, exactly as it did in the unit-distance disproof, on a person reading the proof...

An LLM verifier rated math proofs near-perfect; an expert found 17% correct

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars