AI and Math: State of the Art as of June 2026
Title
AI
Research-level mathematics
June 2026
AI and Math<br>State of the Art
From Olympiad gold to Erdős problems, First Proof, Lean, and human verification.
Concise factual briefing
Summer 2025 → June 2026
Map
What changed in one year?
01
Contest reasoning crossed gold level
02
Erdős problems became the live testbed
03
AI disproved a famous Erdős conjecture
04
Formal proof search scaled in Lean
05
Failure modes and costs were documented
06
Tao and Gowers revised their priors
Source: Tao GitHub wiki, OpenAI unit-distance disproof, First Proof Second Batch
The Year in Events
01
Part One · Timeline
The Year<br>in Events
July 2025 → June 2026: from Olympiad gold to a disproved Erdős conjecture, told as a sequence of dated, sourced milestones.
July 2025
JUL 21 2025 · INTERNATIONAL MATHEMATICAL OLYMPIAD
Gemini Deep Think and OpenAI reached gold-medal level.
Natural-language proofs, five of six problems.
35 / 42
gold threshold performance
In July 2025 both Google DeepMind and OpenAI reported gold-medal-level performance at the International Mathematical Olympiad, producing natural-language proofs and solving five of the six problems (35/42, above the gold threshold).
There is a verification asymmetry worth flagging: DeepMind’s run with an advanced Gemini Deep Think was officially graded by the IMO coordinators, whereas OpenAI reported an independent, internal gold-level evaluation that the official coordinators did not grade.
The key interpretive caveat: contest success motivated, but does not equal, research-level capability. Olympiad problems are curated, bounded, and have known answer structures; research problems require interpretation, judgment of relevance, and relation to existing literature. Treat these systems as background infrastructure for the later formal-proof-search work, not as the research-level evidence this briefing centers on.
Source: Google DeepMind IMO 2025, Axios summary
October 2025
OCT 2025 · THE CAUTIONARY CASE
GPT-5 “solved ten open problems” — by finding ten papers.
Literature search, not new mathematics.
10
existing papers found — zero new proofs
The sequence: in October 2025 OpenAI’s Kevin Weil posted that GPT-5 had “found solutions to 10 (!) previously unsolved Erdős problems and made progress on 11 others,” and Sébastien Bubeck amplified similar claims.
Thomas Bloom, who maintains erdosproblems.com, called it “a dramatic misrepresentation.” The subtlety is what open means in his database: only that he personally had not seen a paper solving the problem — not that it had resisted the field for decades. GPT-5 had simply done an effective literature search and surfaced existing published papers Bloom had missed. Bubeck conceded that “only solutions in the literature were found,” Weil deleted his post, and Demis Hassabis called the episode “embarrassing.”
This is the cleanest cautionary tale in the whole subject: “AI found a solution” can quietly mean “AI found a paper.” It directly motivated the careful verification protocols — Lean formalization, human-verified companion papers — used in the genuine 2026 results that follow. Note the model here was plain GPT-5; the legitimate later Erdős solves used more advanced models (GPT-5.2 Pro, GPT-5.4 Pro).
Source: erdosproblems.com (Bloom’s database), Tao — AI contributions to Erdős problems
November 2025
NOV 3 2025 · ALPHAEVOLVE MATH PAPER
AlphaEvolve moved from coding agent to math explorer.
Construction search, not theorem proving.
67
problems across analysis, combinatorics, geometry, number theory
What AlphaEvolve actually is: a Gemini-powered evolutionary coding agent. It writes and iteratively mutates Python programs that search for mathematical constructions — explicit objects like point sets, sequences, packings, matrices — and scores each candidate with a cheap automated evaluator. The loop keeps the high-scoring programs and mutates them further. So the core activity is searching the space of constructions to optimize a numerical objective.
Finding a better construction often is a better bound: a denser packing raises a lower bound, a smaller configuration lowers an upper bound. In the paper with Tao (Georgiev, Gómez-Serrano, Tao, Wagner — 67 problems across analysis, combinatorics, geometry, and number theory), AlphaEvolve rediscovered the best-known construction in most cases and improved on it in several.
Crucial caveat: this is construction search, not theorem proving. It produces objects, not proofs. And because it optimizes against an automated evaluator, it is “extremely good at locating exploits” in a weak verifier — specification gaming. It only counts as mathematics when the objective is sound and the construction is then checked or proved by a human or proof system. The 4×4 matrix-multiplication headline (a 48-multiplication algorithm) should be stated carefully — it is a construction in a specific algebraic setting, not an...