TheoremGraph: Search 18M+ Mathematical Dependencies

ilreb1 pts0 comments

TheoremGraph — Mathematical Dependency Graph

Search 18 million+<br>mathematical dependencies<br>A unified statement-level dependency graph spanning both informal and formal mathematics, including 11.7 million arXiv statements linked to Mathlib through a shared embedding space.<br>Simon Kurgan, Evan Wang, Eric Leonen, Sophie Szeto, Luke Alexander, Artemii Remizov, Jarod Alper, Giovanni Inchiostro, Vasily Ilin<br>DatasetRead the paper →

11.7M<br>Theorems indexed

18.3M<br>Dependency edges

388k<br>Lean declarations

47,952<br>Formal-informal matches

Motivation<br>Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly. Informal papers cite mostly at the document level, while formal proof assistants like Lean record fine-grained dependencies over a much smaller body of mathematics. This asymmetry limits attribution, duplication detection, and automated formalization.<br>For mathematicians , a unified dependency graph makes it possible to check whether a result is already known and trace exactly which lemmas a proof relies on. A 2024 study found that 2.4% of withdrawn arXiv submissions were self-identified as non-novel, which might have been identified earlier with improved cross-paper dependency tracking.<br>For AI agents , the graph gives neural theorem provers and autoformalization tools a access to mathematics as a graph. Instead of retrieving flat semantic neighbors, an agent can walk the graph to find connected lemmas and related formalizations across informal and formal spaces.

How It Works<br>1Parse informal statements. We extract over 11.7 million theorem-like environments from mathematics arXiv papers using a regex-based parser, and recover 18.3 million candidate directed dependency edges. We use deterministic, heuristic, and notation-based parsing methods, and label each edge with the parser type.

2Extract the formal graph. LeanGraph extracts typed declaration-level dependencies from Mathlib4 and 25 open-source Lean projects, yielding 388,105 nodes and 11.3 million typed edges across six semantic categories.

3Generate slogans and embed. Every formal and informal statement we extract is summarized into a concise natural-language slogan by Qwen3-235B and embedded with Qwen3-Embedding-8B into a shared semantic space.

4Bridging the corpora. A cross-modal nearest-neighbor sweep proposes (informal, formal) candidate matches, and a GPT-5.4 judge affirms 47,952 matches above a 0.8 cosine similarity floor.

Overview<br>Informal Dependency Extractors<br>EXTRACTOREDGESPRECISIONDeterministic5.23M98.8%Heuristic6.47M76.6%Notation7.88M42.7%Any (total)18.3M68.1%<br>Precision estimated by LLM judge (Kimi K2.5) on 500 sampled arXiv papers.

Formal Edge Types<br>TYPEDESCRIPTIONproofUsed inside a theorem proof termsigAppears in a type signaturedefUsed in a non-Prop definition bodyfieldReferenced in a structure field typeextendsStructure or class inheritancedocrefBacktick reference in a docstring<br>LeanGraph extracts 388,105 nodes and 11.3M typed edges across 25 Lean projects (Mathlib v4.27-v4.29 plus 24 community formalizations).

REST API<br>TheoremGraph provides a REST API for semantic search and dependency graph traversal.<br>curl "https://api.theoremsearch.com/graph/paper?external_id=2301.00001"Returns all statements and dependency edges for an arXiv paper or Lean repository. Full API reference →

MCP Tool<br>TheoremGraph is also available as an MCP tool for AI agents via a single theorem_search tool.<br>ENDPOINThttps://api.theoremsearch.com/mcp

dependency graph informal formal dependencies million

Related Articles