Riemann-bench | Surge AI
Blog<br>Leaderboards<br>Workforce<br>Products<br>Research<br>Careers<br>Contact
Login
Menu
Close
Mathematics at the frontier
Riemann-bench
We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.
READ MORE ABOUT Riemann-BENCH ON OUR BLOG<br>Research paper
RL Environments and the Hierarchy of Agentic Capabilities
Our RL environment run on 9 models revealed the core capabilities all agents need to master: tool use, planning, adaptability, groundedness, and common sense.
Model Rankings<br>Last updated 05/27/2026
Claude Fable 5 / Mythos 5
55
GPT-5.5 (xHigh reasoning)
41.6
GPT-5.2 (xHigh reasoning)
32
Claude Opus 4.8
25.6
Claude Opus 4.6
22.4
Claude Opus 4.7
20.8
Gemini 3.5 Flash (High Reasoning)
15.2
Gemini 3.1 (Pro)
15.2
Kimi K2.6
10.4
Claude Opus 4.5
10.4
Kimi K2.5
DeepSeek V4 (Flash)
5.6
Qwen 3.7 (Max)
4.8
DeepSeek v3.2 (Thinking)
4.8
DeepSeek V4 (Pro)
2.4
Extreme Difficulty,<br>Rigorous Verification
Robust Maximal Independent Sets
Problem
A robust maximal independent set in a graph $G$ is a maximal independent set that remains maximal in all connected spanning subgraphs of $G$. How many connected graphs on $12$ vertices have the property that every maximal independent set is a robust maximal independent set, up to isomorphism?
Hahn Series and Multibasic Modules
Problem
Notation and definitions for background context:<br>Let $F$ be the field of order 2. Let $K$ be the field of Hahn series in indeterminate $t$ with value group $\mathbb{R}$ and residue field $F$. Let $A$ be the subring of $K$ consisting of those $a \in K$ with non-negative valuation. Consider $K$ as an $A$-module. For $q \in \mathbb{R}$, let $I_q = t^q A$ and $I_{>q} = \bigcup_{r>q} I_r$. Write $A/I_{>0}$ as $F$, since they are identical both as $A$-modules and as fields. Let $\Theta = K/I_{>0}$ and $\Phi = K/A$. We say that an $A$-module $M$ is 'basic' if it is isomorphic to $L/N$ for some $N 0}\}$.<br>You may assume the following facts:<br>Fact 1: The decomposition of a multibasic $A$-module into basic submodules is unique up to the order of the summands.<br>Fact 2: If $M_i = L_i / N_i$ and $N_i Find the number of distinct isomorphism classes of multibasic $A$-modules $M$ satisfying the following conditions:<br>(i) $K \otimes \text{End}(M) = K$.<br>(ii) $F \otimes \text{End}(M) = F$.<br>(iii) Let $e_r = \dim_F(F \otimes I_r \text{Hom}(I_{>0}, M))$ for all real $r \ge 0$. Then $\lim_{p \to q^-} e_p = e_q$ for all real $q > 0$ except for integers $q$ with $29 \le q \le 328$.<br>If your answer is infinite, write -1.
Eynard-Orantin Topological Recursion
Problem
Consider the Eynard Orantin Topological Recursion Formalism for the spectral curve $(\mathbb{C}\mathbb{P}^1, x, y, \omega_{0,2}(x, y))$, where $x = t + 1/t$ and $y = t^3 / 3$, and the fundamental bidifferential is given by $\omega_{0,2}(x_1, x_2) = \frac{dz_1 dz_2}{(z_1 - z_2)^2}$, with $z_1, z_2 \in \mathbb{C}\mathbb{P}^1$. Note that $x$ has two simple ramification points at $\pm 1$ of order $2$ with deck transformation $\theta(t) = 1/t$.<br>Please calculate the Free energies $F_2$ and return it as a rational fraction in the format $a/b$ for $a$ and $b$ coprime. Recall that the free energies $F_g$ can be computed as the following integral $F_g = \frac{1}{2g-2} \sum_{a \in \Delta} \text{Res}_{q=a} \Phi(q)\omega_{g,1}(q)$, where $\Phi(q) = \int_{o}^{q} y(t)dx(t)$, for an arbitrary base point $o$.
Measuring progress along the<br>mathematical frontier.
Read more about Riemann-Bench: Our methodology
EXPLORE ALL BENCHMARKS
Our Leaderboards
View all
Creative, Business, and Everyday writing
Hemingway-bench
Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.
Read Blog Post
Rank
Model
elo score (95% ci)
Gemini 3.1 (Pro)
1087
1068<br>1105
Gemini 3 (Flash)
1079
1062<br>1095
Gemini 3 (Pro)
1074
1051<br>1097
Claude Opus 4.7 (Max)
1057
1036<br>1078
Anthropic
GPT-5.5
1054
1032<br>1076
OpenAI
Claude Opus 4.6
1054
1035<br>1073
Anthropic
DeepSeek V4 (Pro)
1039
1017<br>1060
High-Flyer
Claude Opus 4.5
1038
1019<br>1057
Anthropic
DeepSeek V4 (Flash)
1021
999<br>1042
High-Flyer
GPT-5.2 (Chat)
1018
1001<br>1035
OpenAI
Kimi K2.5
1018
1000<br>1035
Moonshot AI
Claude Sonnet 4.6
1014
995<br>1032
Anthropic
View full leaderboard
Enterprise Agents in Realistic RL Environments
EnterpriseBench: CoreCraft
Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.
Read Blog...