Singular Learning Theory: AI learns like ice melts – Zoltan's Blog
Singular Learning Theory: AI learns like ice melts | Zoltan’s Blog
Singular Learning Theory: AI learns like ice melts
February 26, 2026
Neural networks should not work.
Intuitively, neural networks with billions or trillions of parameters should not be able to generalize, they should catastrophically overfit the training data. But deep neural networks have demonstrated remarkable generalization abilities for a variety of tasks.
Why is that?
The cleanest mathematical story that explains it is called Singular Learning Theory. SLT might end up being a key mathematical tool for AI safety and interpretability in the future.
What is SLT and what problem is it solving? And what does it have to do with ice?
The problem
Neural networks cannot be analyzed with statistical tools that work well for classical models like linear regression or logistic regression. Why not? Like many of us, they have degeneracies. In classical models, the loss landscape has curvature everywhere, the Fisher information matrix is full rank. With less jargon: wiggle any parameters and your loss smoothly changes. You can try it yourself on this simple, two-parameter linear regression toy model.
Small changes in the parameters change the loss gradually on the surface of a smooth bowl. That’s not true for neural networks. Some neurons may have zero outgoing weights and the order of neurons within a layer doesn’t change the model. The loss might not respond to wiggles of all parameters. This makes the model singular.
In a classical model, all parameters are “effective” because they all affect the model’s function. The model’s complexity can be characterized by the number of these parameters, $d$. This number shows up in many places: Bayes free energy, BIC penalty, posterior entropy and so on. (Usually $d$ is divided by two. The factor of $1/2$ is a math artifact that falls out of the $d$-dimensional Gaussian. The conventional complexity metric is $d/2$.)
Neural networks are different: because of their degeneracies, the number of effective parameters is usually much lower than $d$. (Consequently, a typical neural network has a smaller effective complexity than $d/2$.) These degeneracies are the secret to why they actually generalize so well.
How so?
What is Singular Learning Theory?
SLT is a branch of statistics developed by Sumio Watanabe. It is the mathematical machinery that lets us analyze models whose parameter space is not one-to-one with the function they represent. Put differently, where the model is not a smooth bowl. Like a neural network.
The math, centered around algebraic geometry, is complicated, but the headline result, $\lambda$, is easy to understand. $\lambda$ denotes real log canonical threshold (RLCT), which replaces $d/2$ in singular models as a measure of complexity. The theory is compatible with regular models, so for these $\lambda = d/2$.
For any model, RLCT measures how much of the parameter space sits near minimum loss. If that makes no sense, you might find the hill-climbing metaphor of gradient descent useful here. Picture that you are sitting at some low loss point on the loss landscape and look around yourself. Can you wander around without starting a steep climb back up? Are you in a fjord or vale? RLCT measures this. In practice, $\hat{\lambda}$, the local version of RLCT is calculated, which measures the shape of the loss landscape around a specific point. It is also called local learning coefficient (LLC).
Put simply, low LLC around a point means you can walk around without much climbing, while high LLC means the loss will quickly increase as you leave the point in the parameter space.
Some natural basins with different shapes. Top : Black Canyon of the Gunnison, Colorado (CC BY-SA 4.0 Terry Foote). In this 2D loss-surface metaphor the model is brittle, small steps away from the local minimum force you up a steep cliff. Bottom : Death Valley, California (Public Domain, National Parks Gallery). An analogy for a degenerate, robust model where you can wiggle parameters without big changes in the loss.
How is LLC useful?
Here’s a surprising math fact: neural networks learn like ice melts1.
When water is frozen, the water molecules sit in a crystal structure and have a limited number of valid ways to be arranged. The number of possible microstates, that is, the entropy, is relatively low. As the temperature rises, two competing pressures of energy (favoring ordered arrangements) and entropy (favoring disorder) shift their relative importance and, above $0^\circ\text{C}$, energy loses the tug of war: a phase transition happens. The entropy surges, because there are more possible molecule arrangements when the ice crystals loosen.
In the beginning of neural network training, the parameters can be in many different configurations at similar loss. Entropy is high. As the training progresses and the model starts to fit the data,...