Why Deep Learning Works Even Though It Shouldn’t – Ryan Moulton's Articles
Skip to content
This is a big question, and I’m not a particularly big person. As such, these are all likely to be obvious observations to someone deep in the literature and theory. What I find however is that there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.
Because I have no need of rigor to post on the internet, (or even a need to be correct) I’m going to post some of those intuitions here as I (not an expert) understand them. Since the best way to get the right answer on the internet is to post the wrong one, I’ve gotten a lot of good feedback from people about this, and updated it accordingly. If this is all obvious to you, skip to the section on "Suggestions for Research" because there are a lot of ways that I think typical papers ignore things that most researchers believe to be true.
In particular I find that people from a statistics background tend to throw up their hands at deep learning, because from a traditional statistics perspective, none of it can possibly work. This makes it very frustrating that it does. As a result they tend to have a much more dim view of its results and methods than their continued success warrants, so I hope here that I can bridge some of that gap.
The key thing I’m going to try to intuitively explain is why models always get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller. Some of this might turn out to be wrong, but I think it’s much more likely to be incomplete than to be wrong. The effects I describe here likely matter, even though it’s possible they aren’t the dominant causes. There is going to be nothing terribly formal here, which will madden some people, and relieve others. If you find this all irritatingly hand wavy, go read papers about the lottery ticket theory of deep learning instead, because I think that’s the closest thing to a formal theory that encapsulates most of this and is currently making progress.
So here goes.
If you start your parameters in a reasonable place, they’re already close to good ones, even though they’re totally random.
In high dimensional spaces, distance is a statistical concept. Squared euclidean distance is just a big sum, and statistics tells us what happens to all big sums. They become normal distributions, and they become relatively tighter and tighter around their mean as the number of terms in the sum increases. This means that when there is any amount of well behaved randomness involved, all distances in high dimensions are about the same. In a model, with parameters that begin as random variables due to initialization, and end as random variables due to the nature of the data, the central limit theorem applies to these sums. So all sets of parameters in a high dimensional model are about equally close to/far from each other.
In the dimensions we live in, we’re used to the idea that some things are closer together than other things, so we mentally think of concepts like "regions" and think about things like bad regions and good regions for parameters. But high dimensional spaces are extremely well connected. You can get to anywhere with a short jump from anywhere else. There are no bad places to start. If the magnitudes of the random initialization are about right, all places are reasonably good. No matter where you start, you’re close to good parameters, and you’re as likely to be close to good parameters as to any others. The only assumption we need for this to be formally true is that there are lots of good sets of parameters, and that they come from roughly the same distribution as the parameters at initialization, which is pretty mild.
High dimensional spaces are unlikely to have local optima, and probably don’t have any optima at all.
Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.
If there’s any randomness at all in the loss landscape, which of course there is, it’s vanishingly unlikely that all of the millions or billions of directions the model has to choose from will be simultaneously uphill. With so many directions to choose from you will always have at least one direction to escape. It’s just completely implausible that any big model comes close to any optima at all. In fact it’s implausible...