The Bitter Lesson
The Bitter Lesson
Rich Sutton
March 13, 2019
The biggest lesson that can be read from 70 years of AI research is<br>that general methods that leverage computation are ultimately the most<br>effective, and by a large margin. The ultimate reason for this is<br>Moore's law, or rather its generalization of continued exponentially<br>falling cost per unit of computation. Most AI research has been<br>conducted as if the computation available to the agent were constant<br>(in which case leveraging human knowledge would be one of the only ways<br>to improve performance) but, over a slightly longer time than a typical<br>research project, massively more computation inevitably becomes<br>available. Seeking an improvement that makes a difference in the<br>shorter term, researchers seek to leverage their human knowledge of the<br>domain, but the only thing that matters in the long run is the<br>leveraging of computation. These two need not run counter to each<br>other, but in practice they tend to. Time spent on one is time not<br>spent on the other. There are psychological commitments to investment<br>in one approach or the other. And the human-knowledge approach tends to<br>complicate methods in ways that make them less suited to taking<br>advantage of general methods leveraging computation. There were<br>many examples of AI researchers' belated learning of this bitter<br>lesson,<br>and it is instructive to review some of the most prominent.
In computer chess, the methods that defeated the world champion,<br>Kasparov, in 1997, were based on massive, deep search. At the time,<br>this was looked upon with dismay by the majority of computer-chess<br>researchers who had pursued methods that leveraged human understanding<br>of the special structure of chess. When a simpler, search-based<br>approach with special hardware and software proved vastly more<br>effective, these human-knowledge-based chess researchers were not good<br>losers. They said that ``brute force" search may have won this time,<br>but it was not a general strategy, and anyway it was not how people<br>played chess. These researchers wanted methods based on human input to<br>win and were disappointed when they did not.
A similar pattern of research progress was seen in computer Go, only<br>delayed by a further 20 years. Enormous initial efforts went into<br>avoiding search by taking advantage of human knowledge, or of the<br>special features of the game, but all those efforts proved irrelevant,<br>or worse, once search was applied effectively at scale. Also important<br>was the use of learning by self play to learn a value function (as it<br>was in many other games and even in chess, although learning did not<br>play a big role in the 1997 program that first beat a world champion).<br>Learning by self play, and learning in general, is like search in that<br>it enables massive computation to be brought to bear. Search and<br>learning are the two most important classes of techniques for utilizing<br>massive amounts of computation in AI research. In computer Go, as in<br>computer chess, researchers' initial effort was directed towards<br>utilizing human understanding (so that less search was needed) and only<br>much later was much greater success had by embracing search and<br>learning.
In speech recognition, there was an early competition, sponsored by<br>DARPA, in the 1970s. Entrants included a host of special methods that<br>took<br>advantage of human knowledge---knowledge of words, of phonemes, of the<br>human vocal tract, etc. On the other side were newer methods that were<br>more statistical in nature and did much more computation, based on<br>hidden Markov models (HMMs). Again, the statistical methods won out<br>over the human-knowledge-based methods. This led to a major change in<br>all of natural language processing, gradually over decades, where<br>statistics and computation came to dominate the field. The recent rise<br>of deep learning in speech recognition is the most recent step in this<br>consistent direction. Deep learning methods rely even less on human<br>knowledge, and use even more computation, together with learning on<br>huge training sets, to produce dramatically better speech recognition<br>systems. As in the games, researchers always tried to make systems that<br>worked the way the researchers thought their own minds worked---they<br>tried to put that knowledge in their systems---but it proved ultimately<br>counterproductive, and a colossal waste of researcher's time, when,<br>through Moore's law, massive computation became available and a means<br>was found to put it to good use.
In computer vision, there has been a similar pattern. Early methods<br>conceived of vision as searching for edges, or generalized cylinders,<br>or in terms of SIFT features. But today all this is discarded. Modern<br>deep-learning neural networks use only the notions of convolution and<br>certain kinds of invariances, and perform much better.
This is a big lesson. As a field, we still have not thoroughly learned<br>it, as we are continuing to make the same kind of mistakes. To see<br>this, and to effectively resist it, we have to...