Building AGI Using Language Models | Leo Gao
Created: 2020-08-17 <br>Modified: 2020-08-17
Despite the buzz around GPT-3, it is, in and of itself, not AGI. In many ways, this makes it similar to AlphaGo or Deep Blue; while approaching human ability in one domain (playing Chess/Go, or writing really impressively), it doesn’t really seem like it will do Scary AGI Things™ any more than AlphaGo is going to be turning the Earth into paperclips anytime soon. While its writings are impressive at emulating humans, GPT-3 (or any potential future GPT-x) has no memory of past interactions, nor is it able to follow goals or maximize utility. However, language modelling has one crucial difference from Chess or Go or image classification. Natural language essentially encodes information about the world—the entire world, not just the world of the Goban, in a much more expressive way than any other modality ever could.[1] By harnessing the world model embedded in the language model, it may be possible to build a proto-AGI.
World Modelling
The explicit goal of a language model is only to maximize likelihood of the model on natural language data. In the autoregressive formulation that GPT-3 uses, this means being able to predict the next word as well as possible. However, this objective places much more weight on large, text-scale differences like grammar and spelling than fine, subtle differences in semantic meaning and logical coherency, which reflect as very subtle shifts in distribution. Once the former are near-perfect, though, the only place left to keep improving is the latter.
At the extreme, any model whose loss reaches the Shannon entropy of natural language—the theoretical lowest loss a language model can possibly achieve, due to the inherent randomness of language—will be completely indistinguishable from writings by a real human in every way, and the closer we get to it, the more abstract the effect on quality of each bit of improvement in loss. Or, said differently, stringing words together using Markov chain generators gets you 50% of the way there, figuring out grammar gets you another 50% of the remaining distance, staying on topic across paragraphs gets you another 50% of the remaining distance, being logically consistent gets you another 50% of the remaining distance, and so on.[2]
H(X)=−E[logP(X)]=−∑x∈ΩfX(x)logfX(x)\begin{aligned}<br>H(X) = -\mathbb E[\log \mathbb P(X)] = -\sum_{x \in \Omega} f_X(x) \log f_X(x)<br>\end{aligned}H(X)=−E[logP(X)]=−x∈Ω∑fX(x)logfX(x)<br>Shannon Entropy: the number of bits necessary, on average to specify one piece of text.
Why? Because if you have a coherent-but-not-logically consistent model, becoming more logically consistent helps you predict language better. Having a model of human behavior helps you predict language better. Having a model of the world helps you predict language better. As the low-hanging fruits of grammar and basic logical coherence are taken, the only place left for the model to keep improving the loss is a world model. Predicting text is equivalent to AI.
The thing about GPT-3 that makes it so important is that it provides evidence that as long as we keep increasing the model size, we can keep driving down the loss, possibly right up until it hits the Shannon entropy of text. No need for clever architectures or complex handcrafted heuristics. Just by scaling it up we can get a better language model, and a better language model entails a better world model.
But how do we use this language model if it’s buried implicitly-represented inside GPT-x, though? Well, we can literally just ask it, in natural language, what it thinks will happen next given a sequence of events, and its output distribution will approximate the distribution of what the average human thinks would happen next after those events. Great—we’ve got ourselves a usable world model.
“But wait!” you say, “Various experiments have shown that GPT-3 often fails at world modelling, and just conjecturing that adding more parameters will fix the problem is a massive leap!” If you’re thinking this, you’re absolutely correct. The biggest and most likely to be wrong assumption that I’m making is that larger models will develop better world models. Since as loss approaches the Shannon entropy its world modelling ability has to become about as good than the average human on the internet[3], this boils down to two questions: “Will we really make models whose loss gets close enough to the Shannon entropy?” and “How close is close enough, in order to have the world modelling capabilities to make all this practical?”
Source)">
The answer to the first question is “most likely”—that’s the main takeaway of GPT-3. The answer to the second question is… nobody knows. Some have demonstrated ways of making GPT-3 better at world modelling, but this alone is probably not sufficient. When models with 1 trillion, then 10 trillion, then 100 trillion parameters become available, we will have empirical...