Next Token Prediction is a Misleading Term – BorisTheBrave.Com
. There’s lots of philo…" />
Skip to content
I’m fed up of hearing about how LLMs are next token predictors, and therefore they really doing cognition> .
There’s lots of philosophical objections, but fundamentally, framing AI as next token predictors in the first places is just misleading and inaccurate. Here’s why LLMs aren’t naive next token predictors.
What is Next Token Prediction
Let’s first briefly cover what “Next Token Prediction” even means. It is referring to base training (also called pre-training), the first step in training a LLM. We’ll talk about the other steps later.
Before a text is used for training a model, it’s split up into short sequences called tokens. Each token is about the length of a word, we can think of them as words for our purposes. LLMs are complicated functions that take an input sequence of tokens, and output a probability distribution for a single token. To train the model, we take the training text, and cut it up into pairs, each a long input sequence1, and then the token that follows it2.
E.g. the text “The cat sat on the mat” would get split into several such pairs: (”The”, “cat”), (”The cat”, “sat”), (”The cat sat” “on”), etc. This is the training data to be used. The model is fed the input sequence from each pair, and then it outputs a probability distribution for the next token. Then the model is carefully tweaked so that its probability of giving the correct answer increases, and the wrong answer decreases. This is repeated tens of trillions of times.
When training is done, and it comes to using the model to generate text3, the process is different. You give the model the text generated so far, and it gives you a probability distribution. A token is randomly picked from that distribution and appended to the text. By repeating this process, a full sentence is generated, where the LLM is invoked once at a time to get the next token, and it gives its best guess as to what happens next.
Hopefully, it’s pretty clear why that’s called Next Token Prediction. The model literally outputs a guess for the next token, over and over again.
It turns out that to do a good job of such a task, models are forced to learn a great deal from the text. They need to consider language and grammar, obviously. But they also need to consider the content of the text. Imagine processing a maths textbook. If you saw the sentence “Nine plus nine makes”, a good prediction for what follows is “eighteen”, which requires knowing or at least memorising the maths. In extremis, you could imagine processing an entire book, which ends with “And, Watson, the name of the murder is”, which would require having an understanding of all preceding clues and inferences to be a good guess. LLMs aren’t capable of that, which illustrates that next token prediction is quite challenging, and gives the right sort of pressures during the training process.
Anyway, if you follow this training regime for enough tokens, and your LLM’s architecture and your tweaking are set up correctly, you end up with a model that is able to continue sentences convincingly, if not necessarily that accurately. In its internal structure, it will have set up patterns for dealing with factual recall, grammar, logic etc.
So yeah, the term Next Token Prediction is pretty accurate, mechanically. Where do the misconceptions come from, then?
The Learning Process Influences Pairs of Token with Long Separations
You’ll notice that during base training, an LLM is given a long sequence of input tokens, and asked to predict one more. These sequences are frequently tens of thousands of tokens long, sometimes going up to a million. All those tokens contribute to the answer, and during training, the computation on all those tokens is tweaked together.
You can even flip around your point of view. Normally, training is phrased as predicting a single token, using all preceding tokens. But it would also be valid to say that each token is used multiple times during the prediction of future tokens of a sentence, and contributes to all of them. The training process ensures that the calculations done on any given token are not just useful for the one that is immediately following, but all following tokens.
In other words, the phrase “next token prediction” gives a false sense that LLM computation is somehow last-minute. That they are forced to consider words in isolation, without regard for how it fits together. This is not accurate. If the training process finds that it has the capacity to learn either a structure useful for predicting the single next token, or a structure that is not immediately useful, but will help with many future tokens, it’ll probably prefer the latter.
LLMS Can and Do Plan Ahead
Let’s do a thought experiment. You are a spy in a foreign country. It’s imperative you blend in to avoid counter-espionage. Right now, you are driving in a car and are concerned your...