How LLMs Really Work
How LLMs Really Work
Arpit Bhayani<br>engineering, databases, and systems. always building.
If you have used ChatGPT, Gemini, or Claude, you have already formed an intuition about what these systems do. You type something in, and text comes back that feels coherent, knowledgeable, and sometimes eerily human. But the machinery underneath is simultaneously simpler and stranger than most people expect.
This article tears open that machinery and explains what a language model is doing at a mechanical level - why it produces the outputs it does, why identical inputs produce different outputs on different runs, and what “temperature” actually means beyond “a creativity dial.”
Next-token Prediction Machine
A large language model (LLM) is, at its most fundamental level, a function that takes a sequence of tokens as input and outputs a probability distribution over its entire vocabulary for what the next token should be. That is the complete description of the core operation. Everything else - the apparent reasoning, the conversational ability, the code generation - emerges from doing this one thing at enormous scale, across an enormous amount of training data.
Concretely, imagine you feed the model the tokens for “The quick brown fox”. The model does not produce the word “jumps”. It produces a table of probabilities: “jumps” might have a 42% chance, “sat” a 12% chance, “leaped” an 8% chance, and every other token in a 100,000-word vocabulary gets some non-zero slice of the remaining probability mass. The model then samples from that distribution to pick the next token. That token gets appended to the sequence, and the whole process repeats until a stop condition is reached.
This is called autoregressive generation. Each token generated becomes part of the input for the next prediction. The model is always asking the same question: “given everything I have seen so far, what token is most likely to come next?”
What Training Actually Does
The model learns to produce these probability distributions by training on a massive corpus of text - essentially a large fraction of the written internet, books, code, and academic papers. During training, the model sees a sequence of tokens and tries to predict the next one.
When it is wrong, the error signal flows backward through the network (via backpropagation), nudging billions of internal parameters - the model’s “weights” - very slightly in the direction that would have made the correct prediction more probable.
After trillions of these updates, the model’s weights encode something remarkable: a compressed statistical model of how language works. It learns that “The Eiffel Tower is located in” is very frequently followed by “Paris,” that Python function definitions start with “def,” and that a sentence starting “To be or not to” almost certainly continues with “be.”
Crucially, the model does not have a memory of individual training examples. It has internalized statistical patterns. This is why it can generalise to novel inputs - it is not retrieving stored sentences, it is sampling from learned distributions.
Logits, Softmax, and Why Probabilities Matter
Before the model produces those clean probabilities, it produces raw scores called logits - one real number per token in the vocabulary. These logits are the raw output of the final linear layer in the neural network.
To convert logits to a probability distribution, the model applies the softmax function:
P(tokeni)=elogiti∑jelogitjP(\text{token}_i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}}P(tokeni)=∑jelogitjelogiti
Softmax does two things. First, it exponentiates each logit, which amplifies differences: a logit that is twice as large becomes exponentially more probable. Second, it normalizes everything so that all probabilities sum to 1. The result is a valid probability distribution over the entire vocabulary.
To see this in action, imagine the model is predicting the next word after “The quick brown fox”. It generates raw logits for a tiny vocabulary of four words:
TokenLogit (xix_ixi)Exponent (exie^{x_i}exi)Probability (PiP_iPi)“jumps” 8.34023.890.7%“leaped” 6.0403.49.1%“sat” 2.18.10.18%“sleeps” -1.50.20.004%Sum 4435.5 100%<br>This is the number the model actually hands you before sampling. The entire drama of temperature, top-k, and nucleus sampling happens here, in the manipulation of this distribution before a token is drawn from it.
Temperature
Temperature is the most misunderstood parameter in prompting. It is commonly described as “creativity” or “randomness,” which is technically correct but obscures exactly how it works. Understanding it precisely lets you use it deliberately.
Temperature is a scalar that divides the logits before the softmax is applied:
P(tokeni)=elogiti/T∑jelogitj/TP(\text{token}_i) = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}P(tokeni)=∑jelogitj/Telogiti/T
When T=1.0T = 1.0T=1.0 , nothing changes. The...