A Transformer Becomes an LLM

From Transformer to ChatGPT: The Part That Isn't the Architecture | Bharadwaj P Skip to content

← Back to Blog In the last post, we followed the transformer down to one layer. Attention lets words shape each other. An MLP reshapes the result. The final numbers become a guess at the next word. Stack that layer many times and you have the architecture.

But the architecture is not the product. A stack of layers like that is not Claude, or GPT, or Gemini. Untrained, it is a pile of random numbers that knows nothing. This post follows the rest of the path: from that empty architecture to a model you can talk to.

Same approach as last time. Small worked examples and diagrams, not full notation.

TLDR

Stack transformer layers into one big model. Parameter count measures size: 7B, 70B, 405B.

Text gets split into tokens, not words. Models train on trillions of them.

Pre-training teaches the model to guess the next token.

Supervised fine-tuning (training on prompts with the desired answers) creates an assistant.

Alignment (a post-training step that pushes the model toward answers humans prefer) shapes answers.

For cheap customization, freeze the model and train tiny add-on matrices. For serving, lean on GPUs.

We'll take each of those six in turn. Same running example as the last post: "this too shall", and I'm hoping the model lands on "pass".

The piece that makes a deep network (neural net with many layers) trainable: skip connections

In the last post, we skipped a training detail to keep the diagram clean. Let's dig into it now: the residual connection.

Here is the problem it solves. During training, errors flow backward through every layer. That signal updates the early layers. Stack enough layers and the signal degrades on the way down. It can shrink toward nothing or blow up. Past some depth, early layers barely move.

The fix looks small for the problem: let the original signal bypass each block. Instead of passing only block(input) forward, pass the sum: input + block(input). The next layer always receives both pieces together. If the block is noisy early in training, the original input still survives inside that sum. Over time, the block learns a useful adjustment instead of rebuilding the whole representation from scratch.

input

block(input) the learned change

bypass: carry the original input forward

add both

input + block(input)

The next layer receives the original input plus the block's learned change.

Those dashed paths are side channels. The original signal has a clean route forward around the block. The correction signal gets the same route on the way back. Each block only has to learn a small adjustment on top of a stable input. That makes training steady enough for trillions of tokens.

First, text becomes tokens

The last post said "each word becomes a row of numbers". The real unit is the token, not the word. That difference explains a lot of how LLMs behave.

Models do not work on words. Words are the wrong unit. The output layer shows why. English has hundreds of thousands of words, and it keeps borrowing more. New words keep arriving. Worse, the model has to score every vocabulary item at every step. With 600,000 words, that means 600,000 scores per step. Sentences are worse, since the set is open-ended.

Go the other way, then. Characters? Now the vocabulary is tiny. You might need a thousand symbols for letters, punctuation, and accents. But a single character carries little meaning. "a" and "I" are words; "z" is not. Asking the model to build meaning one character at a time is a brutal job.

So tokenizers split the difference by counting. They look for groups of characters that appear together across a huge pile of text. The most common groups become units. Those units are tokens . Common chunks become single tokens. Rare words split into several. There is no clean grammar rule for the splits.

You can watch this happen on Tiktokenizer. Feed it "I am an aardvark on a large ark" with the Llama 3 tokenizer and you get 11 tokens, each mapped to an ID:

am an ard vark on large ark "aardvark" is rare, so it shatters into pieces. "I" is common, so it stays whole.

Eight words become eleven tokens.

For ordinary English prose I budget 1.5 to 2 tokens per word. Code, odd names, and non-English text can push that ratio up.

That ratio is not the vocabulary size. They count different things. The ratio is how many tokens one word costs when you split it. The vocabulary is how many distinct tokens exist to choose from. Many recent models carry around 200,000. English has maybe 600,000 words, but the vocabulary stays smaller because rare words do not each get a slot. They get built from pieces, the way "aardvark" broke into three above.

This one design choice explains several things people find odd about LLMs.

Why they handle typos and mixed languages so well. The model never sees "words" the way you do. Misspell something, or drop a Hindi word into an English sentence. The model just...

A Transformer Becomes an LLM

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7