LLM Quantization Part 1: What Even is an LLM? | LTT Labs
←BACK TO ARTICLES
LLM Quantization Part 1: What Even is an LLM?
Utkarsh J.·Tested by Jessie C.·
LLM Quantization Part 1: What Even is an LLM?<br>LLM Quantization Part 1: What Even is an LLM?
Utkarsh J.·Tested by Jessie C.·
Table of Contents
Explore similarExplainerLLM (AI)
You've seen the headlines. You've probably used ChatGPT or asked Claude something embarrassing at 2 am. Have you wondered what is really happening under the hood when someone says "large language model"? You don't need a PhD to get a useful mental model of this stuff, and more importantly, you'll need this foundation to follow along when we get into our quantization experiments in later articles. Feel free to skip to the next article (if it has been released) if you have a working understanding of LLMs.<br>Note: This is Part 1 of 3 in our LLM Quantization series.<br>Part 1: What Even is an LLM?<br>Part 2: Why do LLMs Need So Much VRAM? (coming soon)<br>Part 3: How Do We Compress an LLM? (coming soon)<br>Thank you Bartowski (Colin Kealty) for his invaluable input.<br>It Starts With a Neuron (Sort of)<br>Your brain has about 86 billion neurons. Each one receives signals, reacts to them, and passes a signal along. In the 1940s some very ambitious scientists thought: what if we made a “fake” version of that with math? The artificial neural network was born. The fake neuron is embarrassingly simple compared to the real thing (for now). It takes in some numbers, multiplies each one by a weight, adds them up, and spits out a result. That's essentially it. The magic isn't in any individual neuron, rather it's in connecting millions (or billions) of them together in layers. A great visual intro to how these layers work is 3Blue1Brown's Neural Networks playlist.<br>Watch "YouTube video player" on YouTubeWatch<br>3Blue1Brown's "But what is a neural network?".
What is a Parameter?<br>Every one of those multiplication weights we mentioned? Those are parameters. When people say "Llama 3.1 8B" the 8B refers to 8 billion parameters. Each one is just a number, a dial that gets tuned during training. A small model might have a few hundred million of these dials. A large one has hundreds of billions.<br>These numbers have to be stored somehow, and the format they're stored in matters. By default most models store each parameter as a 16-bit floating point number – meaning each dial takes up 2 bytes of space. The math is then pretty straightforward: 8 billion parameters × 2 bytes = 16GB just to hold the model in memory. A 70 billion parameter model at the same precision needs around 140GB. This is why model files are so large, and why your average gaming GPU breaks into a cold sweat when you ask it to run one.<br>The collection of all these numbers IS the model. When you download a model file, you are very literally downloading a very large list of numbers. This is also why model files are so big, and it's the whole reason quantization (coming in following articles) is interesting. Those numbers don't necessarily need to be stored at full precision but we're getting a little ahead of ourselves.<br>Training: Getting the Dials Right<br>So how do all those billions of dials end up set to the right values? Training. Here's the high level version without any of the calculus:<br>Show the model some text.<br>Ask it to predict what comes next.<br>Compare its guess to the real answer.<br>Measure how wrong it was. This is called loss, and you'll be seeing that word again.<br>Nudge every dial very slightly in the direction that would have made the answer less wrong.<br>Repeat this billions of times on hundreds of billions of words of text.<br>The direction of the nudging process is called backpropagation, and the nudge itself is the gradient descent. You don't need to know the math, just know that it works, and that doing it well requires an OBSCENE amount of compute. Training GPT-4 reportedly cost over $100 million in compute alone. Training fresh base large language and multi-modal models from scratch (all weights/dials randomly set) takes an unfathomable amount of data and millions of GPU hours.<br>What most hobbyists or consumers mean when they say they "trained” an LLM or a Large Multi-modal Model is actually fine-tuning – taking a model that already has mostly good dial settings (pre-trained) and nudging a small fraction of them toward a specific task or personality. In traditional machine learning it is still common to train small models from scratch.<br>For a visual walkthrough of how training and loss actually work: StatQuest: Gradient Descent, Step by Step is an approachable first descent into the subject.<br>Watch "YouTube video player" on YouTubeWatch<br>StatQuest with Josh Starmer's "Gradient Descent, Step-by-Step".
Tokens and Context Windows<br>LLMs don't read words or letters. They read tokens. A token is a chunk of text which is sometimes a whole word, sometimes part of a word, sometimes just punctuation. The model converts everything to tokens before processing it,...