Data Fundamentals Primer for Learning LLM

vlumm9 pts1 comments

Data Fundamentals Primer — Algorhythm<br>A dataset is, mechanically, just a list. Each entry in the list is one example of the thing you want the model to learn about — an email, a photo, a sentence, a transaction, a CT scan. The list might be 50 entries or 50 billion; the principle is the same. Whatever a model "knows," it knows because it was shown enough examples for the pattern to be obvious in the data.<br>Vocabulary you'll see used interchangeably for one entry: sample, example, instance, row, record, data point, observation. They all mean the same thing — one self-contained unit the model will see during training. Pick whichever word your team uses and stop worrying about it.<br>The simplest mental model is a spreadsheet . One row per example, one column per piece of information about it. Here's a sketch of a "predict house price" dataset:<br>sqft bedrooms age zip price<br>────────────────────────────────────────────<br>850 1 12 94110 820,000<br>1450 3 8 94110 1,300,000<br>2100 4 15 94114 1,720,000<br>3200 5 3 94114 2,650,000<br>600 0 22 94103 480,000<br>... ... ... ... ...Dataset · 1 samplessqftbedroomsagezipprice85011294110820,000145038941101,300,0002100415941141,720,000320053941142,650,00060002294103480,000<br>⏮▶1 / 5⏭<br>Each row is one sample. The dataset grows by adding more rows.That table is a dataset. Five examples shown, presumably many thousands more not shown. Every column is a fact about a house; every row is one house. As long as the spreadsheet analogy fits, this is what "dataset" means.<br>Three quick observations to set up the rest of this primer:<br>Size matters but isn't everything. Deep learning loves big data — ImageNet has 1.2 million images; modern LLMs train on trillions of tokens — but a small, carefully curated dataset can beat a huge sloppy one. "Garbage in, garbage out" is the oldest rule in ML, and it's still true.<br>The rows have to look alike. Every row in a dataset should be the samekind of thing, with the same columns, drawn from a population you care about. Mixing apartments and shipping containers in a "house price" dataset just gives the model a harder job than it needs.<br>Not all data is tabular. Images are 3-D arrays (height × width × channels), audio is a long sequence of samples, text is a string. The spreadsheet picture still works — each row is one image or one document — but each "cell" might itself be huge.<br>The two big questions a dataset has to answer, which the next two sections unpack: What information does each row carry? (features and labels) and How do we keep the model from cheating? (train / validation / test split). Everything else builds on those.<br>In a Transformer: the dataset for a modern LLM is "all the text we could get our hands on" — Common Crawl, GitHub, books, papers, code, conversations. Trillions of tokens. There's no labels file alongside it; the prompt itself is the question and the next token is the answer, billions of times per epoch. Everything else in this primer — features, labels, splits, encoding, cleaning — applies, just with a vocabulary tuned to sequences of bytes instead of spreadsheet rows.

Section 1's dataset is just a pile of rows. To turn that pile into a learning problem you split each row into two parts: features — the columns the model gets to look at — and the label — the column you're asking it to predict. That split, repeated across every row in the dataset, is what makes "training a model" a meaningful operation.<br>Conventional notation, used across almost every ML paper:<br>x — the features of one example. Usually a vector of numbers; sometimes an image, a string, a graph.<br>y — the label for that example. A single number, a category, or sometimes itself a structured thing.<br>One row of the dataset = one (x, y) pair. The whole dataset = a list of (x, y) pairs.<br>Back to the housing example from Section 1. To learn "given a house, predict its price," the price column is the label and everything else is features:<br>features (x) label (y)<br>──────────────────────────────── ───────────<br>sqft bedrooms age zip price<br>──────────────────────────────── ───────────<br>850 1 12 94110 820,000<br>1450 3 8 94110 1,300,000<br>2100 4 15 94114 1,720,000<br>...sqftbedroomsagezipprice85011294110820,000145038941101,300,0002100415941141,720,000Three rows from the housing table.<br>⏮▶1 / 3⏭<br>Same table, two roles: the columns the model sees, and the column it predicts.Picking the right split is the entire framing of the problem. Same dataset, different choices of label, give you completely different models:<br>Label = price → a model that estimates house value.<br>Label = sold within 30 days? → a model that predicts how fast a listing will move.<br>Label = zip code (using sqft, bedrooms, age as features) → a model that guesses neighborhood from architecture.<br>The label's type drives the choice of model and loss function:<br>Continuous label → regression. Price in dollars; temperature tomorrow; user click-through rate. Loss is usually mean squared error.<br>Discrete label, 2 choices → binary classification. Spam vs. not-spam;...

dataset model label price features data

Related Articles