MicroGPT and Interactive Walkthrough

learning2Grow1 pts0 comments

microgpt — a guided, illustrated walkthrough

Present

Contents

Introduction

Where to find it

Dataset

Tokenizer

From neuron to network

Forward pass

Autograd

Architecture

Parameters

Training loop

Inference

Train the toy GPT

Run it

Progression

Real stuff

Zoom in further · Bycroft

Assignment

FAQ

DS 6042 — Lab 02

Machine Learning in Systems & Network Security

Andrej Karpathy's post, augmented by Daniel Graham

Before we can begin evaluating and auditing AI systems, we have to understand them from first principles. On Feb 12, 2026, Andrej Karpathy (co-founder at OpenAI; helped build Tesla Autopilot) released a 200-line pure-Python program implementing the fundamental ideas behind GPT. I've taken his post and turned it into a lab with exercises and visuals to help us understand the concepts deeply rather than skim them. Karpathy's post is already well written — the goal is to augment it. The Python here is also rewritten in a slightly less compressed style: ~2XX lines instead of 200, but a bit easier to read. As always, feel free to work with the people at your table. You've got this.

Original post: karpathy.ai/microgpt.html · companion video on autograd: The spelled-out intro to neural networks and backpropagation (2.5 hr)

Try it · generate names from the trained microgpt

microgpt is tiny (just 4,192 parameters) but it's still a real neural language model. Karpathy trained one on 32,033 first names (the makemore dataset). The weights are loaded right here in your browser, and the same forward pass you'll dissect later in the lab runs every time you press Send.

Type a single letter and press Enter.

Temperature

0.70<br>low = conservative · high = wild

Send

Take it with you:<br>↓ model.json (weights)<br>↓ sampler.js<br>↓ sampler.py

Where to find it

GitHub gist with the full source code: microgpt.py

Also available on this web page: karpathy.ai/microgpt.html

Also available as a Google Colab notebook — you can run it without installing anything

The following is a guide that steps an interested reader through the code.

Dataset

The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page — but for microgpt, we use a simpler example of 32,000 names, one per line:

# Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)<br>if not os.path.exists('input.txt'):<br>import urllib.request<br>names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'<br>urllib.request.urlretrieve(names_url, 'input.txt')<br>docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]<br>random.shuffle(docs)<br>print(f"num docs: {len(docs)}")

The dataset looks like this. Each name is a document:

emma<br>olivia<br>ava<br>isabella<br>sophia<br>charlotte<br>mia<br>amelia<br>harper<br>... (~32,000 names follow)

The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate ("hallucinate"!) new, plausible-sounding names. Skipping ahead, we'll get:

sample 1: kamon sample 8: anna sample 15: earan<br>sample 2: ann sample 9: areli sample 16: lenne<br>sample 3: karai sample 10: kaina sample 17: kana<br>sample 4: jaire sample 11: konna sample 18: lara<br>sample 5: vialan sample 12: keylen sample 19: alela<br>sample 6: karia sample 13: liole sample 20: anton<br>sample 7: yeran sample 14: alerin

It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny-looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion.

Tokenizer

Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken (used by GPT-4) operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset:

# Let there be a Tokenizer to translate strings to discrete symbols and back<br>uchars = sorted(set(''.join(docs))) # unique characters become token ids 0..n-1<br>BOS = len(uchars) # token id for Beginning of Sequence<br>vocab_size = len(uchars) + 1 # total tokens, +1 for BOS<br>print(f"vocab size: {vocab_size}")

We collect all unique characters across the dataset (which are just the lowercase letters a–z), sort them, and each letter gets an id by its index. The integer values themselves carry no meaning — each token is just a discrete symbol. Instead of 0, 1, 2 they could be different emoji. We also create one special token, BOS (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with BOS on both sides: [BOS, e, m, m, a, BOS]. The...

sample microgpt dataset karpathy names model

Related Articles