Karpathy's LLM teaching corpus, rendered as a designed HTML wiki

Karpathy's LLM Pedagogy

Hub · Overview Karpathy's LLM Pedagogy

This wiki covers Andrej Karpathy's published teaching corpus on language models — seven open-source repositories and a nine-lecture YouTube series ("Neural Networks: Zero to Hero"). Together they trace the technical lineage from "what is backpropagation" through to "here is a working reproduction of GPT-2 (124M)."

The corpus is unusually coherent. The same patterns and abstractions recur across repos — Block, MultiHeadAttention, configure_optimizers, estimate_mfu, from_pretrained — at progressively bigger scales. Reading any one repo in isolation works, but reading them in order shows you the underlying ideas being refined.

Reading guide

If you're starting from zero and want the full arc, the order is:

zero-to-hero-arc The lecture map. Read this first.

repos/micrograd Scalar autograd. The conceptual root.

backpropagation and value-class The algorithm and its data structure.

repos/makemore First real LMs. Bigram → MLP → ... → Transformer.

repos/ng-video-lecture Character-level GPT on Tiny Shakespeare.

repos/nanoGPT Production-grade GPT-2 implementation.

repos/build-nanogpt Faithful GPT-2 reproduction with every optimization.

repos/llama2-c Llama 2 in PyTorch + pure C inference. The "modern" architecture.

repos/llm-c Same training task as build-nanogpt, in pure C/CUDA.

If you want to learn a specific concept, jump to the concept page; each one cross-references the repos that demonstrate it.

The architecture, in pieces

The transformer architecture as Karpathy teaches it, broken into independent pieces:

TopicPage

The repeating unittransformer-block Information mixing across positionsattention Stability mechanism for deep stacksresidual-connections Per-layer normalizationlayernorm-vs-rmsnorm Per-position nonlinearitygelu-and-swiglu Positional information (GPT-2 vs Llama)rope Vocabulary and embeddingtokenization, character-vs-bpe Embedding-unembedding sharingweight-tying

Training, in pieces

TopicPage

Gradient computationbackpropagation, value-class Parameter updateadamw Initializationweight-init Learning rate over timelearning-rate-schedules Batches and effective batch sizegradient-accumulation, dataloader Numerical precisionmixed-precision-and-mfu Keeping training alivetraining-stability Downstream evaluationhellaswag-eval

Inference

TopicPage

Token selectionsampling Generation accelerationkv-cache Pure-C runtimerepos/llama2-c

Three "model families" to compare

The corpus contains three subtly different transformer architectures, useful to compare against each other:

Component GPT-2 ng-video-lecture, nanoGPT, build-nanogpt, llm.c Llama 2 llama2.c makemore Transformer

Normalization LayerNorm RMSNorm LayerNorm

Positional Learned embedding RoPE Learned embedding

Activation GELU SwiGLU GELU

Tokenizer BPE (50257) SentencePiece BPE (32000) character-level

Attention Multi-head Grouped-query Multi-head

Same skeleton, different organs. Once you know the skeleton (the transformer block wrapped in residuals and a stack), swapping organs is straightforward.

What's not in this wiki

Things outside the scope of the corpus:

Post-training (SFT, RLHF, DPO) None of these repos do instruction tuning or alignment. nanochat does, but it's not in the corpus.

Model parallelism beyond DDP No tensor parallelism, no pipeline parallelism. llm.c has ZeRO-1 optimizer sharding but no model sharding.

Multimodal Text-only throughout.

MoE Dense models only.

In scope: dense, decoder-only, pretraining + base inference, up to GPT-2 / Llama 2 scale. Within that scope it's the most complete teaching resource available.

Cross-reference conventions

Every page in this wiki uses markdown reference links: [name](name.md) for concepts, [name](repos/name.md) for repos. The link text is usually the unqualified name; the path tells you whether it's a concept or a repo page.

For agents post-processing this wiki: every page is a self-contained topic that can be rendered as a single HTML page. Internal links between pages are the primary structural signal of the wiki graph. The concepts/ flat layout was rejected in favor of having concepts at the wiki root and repos in a subdirectory — concepts are first-class citizens, repos are case studies that ground them.

Karpathy's LLM teaching corpus, rendered as a designed HTML wiki

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy