Train LLM from Scratch

Train LLM From Scratch

Initializing search

FareedKhan-dev/train-llm-from-scratch

Foundations

Theory & Pipeline

How-to

Reference

Post-Training & Alignment — Overview¶

When I first trained this transformer from scratch, it could continue text but it couldn't follow instructions or reason. That's what post-training fixes. This docs/ folder walks through the whole journey I built on top of the base model — every stage written from scratch in plain PyTorch (no trl, no peft, no transformers), trained on real public datasets, and runnable on a single GPU or scaled across multiple GPUs with DDP.

If you are new to LLM training internals, start with the new LLM Foundations section before reading the stage pages. It explains the token shapes, decoder-only Transformer, attention masks, objectives, optimization loop, and generation mechanics that every later page relies on.

Recommended reading order¶

Foundations first : Tokenization -> Transformer -> Attention -> Objectives -> Optimization -> Generation.

Then the full pipeline : Data -> Pretraining -> SFT -> Reward Model -> DPO -> PPO -> GRPO.

Finally run and inspect : Evaluation, Inference / Chat, and the command cheatsheet.

The pipeline mirrors how modern aligned/reasoning models are actually built:

Mermaid source (live, editable)

flowchart TD PILE([The Pile9.8B tokens]):::data --> PRE{{Pretrain~400M base}}:::model PRE --> BASE[(base_pretrained.pt)]:::ckpt BASE --> SFT{{SFTAlpaca · Dolly · GSM8K}}:::model SFT --> SFTCK[(sft.pt)]:::ckpt SFTCK --> RM{{Reward ModelBradley-Terry}}:::rl SFTCK --> DPO{{DPO / ORPO / KTOpreference}}:::rl RM --> RMCK[(reward.pt)]:::ckpt RMCK -->|reward signal| PPO{{PPOGAE + clip + KL}}:::rl SFTCK --> PPO SFTCK --> GRPO{{GRPO / RLVRgroup-relative}}:::rl PPO --> EVAL([GSM8K eval+ chat / inference]):::eval DPO --> EVAL GRPO --> EVAL classDef data fill:#d6ffd9,stroke:#27ae60,stroke-width:2px,color:#143d1a; classDef model fill:#ffe8a3,stroke:#d48806,stroke-width:2px,color:#5a3d00; classDef rl fill:#ffd9b3,stroke:#e67e22,stroke-width:2px,color:#6b3500; classDef ckpt fill:#eeeeee,stroke:#555555,stroke-width:2px,color:#222; classDef eval fill:#e8d6ff,stroke:#8e44ad,stroke-width:2px,color:#3d1a5a;

The stages, in order¶

Stage What it teaches the model Doc

Pretraining language itself (next-token prediction on the Pile) 02_pretraining.md

SFT to follow instructions & produce the / format 03_sft.md

Reward Model to score which answer humans prefer 04_reward_model.md

DPO / ORPO / KTO to prefer better answers without an RL loop 05_dpo.md

PPO to maximize a reward (RM or verifier) with the classic RLHF loop 06_ppo.md

GRPO / RLVR to reason, using verifiable rewards (DeepSeek-R1 style) 07_grpo.md

Data pipeline how every dataset above is downloaded & preprocessed 01_data_pipeline.md

Evaluation how I measure GSM8K accuracy across all stages 08_evaluation.md

Inference / chat how to actually talk to any checkpoint 09_inference.md

The one design rule: wrap, don't rewrite¶

Everything here sits on top of the original Transformer. I changed the educational model in exactly one place — I added a forward_hidden method that returns the final hidden states the lm_head consumes. Every post-training head (a value head for PPO, a scalar reward head for the reward model) and every RL log-prob computation composes around that one method, so the from-scratch model you already understand stays intact.

Colour legend (used in every diagram in these docs)¶

🟩 data / corpus · 🟦 preprocessing · 🟦‍⬛ storage (HDF5 / JSONL) · 🟨 model / training loop · 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint

Each diagram is a hand-drawn, colour-coded Mermaid sketch, pre-rendered to a PNG and embedded as an image (GitHub's live Mermaid doesn't reliably do look: handDrawn, and some viewers — e.g. the VS Code preview — block SVGs, so an embedded PNG shows everywhere). The editable Mermaid source sits in a collapsible "Mermaid source" block under each image. To regenerate the images after editing, see diagrams/README.md.

Run the whole thing¶

Once the base model has pretrained (02_pretraining.md), the entire chain is one script:

bash scripts/run_posttraining.sh # SFT -> RM -> DPO -> PPO -> GRPO -> eval table

See POST_TRAINING.md for the condensed command reference.

Train LLM from Scratch

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews