Train LLM From Scratch
Skip to content
Initializing search
FareedKhan-dev/train-llm-from-scratch
Foundations
Theory & Pipeline
How-to
Reference
Post-Training & Alignment — Overview¶
When I first trained this transformer from scratch, it could continue text but it couldn't<br>follow instructions or reason. That's what post-training fixes. This docs/ folder walks<br>through the whole journey I built on top of the base model — every stage written from scratch<br>in plain PyTorch (no trl, no peft, no transformers), trained on real public datasets, and<br>runnable on a single GPU or scaled across multiple GPUs with DDP.
If you are new to LLM training internals, start with the new<br>LLM Foundations section before reading the stage pages. It explains the<br>token shapes, decoder-only Transformer, attention masks, objectives, optimization loop, and generation<br>mechanics that every later page relies on.
Recommended reading order¶
Foundations first :<br>Tokenization -><br>Transformer -><br>Attention -><br>Objectives -><br>Optimization -><br>Generation.
Then the full pipeline :<br>Data -><br>Pretraining -><br>SFT -><br>Reward Model -><br>DPO -><br>PPO -><br>GRPO.
Finally run and inspect :<br>Evaluation, Inference / Chat, and the<br>command cheatsheet.
The pipeline mirrors how modern aligned/reasoning models are actually built:
Mermaid source (live, editable)
flowchart TD<br>PILE([The Pile9.8B tokens]):::data --> PRE{{Pretrain~400M base}}:::model<br>PRE --> BASE[(base_pretrained.pt)]:::ckpt<br>BASE --> SFT{{SFTAlpaca · Dolly · GSM8K}}:::model<br>SFT --> SFTCK[(sft.pt)]:::ckpt<br>SFTCK --> RM{{Reward ModelBradley-Terry}}:::rl<br>SFTCK --> DPO{{DPO / ORPO / KTOpreference}}:::rl<br>RM --> RMCK[(reward.pt)]:::ckpt<br>RMCK -->|reward signal| PPO{{PPOGAE + clip + KL}}:::rl<br>SFTCK --> PPO<br>SFTCK --> GRPO{{GRPO / RLVRgroup-relative}}:::rl<br>PPO --> EVAL([GSM8K eval+ chat / inference]):::eval<br>DPO --> EVAL<br>GRPO --> EVAL<br>classDef data fill:#d6ffd9,stroke:#27ae60,stroke-width:2px,color:#143d1a;<br>classDef model fill:#ffe8a3,stroke:#d48806,stroke-width:2px,color:#5a3d00;<br>classDef rl fill:#ffd9b3,stroke:#e67e22,stroke-width:2px,color:#6b3500;<br>classDef ckpt fill:#eeeeee,stroke:#555555,stroke-width:2px,color:#222;<br>classDef eval fill:#e8d6ff,stroke:#8e44ad,stroke-width:2px,color:#3d1a5a;
The stages, in order¶
Stage<br>What it teaches the model<br>Doc
Pretraining<br>language itself (next-token prediction on the Pile)<br>02_pretraining.md
SFT<br>to follow instructions & produce the / format<br>03_sft.md
Reward Model<br>to score which answer humans prefer<br>04_reward_model.md
DPO / ORPO / KTO<br>to prefer better answers without an RL loop<br>05_dpo.md
PPO<br>to maximize a reward (RM or verifier) with the classic RLHF loop<br>06_ppo.md
GRPO / RLVR<br>to reason, using verifiable rewards (DeepSeek-R1 style)<br>07_grpo.md
Data pipeline<br>how every dataset above is downloaded & preprocessed<br>01_data_pipeline.md
Evaluation<br>how I measure GSM8K accuracy across all stages<br>08_evaluation.md
Inference / chat<br>how to actually talk to any checkpoint<br>09_inference.md
The one design rule: wrap, don't rewrite¶
Everything here sits on top of the original Transformer. I changed the<br>educational model in exactly one place — I added a forward_hidden<br>method that returns the final hidden states the lm_head consumes. Every post-training head (a value<br>head for PPO, a scalar reward head for the reward model) and every RL log-prob computation composes<br>around that one method, so the from-scratch model you already understand stays intact.
Colour legend (used in every diagram in these docs)¶
🟩 data / corpus · 🟦 preprocessing · 🟦⬛ storage (HDF5 / JSONL) · 🟨 model / training loop<br>· 🟧 RL / reward · 🟥 loss / objective · 🟪 evaluation · ⬜ checkpoint
Each diagram is a hand-drawn, colour-coded Mermaid sketch, pre-rendered to a PNG and embedded as<br>an image (GitHub's live Mermaid doesn't reliably do look: handDrawn, and some viewers — e.g. the<br>VS Code preview — block SVGs, so an embedded PNG shows everywhere). The editable Mermaid source sits<br>in a collapsible "Mermaid source" block under each image. To regenerate the images after editing,<br>see diagrams/README.md.
Run the whole thing¶
Once the base model has pretrained (02_pretraining.md), the entire chain is one script:
bash scripts/run_posttraining.sh # SFT -> RM -> DPO -> PPO -> GRPO -> eval table
See POST_TRAINING.md for the condensed command reference.
Back to top