Stanford CS336 | Language Modeling from Scratch
CS336: Language Modeling from Scratch
Stanford / Spring 2026<br>previous offerings)
Spring 2025 |<br>Spring 2024
Previous offerings: Spring 2025 |<br>Spring 2024
-->
Course Staff
Tatsunori Hashimoto
Instructor
Percy Liang
Instructor
Herman Brunborg
CA
Marcel Rød
CA
Steven Cao
CA
Logistics
Lectures: Monday/Wednesday 3:00-4:20pm in<br>Skilling Auditorium
Recordings:<br>YouTube playlist
Office hours:
Percy Liang: Fridays 11am-12pm in Gates 366
Tatsu Hashimoto: Tuesdays 11-12am in Gates 364
Marcel Rød: Tuesdays 4:30-5:30pm in Gates 498, Wednesdays 4:30-5:30pm in Gates 415
Herman Brunborg: Wednesdays 1:30-2:30pm, Fridays 1:30-2:30pm, location Gates 392
Steven Cao: Mondays 4:30-5:30pm, Thursdays 9:30-10:30am, Gates 200
Contact : Students should ask all course-related<br>questions in public Slack channels. All announcements will also be<br>made in Slack. For personal matters, email
cs336-spr2526-staff@lists.stanford.edu.
Content
What is this course about?
Language models serve as the cornerstone of modern natural language<br>processing (NLP) applications and open up a new paradigm of having a<br>single general purpose system address a range of downstream tasks. As<br>the field of artificial intelligence (AI), machine learning (ML), and<br>NLP continues to grow, possessing a deep understanding of language<br>models becomes essential for scientists and engineers alike. This course<br>is designed to provide students with a comprehensive understanding of<br>language models by walking them through the entire process of developing<br>their own. Drawing inspiration from operating systems courses that<br>create an entire operating system from scratch, we will lead students<br>through every aspect of language model creation, including data<br>collection and cleaning for pre-training, transformer model<br>construction, model training, and evaluation before deployment.
Prerequisites
Proficiency in Python
The majority of class assignments will be in Python. Unlike most<br>other AI classes, students will be given minimal scaffolding. The<br>amount of code you will write will be at least an order of magnitude<br>greater than for other classes. Therefore, being proficient in<br>Python and software engineering is paramount.
Experience with deep learning and systems optimization
A significant part of the course will involve making neural language<br>models run quickly and efficiently on GPUs across multiple machines.<br>We expect students to be able to have a strong familiarity with<br>PyTorch and know basic systems concepts like the memory hierarchy.
College Calculus, Linear Algebra (e.g. MATH 51, CME 100)
You should be comfortable understanding matrix/vector notation and<br>operations.
Basic Probability and Statistics (e.g. CS 109 or equivalent)
You should know the basics of probabilities, Gaussian distributions,<br>mean, standard deviation, etc.
Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)
You should be comfortable with the basics of machine learning and<br>deep learning.
Note that this is a 5-unit class. This is a very implementation-heavy<br>class, so please allocate enough time for it.
Coursework
Assignments
Assignment 1 : Basics
Implement all of the components (tokenizer, model architecture,<br>optimizer) necessary to train a standard Transformer language<br>model.
Train a minimal language model.
Assignment 2 : Systems
Profile and benchmark the model and layers from Assignment 1<br>using advanced tools, optimize Attention with your own Triton<br>implementation of FlashAttention2.
Build a memory-efficient, distributed version of the Assignment<br>1 model training code.
Assignment 3 : Scaling
Study and understand model initializations and weight + activation dynamics.<br>-->
Understand the function of each component of the Transformer.
Query a training API to fit a scaling law to project model<br>scaling.
Assignment 4 : Data
Convert raw Common Crawl dumps into usable pretraining data.
Perform filtering and deduplication to improve model<br>performance.
Assignment 5 : Alignment and Reasoning RL
Apply supervised finetuning and reinforcement learning to train<br>LMs to reason when solving math problems.
Optional Part 2 : implement and apply safety alignment methods such as DPO.
All (currently tentative) deadlines are listed in the<br>schedule.
GPU compute for self-study
If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.
Here are a few options (public pricing for a single B200 GPU on March 28, 2026):
Modal (sponsor): $6.25/hour . Offers $30 of free monthly compute. You are only charged for actual compute (no idle resources) and their UX makes switching between local dev and<br>large-scale gpu experiments simple. (Modal Pricing)
Lambda Labs: $6.69/hour (Lambda Pricing)
RunPod: $4.99/hour (RunPod Pricing)
Nebius: $5.50/hour ($3.05/hour preemptible) (Nebius Pricing)
Together: $7.49/hour , minimum 8 GPUs, cheaper for longer commitments (Together Pricing)
For...