CS336: Language Modeling from Scratch

Stanford CS336 | Language Modeling from Scratch

Stanford / Spring 2026 previous offerings)

Spring 2025 | Spring 2024

Previous offerings: Spring 2025 | Spring 2024

-->

Course Staff

Tatsunori Hashimoto

Instructor

Percy Liang

Instructor

Herman Brunborg

Marcel Rød

Steven Cao

Logistics

Lectures: Monday/Wednesday 3:00-4:20pm in Skilling Auditorium

Recordings: YouTube playlist

Office hours:

Percy Liang: Fridays 11am-12pm in Gates 366

Tatsu Hashimoto: Tuesdays 11-12am in Gates 364

Marcel Rød: Tuesdays 4:30-5:30pm in Gates 498, Wednesdays 4:30-5:30pm in Gates 415

Herman Brunborg: Wednesdays 1:30-2:30pm, Fridays 1:30-2:30pm, location Gates 392

Steven Cao: Mondays 4:30-5:30pm, Thursdays 9:30-10:30am, Gates 200

Contact : Students should ask all course-related questions in public Slack channels. All announcements will also be made in Slack. For personal matters, email

cs336-spr2526-staff@lists.stanford.edu.

Content

What is this course about?

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.

Prerequisites

Proficiency in Python

The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount.

Experience with deep learning and systems optimization

A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy.

College Calculus, Linear Algebra (e.g. MATH 51, CME 100)

You should be comfortable understanding matrix/vector notation and operations.

Basic Probability and Statistics (e.g. CS 109 or equivalent)

You should know the basics of probabilities, Gaussian distributions, mean, standard deviation, etc.

Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)

You should be comfortable with the basics of machine learning and deep learning.

Note that this is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it.

Coursework

Assignments

Assignment 1 : Basics

Implement all of the components (tokenizer, model architecture, optimizer) necessary to train a standard Transformer language model.

Train a minimal language model.

Assignment 2 : Systems

Profile and benchmark the model and layers from Assignment 1 using advanced tools, optimize Attention with your own Triton implementation of FlashAttention2.

Build a memory-efficient, distributed version of the Assignment 1 model training code.

Assignment 3 : Scaling

Study and understand model initializations and weight + activation dynamics. -->

Understand the function of each component of the Transformer.

Query a training API to fit a scaling law to project model scaling.

Assignment 4 : Data

Convert raw Common Crawl dumps into usable pretraining data.

Perform filtering and deduplication to improve model performance.

Assignment 5 : Alignment and Reasoning RL

Apply supervised finetuning and reinforcement learning to train LMs to reason when solving math problems.

Optional Part 2 : implement and apply safety alignment methods such as DPO.

All (currently tentative) deadlines are listed in the schedule.

GPU compute for self-study

If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.

Here are a few options (public pricing for a single B200 GPU on March 28, 2026):

Modal (sponsor): $6.25/hour . Offers $30 of free monthly compute. You are only charged for actual compute (no idle resources) and their UX makes switching between local dev and large-scale gpu experiments simple. (Modal Pricing)

Lambda Labs: $6.69/hour (Lambda Pricing)

RunPod: $4.99/hour (RunPod Pricing)

Nebius: $5.50/hour ($3.05/hour preemptible) (Nebius Pricing)

Together: $7.49/hour , minimum 8 GPUs, cheaper for longer commitments (Together Pricing)

For...

CS336: Language Modeling from Scratch

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy