CS336: Language Modeling from Scratch

kristianpaul1 pts0 comments

Stanford CS336 | Language Modeling from Scratch

CS336: Language Modeling from Scratch

Stanford / Spring 2026<br>previous offerings)

Spring 2025 |<br>Spring 2024

Previous offerings: Spring 2025 |<br>Spring 2024

-->

Course Staff

Tatsunori Hashimoto

Instructor

Percy Liang

Instructor

Herman Brunborg

CA

Marcel Rød

CA

Steven Cao

CA

Logistics

Lectures: Monday/Wednesday 3:00-4:20pm in<br>Skilling Auditorium

Recordings:<br>YouTube playlist

Office hours:

Percy Liang: Fridays 11am-12pm in Gates 366

Tatsu Hashimoto: Tuesdays 11-12am in Gates 364

Marcel Rød: Tuesdays 4:30-5:30pm in Gates 498, Wednesdays 4:30-5:30pm in Gates 415

Herman Brunborg: Wednesdays 1:30-2:30pm, Fridays 1:30-2:30pm, location Gates 392

Steven Cao: Mondays 4:30-5:30pm, Thursdays 9:30-10:30am, Gates 200

Contact : Students should ask all course-related<br>questions in public Slack channels. All announcements will also be<br>made in Slack. For personal matters, email

cs336-spr2526-staff@lists.stanford.edu.

Content

What is this course about?

Language models serve as the cornerstone of modern natural language<br>processing (NLP) applications and open up a new paradigm of having a<br>single general purpose system address a range of downstream tasks. As<br>the field of artificial intelligence (AI), machine learning (ML), and<br>NLP continues to grow, possessing a deep understanding of language<br>models becomes essential for scientists and engineers alike. This course<br>is designed to provide students with a comprehensive understanding of<br>language models by walking them through the entire process of developing<br>their own. Drawing inspiration from operating systems courses that<br>create an entire operating system from scratch, we will lead students<br>through every aspect of language model creation, including data<br>collection and cleaning for pre-training, transformer model<br>construction, model training, and evaluation before deployment.

Prerequisites

Proficiency in Python

The majority of class assignments will be in Python. Unlike most<br>other AI classes, students will be given minimal scaffolding. The<br>amount of code you will write will be at least an order of magnitude<br>greater than for other classes. Therefore, being proficient in<br>Python and software engineering is paramount.

Experience with deep learning and systems optimization

A significant part of the course will involve making neural language<br>models run quickly and efficiently on GPUs across multiple machines.<br>We expect students to be able to have a strong familiarity with<br>PyTorch and know basic systems concepts like the memory hierarchy.

College Calculus, Linear Algebra (e.g. MATH 51, CME 100)

You should be comfortable understanding matrix/vector notation and<br>operations.

Basic Probability and Statistics (e.g. CS 109 or equivalent)

You should know the basics of probabilities, Gaussian distributions,<br>mean, standard deviation, etc.

Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)

You should be comfortable with the basics of machine learning and<br>deep learning.

Note that this is a 5-unit class. This is a very implementation-heavy<br>class, so please allocate enough time for it.

Coursework

Assignments

Assignment 1 : Basics

Implement all of the components (tokenizer, model architecture,<br>optimizer) necessary to train a standard Transformer language<br>model.

Train a minimal language model.

Assignment 2 : Systems

Profile and benchmark the model and layers from Assignment 1<br>using advanced tools, optimize Attention with your own Triton<br>implementation of FlashAttention2.

Build a memory-efficient, distributed version of the Assignment<br>1 model training code.

Assignment 3 : Scaling

Study and understand model initializations and weight + activation dynamics.<br>-->

Understand the function of each component of the Transformer.

Query a training API to fit a scaling law to project model<br>scaling.

Assignment 4 : Data

Convert raw Common Crawl dumps into usable pretraining data.

Perform filtering and deduplication to improve model<br>performance.

Assignment 5 : Alignment and Reasoning RL

Apply supervised finetuning and reinforcement learning to train<br>LMs to reason when solving math problems.

Optional Part 2 : implement and apply safety alignment methods such as DPO.

All (currently tentative) deadlines are listed in the<br>schedule.

GPU compute for self-study

If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.

Here are a few options (public pricing for a single B200 GPU on March 28, 2026):

Modal (sponsor): $6.25/hour . Offers $30 of free monthly compute. You are only charged for actual compute (no idle resources) and their UX makes switching between local dev and<br>large-scale gpu experiments simple. (Modal Pricing)

Lambda Labs: $6.69/hour (Lambda Pricing)

RunPod: $4.99/hour (RunPod Pricing)

Nebius: $5.50/hour ($3.05/hour preemptible) (Nebius Pricing)

Together: $7.49/hour , minimum 8 GPUs, cheaper for longer commitments (Together Pricing)

For...

language model from assignment gates learning

Related Articles