Building a Jax training loop for an LLM training run

Writing an LLM from scratch, part 34a -- building a JAX training loop for an LLM training run :: Giles' blog

el.dataset.currentDropdown = '') }">

Giles' blog

Writing the post that I wished I'd found when I started learning whatever it was...

About

Contact

Blogroll

June 2026 (7)

May 2026 (2)

April 2026 (11)

March 2026 (3)

February 2026 (4)

January 2026 (4)

December 2025 (1)

November 2025 (3)

October 2025 (9)

September 2025 (3)

August 2025 (5)

July 2025 (1)

June 2025 (2)

May 2025 (3)

April 2025 (2)

March 2025 (7)

February 2025 (10)

January 2025 (6)

December 2024 (7)

September 2024 (1)

August 2024 (2)

July 2024 (2)

May 2024 (2)

April 2024 (2)

February 2024 (2)

April 2023 (1)

March 2023 (2)

September 2022 (1)

February 2022 (1)

November 2021 (1)

March 2021 (1)

February 2021 (2)

August 2019 (1)

November 2018 (1)

May 2017 (1)

December 2016 (1)

April 2016 (1)

August 2015 (1)

December 2014 (1)

August 2014 (1)

March 2014 (1)

December 2013 (1)

October 2013 (3)

September 2013 (4)

August 2013 (2)

July 2013 (1)

June 2013 (1)

February 2013 (1)

October 2012 (1)

June 2012 (1)

May 2012 (1)

April 2012 (1)

February 2012 (1)

October 2011 (1)

June 2011 (1)

May 2011 (1)

April 2011 (1)

March 2011 (1)

February 2011 (1)

January 2011 (1)

December 2010 (3)

November 2010 (1)

October 2010 (1)

September 2010 (1)

August 2010 (1)

July 2010 (1)

May 2010 (3)

April 2010 (1)

March 2010 (2)

February 2010 (3)

January 2010 (4)

December 2009 (2)

November 2009 (5)

October 2009 (2)

September 2009 (2)

August 2009 (3)

July 2009 (1)

May 2009 (1)

April 2009 (1)

March 2009 (5)

February 2009 (5)

January 2009 (5)

December 2008 (3)

November 2008 (7)

October 2008 (4)

September 2008 (2)

August 2008 (1)

July 2008 (1)

June 2008 (1)

May 2008 (1)

April 2008 (1)

January 2008 (4)

December 2007 (3)

March 2007 (3)

February 2007 (1)

January 2007 (2)

December 2006 (4)

November 2006 (18)

AI (87)

TIL deep dives (76)

Python (72)

LLM from scratch (47)

Resolver One (34)

PyTorch (21)

TIL (21)

Blogkeeping (18)

PythonAnywhere (17)

Linux (16)

Startups (15)

Hugging Face (13)

NSLU2 offsite backup project (13)

Gadgets (12)

Funny (11)

Musings (11)

Finance (10)

Fine-tuning LLMs (10)

C (9)

Personal (8)

Robotics (8)

Website design (8)

JAX (6)

3D (5)

Quick links (5)

Rants (5)

Cryptography (4)

JavaScript (4)

Music (4)

Oddities (4)

Talks (4)

Dirigible (3)

Eee (3)

Memes (3)

Politics (3)

Django (2)

GPU Computing (2)

LaTeX (2)

MathML (2)

OLPC XO (2)

Retro Language Models (2)

Space (2)

VoIP (2)

Golang (1)

Microprojects (1)

Raspberry Pi (1)

Software development tools (1)

Agile Abstractions

Astral Codex Ten

:: (Bloggable a) => a -> IO ()

David Friedman's Substack

Econ & Energy

Entrepreneurial Geekiness

For some value of "Magic"

Hackaday

kaleidic.ai newsletter

Knowing.NET

Language Log

Millennium Hand

ntoll.org

Obey the Testing Goat!

PythonAnywhere News

Simon Willison's Weblog

Societive

Software Deviser

Some opinions, held with varying degrees of certainty

tartley.com

Writing an LLM from scratch, part 34a -- building a JAX training loop for an LLM training run

Posted on 30 June 2026

AI,

LLM from scratch,

TIL deep dives,

JAX

For over a year, I've been using Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- and the multitude of side-projects that have branched out from reading it -- as something like a curriculum for learning about modern AI. The one final task I had set myself was to build and train an LLM from scratch just using my notes -- no reference to the book, no reference to the model code I'd written following the book.

As an output, I wanted something as good as my best PyTorch model based on Raschka's code -- a base model, trained on 3.2B tokens, that my (admittedly limited) evals ranked as being close to the original GPT-2 small's quality.

I wanted to use a different framework, just to make sure I wasn't parroting code that I'd somehow memorised, so I asked people on Twitter which one I should use, and the winner was JAX.

I took a slightly different route to Raschka's book; he takes an inside-out perspective, explaining things like attention, gradually building up a complete GPT-2-style model, and then building a training loop on top of it. I wanted to go outside-in: I'd put together a training harness to train the simplest-possible model with an API similar to a real LLM, get that working to my satisfaction, and then add features to that simple model, one by one, until it had the full architecture in place. The plan (which actually worked out nicely!) was that I'd be able to show how each change improved things.

That's all done now, and I'm posting about it in two parts; in this one, I'll explain how I built the training harness, and in the next, I'll show the actual building and training of the LLM.

So let's get started!

Which...

Building a Jax training loop for an LLM training run

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level