Model Training as Code — Aleph Alpha
Research Michael Barlow<br>22/05/2026<br>Model Training as Code
TL;DR: Model training has grown complex enough to require many specialised stages and<br>teams, and manual coordination between them doesn’t scale. At Aleph Alpha, we’ve built<br>Savanna, a model factory that implements the entire training pipeline in code, turning model<br>training into a collaborative software project. In Savanna, end-to-end training runs are<br>hermetic, and launchable with one click. This post describes Savanna, why it’s needed, and<br>the engineering culture that makes it effective.
Introduction<br>Model training is moving fast. New stages keep joining the pipeline and existing ones grow<br>more intricate, making model training an engineering challenge for three key reasons. First,<br>more complexity means more room for error: bugs or inconsistencies in data, code, or<br>configuration can cause entire training runs to fail or diverge. Second, the cost of failure<br>keeps rising: models grow larger, GPU prices rise, and ever more data is processed per run.<br>When you’re burning thousands of GPU hours, “oops” is an expensive word.<br>The third and hardest challenge is organisational. The complexity has long outgrown the<br>capacity of a single mind, so labs like ours have built large, specialised teams. The<br>problem then becomes coordinating those teams. How can members autonomously explore the<br>latest research in their area of expertise, while integrating their changes into the<br>production pipeline without breaking it or interfering with each other’s work? And how do<br>they ensure that improvements to individual stages translate into a better model at the end?<br>Traditional, manual model training processes do not have good answers to these questions.
The hidden cost of manual model training<br>Let’s break down the traditional manual process. At a high level, model training looks<br>simple: pre-training to absorb the internet, followed by post-training to learn<br>instruction-following. In reality, you don’t train a good model on your first try. Arriving<br>at a good data mix, architecture and training recipe is an iterative, compute-bound process<br>guided by evaluation, where each arrow below is weighted by its relative GPU cost:<br>Each of these components is complex enough to warrant multiple dedicated teams. Modern<br>post-training, for example, comprises a supervised fine-tuning (SFT) stage followed by a<br>reinforcement learning (RL) stage. SFT and RL require different skillsets and tools, but<br>must be integrated to train a model.<br>Consider what a single model’s journey through the pipeline might look like in a manual<br>lab:<br>The data team finishes a new mix and sends the database path over Slack to the pre-training<br>team, who kick off a multi-week run. Two weeks in, the storage quota fills up and the<br>training run crashes. The filesystem is managed manually, so no one is sure whether it’s<br>safe to delete that 30TB dataset with do_not_delete in its file name, and the GPUs<br>sit idle while the pre-training team works this out. When they finally relaunch, they reconstruct<br>the original setup from memory and Slack threads, hoping they didn’t forget to set a flag. This is the first hidden cost: every manual step is an opportunity for human error.<br>When pre-training is eventually complete, the pre-training team hands the checkpoint to the<br>SFT team. To find a good recipe, the SFT team then manually kicks off a sweep of parallel<br>trainings with different configurations and data mixes. As checkpoints roll in, the team<br>runs their evaluation script on each one, sharing results and analysis in Slack. Some<br>recipes look promising, others don’t, and they repeat this process for a few weeks until<br>they narrow down a good one. Without realising it, the team repeated some experiments that<br>they already completed for the previous pre-trained checkpoint a few months back.<br>This is the second hidden cost: the team forgets its learnings. There’s no durable<br>record of the reasoning behind a hyperparameter’s current value, no formal link between a data<br>mix and its constituent datasets, and no clear attribution attaching a model to the training recipe<br>that produced it. In a manual lab, this lineage is scattered across Slack, the filesystem, an<br>experiment manager and various wiki pages, and is easily lost over time.<br>The SFT team hands their checkpoint to the RL team, who kick off a training run with this<br>as the base. The final model underperforms. Is the RL recipe overfit to last month’s SFT<br>checkpoint, or is the SFT checkpoint itself at fault? After two weeks of debugging, the RL<br>team confirms it’s the latter. Neither team can execute the other’s stage, so each had<br>optimised for its own slice of the pipeline rather than the model at the end. And because<br>integration is a manual hand-off, it happens rarely, leaving a month’s worth of divergence<br>to reconcile each time. This is the third hidden cost: manual, infrequent hand-offs fragment teams’ ownership.<br>It’s clear that manual model training...