Model Training as Code

Model Training as Code — Aleph Alpha

Research Michael Barlow 22/05/2026 Model Training as Code

TL;DR: Model training has grown complex enough to require many specialised stages and teams, and manual coordination between them doesn’t scale. At Aleph Alpha, we’ve built Savanna, a model factory that implements the entire training pipeline in code, turning model training into a collaborative software project. In Savanna, end-to-end training runs are hermetic, and launchable with one click. This post describes Savanna, why it’s needed, and the engineering culture that makes it effective.

Introduction Model training is moving fast. New stages keep joining the pipeline and existing ones grow more intricate, making model training an engineering challenge for three key reasons. First, more complexity means more room for error: bugs or inconsistencies in data, code, or configuration can cause entire training runs to fail or diverge. Second, the cost of failure keeps rising: models grow larger, GPU prices rise, and ever more data is processed per run. When you’re burning thousands of GPU hours, “oops” is an expensive word. The third and hardest challenge is organisational. The complexity has long outgrown the capacity of a single mind, so labs like ours have built large, specialised teams. The problem then becomes coordinating those teams. How can members autonomously explore the latest research in their area of expertise, while integrating their changes into the production pipeline without breaking it or interfering with each other’s work? And how do they ensure that improvements to individual stages translate into a better model at the end? Traditional, manual model training processes do not have good answers to these questions.

The hidden cost of manual model training Let’s break down the traditional manual process. At a high level, model training looks simple: pre-training to absorb the internet, followed by post-training to learn instruction-following. In reality, you don’t train a good model on your first try. Arriving at a good data mix, architecture and training recipe is an iterative, compute-bound process guided by evaluation, where each arrow below is weighted by its relative GPU cost: Each of these components is complex enough to warrant multiple dedicated teams. Modern post-training, for example, comprises a supervised fine-tuning (SFT) stage followed by a reinforcement learning (RL) stage. SFT and RL require different skillsets and tools, but must be integrated to train a model. Consider what a single model’s journey through the pipeline might look like in a manual lab: The data team finishes a new mix and sends the database path over Slack to the pre-training team, who kick off a multi-week run. Two weeks in, the storage quota fills up and the training run crashes. The filesystem is managed manually, so no one is sure whether it’s safe to delete that 30TB dataset with do_not_delete in its file name, and the GPUs sit idle while the pre-training team works this out. When they finally relaunch, they reconstruct the original setup from memory and Slack threads, hoping they didn’t forget to set a flag. This is the first hidden cost: every manual step is an opportunity for human error. When pre-training is eventually complete, the pre-training team hands the checkpoint to the SFT team. To find a good recipe, the SFT team then manually kicks off a sweep of parallel trainings with different configurations and data mixes. As checkpoints roll in, the team runs their evaluation script on each one, sharing results and analysis in Slack. Some recipes look promising, others don’t, and they repeat this process for a few weeks until they narrow down a good one. Without realising it, the team repeated some experiments that they already completed for the previous pre-trained checkpoint a few months back. This is the second hidden cost: the team forgets its learnings. There’s no durable record of the reasoning behind a hyperparameter’s current value, no formal link between a data mix and its constituent datasets, and no clear attribution attaching a model to the training recipe that produced it. In a manual lab, this lineage is scattered across Slack, the filesystem, an experiment manager and various wiki pages, and is easily lost over time. The SFT team hands their checkpoint to the RL team, who kick off a training run with this as the base. The final model underperforms. Is the RL recipe overfit to last month’s SFT checkpoint, or is the SFT checkpoint itself at fault? After two weeks of debugging, the RL team confirms it’s the latter. Neither team can execute the other’s stage, so each had optimised for its own slice of the pipeline rather than the model at the end. And because integration is a manual hand-off, it happens rarely, leaving a month’s worth of divergence to reconcile each time. This is the third hidden cost: manual, infrequent hand-offs fragment teams’ ownership. It’s clear that manual model training...

Model Training as Code

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi