frontier model training methodologies | Alex Wa’s Blog
Share on:
How do labs train a frontier, multi-billion parameter model? We look towards seven open-weight frontier models: Hugging Face’s SmolLM3, Prime Intellect’s Intellect 3, Nous Research’s Hermes 4, OpenAI’s gpt-oss-120b, Moonshot’s Kimi K2, DeepSeek’s DeepSeek-R1, and Arcee’s Trinity series. This blog is an attempt at distilling the techniques, motivations, and considerations used to train their models with an emphasis on training methodology over infrastructure.
These notes are largely structured based on Hugging Face’s SmolLM3 report due to its extensiveness, and it is currently supplemented with notes from other reports including Intellect-3, gpt-oss-120b, Hermes 4, DeepSeek, and Kimi. While this blog explores some infrastructure-related ideas like in-flight weight updates and multi-client orchestrators, there are many other ideas mentioned throughout those posts/blogs like expert parallelism and quantization. Hugging Face writes more about gpt-oss-120b’s infrastructure here.
table of contents
tl;dr
a minimal training playbook
general practices
architecture and set-up
architecture decision heuristics
attention
gated attention
document masking
embedding sharing
positional encodings
attention for long contexts
MoE
hybrid models
architecture takeaways
stability
$z$-loss
logit softcapping
weight decay and embeddings
QK-norm
RMSNorm
other design considerations
stability takeaways
tokenizer
optimizers and training hyperparameters
adamW
muon
learning rates
batch size
scaling laws
data curation and pre-training
multi-stage training
ablation
token utility
pre-training data
SmolLM3
hermes 4
data takeaways
mid-training
post-training
evals
post-training data
intellect 3
hermes 4
kimi k2
chat template
sft
capabilities
sequence packing
learning rate and epochs
preference optimization (PO)
algorithms
RL
RLVR and rubrics
online data filtering
alternatives to RL
limitations
post-training takeaways
behaviors and safety
safety testing and mitigation
behaviors and latent capabilities
the training marathon
vanishing throughput
noisy loss
tensor parallelism
multi-client orchestrator
the usual suspects
training ops takeaways
tl;dr
Frontier training is a systems problem: data mixture, architecture, and stability choices dominate most algorithmic tweaks.
Start from a strong baseline and ablate fast and reliably; derisk changes and avoid multi-variable edits.
For long context, document masking + RNoPE/YaRN-style scaling is a robust default; attention variants trade compute for reach.
GQA with small groups (2/4/8 groups) typically outperforms MHA and MQA in ablations at similar model scales; MLA cuts KV cache but raises implementation complexity.
MoE is efficient when it is load-balanced; routing, auxiliary or bias balancing, and global stats are non-negotiable.
Tokenizer design should mirror target data; vocab size trades embedding cost against token compression and KV cache.
AdamW is still the default; Muon can help but needs careful infra (all-to-all, padding, scaling quirks).
Scaling laws guide, but many frontier models overtrain; inference cost and sparsity tradeoffs often drive final choices.
Data scheduling matters: multi-stage mixtures and late-stage high-quality injection shape final behavior.
Mid-training and post-training (SFT + preference/RL/distillation) often determine reasoning and tool-use behavior.
Training ops are frequent failure points: dataloader design, throughput, seeds in TP, and checkpointing.
Most training failures stem from common causes: high learning rates, problematic data batches, load imbalance in MoE models, or storage/infrastructure issues (see “the usual suspects” section for details).
a minimal training playbook
Define the product goal and lock evals early across knowledge, math, code, long-context, and instruction following.
Pick a baseline architecture with known failure modes; default to dense + GQA + RoPE/RNoPE unless MoE is essential.
Choose a tokenizer matched to your target languages and domains; freeze vocab and special tokens early.
Build the data pipeline with deduplication, filtering, and contamination checks; measure data quality explicitly.
Run small ablations for attention, positional encoding, optimizer, and learning rate schedule; change one variable at a time.
Plan a multi-stage data mixture; delay the best data and reasoning-heavy data toward the end.
Add stability guardrails: logit softcapping (preferred, per Gemma) or z-loss/QK-norm, gradient clipping, precision policy, loss spike alerts.
Validate throughput on long runs and confirm dataloader behavior (packing, shuffling, random access).
Run the main training with interval evals and consistent seeds, especially for tensor parallelism.
Mid-train for domain gaps if SFT reveals them; extend context length gradually (4k → 32k → 64k → 128k).
Post-train with SFT, then choose...