Scaling Laws, Honestly - by Diogo - Complete Skeptic
Complete Skeptic
SubscribeSign in
Scaling Laws, Honestly<br>TL;DR: The original scaling laws were wrong due to a bug
Diogo<br>Jul 04, 2026
Share
Background
Scaling laws were one of OpenAI’s most important results, both technically and philosophically (so much so that being scaling-pilled became a thing). They allow us to predict results for ever larger language model runs, and also allow for debugging models as we use exponentially more resources. All of this led to the era of LLMs we’re in today, but the craziest part was… the original Kaplan et al scaling laws were wrong.<br>Recently, Lilian Weng posted another awesome (and highly recommended) blog post on scaling laws. I was extra excited about the section “Reconciling Kaplan and Chinchilla”, the former being OpenAI’s original scaling laws and the latter being DeepMind’s follow-up with completely different scaling laws.<br>Thanks for reading! Subscribe for free to receive new posts and support my work.
Subscribe
Figure 1 from Chinchilla. The black dotted line shows the original scaling laws, and the cyan star shows that significantly smaller models should be used.<br>Lilian’s article goes into the mainstream explanation of the difference between them from follow-up research (namely that it’s about how they counted the total number of parameters). That follow-up research unfortunately is inaccurate, though not due to any fault of the authors.<br>The reality of the difference between the original scaling laws and Chinchilla’s is that the former had a bug!<br>The bug: 3 ingredients
Non-researcher summary
The 2 scaling laws (original and Chinchilla) give different “scaling recipes” for how to efficiently train large language models
The former was incorrect because they:<br>Did not train on enough data (Step 1)
Gradually decreased the impact of data to make it look like more data wasn’t needed (Step 2)
Claimed that the gradual decrease was unimportant (Step 3)
Thus, for a few years, people trained models that were much too large on too little data
Clue: Data scales with size.
It’s easier to identify this when working backwards: both scaling laws predict that data should scale with model size. The handwavy explanation is that bigger models have more capacity to soak up that data. Thus the amount of data is a very important parameter.<br>Step 1: Use a fixed amount of data.
The Chinchilla paper points out the root issue stating the original Kaplan et al paper authors “use a fixed number of training tokens and learning rate schedule for all models”. When every model is trained on the same fixed amount of data, the tiny model trained on ~130B tokens is getting way more training relative to its size than a giant model trained on the same ~130B tokens.
Relevant quote from Chinchilla’s related work section.
Figure 2 from Kaplan et al. showing all model sizes trained to the same ~130B tokens.
Figure 2 from Chinchilla with a pink arrow added to show roughly where the training curve would have been cut off if only trained to 130B tokens. It would have been obvious that training ended before reaching the scaling laws’ pareto frontier.<br>Keeping the amount of data fixed would be sufficient to get incorrect scaling laws, but if that was the only mistake, the results would look obviously incorrect. Except if you also…<br>Step 2: Use a cosine decayed learning rate schedule to zero.
This learning rate schedule caused learning to slow as training approached the target number of tokens. Performance naturally plateaued, appearing as if training is saturated. We now know that large models would have kept improving with more data and a different learning rate schedule, but the learning rate schedule artificially constrained results, making it appear that more data would not help.
Visualization of a cosine learning rate decay with a warmup (source) - you can see a smooth decay to lr=0, where learning stops entirely<br>The fixed amount of data and the learning rate schedule lead to both incorrect and hard to debug scaling laws, and it becomes even harder to debug if you…<br>Step 3: Claim that results were “largely independent of learning rate schedule”.
Given a maximum number of tokens, their conclusion is entirely accurate, but doesn’t apply to the true infinite data limit that scaling laws aim to model.<br>Aside: I too worked on LLM optimization at OpenAI at the time and missed the bug as well. 😅 The learning rate schedule seemed so obviously an important hyperparameter that it looked intentionally set.
Section 2.2 of Kaplan et al., describing how it was trained. Green box shows calculation for a constant number of tokens with model size. Red box shows the learning rate schedule.<br>Result: Models were undertrained and too large.
You can see how the difference of learning rate shows up: Chinchilla ended up with a model less than half the size of GPT-3, trained on over 4x more tokens. They could not have achieved this result if the learning...