Scaling Laws, Carefully

Scaling Laws, Carefully | Lil'Log

Table of Contents

Early days: ML loss predictability

Scaling Laws in Data-Infinite Region

Kaplan et al.’s Scaling Laws

Chinchilla Scaling Laws

Method 1: Fix model sizes, vary the token budget

Method 2: IsoFLOP profiles

Method 3: Parametric fit

Reconciling Kaplan and Chinchilla

Why power law?

Scaling Laws in Data-Limited Region

Trickiness of Fitting Scaling Laws in Reality

Toy simulation

Citation

References

Scaling laws are one of the most critical empirical findings in deep learning. The observation is simple in form: the training loss $L$ decreases predictably as we scale up model size $N$, dataset size $D$, and compute $C$, following a power-law curve, which appears as a straight line on a log-log plot. We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally between $N$ and $D$.

This predictability makes scaling laws highly valuable in practice. A common workflow is to fit scaling laws on a handful of small runs and then extrapolate to estimate the token and compute requirements for larger models.

Symbol Note

$N$ Model size, measured in parameter count.

$D$ Training dataset size, usually measured in token count.

$C$ Training compute in FLOPs. As a useful approximation, $C \approx 6ND$ (Kaplan et al. 2020), where $2ND$ accounts for the forward pass and $4ND$ for backpropagation.

$E$ Irreducible loss

$L, \hat{L}(.)$ Test loss / test loss prediction function; can also refer to training loss, since they are strongly correlated.

$\epsilon$ Generalization error.

Early days: ML loss predictability#

The predictability of generalization error with scale had already been investigated before scaling laws became a mainstream concept.

Amari et al. (1992) derived four types of learning curves using a Bayesian approach and the annealed approximation.

Deterministic learning algorithm, noiseless data, one unique solution: $\epsilon \sim c \cdot D^{-1}$, where $c$ is some constant.

Deterministic learning algorithm, noiseless data, multiple equivalent solutions: $\epsilon \sim c \cdot D^{-2}$; the learning is faster with each new data point, because the model only learns the optimal manifold of parameters, instead of finding the single solution point.

Deterministic learning algorithm, noisy data: $\epsilon \sim c \cdot D^{-1/2}$; noises in data make learning harder.

Stochastic learning algorithm, noisy data: $\epsilon \sim c \cdot D^{-1} + E$; here the irreducible loss $E$ is the residual error that a stochastic learner cannot reduce further, for example when the model runs out of capacity on large data. All four types of learning curves follow a power law:

$$ \epsilon \sim c \cdot D^\alpha + E $$

where $E$ can be 0 and $\alpha = -2, -1, -1/2$. Although their theoretical setup is based on a simplified binary classification task, it points in a useful direction for building empirical ML loss prediction models.

One of the earliest empirical studies by Hestness et al. (2017) explained the relationship between generalization error, model size and data. For a given training data size, they identified the best-fit model size via grid search and then plotted loss against training dataset size. Across four different domains in deep learning (neural machine translation, image classification, language modeling, and speech recognition), a recurring pattern was observed where:

Generalization error scales as a power law across a set of factors (e.g. data size).

Model improvements shift the error curve but do not seem to affect the power-law exponent.

Interestingly, architecture changes the offset ($E$) of the power-law fit but does not change the exponent ($\alpha$). The slope of the power law appears to be a property of the problem domain rather than the model architecture.

The number of model parameters $N$ needed to fit a dataset of size $D$ also scales as a power law.

Learning curves for (Left) Deep-Speech-2 (DS2) and attention speech model and for (Right) DS2 models of various sizes. The losses of small models plateau when training data becomes large. (Image source: Hestness et al. 2017)

A conceptual illustration breaks the learning curve into three stages. In the small-data region, when there are not enough learning signals, the model performs only slightly better than random guessing. In the middle (“power-law region”), we observe a power-law relationship between loss, data, and model size. The final irreducible-error region can be attributed to factors such as noise in the data.

Illustration of power-law learning curve phases. (Image source: Hestness et al. 2017)

Rosenfeld et al. (2020) pushed this further by trying to model error as a joint function of both model size $N$ and data size $D$, across a diverse set of architectures (ResNet, WRN, LSTM, Transformer) and optimizers (Adam, SGD variants). Empirically...

Scaling Laws, Carefully

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi