Compute Optimal Tokenization
Compute Optimal Tokenization
Scaling Laws for Data Compression in Language Models
Tomasz Limisiewicz1,<br>Artidoro Pagnoni1,<br>Srini Iyer1,<br>Mike Lewis1,<br>Sachin Mehta1,<br>Alisa Liu2,<br>Margaret Li2,<br>Gargi Ghosh1,<br>Luke Zettlemoyer1
1FAIR at Meta,<br>2University of Washington
Paper
arXiv
Meta AI
TL;DR
We study the impact of data compression on scaling laws. We find that:
[F1] In compute-optimal scaling, bytes (not tokens) of data increase proportionally to parameter count.
[F2] At each training budget, there is an optimal compression rate, and its value decreases at larger scales.
[F3] The optimal compression rate varies across languages and differs from the compression rate of popular BPE tokenizers.
Tokenization determines how raw text is compressed into discrete tokens for language model training.<br>Most scaling law research fixes the tokenizer and varies only model size and data amount, but what happens when we can control the tokenizer's compression rate (bytes/token)?
We train language models at fixed compute budgets.<br>We sweep over compression rate and model size, which together determine the amount of training data under a fixed budget and the corresponding bytes per parameter ratio.<br>The relationship between compression, bytes per parameter ratio and loss is the most interesting here.
Now, let's dive in...
We train language models for fixed compute budgets, e.g. 10e20 FLOPs.<br>We sweep over compression rate, and model size, which determine amount of training data under fixed budget and corresponding bytes per paramter ratio.<br>The relation between compression, bytes per parameter ratio and loss is the most interesting.<br>Let's see how it looks:<br>-->
Value of compression rate T (x-axis) and model size N (y-axis) determine the amount of training data B (color) and the bytes per parameter ratio (values in squares).
-->
Plotting IsoFLOPs: Loss vs. bytes per parameter for each compression rate.
-->
-->
[F1] Optimal Data to Model Size
For a fixed compute budget (1e20 FLOPs), we plot loss against compression rate and bytes per parameter ratio.<br>This yields a 3D IsoFLOP:
The bowl-shaped IsoFLOP surface shows that, for every compression rate, the lowest loss is achieved at roughly the same bytes per parameter ratio (triangles).
Interactive version of the plot above. You can rotate it and hover over points to check loss, compression, and byte per parameter values:
-->
Interactive version of the plot above. You can rotate it and hover over points to check loss, compression, and bytes per parameter values:
We observe that the optimal bytes per parameter ratio remains nearly constant across different compression rates.
Finding 1
The optimal ratio between bytes of data and model parameters is approximately constant across varying compute budgets and compression rates.<br>Therefore, when generalizing a scaling recipe to a model with a different tokenizer, we advise matching the ratio of training bytes (not tokens) to model parameters.
With the data-to-parameter ratio pinned down, a natural follow-up: what compression rate should we target?
-->
[F2] Optimal Compression Rate
For each compute budget (from 5e18 to 2e21 FLOPs), we gather the optimal points (triangles) from [F1] .<br>This lets us see how loss changes with compression across compute budgets:
U-shaped loss profiles across increasing compute budgets. For best results, models should be trained at a compression rate close to the optimum.
Interactive version of the plot above. You can rotate it and hover over points to check loss, compression, and compute budget:
-->
Interactive version of the plot above. You can rotate it and hover over points to check loss, compression, and compute budget:
We fit a power law to model the relationship between compression rate, compute, and loss (for details, see the paper):
Power law estimating loss given compression rate and compute budget.
We see that the optimal compression rate decreases for higher compute budgets.
Finding 2
At each training compute budget, there is an optimal compression rate that minimizes loss.<br>The optimal compression rate decreases as the training budget increases.
These findings hold for English, but do they generalize across languages?
-->
[F3] Beyond English
We run the same experiments on non-English data (training on each language separately) to see whether these findings still hold.<br>The IsoFLOP analysis (as in [F1] ) lets us find the optimal bytes per parameter ratio and compression rate for each language:
French
Arabic
Russian
Hindi
Across languages, the IsoFLOP "bowls" are shifted along the compression-rate axis, meaning that the optimal compression rate depends on the language.<br>Below, we compare these optima against Parity∗ and the compression rates of popular BPE tokenizers:
The optimal compression is correlated with Parity.
The optimal compression rate differs from compression of popular BPE tokenizers.
We observe relation between...