Finding Optimal Tokenizers
In this post, I will present an algorithm that was able to compute an optimal tokenizer in some<br>settings. This result is cool because optimal tokenization is theoretically intractable, but seems to be solvable in<br>practice. My finding is very similar to various results on the Traveling Salesman Problem (TSP), where<br>even difficult instances can be solved optimally using cutting-plane techniques.
I'll highlight that, while this result is cool, there are a few reasons that it isn't necessarily<br>useful. First, the existing state of the art was already somewhat close to optimal (often within<br>1%). Second, even if a tokenizer is optimal on the training data, it may not generalize as well<br>as other tokenizers when evaluated on held out test data. Finally, inefficient tokenizers are basically<br>fine: you can pay for the cost of a less efficient tokenizer by slightly increasing your vocabulary<br>size.
Despite the above caveats, I had a really fun time working on this project, and I hope others will be<br>interested in pushing the frontier of this problem as well.
Background: Tokenizers
Frontier LLMs are typically trained on sequences of integers known as tokens. Each token refers to<br>some sequence of bytes, and these byte sequences often correspond to common words. For example, in the<br>GPT-5 tokenizer, the token 290 corresponds to the bytes “ the”, and 6602 corresponds to<br>“ token”, so the text “ the token” can be encoded as the sequence [290, 6602].
The mapping from tokens to bytes, known as the “vocabulary”, is fixed before the LLM is even<br>trained. Typically, we try to find a vocabulary that compresses a slice of training data. In particular,<br>we would like to pick a vocabulary of a fixed size that minimizes the number of tokens required to encode<br>the data. The dominant technique for finding such a vocabulary is byte-pair encoding<br>(BPE), a decades-old greedy compression algorithm.
Tokenization as integer linear programming
In a recent paper, Tempus et al. connected<br>tokenization to integer linear programming. The basic idea of their approach is to represent the entire<br>dataset's tokenization as a set of integer variables.
In this formulation, there's a “color” variable for each possible vocabulary entry. In<br>particular, we create one color variable for every unique substring of the dataset. A color variable is 1<br>if the corresponding byte sequence is in the vocabulary, or 0 otherwise. We add a single constraint to<br>force the sum of color variables to equal the target vocabulary size.
A color corresponds to some sequence of bytes, but a given sequence of bytes may occur many times<br>throughout the dataset. For each occurrence of a color, there's a separate “edge” variable.<br>The edges work together to encode an actual tokenization of the dataset. If an edge is 1, then the<br>edge's corresponding token is used in this particular place. The objective of our linear program is to<br>minimize the sum of all the edge variables, i.e. the number of tokens used to encode our dataset.
For example, in the below picture, we tokenize the word “Queue” as the tokens [“Q”,<br>“ue”, “ue”]. We could alternatively have tokenized it as [“Qu”,<br>“e”, “ue”], but that is not the tokenization indicated by the current ILP<br>solution, since the edge variables for the initial “Qu” and “e” edges are 0.
We constrain the LP in two ways. First, we can't use a token if it's not in the vocabulary. To this end,<br>we constrain each edge variable to be less than or equal to its corresponding color variable. Second, we<br>want to make sure that we tokenize the dataset in exactly one valid way. To this end, we add flow<br>constraints: for each byte position in the dataset, we want the sum of edges flowing into this position<br>to be equal to the sum of edges flowing out of this position, with the exception of the boundaries. For<br>the first and last positions, we want the flow out or flow in to be 1. In an integer solution, you can<br>see flow constraints as asserting the following: any point that an edge goes into must have an edge<br>going out of it, except the first and last positions.
If all the variables were integral and constrained to [0, 1], then this linear program is enough to encode<br>the optimal tokenization. However, since we cannot solve arbitrary integer linear programs efficiently,<br>Tempus et al. relax the ILP to a continuous LP and solve this with a well-optimized solver.
The solution to the continuous LP is not generally integral. We can see an example of this below, where we<br>have two superimposed tokenizations of the word “Queue”: either we encode it as<br>[“Q”, “ue”, “ue”], or as [“Qu”, “e”,<br>“ue”]. The problem with this solution is that our color variables sum to 2.5, but we've<br>actually used...