Finding Optimal Tokenizers

mcyc2 pts0 comments

Finding Optimal Tokenizers

In this post, I will present an algorithm that was able to compute an optimal tokenizer in some<br>settings. This result is cool because optimal tokenization is theoretically intractable, but seems to be solvable in<br>practice. My finding is very similar to various results on the Traveling Salesman Problem (TSP), where<br>even difficult instances can be solved optimally using cutting-plane techniques.

I'll highlight that, while this result is cool, there are a few reasons that it isn't necessarily<br>useful. First, the existing state of the art was already somewhat close to optimal (often within<br>1%). Second, even if a tokenizer is optimal on the training data, it may not generalize as well<br>as other tokenizers when evaluated on held out test data. Finally, inefficient tokenizers are basically<br>fine: you can pay for the cost of a less efficient tokenizer by slightly increasing your vocabulary<br>size.

Despite the above caveats, I had a really fun time working on this project, and I hope others will be<br>interested in pushing the frontier of this problem as well.

Background: Tokenizers

Frontier LLMs are typically trained on sequences of integers known as tokens. Each token refers to<br>some sequence of bytes, and these byte sequences often correspond to common words. For example, in the<br>GPT-5 tokenizer, the token 290 corresponds to the bytes &ldquo; the&rdquo;, and 6602 corresponds to<br>&ldquo; token&rdquo;, so the text &ldquo; the token&rdquo; can be encoded as the sequence [290, 6602].

The mapping from tokens to bytes, known as the &ldquo;vocabulary&rdquo;, is fixed before the LLM is even<br>trained. Typically, we try to find a vocabulary that compresses a slice of training data. In particular,<br>we would like to pick a vocabulary of a fixed size that minimizes the number of tokens required to encode<br>the data. The dominant technique for finding such a vocabulary is byte-pair encoding<br>(BPE), a decades-old greedy compression algorithm.

Tokenization as integer linear programming

In a recent paper, Tempus et al. connected<br>tokenization to integer linear programming. The basic idea of their approach is to represent the entire<br>dataset's tokenization as a set of integer variables.

In this formulation, there's a &ldquo;color&rdquo; variable for each possible vocabulary entry. In<br>particular, we create one color variable for every unique substring of the dataset. A color variable is 1<br>if the corresponding byte sequence is in the vocabulary, or 0 otherwise. We add a single constraint to<br>force the sum of color variables to equal the target vocabulary size.

A color corresponds to some sequence of bytes, but a given sequence of bytes may occur many times<br>throughout the dataset. For each occurrence of a color, there's a separate &ldquo;edge&rdquo; variable.<br>The edges work together to encode an actual tokenization of the dataset. If an edge is 1, then the<br>edge's corresponding token is used in this particular place. The objective of our linear program is to<br>minimize the sum of all the edge variables, i.e. the number of tokens used to encode our dataset.

For example, in the below picture, we tokenize the word &ldquo;Queue&rdquo; as the tokens [&ldquo;Q&rdquo;,<br>&ldquo;ue&rdquo;, &ldquo;ue&rdquo;]. We could alternatively have tokenized it as [&ldquo;Qu&rdquo;,<br>&ldquo;e&rdquo;, &ldquo;ue&rdquo;], but that is not the tokenization indicated by the current ILP<br>solution, since the edge variables for the initial &ldquo;Qu&rdquo; and &ldquo;e&rdquo; edges are 0.

We constrain the LP in two ways. First, we can't use a token if it's not in the vocabulary. To this end,<br>we constrain each edge variable to be less than or equal to its corresponding color variable. Second, we<br>want to make sure that we tokenize the dataset in exactly one valid way. To this end, we add flow<br>constraints: for each byte position in the dataset, we want the sum of edges flowing into this position<br>to be equal to the sum of edges flowing out of this position, with the exception of the boundaries. For<br>the first and last positions, we want the flow out or flow in to be 1. In an integer solution, you can<br>see flow constraints as asserting the following: any point that an edge goes into must have an edge<br>going out of it, except the first and last positions.

If all the variables were integral and constrained to [0, 1], then this linear program is enough to encode<br>the optimal tokenization. However, since we cannot solve arbitrary integer linear programs efficiently,<br>Tempus et al. relax the ILP to a continuous LP and solve this with a well-optimized solver.

The solution to the continuous LP is not generally integral. We can see an example of this below, where we<br>have two superimposed tokenizations of the word &ldquo;Queue&rdquo;: either we encode it as<br>[&ldquo;Q&rdquo;, &ldquo;ue&rdquo;, &ldquo;ue&rdquo;], or as [&ldquo;Qu&rdquo;, &ldquo;e&rdquo;,<br>&ldquo;ue&rdquo;]. The problem with this solution is that our color variables sum to 2.5, but we've<br>actually used...

ldquo rdquo vocabulary color edge optimal

Related Articles