What Are Tokens in LLMs?

s1monb1 pts0 comments

Tokens and Tokenization | Simon's JournalTokens and Tokenization<br>How LLMs split text into tokens, the BPE algorithm, and why 'strawberry' has 3 r's the model can't see.<br>June 7, 2026 · Simon Bjørnøy

Part of<br>Learning LLMs<br>· 2 of 2Table of Contents<br>Ask GPT-4 how many r&rsquo;s are in &ldquo;strawberry&rdquo; and it will confidently say two. The right answer is three. This isn&rsquo;t because the model can&rsquo;t count. It&rsquo;s because it never sees the letters at all.<br>Every Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren&rsquo;t characters and they aren&rsquo;t words. They&rsquo;re something more specific, and the specificity matters more than most people realize.<br>What a &ldquo;token&rdquo; really is#<br>Most people first meet the word &ldquo;token&rdquo; through prices and limits: &ldquo;1,500 tokens used&rdquo;, &ldquo;the context window is 128K tokens&rdquo;. Those numbers are real, but they hide what a token actually is.<br>A token is the smallest unit of input a specific model can perceive. Each model has its own fixed list of tokens, called its vocabulary , decided once at training time. GPT-4&rsquo;s vocabulary isn&rsquo;t Claude&rsquo;s. Claude&rsquo;s isn&rsquo;t Llama&rsquo;s.<br>When you send text to a model, the text gets chopped into pieces from that model&rsquo;s vocabulary, and each piece is swapped for an integer ID. Only those IDs ever reach the model. The model never sees text. It sees a sequence of integer indices into its own private alphabet.<br>So tokens aren&rsquo;t &ldquo;roughly like words&rdquo; or &ldquo;kind of like characters&rdquo;. They&rsquo;re the atoms of perception for one specific model, and they&rsquo;re the only language that model speaks. Two models fed the same English sentence will produce two different integer sequences, often of different lengths:<br>"I love strawberry milkshakes!"<br>GPT-4<br>·love<br>·str<br>aw<br>berry<br>·milk<br>sh<br>akes<br>9 tokens<br>Llama 3<br>·love<br>·straw<br>berry<br>·milk<br>shakes<br>7 tokens<br>Each chip is one token. · marks a leading space (so ·love is the token love, distinct from love). Splits are approximate; the interactive playground at the end of the post shows exact tokenization.

The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies. To GPT-4, the token ·straw doesn&rsquo;t exist as a single chunk, so &ldquo;strawberry&rdquo; splits across three pieces. Llama 3&rsquo;s vocabulary happens to include ·straw, so it gets through in two.<br>Here&rsquo;s GPT-4&rsquo;s actual tokenizer running in your browser. Type anything: your name, a strange word, a sentence in another language. Each chip below is one token.<br>GPT-4 tokenizer (cl100k_base)<br>loading…

How does a model end up with one specific vocabulary instead of another? The dominant algorithm is Byte Pair Encoding , or BPE .<br>BPE, the algorithm#<br>BPE is an algorithm for deciding which subword chunks deserve to be tokens, given a corpusA corpus is the dataset of text used to train the tokenizer (and the model). Typically a giant mix of web pages, books, code, and other text. For modern models it&rsquo;s measured in trillions of tokens. and a target vocabulary size. It starts small and grows the vocabulary one merge at a time, always merging the most frequent adjacent pair in the corpus.<br>The whole algorithm fits on a sticky note.<br>The setup. You have:<br>A corpus to tokenize.<br>A target vocabulary size $V$ (a number you choose; typical values are 30,000 to 100,000).<br>You want to end up with a list of $V$ tokens such that common substrings (the, ing, to) get their own token, so common text compresses into short sequences. Rare substrings decompose into smaller pieces, down to single characters in the worst case, so nothing is ever out-of-vocabulary.<br>The algorithm.<br>Initialize the vocabulary as every distinct character in the corpus.<br>Scan the corpus and count every adjacent pair of tokens.<br>Take the most frequent pair, merge it into a new token, and add it to the vocabulary.<br>Repeat steps 2 and 3 until the vocabulary has $V$ entries.<br>That&rsquo;s it. No clever scoring, no neural networkA computational model made of layers of trainable mathematical functions whose parameters are tuned to fit data. Modern LLMs are massive neural networks. BPE, by contrast, is plain bookkeeping with no learned parameters., no second pass. The &ldquo;merge&rdquo; in step 3 doesn&rsquo;t do anything sophisticated. It just declares: from now on, whenever you see t followed by h in this corpus, treat them as one symbol called th.<br>Two details matter:<br>The originals don&rsquo;t disappear: when t and h get merged into th, all three are now in the vocabulary. If a word later happens to use t followed by some other character, the tokenizer can still represent it. The vocabulary grows monotonically.<br>Pairs get...

rsquo tokens model vocabulary token ldquo

Related Articles