Tokenization Is the Bottleneck You're Not Measuring

mindsaspire1 pts0 comments

Tokenization Is the Bottleneck You’re Not Measuring | Ranvier

You’ve optimized your GPU serving stack. You’ve tuned vLLM’s batch size,<br>configured PagedAttention, maybe even set up prefix-aware routing for KV cache<br>locality. Your P99 looks good. Your throughput is climbing. And somewhere in<br>your proxy layer, every single request is blocking for 5-13 milliseconds while<br>a tokenizer turns text into integers.

You’re probably not measuring it. Most LLM proxies treat tokenization as<br>instantaneous—call the function, get the tokens, move on. But on an<br>event-loop architecture, 5-13ms isn’t a rounding error. It’s an eternity.<br>Every millisecond your event loop spends inside a tokenizer FFI call is a<br>millisecond where no other request is read, no response is forwarded, no<br>health check is answered, no connection is accepted.

This post is about a bottleneck hiding in the gap between “fast enough” and<br>“actually non-blocking.”

Why Tokenization Blocks

If you’re doing prefix-aware routing, request rewriting, cost estimation, or<br>priority classification, your proxy needs to tokenize the input before<br>forwarding it. That means calling a tokenizer, usually HuggingFace’s<br>tokenizers library, the same BPE implementation used by most serving<br>engines.

The problem is that tokenization is CPU-bound work executed through an FFI<br>boundary. The Rust tokenizers crate does the actual BPE encoding. Your<br>proxy calls it through a C binding. The call takes 5-13ms depending on input<br>length. During that call, your thread is gone.

In a thread-per-request architecture (Go, Java, threaded Python), this is<br>fine. One thread blocks; the others keep working. In an event-loop<br>architecture—Node.js, Seastar, anything built on epoll/io_uring with<br>cooperative scheduling—it’s a disaster. The event loop processes everything<br>sequentially. While it’s inside the tokenizer, it processes nothing else.

Let’s make this concrete. You have an event loop handling 1,000 requests per<br>second. Each tokenization call takes 10ms. If you tokenize synchronously on<br>the event loop, you can process at most 100 tokenizations per second on that<br>core. Your other 900 requests are queued, their latency inflating by 10ms for<br>each request ahead of them in line.

At 20 concurrent users, we measured tokenization accounting for 10.6ms of<br>total routing overhead, while the actual routing decision (a radix tree<br>lookup) took 0.01ms . The tokenizer was 1,000x slower than the thing it<br>was feeding.

The Caching Layer That Actually Works

The first optimization is the most obvious: don’t tokenize the same text<br>twice.

LLM traffic has a property that makes caching extraordinarily effective:<br>repetition. Every request to a RAG application includes the same system<br>prompt. Every multi-turn conversation starts with the same instruction<br>prefix. Every API call from the same client sends the same role tags<br>(\n, \n).

We added an LRU cache in front of the tokenizer. Hit rates depend entirely on<br>content type, and the spread is dramatic. Here’s what we expect:

Content type<br>Cache hit rate

Role tags (\n)<br>95%+

System messages<br>80-90%

User queries<br>10-30%

That 80-90% hit rate on system messages means that for most requests, the<br>expensive part—tokenizing the 2,000-4,000 token system prompt—is a hash<br>table lookup returning in microseconds instead of a 10ms FFI call.

The implementation is straightforward: a hash map keyed on the input text,<br>with an LRU eviction list capped at a configured maximum (we use 1,000<br>entries). On hit, move the entry to the front. On miss, tokenize, insert at<br>the front, evict from the tail if full. No locks needed: in a sharded<br>architecture, each core has its own cache.

Two details matter:

Cap cached text—but don’t cap it too low. Our first instinct was to cap at<br>8KB: surely a long RAG document won’t repeat verbatim often enough to earn its<br>memory. That was a mistake. The long, stable system prefixes we most wanted to<br>cache routinely exceed 8KB, and refusing them reintroduced a 5-7ms P50<br>regression at 20+ concurrent users, exactly the cost we were trying to delete.<br>We raised the cap to 64KB. Worst case is 1,000 entries × 64KB ≈ 64MB per<br>shard, which is cheap insurance. The cache key is the full input text (the<br>tokens themselves are a small vector of int32s), so the cap is really about<br>bounding key memory. And the texts most worth caching are precisely the long<br>ones. (A separate, tighter 32KB limit applies to the cross-core dispatch path,<br>because that one copies the string across a core boundary, where large copies<br>aren’t free.)

Don’t cache unique content. User queries have a 10-30% hit rate; most<br>are unique. The cache handles this naturally through LRU eviction: unique<br>queries enter the cache, never get hit, and fall off the tail. The system<br>prompt stays hot at the front.

When Caching Isn’t Enough

A 90% cache hit rate sounds great until you think about what happens on the<br>other 10%. At 1,000 requests per second, 10% misses means 100 tokenizer calls<br>per second, each blocking...

cache tokenization tokenizer call request event

Related Articles