Tokenization Is the Bottleneck You're Not Measuring

Tokenization Is the Bottleneck You’re Not Measuring | Ranvier

You’ve optimized your GPU serving stack. You’ve tuned vLLM’s batch size, configured PagedAttention, maybe even set up prefix-aware routing for KV cache locality. Your P99 looks good. Your throughput is climbing. And somewhere in your proxy layer, every single request is blocking for 5-13 milliseconds while a tokenizer turns text into integers.

You’re probably not measuring it. Most LLM proxies treat tokenization as instantaneous—call the function, get the tokens, move on. But on an event-loop architecture, 5-13ms isn’t a rounding error. It’s an eternity. Every millisecond your event loop spends inside a tokenizer FFI call is a millisecond where no other request is read, no response is forwarded, no health check is answered, no connection is accepted.

This post is about a bottleneck hiding in the gap between “fast enough” and “actually non-blocking.”

Why Tokenization Blocks

If you’re doing prefix-aware routing, request rewriting, cost estimation, or priority classification, your proxy needs to tokenize the input before forwarding it. That means calling a tokenizer, usually HuggingFace’s tokenizers library, the same BPE implementation used by most serving engines.

The problem is that tokenization is CPU-bound work executed through an FFI boundary. The Rust tokenizers crate does the actual BPE encoding. Your proxy calls it through a C binding. The call takes 5-13ms depending on input length. During that call, your thread is gone.

In a thread-per-request architecture (Go, Java, threaded Python), this is fine. One thread blocks; the others keep working. In an event-loop architecture—Node.js, Seastar, anything built on epoll/io_uring with cooperative scheduling—it’s a disaster. The event loop processes everything sequentially. While it’s inside the tokenizer, it processes nothing else.

Let’s make this concrete. You have an event loop handling 1,000 requests per second. Each tokenization call takes 10ms. If you tokenize synchronously on the event loop, you can process at most 100 tokenizations per second on that core. Your other 900 requests are queued, their latency inflating by 10ms for each request ahead of them in line.

At 20 concurrent users, we measured tokenization accounting for 10.6ms of total routing overhead, while the actual routing decision (a radix tree lookup) took 0.01ms . The tokenizer was 1,000x slower than the thing it was feeding.

The Caching Layer That Actually Works

The first optimization is the most obvious: don’t tokenize the same text twice.

LLM traffic has a property that makes caching extraordinarily effective: repetition. Every request to a RAG application includes the same system prompt. Every multi-turn conversation starts with the same instruction prefix. Every API call from the same client sends the same role tags (\n, \n).

We added an LRU cache in front of the tokenizer. Hit rates depend entirely on content type, and the spread is dramatic. Here’s what we expect:

Content type Cache hit rate

Role tags (\n) 95%+

System messages 80-90%

User queries 10-30%

That 80-90% hit rate on system messages means that for most requests, the expensive part—tokenizing the 2,000-4,000 token system prompt—is a hash table lookup returning in microseconds instead of a 10ms FFI call.

The implementation is straightforward: a hash map keyed on the input text, with an LRU eviction list capped at a configured maximum (we use 1,000 entries). On hit, move the entry to the front. On miss, tokenize, insert at the front, evict from the tail if full. No locks needed: in a sharded architecture, each core has its own cache.

Two details matter:

Cap cached text—but don’t cap it too low. Our first instinct was to cap at 8KB: surely a long RAG document won’t repeat verbatim often enough to earn its memory. That was a mistake. The long, stable system prefixes we most wanted to cache routinely exceed 8KB, and refusing them reintroduced a 5-7ms P50 regression at 20+ concurrent users, exactly the cost we were trying to delete. We raised the cap to 64KB. Worst case is 1,000 entries × 64KB ≈ 64MB per shard, which is cheap insurance. The cache key is the full input text (the tokens themselves are a small vector of int32s), so the cap is really about bounding key memory. And the texts most worth caching are precisely the long ones. (A separate, tighter 32KB limit applies to the cross-core dispatch path, because that one copies the string across a core boundary, where large copies aren’t free.)

Don’t cache unique content. User queries have a 10-30% hit rate; most are unique. The cache handles this naturally through LRU eviction: unique queries enter the cache, never get hit, and fall off the tail. The system prompt stays hot at the front.

When Caching Isn’t Enough

A 90% cache hit rate sounds great until you think about what happens on the other 10%. At 1,000 requests per second, 10% misses means 100 tokenizer calls per second, each blocking...

Tokenization Is the Bottleneck You're Not Measuring

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits