LLM Serving and the Bus That Never Stops | Rafi Hasan
← BACK<br>April 2, 2026•8 min read
LLM Serving and the Bus That Never Stops<br>In-flight batching is the trick that keeps LLM serving from wasting GPU seats.
Table of Contents
I used to think batching requests for a machine learning model was a solved problem. I have hosted models and served requests. Batching speeds up serving requests. If you don't batch, for example, send one tiny request at a time, the GPU behaves like a giant bus carrying one passenger. It can move, but the economics are terrible. This is why batching exists. But LLMs make batching weird.
The Idea
In a traditional web backend or a standard computer vision pipeline, it's straightforward. You put requests into a queue, wait until you hit a batch size of 4 or 8, slam them into the GPU, and return the results. Standard engineering. A single trip through the model.
input -> model -> output
An image classifier does this. An embedding model does this. You pass in data, the model runs, and you get the result. But LLM serving completely breaks this mental model. LLM generation is iterative. It generates one token at a time.
prompt -> prefill<br>token 1 -> decode<br>token 2 -> decode<br>token 3 -> decode<br>...
This means serving an LLM is not just "run the model once." It is a scheduling problem that repeats every token. If you treat LLM requests like traditional web requests, your GPU efficiency plummets, your latency spikes, and your cloud bill skyrockets.
LLM serving is a loop. Every token is another chance to waste the GPU or fill it.
Prefill, Decode, and the KV Cache
Each request has two phases. The first phase is prefill , where the model reads the prompt and builds the internal attention state. The second phase is decode , which uses autoregressive decoding to generate text one token at a time, feeding each generated token back into the model to predict the next.
Because decode steps run repeatedly until an end condition is met, requests vary wildly in duration. To avoid recomputing the prompt history at every step, the server maintains a KV cache in GPU memory. The scheduler's goal is to keep the GPU busy with token generation without running out of this finite cache memory.
Static Batching
Imagine three requests arrive together.
A -> needs 8 output tokens<br>B -> needs 2 output tokens<br>C -> needs 6 output tokens
Static batching puts them on the same bus and makes the bus finish the whole trip before taking new passengers.
Step 1: A B C<br>Step 2: A B C # B is done<br>Step 3: A _ C<br>Step 4: A _ C<br>Step 5: A _ C<br>Step 6: A _ _ # C is done<br>Step 7: A _ _<br>Step 8: A _ _ # A is done
Even though B and C finished early, their seats cannot be reassigned. The GPU keeps running, but it wastes memory and computes empty padding tokens. That is the waste.
The Fixed Tour Bus vs. The Dynamic City Transit Bus
I like thinking about the difference between static and in-flight batching as the difference between a pre-booked tour bus and a public city transit bus:
The Fixed Tour Bus (Static Batching): A tour bus leaves the station with a set passenger list. Even if a passenger decides to get off early at stop 2, their seat must remain empty for the rest of the trip. The bus cannot pick up new passengers on the road. Instead, it must complete the entire tour and return to the station before loading a new group.
The Dynamic City Transit Bus (In-Flight Batching): A transit bus that runs a continuous loop. As soon as a passenger reaches their destination and steps off, the bus pauses briefly at the next stop, lets a new passenger board to fill the empty seat, and immediately continues its journey.
In LLM serving, the bus is the active batch. A seat is not just a "batch slot." It represents GPU memory and KV cache capacity. Getting off means a request has hit an end condition. Boarding means a new request has enough memory budget to join the active generation loop.
LLM serving intuition<br>The bus that changes passengers while moving<br>Each seat is a batch slot backed by GPU memory and KV cache. Each tick is one generation iteration.
In-flightStatic<br>In-flight batch<br>The bus is already moving, but the scheduler keeps swapping riders at token boundaries when capacity opens up.
3/3 active seats
requestA<br>7 tokens left
requestB<br>1 tokens left
requestC<br>5 tokens left
Token iterationPrevPlay ▶Next
12345678End
Bus stop queue<br>nobody waiting
0 idle seats in this toy loop
The important part is that the batch is no longer a fixed group of requests. It is a dynamic, fluid variable.
In-flight Batching
In-flight batching is also called continuous batching or iteration-level batching. So how do engines like vLLM or TensorRT-LLM actually implement this "dynamic bus" in code? They shift the scheduling boundary from the request level to the iteration level.
At every generation iteration, the scheduler asks:
Which requests are still active?
Which requests just finished?
Which new requests are waiting?
Is there enough KV...