Unlocking asynchronicity in continuous batching
Log In<br>Sign Up
Back to Articles
Unlocking asynchronicity in continuous batching
Published<br>May 14, 2026<br>Update on GitHub<br>Upvote 32
+26
Rémi Ouazan Reboul ror Follow
Pedro Cuenca pcuenq Follow
Aritra Roy Gosthipaty ariG23498 Follow
TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference.
This is the second post in a series on efficient LLM inference. The first post covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc.
An H200 costs around $5 an hour on Inference Endpoints. That's cheap for an hour, but use it for a day and you are already paying $120. If this is the case, you want your GPU to be used to its fullest.<br>We have seen that Continuous Batching improves GPU utilization by scheduling tightly packed batches, so no compute is wasted on padding. But there is a second source of waste that continuous batching does not address: by default, it is synchronous. This means the CPU and GPU take turns: while the GPU computes, the CPU waits. And while the CPU prepares the next batch, the GPU waits. In a loop running hundreds of steps per second, those idle gaps add up, and as we will show, they can account for nearly a quarter of total runtime. To ensure the GPU is busy computing 100% of the time, we need to get rid of those gaps.
To achieve this, we can use asynchronous batching : we are going to disentangle CPU batch preparation from GPU batch compute, so both can run in parallel and we always have a productive GPU 🔥
Synchronous batching
This is how naive synchronous batching works:
When the CPU prepares a new batch, it selects which requests to include, updates the KV cache table, evicts requests that finished in the previous runs, and admits new ones to fill the freed space. Once that is done, it transfers the prepared inputs to the GPU. The GPU runs its forward pass and samples (i.e. chooses) a new token for each request. The results come back to the CPU, so it knows what token each request just produced, then the whole cycle repeats again.
Notice the red annotation on the right: after the GPU finishes computing, it goes idle. The next batch cannot start until the CPU has gone through its update step: sampling the output tokens, updating request states, re-scheduling the batch.
This is the core inefficiency of synchronous batching: the CPU and GPU take turns. While the GPU is computing, the CPU is idle. While the CPU is updating, the GPU is idle. In no circumstances are they both doing useful work at the same time. For a single forward pass this might seem like a small price to pay, but in a continuous batching loop running hundreds of steps per second, these idle gaps accumulate into real throughput loss.
To showcase this, we profile the time spent on CPU and GPU when generating 8K tokens with a batch size of 32 using an 8B model:
If you want to produce the same kind of graph, you can instrument the continuous batching code to dump CPU and GPU activity spans and use this script.
The timeline alternates between green (GPU active, CPU idle) and red (CPU active, GPU idle): the two never overlap. Total generation time is 300.6 seconds, with 24.0% of that spent with an idle GPU waiting for the CPU to finish. Nearly a quarter of all generation time is wasted, from the point of view of the GPU. This is the pessimistic way of viewing things.
The optimistic way is that generation time would drop from 300 to 228 seconds (a free 24% speedup!), if we could eliminate CPU overhead entirely. This requires zero new kernel or model changes, just careful coordination of hardware.
Fundamentally, the idea is simple: we need to figure out how to run batch preparation for batch N+1 while batch N is computing. But this simple idea hides a few technical difficulties:
How can we launch something on the GPU and get back control to the CPU?
How can we make sure data is ready, for either CPU or GPU tasks, by the time each task is launched?
How can we prepare batch N+1 if it is based on the predictions of batch N?
By answering those questions, we are going to build asynchronous batching from scratch. We followed the same steps to implement it as part of continuous batching in the transformers library. Feel free to check the code and compare!
Creating concurrency
Our end goal is to have concurrent execution of CPU and GPU operations. We need a way to categorize our operations, so we can let the machine know which operations can run concurrently. We can achieve this using CUDA streams.
What is a CUDA stream?
To understand how CUDA orders its operations, we need to talk about CUDA streams . A stream is an ordered queue of GPU operations (kernel launches, memory copies, synchronization barriers) that executes in the order they were submitted. Every GPU operation is always scheduled inside a stream. Operations within the same...