Popping the GPU Bubble

radq1 pts0 comments

Popping the GPU Bubble | Moondream

9.6kTry the model

← Back to all blogs<br>Moondream Engineering<br>Popping the GPU Bubble<br>Photon, Moondream's inference engine, achieves near-realtime VLM inference (~33ms on NVIDIA B200). This is a peek into how it delivers up to 35% higher decode throughput by optimizing how the GPU works.<br>June 4, 2026

How do you make an AI model run as fast as possible? This is a question we obsess over at<br>Moondream HQ. The GPU handles all the math involved in model inference, so at first glance it<br>doesn't seem like there's much to it: just tell it what to do and wait for the answer. But if<br>you start looking at how it actually works under the hood, you find that the GPU often sits<br>idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This<br>phenomenon is called a GPU bubble .

When a typical AI model generates text, it produces one token at a time (a token is a<br>chunk of text, roughly a few characters). Each token depends on the tokens before it, a<br>property called autoregressive, so generation is sequential. You can't compute the third<br>token before you have the second. This decode loop involves a round trip between the CPU and<br>GPU. The GPU does most of the heavy lifting to run the actual model, performing billions of<br>arithmetic operations to produce the next token. But there's also a surprising amount of work<br>done by the CPU. It selects which requests to run next, sets up the metadata the GPU needs for<br>them, picks the actual token out of the model's output and records it, and more.

The challenge is that one token's worth of GPU work is small, while the CPU housekeeping is a<br>fixed cost paid on every trip. If the GPU has to wait for that housekeeping before it can start<br>the next token, it sits idle for part of every loop. This is why we get GPU bubbles.

In this post we're going to dive into how Photon hides these bubbles using a<br>technique called pipelined decoding. The idea is to overlap the two kinds of work: we start<br>GPU work on the next token while the CPU is still finishing the last one.

The bubble

Here's the shape of the problem.

In the blocking version (top), every step is a baton pass. The CPU plans and launches a<br>forward, the GPU runs it, then the CPU synchronizes, waits for the results to land,<br>commits them, and only then starts planning the next step. This is because the plan depends<br>on the token we select. For example, if the model indicates it has finished answering,<br>then we need to schedule a new pending request from our queue. The GPU sits idle waiting<br>for the CPU to finish its commit-plan-launch work.

The fix is to pipeline the loop. Launch the next forward<br>while the current step's token is still coming back and being committed. That's the<br>pipelined version (bottom): the forwards run back-to-back, and the CPU work is overlapped<br>underneath them.

The reason we can is that the token we just sampled doesn't have to leave the GPU. The next<br>forward reads it straight from GPU memory as its input. We still want a copy on the CPU<br>eventually, to detokenize it, stream it, and decide whether the request is done, but that is<br>bookkeeping we can do a moment later, in the background, while the next forward already runs.<br>Not waiting on that copy is the move that removes the bubble.

Making it safe requires three things, that we cover in the rest of this post: keeping step<br>buffers from colliding (ping-pong slots), getting the sampling order right for constrained decoding<br>(forward now, sample later), and cleaning up after a request finishes (zombies).

Mechanism 1: ping-pong slots

To run a decode step, the GPU needs a working set of buffers: a place to stage the input (the<br>last generated token and its position in the sequence), a place for the model to write its<br>output (the logits, one score per word in the vocabulary), a place to land the sampled token,<br>and some bookkeeping the attention kernel needs to find each sequence's cached keys and values<br>(its KV cache). We keep pinned (page-locked) host buffers on both ends, so the copies on and<br>off the GPU run as background DMA (direct memory access) transfers instead of blocking the CPU.

These buffers are allocated once and reused on every step. We work hard to avoid performing<br>GPU memory allocations at runtime, because they can cause device synchronization and introduce<br>bubbles. Fixed buffer addresses are also needed for capturing the decode step once as a<br>CUDA graph and replaying it,<br>reducing kernel launch overhead. We call this bundle a DecodeSlot.

This works, but introduces a blocker for pipelining. The buffers stay in use until the step is<br>done, so we cannot start the next step until the current one finishes. To overlap two steps,<br>the second step needs its own working set, otherwise it can overwrite the results of the first<br>step before the CPU has read them. So we keep two slots and alternate between them, ping-pong<br>style.

One thing to note about launch: we don't execute kernels the instant we issue a launch...

token step next model work bubble

Related Articles