Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

charles_irl1 pts0 comments

Cutting inference cold starts by 40x with LP, FUSE, C/R, and cuda-checkpoint

Running background agents in production: lessons from Open-Inspect Register

All posts<br>Back Engineering<br>May 12, 2026•20 minute read

Cutting inference cold starts by 40x with LP, FUSE, C/R, and cuda-checkpoint<br>Charles Frye@charles_irl Member of Technical Staff

Jonathan Belotti@jonobelotti_IO Member of Technical Staff

Erik Bernhardsson@bernhardsson CEO and Founder

Akshat Bubna@akshat_b CTO and Founder

We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media, author software, and fold proteins at massive scale.<br>Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load.<br>But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability.<br>At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did.<br>There are four key ingredients:<br>Cloud buffers : maintain a small buffer of healthy, idle GPUs to take on new load<br>Custom filesystem : serve container images lazily out of a content-addressed, multi-tier cloud-native cache<br>Checkpoint/restore : fast-forward through CPU-side initialization by directly restoring processes into memory<br>CUDA checkpoint/restore : fast-forward through GPU-side initialization by directly restoring CUDA contexts into memory<br>Together, they take AI inference server replica scaling from multiple kiloseconds to just tens of seconds.<br>We’ve shared bits and pieces of this work along the way, because we believe that secrecy is a bad moat. And if more people learn how to use GPUs efficiently, there will be more available in the market for us!<br>But this blog post represents the first time we’ve put the entire story together in one place. We hope it convinces you that our system is worth buying into — or joining us to build it.<br>Why care about serverless GPUs? To maximize GPU Allocation Utilization for inference workloads.<br>First, let’s frame the problem clearly. GPUs are expensive and scarce, so we want to maximize their utilization, where “utilization” is the following unitless quantity:<br>Utilization := Output achieved ÷ Capacity paid for<br>There are many ways to measure utilization — to define output and capacity. The most sophisticated and most stringent here is probably “Model FLOP/s Utilization”, which divides raw algorithmic operation requirements by aggregate arithmetic bandwidth.<br>This is catnip for engineers. It’s also especially critical for “hero run” large-scale training, so it draws a lot of investment and attention, e.g. recently as everyone dunked on xAI’s ~10% MFU.<br>But at the other end of the stack, there’s a more basic form of utilization that wrecks the relationship between achieved output and allocated capacity for inference workloads, GPU Allocation Utilization:<br>GPU Allocation Utilization := GPU-seconds running application code ÷ GPU-seconds paid for<br>Aside on "GPU Utilization" terminology The "GPU utilization" reported by nvidia-smi and similar tools is in between these two extremes. It reports the fraction of the time that kernel code is running on the GPU — literally, the fraction of time there is a CUDA stream running on the GPU. Read more here.<br>Inference applications have highly variable scale. Unlike training, the demand for capacity is outside the direct control and management of the engineering organization. Instead, it is driven by external user behavior — by markets or social media algorithms or product teams.<br>Here’s a sample trace of requests per minute from a time-varying Poisson process we use to model inference applications. Notice not only the seasonal variation (daily cycles) but also the long-term trend of increasing variability in demand as the average demand increases.

Spiky demand raises serious engineering problems. To borrow from Marc Brooker of AWS: “the cost of a system scales with its (short-term) peak traffic, but for most applications the value the system generates scales with the (long-term) average traffic.” Spiky demand means high peak-to-average ratios, which challenge system economics.<br>Concretely, imagine the capacity planning for such an application. You might have demand (measured in GPUs required to service requests within latency targets) that looks like this:<br>With a fixed, over-provisioned GPU allocation, utilization is low

Application Demand

Provisioned

To properly service your anticipated load,...

utilization inference demand cuda checkpoint gpus

Related Articles