Cramming 1M (Scaled to Zero) Virtual Machines in a Single Box

Cramming 1M (Scaled to Zero) Virtual Machines in a Single Box<br>Felipe Huici<br>Co-Founder & CEO

:first-child]:mt-0 prose-img:w-min [&_svg]:w-full not-quotes" data-astro-cid-qxzanjbg="true" data-astro-cid-dbrruu22> At Unikraft, we’re on a quest for exponentially better cloud efficiency; the aim is to run workloads in as few boxes as possible — racks of servers rather than data-centers.

One important way to achieve that is to scale any idle workload (VM) down to zero, and to wake it up just in time, when the next request to it appears. The trick here is to ensure that the entire process from first receiving the request, to waking up the VM, to unbuffering the request, and to the VM actually responding, happens in millisecond timescales, so that the user sending the request never knows that the VM was actually scaled to zero (in a standby state in our lingo). In such a standby state the VM consumes no CPU nor memory resources, resources that can then go to other VMs on the server that do have actual work to do; this resume process, which takes the VM from a standby state to a running one, happens in

I won’t go into a lot of detail in this write-up as to how the platform achieves those In a previous life we were virtualization and performance researchers and had already looked into achieving high density, which, back then, translated to us hacking away at most of the major components of the Xen hypervisor. The result was a paper we published in SOSP, the top systems conference in the world:

And yes, we were slightly annoyed with the industry claiming that containers were secure :), thus the somewhat hot take title (but I’ll leave the security aspect to another blog post). The paper did well, got 28K citations (the research equivalent of a TikTok video going viral) and also hit HN a few years later:

Back to the story, in the paper we showed that it was possible to run up to 8K VMs on a standard server while still being able to start new ones in a few milliseconds:

You may have noticed I mentioned the word “hack” above: the system wasn’t particularly stable, and back then we were using highly optimized NGINX images to obtain these numbers (we did experiments with other applications too, but certainly nothing like the “define anything in a Dockerfile and we’ll start in One turning point along the road was the appearance of Firecracker which, unlike other so called Virtual Machine Monitors (VMMs) like QEMU was built for speed and efficiency (and yes, I’m aware that QEMU is an emulator hijacked into being a VMM as well, but I digress). In the beginning, we had started by testing our platform at 5K-10K scales, and we were happy to see that these numbers of VMs on a single server were not affecting start or resume times. In case you’re wondering, for the benchmarking we tend to use servers with 24 or 48 CPU cores, 256-384GB of memory, and 2 x 2TB NVMe drives, so nothing spectacular. Note that the NVMe drives are important as we use them to ensure we can statefully resume VMs from snapshots quickly.

As we kept doing deployments, customers kept requesting higher and higher levels of density. Pushing that limit to 50K and beyond started making lots of things on the Linux host upset, especially on the networking side of things, where we were using one tap device per VM, leading to kernel lock contention, maxing out on the number of bridge ports and thus having to use a large and increasing number of bridges, and some funnier items like crashing Tailscale as it was scanning through all network devices on the host.

At some point it became clear that we couldn’t keep relying on tap devices and the standard network substrate on the host at the levels of scale some of our clients were requesting: 1M VMs in a server, and yeah, there were internal memes about Austin Powers floating around, like this one for example:

So basically we redesigned the platform to not use tap devices, or any of the network and protocol based sub-system, and instead move to all communications within the server to be done via shared-memory devices, including leveraging vsock. To make a long story short, we finished the bulk of this work in early 2026 and it is now in production. Before this work we had done a number of workarounds to get us to about 100K VMs, but we weren’t sure where the new architecture would take us. I mean, the platform’s core components (controller, proxy, snapshot and storage subsystems) were designed and implemented for scale and speed, but you never know — the difference between theory and practice is larger in theory than in practice.

So we asked one of our field engineers to take a box out for a spin — nothing too incredibly beefy, a standard, off-the-shelf server with 48 CPU cores, 384GB of RAM and 2 x 1.9TB NVMe drives. Because we knew we were going for (large) scale, we turned on all of the platform’s knobs, including the ability to compress all snapshots,...

Cramming 1M (Scaled to Zero) Virtual Machines in a Single Box

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7