We made our sandbox filesystem 47× faster by deleting it

makeboss1 pts0 comments

How we made our OCI filesystem 47× faster - microsandbox<br>Back to Blog

div]:rounded-lg [&>div]:border [&>div]:border-border [&>div]:bg-[var(--card)]">Table of Contents

A user in our Discord said microsandbox felt slow. Listing every file in the Python standard library took 5.3 seconds inside a sandbox; in Docker it took milliseconds. We went digging.

We fixed it in v0.4: we replaced our user-space filesystem with a Linux disk image that the VM mounts directly. The geometric mean speedup across our mixed guest-visible filesystem suite is 47×, with the worst-case rows more than 1,000× faster, and the host filesystem code is about 5,300 lines shorter.

Where this started

Before microsandbox, I'd been working on a distributed filesystem, and I came to this project with specific ideas about storage. The first rootfs I built reflected them: monofs, a content-addressed filesystem with block-level deduplication, compression, and distributed read replicas.

When I ran real container workloads through it, the numbers didn't add up. monofs stored an image at about 1.3× its original size on disk, and the performance wasn't great either. Block-level dedup is a long-tail win that pays off once a user has accumulated a lot of overlapping data, but microsandbox is local-first, and paying 1.3× storage out of the gate for a payoff that arrives much later wasn't the right trade.

So I needed a different filesystem, but I still wanted three things from one: dedup that pays off from the first sandbox, identical behavior on Linux and macOS so a divergent rootfs design wouldn't become a permanent tax, and compatibility with the format the rest of the industry already uses. OCI gave us the dedup and the compatibility cleanly: container images are content-addressed by layer, so two images sharing a base layer share that layer on disk and in cache. That's dedup that pays off the first time anyone pulls two related images. And when the layer stacking is handled by Linux's overlayfs, the boot path stops dragging.

The catch is that overlayfs is a Linux kernel feature, and macOS doesn't have it. To get the same behavior on both, I had three options: port overlayfs, find a different in-kernel option (EROFS, which I hadn't come across yet), or build something ourselves. libkrun, the microVM monitor we run sandboxes on, exposes a hook that lets a host program serve files into a guest VM, and that hook made the third path workable. So I wrote our own overlayfs in user space on top of it, and it gave us OCI layer support, layer-level dedup, and identical behavior across platforms.

Where the time was going

Every file operation inside the VM had to bounce out to the host through FUSE, which is Linux's mechanism for letting an ordinary program act as a filesystem. To open a file, the VM hands the request to our host process, which walks every layer looking for the file and sends the answer back; the same trip happens for every stat, every readdir, and every cache miss. A single Python import triggers dozens of these round trips before your code even starts running, and a ten-layer image multiplies the cost of each one.

We spent the next stretch of v0.3 trying to make that path faster: better caching, fewer syscalls, smaller responses. Each change shaved a few percent. None of them changed the order of magnitude.

Docker doesn't have this problem because Docker uses the kernel's own layered-filesystem driver (overlayfs), so file operations never leave the kernel. We were trying to match a kernel filesystem from outside the kernel; no cache could close that gap.

So we deleted the filesystem.

The new plan

The new plan was to stop bouncing every file operation between the VM and the host. We'd build a Linux filesystem image ahead of time, hand it to the VM as a virtual disk, and let the VM's own kernel mount it. With FUSE out of the loop, file operations inside the VM would stay inside the VM.

Before<br>app<br>guest VFS<br>virtiofs / FUSE boundary<br>host filesystem code<br>layer lookup / overlay logic<br>response back into the VM

After<br>app<br>guest VFS<br>guest overlayfs<br>guest EROFS<br>virtio-blk<br>cached block-backed image

Before, every lookup crossed the VM/host boundary. After, normal reads and lookups stay inside the guest kernel.<br>The filesystem we picked is EROFS: read-only, in-tree since the kernel needed it for Android, and easy to author. EROFS also solved the macOS problem: the VM's own kernel is Linux regardless of what's running outside it, so once the disk image is built, the host's filesystem stops mattering.

No mkfs, no mount, no helpers

microsandbox runs on both Linux and macOS, and macOS lacks the host-side tools you'd normally use to build a filesystem image: no mkfs.ext4, no mkfs.erofs, no loopback mounts. If our image pipeline depended on any of them, we'd either have to ship a helper VM (heavy, slow to start) or live with a permanent split between platforms, and neither option fit microsandbox's "single self-contained binary" promise. So we...

filesystem kernel host layer linux image

Related Articles