We made our filesystem 47× faster by deleting it

appcypher2 pts0 comments

How we made our OCI filesystem 47× faster - microsandbox<br>Back to Blog

div]:rounded-lg [&>div]:border [&>div]:border-border [&>div]:bg-[var(--card)]">Table of Contents

A user in our Discord said microsandbox felt slow. Listing every file in the Python standard library took 5.3 seconds inside a sandbox; in Docker it took milliseconds. We went digging.

We fixed it in v0.4: we replaced our user-space filesystem with a Linux disk image that the VM mounts directly. The geometric mean speedup across our mixed guest-visible filesystem suite is 47×, with the worst-case rows more than 1,000× faster, and the host filesystem code is about 5,300 lines shorter.

Where this started

My first try was monofs: a content-addressed filesystem with block-level dedup, compression, and distributed read replicas. It stored images at 1.3× their original size on disk, and microsandbox is local-first, so the long-tail dedup payoff wasn't worth the up-front cost. For v0.3 I switched to OCI plus a user-space overlay built on a libkrun hook; we got layer dedup and identical behavior on Linux and macOS, but everything still ran outside the kernel.

Where the time was going

Every file operation inside the VM had to bounce out to the host through FUSE, which is Linux's mechanism for letting an ordinary program act as a filesystem. To open a file, the VM hands the request to our host process, which walks every layer looking for the file and sends the answer back; the same trip happens for every stat, every readdir, and every cache miss. A single Python import triggers dozens of these round trips before your code even starts running, and a ten-layer image multiplies the cost of each one.

We spent the next stretch of v0.3 trying to make that path faster: better caching, fewer syscalls, smaller responses. Each change shaved a few percent. None of them changed the order of magnitude.

Docker doesn't have this problem because Docker uses the kernel's own layered-filesystem driver (overlayfs), so file operations never leave the kernel. We were trying to match a kernel filesystem from outside the kernel; no cache could close that gap.

So we deleted the filesystem.

The new plan

The new plan was to stop bouncing every file operation between the VM and the host. We'd build a Linux filesystem image ahead of time, hand it to the VM as a virtual disk, and let the VM's own kernel mount it. With FUSE out of the loop, file operations inside the VM would stay inside the VM.

Before<br>app<br>guest VFS<br>virtiofs / FUSE boundary<br>host filesystem code<br>layer lookup / overlay logic<br>response back into the VM

After<br>app<br>guest VFS<br>guest overlayfs<br>guest EROFS<br>virtio-blk<br>cached block-backed image

Before, every lookup crossed the VM/host boundary. After, normal reads and lookups stay inside the guest kernel.<br>The filesystem we picked is EROFS: read-only, in-tree since the kernel needed it for Android, and easy to author. EROFS also solved the macOS problem: the VM's own kernel is Linux regardless of what's running outside it, so once the disk image is built, the host's filesystem stops mattering.

No mkfs, no mount, no helpers

microsandbox runs on both Linux and macOS, and macOS lacks the host-side tools you'd normally use to build a filesystem image: no mkfs.ext4, no mkfs.erofs, no loopback mounts. If our image pipeline depended on any of them, we'd either have to ship a helper VM (heavy, slow to start) or live with a permanent split between platforms, and neither option fit microsandbox's "single self-contained binary" promise. So we wrote the image writers ourselves in Rust. A filesystem is a byte layout on disk; the writers just produce that layout. Three small pieces do the work:

An EROFS writer that emits the read-only image of an OCI layer.

An ext4 writer that emits the sparse, journaled scratch area each sandbox gets.

A VMDK descriptor that stitches everything into one virtual disk.

Nothing in the pipeline shells out, asks for root, or mounts a loopback device, and the same Rust code path builds the images on Linux and Apple Silicon without depending on host-only filesystem tools. The EROFS artifacts round-trip through a reader we also wrote, and CI boots the full stack under the real VM kernel. If a byte is wrong, two different readers tell us about it.

The first cut

The obvious way to use these writers was one EROFS image per OCI layer. The VM would get one virtual disk per layer plus one for the scratch area, and the kernel's overlayfs would merge them at boot. It worked: the first measurements landed between 10× and 175× faster than v0.3 depending on the workload, and we were ready to ship.

First cut<br>layer 1<br>/dev/vda<br>layer 2<br>/dev/vdb<br>layer 3<br>/dev/vdc<br>layer 30<br>/dev/vd?

One EROFS image per OCI layer. Python images attached ~10 disks; some custom builds pushed past the microVM's virtio device cap.<br>Then we counted the layers. A Python image runs around ten; CUDA images more; some user-built ones push thirty or forty. microVMs cap how many devices they...

filesystem image layer kernel host erofs

Related Articles