Toward Zero-Copy OCI Layers | Amla LabsSkip to main content
Container images are layered, but most runtimes don’t keep them that way at the storage edge. They unpack tarballs into one shared graph driver, walk those layers with an overlay implementation that lives in the host kernel, and bind-mount the result into the workload.<br>Kalahari takes a different approach. Each OCI layer becomes its own EROFS image on the host, addressed by the SHA256 of the original tar layer. Those images are exposed to the guest as virtio-pmem-backed ranges with DAX. The guest stacks them into the workload’s root filesystem with ordinary overlayfs.<br>This post is about why “one EROFS per layer” is a bigger choice than it sounds, and what falls out of it.<br>What’s Wrong with Flattening<br>The simplest path is to flatten an OCI image into a single root filesystem. Pull each layer in order, apply each on top of the last, write out one tree. That tree is what the workload mounts.<br>Flattening sounds like a one-line story until you ask:<br>A second image shares 90% of its layers with the first one. Do you re-fetch and re-flatten every time?<br>A layer hash you’ve already converted appears in a third image. Do you remember the conversion?<br>A multi-gigabyte base image needs to land in memory. Where does the unpacked tree live during conversion?<br>The honest answers are: yes, no, and “in RAM, hopefully.” Caching by image identity instead of by layer identity throws away the structure that registries already give you.<br>Kalahari’s storage layer keeps that structure. The unit of caching is the layer, keyed by the OCI blob digest : the SHA256 of the compressed layer blob the registry already published. (Not the uncompressed diffID; the registry-published blob digest is the cache key.) The builder decompresses that blob to build EROFS, but the cache key remains the registry blob digest. Two images that share a layer share its EROFS blob on disk, byte for byte.<br>flowchart LR<br>subgraph py312["python:3.12-slim"]<br>L1["layer sha256:abc..."]<br>L2["layer sha256:def..."]<br>L3["layer sha256:ghi..."]<br>end<br>subgraph py313["python:3.13-slim"]<br>L4["layer sha256:abc..."]<br>L5["layer sha256:def..."]<br>L6["layer sha256:jkl..."]<br>end<br>subgraph store["Content-addressed blob store"]<br>B1["blobs/abc... (EROFS)"]<br>B2["blobs/def... (EROFS)"]<br>B3["blobs/ghi... (EROFS)"]<br>B4["blobs/jkl... (EROFS)"]<br>end<br>L1 --> B1<br>L2 --> B2<br>L3 --> B3<br>L4 -. reused .-> B1<br>L5 -. reused .-> B2<br>L6 --> B4The cache hit is content-addressed. It does not depend on the image tag, the registry, or any metadata that can drift.<br>Why EROFS<br>EROFS is a read-only filesystem in the upstream Linux kernel, designed for compact images that mount fast. For a sandbox that wants to expose immutable image data to a guest VM, it’s a good fit for reasons that aren’t obvious until you go looking:<br>It’s read-only by design, which matches the “this layer never changes” semantics of an OCI layer exactly.<br>The on-disk layout is compact: small files can be tail-packed inline into inode metadata so they don’t pay a block-allocation cost.<br>The block size can be set at build time, so it can be aligned with the guest page size.<br>It supports DAX. Pages of an EROFS image, when backed by persistent memory or a virtio-pmem device, can be mapped directly into the guest’s page table without going through a guest page cache.<br>That last property is the one that decides the host/guest hand-off.<br>Streaming the Build<br>Before any of that matters, you have to actually build the EROFS image. This is where naïve approaches fall over.<br>A base image layer can be hundreds of megabytes uncompressed. Multiple layers can be in flight at once. If the builder buffers the entire layer in memory, importing a real image starts to look like an OOM test.<br>Kalahari’s EROFS builder is streaming. Tar entries arrive one at a time from a decompressed stream. File data is written straight through to the output as the entry is consumed. The builder never holds the full file body in memory, regardless of file size.<br>The on-disk layout is “data first, metadata last.” File blocks are written as they stream in. Inode tables, directory entries, and the superblock pointer to the metadata region are emitted at the end, after all file data has been placed. Memory usage is O(metadata): proportional to the number of inodes, not the total bytes.<br>flowchart LR<br>HTTP["HTTP fetch"] --> GZ["gunzip stream"] --> TAR["tar entries"] --> BLD["push_file(path, meta, size, reader)"]<br>BLD --> O1["file blocks streamed straight to output"]<br>BLD --> O2["inode table updated incrementally"]<br>BLD --> O3["on finalize:superblock + meta region"]The builder API is small enough to describe in a sentence: hand it a tar reader and a writer that supports finalization, and it produces an EROFS image. There is no intermediate step where a flattened directory tree exists on disk or in RAM.<br>Whiteouts Are Translated, Not Merged<br>OCI represents whiteouts as tar entries named .wh.NAME and .wh..wh..opq; overlayfs represents them on disk as 0/0 character...