Making POSIX filesystems replicated and highly available — ZeroFS
Engineering · High availability
Making POSIX filesystems replicated and highly available
June 28, 2026·<br>Pierre Barre·<br>11 min read
ZeroFS stores a POSIX filesystem in a log-structured database that lives in an S3 bucket. Running that on one node is straightforward. Surviving the loss of the node is harder than it first looks, because the obvious ways to add a second one quietly weaken durability, and on a filesystem durability is the entire contract. Downtime is recoverable; a write the caller already saw acknowledged is not. Holding that line across a failover is most of what high availability in ZeroFS comes down to.
Where a single node runs out
Object storage is durable but slow and charged per request, so committing every filesystem operation as its own upload would saddle each write with tens of milliseconds and a billed PUT. ZeroFS does what any LSM engine does and buffers writes in an in-memory memtable, flushing them to the bucket in batches: on fsync, when the memtable fills, and on a periodic timer. A write() returns as soon as the data reaches the memtable, and fsync() is what pushes it the rest of the way to S3.
That leaves a single node exposed in two unrelated ways. Writes that have been acknowledged but not yet flushed exist only in that node's memory, so a crash before the next flush loses them; POSIX allows this, since the durability boundary is fsync rather than write, but the gap between the two is real. And the data that has already reached S3, while perfectly safe, has nobody to serve it for as long as the node is down. The first problem is about un-flushed writes, the second about an absent server, and a standby node can answer both at once. The work is in adding that standby without reopening the durability hole the single node already had.
Only one writer, ever
Two nodes over one bucket can fall into split-brain: both decide they are the leader and write at the same time. On a filesystem that means corruption, and once two writers have interleaved their updates there is no untangling them afterward. The protection therefore cannot depend on good timing; it has to hold under arbitrary delays, network partitions, and clock skew.
ZeroFS is single-writer per database, and the writer is chosen by the storage layer rather than negotiated between the nodes. Opening the database for writing issues a conditional update to its manifest that increments a writer_epoch and fences whoever held it before; the previous writer's next manifest write fails outright. Only one node can commit durable state at a time. The fenced leader's in-flight SST files are left unreferenced and later reclaimed by garbage collection, so they never overwrite live data, and the leader stops the moment its first write is rejected. Both nodes can briefly believe they lead, but only one will ever succeed in committing; the other finds it has been fenced and stops. This holds without reference to any clock, and everything else is layered on top of it.
Fencing relies on a conditional write (put-if-not-exists), which S3, Azure Blob, and Google Cloud Storage support directly; for S3-compatible stores that don't, ZeroFS reaches the same guarantee through Redis. A node also asks its peer which of them is currently active at startup instead of trusting its configured role, so two writers can never come up together from cold.
Why two nodes, not three
The instinct from Raft and Paxos is that fault-tolerant agreement wants an odd number of participants: three nodes so a majority survives the loss of one, five to survive two. That arithmetic exists because the cluster has to agree with itself about who leads and which writes are committed, and a majority is how it breaks a tie without splitting. ZeroFS holds no such vote. The one fact the whole scheme rests on, which node currently holds the writer epoch, is settled by a single compare-and-swap against the manifest in the bucket, and it is the object store, not the nodes, that picks the winner.
A linearizable compare-and-swap register is on its own enough to solve consensus for any number of contenders; it is the strong primitive a protocol like Raft spends pages rebuilding out of weaker ones. The object store offers it directly through conditional writes. The agreement three machines would otherwise maintain between themselves therefore lives in the bucket, next to the committed data it is agreeing about. That leaves the ZeroFS nodes to do nothing but serve: one node is already correct on its own, and the second is there to take over and to hold the un-fsync'd tail, not to make up a quorum. Because there is no quorum to maintain, an even number of nodes is fine. If the bucket is unreachable the filesystem stops, but so would anything else reading from it, and the data inside is no less safe.
Keeping a standby in step
With that settled, the standby's job is to hold the writes the leader has accepted...