A new metadata engine for high performance Object Storage (not LSM, not B+ tree)

rustshellscript1 pts0 comments

Metadata Engine for Our Object Storage : from LSM Tree to Fractal ART | FractalBits Blog<br>When we started building FractalBits1 object storage, we assumed we’d use RocksDB for metadata engine like everyone else. But as we prototyped, we kept hitting the same friction: LSM engines do not natively model hierarchical structure in keys — comparisons are lexicographic byte-wise, regardless of semantic boundaries like path separators. While RocksDB supports prefix extractors and custom comparators, these are fundamentally advisory optimizations; the core data structure still reasons about keys as flat byte sequences. Object paths like /bucket/datasets/2024/train/batch_001.parquet have rich hierarchical structure that these engines can’t exploit. The mismatch creates real problems — redundant prefix comparisons, costly directory renames, multi-hop path resolution.

This post explains why we abandoned the LSM approach for metadata and built Fractal ART (Adaptive Radix Tree) — an on-disk radix tree that splits data along path boundaries. The result: O(path_length) lookups instead of the repeated prefix comparisons inherent in binary searches (O(path_length * log N)), atomic directory renames via single pointer updates, and no directory contention under concurrent writes. We’ll also cover what we gave up by going custom, because every design has tradeoffs.

The Metadata Landscape Today

Most (object) storage systems fall into one of these categories for metadata:

LSM-backed key-value stores. The dominant approach. Apache Cassandra, CockroachDB, TiDB, Ozone, and Ceph (BlueStore) all use RocksDB or similar LSM engines under the hood. SeaweedFS defaults to LevelDB for its filer metadata, though it supports pluggable backends (Redis, Postgres, MySQL, etc.). The LSM approach is battle-tested, and RocksDB does offer prefix seek to skip irrelevant SSTables during prefix range scans. But these are advisory optimizations — the core data structure still reasons about keys as flat byte sequences, and the hierarchical structure of paths isn’t something the engine can exploit for comparisons, renames, or locality.

External databases. Some newer systems offload metadata entirely — 3FS and Tigris Data use FoundationDB, avoiding the need to build a metadata engine at all. This trades one set of problems (building a metadata layer) for another (depending on an external system’s availability, performance characteristics, and operational complexity). It’s also worth noting that FoundationDB itself is a sorted key-value store — it still treats keys as flat byte sequences, so the redundant prefix comparison problem described in this post applies at the storage node level. You’re outsourcing metadata management, not solving the structural mismatch.

Local filesystem. MinIO stores metadata as xl.meta files directly on the local filesystem, and NVIDIA AIStore takes a similar approach using filesystem extended attributes (xattrs) per object. This is elegant in its simplicity, but ties metadata performance to filesystem behavior: listing operations become expensive directory traversals, there’s no way to separate metadata onto faster storage tiers, and on many filesystems, having millions of files in one directory can degrade listing performance — because MinIO and NVIDIA AIStore map S3 object prefixes into filesystem structures, extreme flat layouts can hit those same limitations.

We also considered B+ trees2, but the same core problem applies — directory renames still require rewriting all affected keys, and prefix compression doesn’t eliminate redundant comparisons during lookups.

Two Ways to Index Object Paths (Both Problematic)

Full-Path Keys

The simplest model: store /bucket/images/2024/photo.jpg as a flat key in the KV store.

Key -> Value (metadata + data location)<br>/bucket/images/2024/photo.jpg -> {size: 2MB, etag: "abc123", ...}<br>/bucket/images/2024/video.mp4 -> {size: 50MB, etag: "def456", ...}<br>/bucket/images/2025/doc.pdf -> {size: 1MB, etag: "ghi789", ...}

This is the approach GFS originally took — a single in-memory lookup table mapping full pathnames to metadata on the master, with no inodes or per-directory structures. It worked until it didn’t: the single master’s memory became the bottleneck at petabyte scale, which led Google to replace it with Colossus (backed by Bigtable). CalvinFS also uses full-path keys. It’s simple — one lookup per object — but has two structural problems in LSM trees:

Redundant prefix comparisons. Looking up /bucket/images/2024/photo.jpg means comparing it byte-by-byte against /bucket/images/2024/video.mp4, re-processing the shared /bucket/images/2024/ prefix on every comparison. Across a binary search through thousands of keys, that’s a lot of wasted CPU on the same bytes. Prefix compression helps on disk, but doesn’t reduce comparison cost.

Costly directory renames. Renaming /bucket/images/ to /bucket/photos/ means rewriting every descendant key — potentially millions of deletes and inserts....

metadata prefix bucket object keys directory

Related Articles