My LSM tree was slower than a B-tree. Then I profiled it

aasheeshrathour1 pts0 comments

My LSM Tree Was Slower Than a B-Tree. Then I Profiled It. | by Aasheesh Rathour | Jun, 2026 | MediumSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

My LSM Tree Was Slower Than a B-Tree. Then I Profiled It.

Aasheesh Rathour

8 min read·<br>1 hour ago

Listen

Share

A few weeks ago I wanted to understand how the storage engine inside RocksDB actually works. Not read about it. Build it.<br>RocksDB, LevelDB, Cassandra, TiKV, CockroachDB — they all sit on top of the same idea: the log-structured merge tree. Every backend engineer has used a database built on an LSM tree. Almost none of us have read one we could actually understand, because the production ones are hundreds of thousands of lines.<br>So I built one in Go. Then I benchmarked it and it was embarrassingly slow — 250,000 writes per second when BoltDB, the pure-Go B-tree, was doing more.<br>This is the story of getting it from 250k to nearly 2 million writes per second. Every optimization came from profiling, not guessing. Each fix revealed the next bottleneck. That part — watching the bottleneck move as you chase it — is the actual skill, and nobody writes it down.

The villain: why B-trees choke on writes<br>A B-tree keeps your data sorted on disk at all times. Every insert finds the right spot in the tree and writes it there. That means every write is a random write — the disk has to seek to a different location each time.<br>On a write-heavy workload, random writes are the enemy. Whether you’re on a spinning disk or an SSD, sequential writes are dramatically faster than scattered ones.<br>You can see it in the numbers. BoltDB is a well-engineered B-tree, and on sequential writes it does 2.77 million per second. On random writes it collapses to 234,000 — more than a 10x drop. Same database, same machine. The only difference is the write pattern.<br>That collapse is the hole LSM trees exist to fill.

The fix: never write randomly<br>The LSM tree idea is simple. Don’t write to disk on every insert. Write to an in-memory sorted structure instead — that’s instant. When that structure fills up, dump the entire thing to disk in one sequential write.<br>The in-memory part is called the MemTable. The on-disk files are SSTables — Sorted String Tables, immutable once written. You never modify an SSTable. You just write new ones and merge old ones together in the background, a process called compaction.<br>The trade-off: reads get harder. Your data is now spread across the MemTable plus a pile of SSTables. A read might have to check all of them. Managing that read cost — with bloom filters, sparse indexes, and levels — is the entire engineering challenge of an LSM tree.<br>I built all of that. And then the slow part started.

Baseline: 250k writes/sec, and a profiler<br>The first working version did 250,000 sequential writes per second and 120,000 random. Slower than the B-tree I was supposed to beat.<br>I ran pprof. The picture was clear and ugly:<br>Write syscall: 34% of CPU<br>Garbage collection: ~35%<br>Sorting: ~13%<br>Three bottlenecks eating 80% of the time. So I went after them one at a time.<br>Press enter or click to view image in full size

Bottleneck 1: the write syscall (34% → 2.16%)<br>Every batch of writes called file.Write on the write-ahead log. That's a syscall — a trip into the kernel and a data copy — on every single batch. At 34% of CPU, it was the biggest single cost.<br>The fix was to memory-map the WAL file. Instead of calling file.Write, I mmap the whole file at startup and write into it with a plain copy() into the mapped region. No syscall. The data lands in the page cache directly, and a background goroutine handles the actual fsync to disk.<br>Write syscall went from 34% of CPU to 2.16%. The residual is just the page-fault handler and occasional writeback. This was the single biggest win in the whole project.

Bottleneck 2: GC and sorting (48% combined)<br>With the syscall gone, profiling pointed at compaction. The way I’d written it, compaction loaded every entry from the SSTables being merged into one giant slice, sorted the whole thing, then deduplicated it. That slice allocation plus the sort was almost half the CPU time — GC churning through millions of temporary entries, and sort.Slice on top.<br>The fix was a streaming k-way merge. The SSTables are already sorted individually. So instead of loading everything and sorting, I use a min-heap (container/heap) that pulls the next-smallest key across all the input files one at a time. No giant slice. No sort. Constant memory regardless of how much data is being compacted.<br>GC overhead from compaction disappeared. The sort disappeared entirely.<br>And this is where I hit the bug that taught me the most.

The bloom filter bug that made reads worse<br>A bloom filter is what makes LSM reads fast. Before reading an SSTable from disk, you check its bloom filter — a small probabilistic structure that tells you “this key is definitely not in this file” or “it might be.” If it says definitely not, you skip the file entirely...

write tree writes disk from syscall

Related Articles