Distributed Compaction in SlateDb

riccomini1 pts0 comments

Distributed Compaction in SlateDb — Ryan Dielhenn<br>Guide

Preamble

This guide assumes familiarity with SlateDb and how compaction works; if you’d like that context first, skip down to Background and the sections that follow, then come back here.

Running SlateDb itself is out of scope for this blog, but here are some helpful resources for anyone interested (directly from the SlateDb website):

Connect to Azure Blob Storage

Connect to S3

Connect to Google Cloud Storage

This writeup is about running distributed compaction for SlateDb but anything else you’d like to know is on the SlateDb website.

Running external/distributed compaction

Historically compaction ran as a single process either embedded in the writer or as a standalone process via cli.

slatedb --env-file .env --path run-compactor

The cli sub-command run-compactor still exists if you’d like to run compaction as an entirely separate process outside of the DB writer. Now, you may also disable the embedded compaction worker to decouple compaction scheduling/coordination from the running of actual compaction jobs by adding the --no-embedded-worker flag.

slatedb --env-file .env --path run-compactor --no-embedded-worker

You should not start more than one compaction coordinator. Doing so will fence the writer and halt all DB operations. This is behavior that existed pre distributed compaction and is expected.

However, you may now run multiple workers separately via the run-worker sub-command of the slatedb cli.

slatedb --env-file .env --path run-worker

We’ve discussed distributing and scaling coordination but that is out of scope for this work.

Background

SlateDb is an embedded key-value store built on object storage. It uses a Log-Structured Merge Tree, or LSM for short, to batch writes to object storage to reduce write latency. Incoming writes land in an in-memory buffer called the memtable. Once the memtable fills up, it is flushed to an immutable Sorted String Table (SST) in object storage. Freshly flushed SSTs land in L0. From there, SlateDb uses size-tiered compaction, where SSTs are grouped by size and merged together as each tier fills up. LSM trees let you tune the tradeoffs between read, write, and space amplification. I recommend reading this blog by Almog Gavra if you want know more about these tradeoffs.

The following is what you would see if you listed the contents of an object storage bucket path used by SlateDb:

manifest/<br>00000000001.manifest # This is a snapshot of the database state:<br>00000000002.manifest # SST lists, watermarks, epochs, external dbs, checkpoints etc.<br>00000000003.manifest<br>...<br>compactions/<br>00000000001.compactions # Jobs scheduled for compaction by the<br>00000000002.compactions # compaction coordinator.<br>00000000003.compactions<br>...<br>compacted/<br>.sst # This is compacted data<br>.sst # L0 ssts also happen to live here which have not been compacted yet<br>.sst<br>...<br>wal/ # This is the write ahead log. Writes land here first so that they can be replayed in a failure scenario.<br>00000000001.sst<br>00000000002.sst<br>00000000003.sst<br>gc/<br>manifest.boundary # Garbage collector deletes .manifest versions at or below this Boundary<br>compactions.boundary # Garbage collecor deletes .compactions files at or below this Boundary

All of this is hidden from the user under simple put/get/scan APIs.

Compaction

Compaction is a critical background process of the LSM tree that takes Sorted String Tables (SST for short) and merges them to produce an output SST with non-repeating keys. This process does a few things. When multiple SSTs share keys, merging them removes duplicate entries and cleans up tombstones left behind by deletes, reducing space amplification. It also reduces the number of SSTs that need to be read to find a key i.e. reduces read amplification.

Distributed Compaction

A single compactor is a bottleneck: if it cannot keep pace with write throughput, the whole system degrades in two stages. First, as uncompacted SSTs pile up in L0, more files need to be scanned to find a key, increasing read latency. Then, once the L0 file count reaches l0_max_ssts, the flusher stops writing immutable memtables to L0. Those memtables accumulate in memory until max_unflushed_bytes is exceeded, at which point SlateDb applies backpressure that stalls writes from being durably written to object storage. A lagging compactor therefore degrades read latency first, then write throughput.

Future benefits and work

The door is wide open for future enhancements that take advantage of these stateless compaction workers. Below are just a few examples of extensions made possible by the stateless workers added in RFC-0025.

L0 Compaction Watermark

Ideally we want to parallelize compaction work, but it helps to separate two axes of parallelism that are easy to conflate.

The first is running independent compactions at the same time e.g. an L0→SR compaction in the same segment, or L0 compactions in two different segments. RFC-24 made this safe by giving each...

compaction slatedb compactions storage distributed ssts

Related Articles