ScyllaDB's Trie-Based Index Delivers Up to 3X More Throughput

eatonphil1 pts0 comments

How ScyllaDB's Trie-Based Index Delivers Up to 3X More Throughput - ScyllaDB

See all blog posts

How ScyllaDB’s Trie-Based Index Delivers Up to 3X More Throughput

By Petr Hala, Tzach Livyatan, Michał Chojnowski<br>June 30, 2026

BlueskyRedditXLinkedIn

RSS

Subscribe to Receive Blog Updates

Subscription Categories

All<br>Posts

Events

Integrations

News

Performance

Releases

ScyllaDB

Seastar

Tutorials

User<br>Stories

Developers<br>Blog

BlueskyRedditXLinkedIn

RSS

By transitioning from separate summary and index files to a prefix tree, we optimized cache efficiency, reduced disk I/O, and reduced memory overhead

Trie-based SSTable index format was added in ScyllaDB 2025.4. Since then, it has evolved and matured to become the default index format in ScyllaDB 2026.2.

In this post, we deep dive into the format change, present its pros and cons, and show our latest benchmark results of the legacy vs. Trie index formats.

For benchmarking, we chose four different read workloads that would benefit from the Trie index format in different degrees. For all four, Trie indexes demonstrated better performance. They achieved 20% to 230% higher throughput and 31% to 63% lower latency compared to legacy indexes. The impact of Trie index on the write path is negligible.

Note that for use cases with either very low cache usage or 100% row-cache hit rate, the performance gain is expected to be lower. However, these use cases are unlikely in production.

Trie Index Usage in ScyllaDB

Before explaining the new format, we will cover the legacy index format and its challenges.

Legacy Three-Layer Lookup (me/md format)

Until ScyllaDB 2026.2 the default storage format was the SSTable version md and me.

Every SSTable lookup in the me/md format traverses three or four structures:

Summary.db (entirely in RAM)

Binary search in Index.db

Sequential read in Data.db

Both the partition index and the clustering-row index are stored in Index.db. The partition index is partially represented in memory by the sampled Summary.db, while the clustering-row index exists as a promoted index for large partitions.

┌──────── MEMORY ──────────────────────────────┐<br>│ │<br>│ Summary.db (entirely in RAM) │<br>│ ───────────────────────────── │<br>│ Sampled at ~1 byte per ~2000 bytes of │<br>│ Data.db │<br>│ │<br>│ "aardvark" → Index.db byte 0 │<br>│ "kangaroo" → Index.db byte 1,048,576 │<br>│ "platypus" → Index.db byte 2,097,152 │<br>│ "zebra" → Index.db byte 3,145,728 │<br>│ │<br>└──────────────────┬───────────────────────────┘<br>binary search → window in Index.db<br>┌──────── DISK ────────────────────────────────┐<br>│ │<br>│ Index.db │<br>│ ───────── │<br>│ "kangaroo" → Data.db: 4,096,000 │<br>│ "koala" → Data.db: 4,097,280 │<br>│ "kookaburra" → Data.db: 4,098,560 │<br>│ "lemur" → Data.db: 4,099,840 ← found │<br>│ ... (up to ~800 entries per 1 MB scan) │<br>│ │<br>└──────────────────┬───────────────────────────┘<br>1 seek + sequential read<br>┌──────── DISK ────────────────────────────────┐<br>│ Data.db │<br>│ │<br>└──────────────────────────────────────────────┘<br>For partitions containing enough clustering rows, a fourth structure is involved:

Index.db entry for a large partition<br>┌──────────────────────────────────────────┐<br>│ partition key → Data.db offset │<br>│ promoted_index (flat list of CK blocks) │<br>│ block 0: ck_start="aaa", ck_end="azz" │<br>│ block 1: ck_start="baa", ck_end="bzz" │<br>│ ... │<br>│ block N: ck_start=..., ck_end=... │<br>│ offsets[0..N] ← binary search here │<br>└──────────────────────────────────────────┘<br>The New Trie Index Format

The Trie index (#25626) replaces Summary.db + Index.db with a single on-disk prefix tree. The storage format is compatible with Apache Cassandra’s BTI (Big Trie Index) format, implemented using ScyllaDB’s Seastar architecture.

Trie indexes are used for both partition indexes and clustering key indexes.

What Is a Trie?

A trie (prefix tree) stores keys character-by-character. Shared prefixes occupy a single path, eliminating redundancy:

Keys: "kangaroo", "koala", "kookaburra", "lemur", "lion"

[root]<br>/ \<br>'k' 'l'<br>| |<br>[k] [l]<br>/ \ / \<br>'a' 'o''e' 'i'<br>| | | |<br>'n' [o]'m' 'i'<br>| / \ | |<br>'g''a' 'o''u' 'o'<br>| | | | |<br>'a''l' 'k''r' 'n'<br>| | | *<br>'r''a' 'a' * = leaf node<br>| * | (payload: Data.db offset)<br>'o' 'b'<br>| |<br>'o' 'u'<br>* |<br>'r'<br>'r'<br>'a'

"k", "ko", "koo" — shared prefixes stored ONCE<br>New SSTable Files

The ms/mt format replaces Summary.db and Index.db with two purpose-built files:

SSTable (ms format)<br>├── Data.db<br>│ unchanged — partition and row data,<br>│ same binary layout as me/md<br>├── Partitions.db ← NEW<br>│ Trie index:<br>│ partition key → Data.db offset<br>│ (small partitions)<br>│ partition key → Rows.db offset<br>│ (large partitions)<br>│ ┌── Page 0 (4,096 bytes) ───────┐<br>│ │ trie root node [1] │<br>│ │ + children (fan-out ≤ 256) │<br>│ │ + their children (packed) │<br>│ └───────────────────────────────┘<br>│ ┌── Page 1 (4,096 bytes) ───────┐<br>│ │ subtree for keys 'a'–'g' │<br>│ └───────────────────────────────┘<br>│ ...<br>│ ┌── Footer ─────────────────────┐<br>│ │ first_key (raw bytes) │<br>│ │ last_key (raw bytes) │<br>│ │ partition_count...

index trie format data scylladb partition

Related Articles