How ScyllaDB's Trie-Based Index Delivers Up to 3X More Throughput - ScyllaDB
See all blog posts
How ScyllaDB’s Trie-Based Index Delivers Up to 3X More Throughput
By Petr Hala, Tzach Livyatan, Michał Chojnowski<br>June 30, 2026
BlueskyRedditXLinkedIn
RSS
Subscribe to Receive Blog Updates
Subscription Categories
All<br>Posts
Events
Integrations
News
Performance
Releases
ScyllaDB
Seastar
Tutorials
User<br>Stories
Developers<br>Blog
BlueskyRedditXLinkedIn
RSS
By transitioning from separate summary and index files to a prefix tree, we optimized cache efficiency, reduced disk I/O, and reduced memory overhead
Trie-based SSTable index format was added in ScyllaDB 2025.4. Since then, it has evolved and matured to become the default index format in ScyllaDB 2026.2.
In this post, we deep dive into the format change, present its pros and cons, and show our latest benchmark results of the legacy vs. Trie index formats.
For benchmarking, we chose four different read workloads that would benefit from the Trie index format in different degrees. For all four, Trie indexes demonstrated better performance. They achieved 20% to 230% higher throughput and 31% to 63% lower latency compared to legacy indexes. The impact of Trie index on the write path is negligible.
Note that for use cases with either very low cache usage or 100% row-cache hit rate, the performance gain is expected to be lower. However, these use cases are unlikely in production.
Trie Index Usage in ScyllaDB
Before explaining the new format, we will cover the legacy index format and its challenges.
Legacy Three-Layer Lookup (me/md format)
Until ScyllaDB 2026.2 the default storage format was the SSTable version md and me.
Every SSTable lookup in the me/md format traverses three or four structures:
Summary.db (entirely in RAM)
Binary search in Index.db
Sequential read in Data.db
Both the partition index and the clustering-row index are stored in Index.db. The partition index is partially represented in memory by the sampled Summary.db, while the clustering-row index exists as a promoted index for large partitions.
┌──────── MEMORY ──────────────────────────────┐<br>│ │<br>│ Summary.db (entirely in RAM) │<br>│ ───────────────────────────── │<br>│ Sampled at ~1 byte per ~2000 bytes of │<br>│ Data.db │<br>│ │<br>│ "aardvark" → Index.db byte 0 │<br>│ "kangaroo" → Index.db byte 1,048,576 │<br>│ "platypus" → Index.db byte 2,097,152 │<br>│ "zebra" → Index.db byte 3,145,728 │<br>│ │<br>└──────────────────┬───────────────────────────┘<br>binary search → window in Index.db<br>┌──────── DISK ────────────────────────────────┐<br>│ │<br>│ Index.db │<br>│ ───────── │<br>│ "kangaroo" → Data.db: 4,096,000 │<br>│ "koala" → Data.db: 4,097,280 │<br>│ "kookaburra" → Data.db: 4,098,560 │<br>│ "lemur" → Data.db: 4,099,840 ← found │<br>│ ... (up to ~800 entries per 1 MB scan) │<br>│ │<br>└──────────────────┬───────────────────────────┘<br>1 seek + sequential read<br>┌──────── DISK ────────────────────────────────┐<br>│ Data.db │<br>│ │<br>└──────────────────────────────────────────────┘<br>For partitions containing enough clustering rows, a fourth structure is involved:
Index.db entry for a large partition<br>┌──────────────────────────────────────────┐<br>│ partition key → Data.db offset │<br>│ promoted_index (flat list of CK blocks) │<br>│ block 0: ck_start="aaa", ck_end="azz" │<br>│ block 1: ck_start="baa", ck_end="bzz" │<br>│ ... │<br>│ block N: ck_start=..., ck_end=... │<br>│ offsets[0..N] ← binary search here │<br>└──────────────────────────────────────────┘<br>The New Trie Index Format
The Trie index (#25626) replaces Summary.db + Index.db with a single on-disk prefix tree. The storage format is compatible with Apache Cassandra’s BTI (Big Trie Index) format, implemented using ScyllaDB’s Seastar architecture.
Trie indexes are used for both partition indexes and clustering key indexes.
What Is a Trie?
A trie (prefix tree) stores keys character-by-character. Shared prefixes occupy a single path, eliminating redundancy:
Keys: "kangaroo", "koala", "kookaburra", "lemur", "lion"
[root]<br>/ \<br>'k' 'l'<br>| |<br>[k] [l]<br>/ \ / \<br>'a' 'o''e' 'i'<br>| | | |<br>'n' [o]'m' 'i'<br>| / \ | |<br>'g''a' 'o''u' 'o'<br>| | | | |<br>'a''l' 'k''r' 'n'<br>| | | *<br>'r''a' 'a' * = leaf node<br>| * | (payload: Data.db offset)<br>'o' 'b'<br>| |<br>'o' 'u'<br>* |<br>'r'<br>'r'<br>'a'
"k", "ko", "koo" — shared prefixes stored ONCE<br>New SSTable Files
The ms/mt format replaces Summary.db and Index.db with two purpose-built files:
SSTable (ms format)<br>├── Data.db<br>│ unchanged — partition and row data,<br>│ same binary layout as me/md<br>├── Partitions.db ← NEW<br>│ Trie index:<br>│ partition key → Data.db offset<br>│ (small partitions)<br>│ partition key → Rows.db offset<br>│ (large partitions)<br>│ ┌── Page 0 (4,096 bytes) ───────┐<br>│ │ trie root node [1] │<br>│ │ + children (fan-out ≤ 256) │<br>│ │ + their children (packed) │<br>│ └───────────────────────────────┘<br>│ ┌── Page 1 (4,096 bytes) ───────┐<br>│ │ subtree for keys 'a'–'g' │<br>│ └───────────────────────────────┘<br>│ ...<br>│ ┌── Footer ─────────────────────┐<br>│ │ first_key (raw bytes) │<br>│ │ last_key (raw bytes) │<br>│ │ partition_count...