2.6s to 89ms — Random High-Cardinality Lookups in OpenObserve
Last time, [two config changes took a count query from **49 seconds to 2 seconds**](/blog/openobserve-tantivy-49s-to-2s).\n>\n> But that win has a quiet asterisk. For one specific shape of query — a random high-cardinality lookup, like searching by `trace_id` — neither of those changes does anything. The query still scans every file. **That's the query this post is about.**\n\n---\n\n> **Previous in series**\n>\n> **[How we cut a query from 49 seconds to 2 seconds.](/blog/openobserve-tantivy-49s-to-2s)** Raise `ZO_COMPACT_MAX_FILE_SIZE` from 1 GB to 10 GB; turn on the tantivy footer cache. The same 2 TB dataset, the same count query, ~25× faster. If you haven't read it, that's the story of how we collapsed an order of magnitude of S3 round trips. This post is the follow-up — what happens when even that isn't enough.\n\n## The query that didn't get fixed\n\nThe first post's win lived in a particular regime: *filtered queries on fields where the filter is selective and the index can range-prune*. Once compact files were big and tantivy footers were cached, the work shrank to \"open a few files, hit the index, read the matching rows.\" A clean order-of-magnitude improvement.\n\nThen we ran a different query. A single matching `trace_id` over a stream called `benchtest` — 170 parquet files for the hour, ~14.6 GB of tantivy index files, no disk cache. The query that, by intuition, should be the *easiest*: needle in a haystack, one needle, exactly one matching row.\n\n**It took 2,584 milliseconds.**\n\nThat number is not catastrophic in isolation. But it's not what you'd expect from an indexed lookup that returns one row. And nothing in the previous post's toolkit fixed it. Bigger compact files? Doesn't help — there's nothing to compact away. Footer cache? Already on. We were squarely in the \"open the index\" stage, and the index opens were the cost.\n\nThe reason is mechanical, and once you see it, the whole layer below this post falls into place.\n\n## Why tantivy can't range-prune a random ID\n\nTantivy keeps a small sparse index per `.ttv` file — it remembers the lowest and highest term in each file. When you search for a value, tantivy's footer (in memory) checks: is this value in this file's range? If no, the file is skipped entirely. **Zero S3 IO.**\n\nThis is what makes the [previous post's](/blog/openobserve-tantivy-49s-to-2s) win work. For most fields — service names, status codes, paths, timestamps — values cluster. Each file holds a narrow slice of the value space. Most files get range-rejected for free.\n\nIt is also what makes *time-ordered IDs* nearly free to look up. UUIDv7, snowflake, anything timestamp-prefixed: files are time-partitioned, so each file's term range is a narrow non-overlapping window. Tantivy rejects almost every file from memory.\n\nNow look at a random 16-byte `trace_id`:\n\n```\n1fb3487f84204def9aa3ec0f1238ce42\n```\n\nEvery file holds `trace_id`s scattered across the entire 128-bit value space. **Every file's range is \"min ≈ 0, max ≈ 2¹²⁸\".** Range-pruning is useless — every file overlaps every other file's range, so every file is a candidate.\n\nTantivy has no choice. It opens every `.ttv`, fetches one term-dictionary block per file, looks up the value, finds it (in one file) or doesn't (in 169 files), and moves on.\n\n\n\n*Figure 1 · Range-pruning works only when value ranges don't overlap. Random IDs guarantee they do.*\n\nThis is the cost we measured. 170 files × one S3-class fetch per file ≈ 2.6 seconds, even with footer cache, even with the right compact size. **The footer cache, the compact size — those tools just don't reach this regime.**\n\nYou can feel the shape of the new tool we need. Something that says \"this value *is not* in this file\" before tantivy is allowed to open anything. Cheap enough to be free. Wrong sometimes, but never wrong in the dangerous direction.\n\nThat's a bloom filter. The interesting part isn't *that*; it's *where* you put it.\n\n## The naive bloom filter doesn't work\n\nThe most obvious place to put a bloom is one bloom per file. Each parquet gets a sidecar `.bf`; the query checks each file's sidecar before deciding to open it.\n\nLet's count S3 requests:\n\n- 170 files → 170 sidecar fetches (one GET each)\n- Bloom says \"maybe\" for ~1 file (the real match) and \"definitely not\" for ~169\n- tantivy then opens the 1 survivor\n\nTotal: **170 + 1 = 171 S3 requests**. Versus tantivy alone at 170. **The naive bloom is a tie at best, a loss after constants.**\n\nAnd it's worse than that. Each per-file bloom must be sized for that file's cardinality — for a stream with 10 million unique `trace_id`s per hour spread across 170 files, each bloom is ~2 MB. The blooms themselves are now the...