How Lume Works: The Retrieval Primitives — Signal Log
DeepBlue Dynamics<br>Signal Log<br>lume-retrieval-primitives
Lume is a Rust hybrid search engine that Steve Harris and I have been building in the open at github.com/DeepBlueDynamics/lume. It’s a small CLI plus an MCP server, BSD-3 licensed, and built around a stubborn idea: when an agent asks a question, every step from query to evidence should be inspectable.
Lume indexes Markdown, source code, and PDFs (via a small Python extractor) and ranks over them with three independent primitives — field-aware BM25, dense GTR-T5 vectors via Shivvr, and a significance-scored entity graph. The lexical core and the graph run entirely on your machine; only the dense vectors call out, and that endpoint defaults to localhost. There is no opaque “search box that returns a ranking” — every score has a name, a file, and a knob.
This post walks Lume’s retrieval core end to end, with line-level references to the current tree. If you’re building agentic systems and tired of treating retrieval as a magic step, this is for you.
A few principles up front, because they explain the design:
Local-first. Lexical search and the entity graph run entirely on your machine. Dense vectors are fetched from Shivvr through SHIVVR_BASE_URL, which defaults to a local endpoint.
Layered, not monolithic. BM25, semantic, and graph are independent signals with their own scores. The blend is one line; each input is replaceable.
Auditable. The engine prints what it pruned, what it ranked, and why it rejected the rest.
0. The unit of retrieval: a Section
Lume indexes Markdown, cut into sections at # headers (parse_markdown in src/bm25.rs:211). A Section (src/bm25.rs:106) is the atom everything ranks over:
pub struct Section {<br>pub title: String,<br>pub body: String,<br>pub line_number: usize,<br>pub filename: Option,<br>pub entities: Vec, // resolved named entities, for the graph
Title and body are separate fields with separate statistics — that distinction shows up immediately in scoring. The whole index lives in memory as a Bm25Index (src/bm25.rs:147): per-field term-frequency maps, document frequencies, field lengths, roaring-bitmap posting lists , prime/Gödel signature filters, and the entity posting lists that feed the graph.
1. Primitive: field-aware BM25
The lexical core is a field-aware BM25 with three selectable variants. The tuning defaults (Bm25Params in src/bm25.rs:125) are deliberately classic:
Self { k1: 1.2, b: 0.75, delta: 1.0, title_weight: 2.0, body_weight: 1.0 }
k1 controls term-frequency saturation; b controls length normalization. The one opinionated choice is title_weight: 2.0 : a title hit contributes twice as much as a body hit before the coordination factor is applied. That is useful, but it can overweight chapter titles when a query token is broad. Treat it as a knob, not a law.
IDF is the standard smoothed form, floored at zero, and each term’s contribution is computed per field then summed with the field weights (calculate_bm25_term_score in src/bm25.rs:728):
let len_normalization = 1.0 - b + b * (doc_len / avgdl);<br>match variant {<br>SearchVariant::Classic => idf * (tf * (k1 + 1.0)) / (tf + k1 * len_normalization),<br>SearchVariant::Plus => idf * ((tf*(k1+1.0))/(tf + k1*len_normalization) + params.delta),<br>SearchVariant::L => { let s = tf / len_normalization;<br>idf * (s*(k1+1.0))/(s + k1) },<br>// total_score += title_weight * title_score + body_weight * body_score; (src/bm25.rs:635)
Classic is textbook BM25.
Plus adds a delta floor so a matched term never contributes nothing, countering BM25’s over-penalty of long documents.
L moves length normalization inside the saturation, smoothing very long docs.
Lume runs Classic by default (src/main.rs:1430).
2. Two-stage pruning: roaring union, then Gödel signatures
You don’t want to BM25-score all 1,926 sections of a book for every query. Lume’s search (src/bm25.rs:445) is two-stage .
Stage 1 — candidate gather. Union the roaring-bitmap posting lists of the query terms. This is a handful of bitset ORs and instantly narrows the corpus to sections that contain any query term:
// src/bm25.rs:460<br>let mut candidate_set = MiniRoaring::new();<br>let mut first = true;<br>for q_tok in &query_tokens {<br>if let Some(list) = self.posting_lists.get(&q_tok.bytes) {<br>if first {<br>candidate_set = list.clone();<br>first = false;<br>} else {<br>candidate_set = candidate_set.union(list);
Stage 1b — Gödel tag-signature pruning. If the query tagger recognizes entities, each candidate section is verified against a prime-factored signature filter (PrimeFilter::test_tag_prime in src/fast_retrieval.rs:449, evaluated in src/bm25.rs:538). Each known tag output maps to a prime; a section’s tag signature is the product of its tag primes, so inclusion is checked by divisibility. Unknown query tags deliberately receive a dummy prime and fail closed. Candidates that fail are dropped as TagSignatureMismatch...