KNN early termination in Manticore Search
KNN early termination in Manticore Search
Author: Ilya Kuznetsov<br>Published: May 29, 2026 - 9 Min read
Modern search engines do more than match keywords. When you search for "cozy mystery set in Paris" and get results for "atmospheric detective novel in France" that's vector search at work: documents and queries are converted into lists of numbers, called embeddings, and the search engine finds the documents whose numbers are closest to the query's.<br>Manticore Search supports this natively. Under the hood, it uses a data structure called HNSW: a graph that connects nearby vectors, so it can find nearest neighbors quickly without scanning every document. That makes vector search fast enough to run on millions of documents in milliseconds.<br>But HNSW has an inefficiency. Early in the traversal, almost every distance computation finds a better candidate than the ones already in the result set.
As the search goes on, those improvements become rarer, but the algorithm keeps traversing the graph until it exhausts its exploration budget. By that point, the result set has often already converged, and the remaining work does little or nothing to improve it. Early termination fixes this by detecting that point and stopping early.<br>The effect becomes more noticeable as k grows, where k is the number of nearest neighbors the query asks Manticore to return. Returning more neighbors requires more graph exploration, and much of that extra work happens after the result set has already stabilized. That also makes early termination more valuable, because it has more unnecessary work to cut.<br>This gets more pronounced with<br>vector quantization<br>. Quantization compresses stored vectors to save memory, which slightly lowers search precision. To recover it, Manticore uses<br>oversampling<br>: it fetches 3x more candidates than requested, then rescores them using the original full-precision vectors. With the default 3x oversampling, HNSW explores many more candidates per query. Large k values often come from this kind of candidate expansion: an application may ask the vector index for hundreds or thousands of candidates, then rescore, rerank, or filter them down to a much smaller final result set to improve recall and precision. That raises latency, and early termination helps win some of it back.<br>The waste is measurable. Benchmarks on a 1M-vector dataset show that with k=60, which is the default result limit with default 3x oversampling, early termination reduces distance computations to about 65% of the full search. At k=1000, computations drop to 30%. At k=10000, just 20%. The search converges long before the exploration budget runs out, and the savings grow with k.<br>Early termination lets Manticore detect this convergence and stop. The algorithm was designed with a specific precision target: lose no more than 2-4% of result set precision compared to a full HNSW search.<br>How it works<br>The algorithm tracks a simple signal: discovery rate - the fraction of distance computations that actually improve the result set.<br>Each time a new node's distance is computed, one of two things happens: either it's good enough to enter the heap - the priority queue that holds the current best candidate neighbors - or it's worse than everything already there and gets discarded. Entering the heap counts as a "discovery." Early in the search, discoveries are frequent - the heap is filling up and most candidates are useful. As the search progresses and the heap saturates with good results, discoveries become rare. Most new distance computations just confirm that the algorithm has already found the best candidates.<br>Manticore monitors this transition. After each round of neighbor expansion, it computes the discovery rate:<br>discovery_rate = new_candidates_collected / distances_computed
If this rate stays below a threshold for several rounds in a row, the search stops.<br>The idea is simple: if the algorithm keeps computing distances but nothing improves the result, the search has converged.
The threshold: quantile-based adaptation<br>That raises the obvious next question: what threshold should count as "low"? A fixed threshold wouldn't work well - different datasets and different regions of the same dataset have wildly different discovery rate distributions. What counts as "low" depends on context.<br>Manticore uses a quantile-based adaptive threshold. Instead of comparing the discovery rate against a fixed number, it continuously estimates a low percentile of recent rounds (20th percentile, or 14th percentile for L2 distance) and uses that as the baseline. This keeps the method lightweight while letting it adapt to different datasets and different regions of the graph.<br>In other words, the threshold adapts to the local search pattern. If the algorithm enters a sparse region of the graph, the threshold drops and avoids stopping too early. If it enters a richer region, the threshold rises.<br>Patience: how many bad rounds before...