Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

snikolaev1 pts0 comments

Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

Author: Ilya Kuznetsov<br>Published: Jun 22, 2026 - 7 Min read

TL;DR: Three changes to the HNSW search engine improve KNN throughput by up to 29% at high k, with over 20% gains under concurrent load. No API changes, no index rebuild, no configuration. Just faster searches.<br>Faster KNN search in Manticore<br>Manticore's KNN search is built on top of hnswlib<br>, an open-source HNSW implementation. Historically, most of our KNN work focused on custom distance functions, such as those used for binary quantization, rather than on hnswlib's core search loop. We also added features like prefiltering with ACORN-1<br>and early termination<br>, but the main search loop stayed the same: hnswlib still visited neighbors, computed distances, and maintained its set of candidates the same way.<br>These changes go further, modifying hnswlib's core search loop itself - restructuring how it traverses neighbors, how it calls distance functions, and how it interacts with the CPU's memory hierarchy. Combined with new AVX-512 distance implementations in the columnar library, these changes target three sources of overhead: inefficient memory access patterns, redundant data loads, and indirect function call overhead.<br>Compile-time distance function specialization<br>Previously, the distance function was a runtime function pointer stored in the HNSW index and called for every candidate. For large search budgets, that can mean a large number of indirect calls per query. Indirect calls prevent the compiler from inlining the distance function into the search loop, and they create branch prediction overhead.<br>The new code resolves the distance function at compile time using C++ templates. When the search begins, a single switch statement selects the right template specialization based on the distance metric and quantization settings. From that point on, the entire inner loop - neighbor traversal, distance computation, candidate set updates - runs as one monolithic function with the distance calculation fully inlined. The compiler can now optimize register allocation, instruction scheduling, and loop unrolling across the distance computation boundary.<br>2-pass neighbor processing<br>The HNSW algorithm explores the graph by visiting nodes and computing distances to their neighbors. In the original implementation, each neighbor was processed in a single pass: check if visited, fetch its vector data, compute distance, update the set of candidates. This meant that memory prefetch hints had little time to take effect before the data was needed.<br>The new implementation splits this into two passes. Pass 1 iterates all neighbors of the current node, skips already-visited ones, and collects the unvisited neighbors into a small batch array. As each neighbor is added to the batch, a prefetch hint is issued for its vector data. Pass 2 iterates the batch and computes distances. By the time Pass 2 reaches each vector, the prefetch from Pass 1 has had time to bring the data into cache.<br>Pass 2 walks a compact sequential array of candidate IDs, not the graph structure itself. The underlying vector loads are still scattered, but the data has been prefetched ahead of time.<br>For unfiltered queries (no WHERE clause on the KNN search), the new code also takes a fast path that eliminates the per-candidate filter check entirely.<br>Batched distance computation<br>The 2-pass structure helps in two ways: it gives prefetching more time to work, and it makes batching easy. Once Pass 2 has a compact list of candidates, it can score them two at a time instead of one by one.<br>When scoring two candidates, the query vector is loaded once per SIMD iteration and reused for both distance computations, eliminating redundant loads.<br>This reduces repeated query-side loads and lets the scoring loop process candidates in pairs, with a fallback for an odd remainder. Batch-2 functions are provided for inner product, L2, and their binary-quantized variants.<br>AVX-512 support<br>The new AVX-512 distance code processes 16 floats per iteration instead of 8 with AVX2. For inner product and L2 distance, the core loop uses fused multiply-add (_mm512_fmadd_ps), which combines multiplication and accumulation in a single instruction. For binary-quantized vectors, the AVX-512 VPOPCNTDQ extension speeds up bit-counting operations used in distance calculation.<br>Manticore now ships three library variants: a baseline build, an AVX2 build, and an AVX-512 build. At startup, Manticore detects the CPU's capabilities and loads the appropriate library automatically. No configuration is needed.<br>Benchmark results<br>The following benchmarks were run on the dbpedia-openai-1M-1536-angular<br>dataset (1M vectors, 1536 dimensions, cosine distance) on an AMD Ryzen 7 9700X (Zen 5, 8 physical cores / 16 logical cores). All data uses 1-bit binary quantization with oversampling and rescoring disabled. For...

distance search pass loop time manticore

Related Articles