SMVE: Multi-Vector Retrieval That Just Works<br>SMVE: Multi-Vector Retrieval That Just Works<br>Martin Spisak, Marek Galovic<br>March 11, 2026
Late-interaction models are having a bit of a moment among the search community — and for good reason. They have been shown to be more expressive than single-vector retrieval across many benchmarks and modalities, but until recently they have been too expensive and cumbersome to use as a true first-stage retrieval primitive.
At TopK, we believe late-interaction retrieval deserves the same place in the search stack as established semantic retrieval primitives like single-vector embeddings and reranker models. Over the past few months, we have worked to make them a first-class feature of our database.
Multi-vector retrieval is now natively supported in TopK (docs), and in this post, we dive into how we made it scale to large datasets while preserving the CRUD properties of our database offering.
No free lunch with multi-vector retrieval
Multi-vector retrieval gets its quality from a richer scoring function: instead of compressing a query and document into one vector each and computing their dot product, it compares sets of token embeddings using the MaxSim operator, matching each query token against its most similar document token.<br>That extra expressiveness is powerful, but not free. Compared to single-vector retrieval, you typically need to store far more data per document and do far more work at scoring time, which makes exhaustive multi-vector retrieval impractical at scale.
Prior work tackles this tradeoff in a few different ways: compress the token embeddings, keep fewer of them, build multi-stage retrieval pipelines such as PLAID, or collapse the multi-vector representation into one large descriptor as in MUVERA.<br>MUVERA is especially interesting because it comes with theoretical guarantees: with sufficiently large descriptors, its scores approximate MaxSim closely enough to recover candidates similar to those from exhaustive late-interaction retrieval.<br>The catch is that these descriptors need to be very large, which makes them expensive to store and compute with.
Our idea is simple: expressive descriptors may need to be large, but they do not need to be dense.
SMVE: Sparse Multi-Vector Encoding
SMVE is based on exactly that idea: instead of approximating multi-vector representations with dense descriptors, we convert them into sparse vectors whose dot product approximates MaxSim similarity.
Sparse vectors have a useful property: storage and computation depend only on the non-zero elements and effectively ignore everything else.<br>This means that storage and computational complexity scale with the number of non-zero elements , not with the ambient dimensionality of the embedding space.<br>That lets us build very high-dimensional representations that preserve much of the original expressiveness while remaining compact to store and fast to query.
The key question, then, is how to find a transformation that produces a sparse vector while still approximating MaxSim well.
How SMVE works
The answer is surprisingly simple - SMVE consists of just three steps:
Random Projection onto Spherical Anchors : We sample a large set of random unit vectors that act as anchor directions in the embedding space.<br>Each token embedding is then projected onto these anchors, producing a higher-dimensional vector of cosine similarities.<br>Intuitively, this gives us a sketch of where the token sits relative to many reference directions.
Sparsification : For each projected token vector, we keep only the Top-K largest values and set the rest to zero.<br>This works well because in high-dimensional spaces, a vector tends to have very small inner products with most random directions, so most projection values stay close to zero while only a small number capture strong alignment with the anchor directions.<br>Keeping only those strongest signals gives us a sparse representation of each token embedding.
Pooling : Finally, we aggregate the token-level sparse representations into a single sparse vector. In particular, for queries we sum the token vectors, while for documents we average the non-zero contributions in each dimension.
Lastly, we can repeat the steps multiple times with different random matrices and concatenate the results to reduce variance and improve retrieval quality.
The core SMVE transformation is simple enough to express in just a few lines of code.
from torch import randn, Tensor, topk, zeros_like
embedding_dim = 128 # dimension of the input token embeddings<br>width = 2048 # dimension of the SMVE-transformed embeddings<br>k = 8 # number of non-zero elements to keep for each token
# sample a random matrix B with unit-norm columns<br># -> a large set of anchor directions spread across the unit sphere<br>B = randn(embedding_dim, width) # shape (embedding_dim, width)<br>B /= B.norm(dim=0, keepdim=True) # normalize each column to unit length
def smve(token_embeddings: Tensor, B: Tensor, k: int,...