Semantic Search in Under 3MB

Semantic Search in Under 3MB :: Luke Salamone's Blog

8 minutes

This project is a continuation of my previous autoresearch project, which optimized a reranking model to be under 10MB. Digging deeper by hand, I was able to take the size reduction much further, while outperforming reranking models which are 30x larger on this task. In the end I was able to reduce the payload from 11.4 MB to 2.79 MB gzipped.

You can see it in action on my resume page.

Each square represents 1 kB. The majority of overall size reduction came from removing the ORT dependency. However, other changes enabled much better representation quality than the baseline.

Baseline

After running my original autoresearch experiment overnight, I had fairly impressive but tiny 4.3M param dual encoder, quantized to int8 onnx.

However, when I tried using it in my resume page, it wasn’t actually that good. In fact, it was much worse than BM25 that was working alongside it. It is notable that on the eval set I used, much larger models like all-MiniLM-L6-v2 also didn’t perform very well. This suggested that the issue was in training and domain adaptation, not size.

Term dropout

One of the motivating issues was that the model failing to correctly rank docs under simple queries like do you have any leadership experience. The model was latching onto terms like “Grammarly” but not to more abstract ones like “leadership”.

To address this, I performed term dropout, randomly dropping the top TF-IDF term from queries and docs with 20% probability. This both normalized the model against overfitting and helped to combat simple keyword matching that BM25 would already be doing.

Because the corpus is so small, we need to prevent overfitting.

Query mining from job postings

To increase the diversity and realism of potential queries to the model, I created a small pipeline to pull queries from real job postings:

First, we gather relevant job postings from Kaleh job postings api, which has a generous free tier.

Next, pull out “queries” from the job postings using a local LLM (phi4)

A “query” amounts to a rephrasing of a job qualification e.g. “experience with python” -> “do you have experience with python”

Finally, use the same LLM to identify high-precision matches from my actual resume, if any.

After one round of this process, I had mined 94 queries which raised MRR 21%. But after a second round with ~1000 pairs MRR had mostly plateaued, increasing the same metric ~3 points.

Architecture ablations

At this point, the model was pretty decent for its size. I wanted to see how much more performance we could squeeze out of it, so I ran a series of quick architecture experiments:

Max pooling vs mean pooling : In order to get the final 256-dimensional embedding for comparison, we need to convert the L x 256 output matrix from the model into a 256-dimensional vector, where L is up to 64 tokens long. Previously we used the mean of each column, which is known as “mean pooling”. Switching to max pooling was slightly better (0.60 to 0.634 on a harder query set) because it allowed the strongest value to directly affect the output.

Factorized embeddings : Embeddings for each token are stored in a lookup table with shape (V x D). However, we can factorize this matrix using a low-rank approximation, which is what ALBERT does to save parameters. This saves ~1M params and was neutral on nDCG.

SwiGLU : SwiGLU replaces GELU with a gated unit, and can theoretically lead to richer intermediate representations. However, SwiGLU resulted in no measurable performance gain.

Multi-vector late interaction : Rather than pooling output vectors with mean/max pooling, we use ColBERT-style token-level expressiveness using the MaxSim function. Concretely it looks like this:

def maxsim(query:str, document:str) -> float: query_vectors = encoder(query) document_vectors = encoder(document)

score = 0 for q in query_vectors: # find the highest dot-product match among all document token vectors best = max(np.dot(q,d) for d in document_vectors) score += best return score

Vocab pruning

The first step after tokenizing a string is to get each token’s corresponding embedding from a lookup table. That means the model stores vocab_size x embedding_dimension parameters which is ~8MB already. To make this smaller on disk there’s really only three things you can do: reduce the vocab, reduce the embedding dimension, and reduce bytes per param (quantization). I did all three.

The full BERT wordpiece vocab is 30k tokens, which is way too much. It contains ~1000 “unused” tokens (tokens like [unused123], [unused456]), tokens in foreign languages, and full words like “baltimore” and “vampires” that are irrelevant to my resume. I was able to cut it down to 5000 tokens safely by trimming “unused” and foreign language tokens, then whitelisted tokens present in...

Semantic Search in Under 3MB

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI