Your embedding model doesn’t understand your data
INTERNALS.md
SubscribeSign in
Your embedding model doesn’t understand your data<br>INTERNALS.md #3 · It never did. Here’s what it actually does, and why that matters for every RAG system you’ll ever build.
Lax Meiyappan<br>May 19, 2026
Share
Here’s a bug that doesn’t show up in your logs.<br>You ship a RAG system. Users ask questions about internal data: support tickets, product docs, sales notes. Cosine similarity scores come back at 0.74, 0.81, 0.78. The LLM generates a confident, fluent answer.<br>While the resulting answers aren’t always obviously broken, they fail in a specific, repeatable pattern.<br>Someone asks about “pipeline” and gets sales documents when they meant data infrastructure.
Someone asks about “incident” and gets both engineering postmortems and customer support tickets, randomly.
Someone asks about “ARR attainment” and gets a document about spreadsheet formulas.
You tune the prompts. You adjust chunk sizes. The results are still wrong.<br>Prompt engineering and chunking strategies fail here because the root cause lies in how the foundational vector space is created.
A note before we start. This post assumes you’ve shipped or worked on a RAG system and have firsthand experience with it underperforming on domain-specific queries. If you need a foundation first, this primer is a solid 10 minutes. You’ll get more from this post with that context.
The Blueprint:
The Illusion: Why your embedding model is a map of the internet, not an understanding machine.
The Geometry: How high-dimensional space pathology compresses your similarity scores.
The Failure Modes: How to diagnose and fix the 4 silent bugs killing domain retrieval (including Hubness and Concept Collision).
The Playbook: A 3-step engineering roadmap to evaluate, adapt, and fine-tune your space.
If you value technical breakdowns that focus on the underlying system mechanics rather than high-level abstractions, join the newsletter.
Subscribe
What you think is happening
Most engineers picture an embedding model as a kind of understanding machine.<br>You feed it text. It reads the text, grasps the meaning, and produces a number that represents that meaning.<br>Two pieces of text with similar meanings get similar numbers. You compare numbers. You find meaning.<br>This mental model feels right. It explains why “cat” and “feline” end up close together. It explains why the system works at all.<br>But this mental model is wrong, and that’s exactly why several production RAG pipelines fail.
Maps, not minds
An embedding model doesn’t understand anything. They possess zero semantic comprehension. They strictly operate as coordinate systems - a map of language drawn by learning how words and phrases appeared in the internet.<br>Every piece of text you feed it gets assigned a location on that map. Texts that appeared together constantly, in the same articles, answering the same kinds of questions, end up near each other. The model never gets to redraw the map when it sees your internal data. It just projects everything onto the existing one, using whatever surface patterns and statistical echoes it recognizes.<br>The crucial part: the map was drawn by reading the internet.<br>Billions of web pages, Wikipedia articles, Reddit threads, news posts, academic papers. It’s a dense, detailed map of how language is used on the open web. This is why it works well for general questions. “Cat” and “feline” appeared near each other constantly. “Paris” and “capital of France” showed up together in thousands of articles.<br>But your company’s specific use of “pipeline”, “incident”, “P0”, or “ARR attainment”? Those meanings were never on the original map. The model does the only thing it can: it finds the nearest coordinates it does have. It always returns something. There is no “I don’t know”.<br>Here is the part that makes this dangerous: the model never warns you. It returns a confident-looking coordinate and a plausible similarity score regardless of whether your data falls within the model’s training distribution or unmapped domain. A 0.79 similarity score looks identical for both a perfectly relevant retrieval and a catastrophic silent failure.<br>The cosine score only tells you distance on the map. It doesn’t tell you whether the map covers your territory.
↓ Internals<br>The formal name for this is the distributional hypothesis, stated by linguist J.R. Firth in 1957: “you shall know a word by the company it keeps”. Modern embedding models are this hypothesis at scale, with a neural network as the function approximator.<br>The model learns: text → a point in ℝⁿ (768 dimensions for BERT-base, 1536 for OpenAI’s text-embedding-3-small). Positions are determined entirely by co-occurrence patterns in the training corpus. A concept that appeared with insufficient frequency or in the wrong context distribution gets placed at unreliable coordinates. Not missing, just wrong.
The map has a geometry problem
Even when you’re asking about...