Every RAG-based localization pipeline has the same blind spot - Lingo.dev
All posts<br>Every RAG-based localization pipeline has the same blind spot<br>Veronica Prilutskaya, CPO & Co-Founder·Published in about 14 hours·10 min read
If a localization pipeline uses retrieval augmented generation to inject glossary terms into the model's context window, it has a retrieval recall problem that has never been measured.<br>The pattern is universal: embed the input text, cosine-search a term bank, inject top-k results into the prompt. The output is grammatically correct. The terminology is wrong. The error is invisible unless someone speaks both languages and knows the glossary.<br>We built this naive version first. Then we measured retrieval recall against production glossaries – and it turned out the system was missing the majority of applicable terms on real payloads.<br>tr:not(:last-child)]:border-b">Technique Retrieval augmented localization (RAL) – context enrichment at inference timeCore fix N-gram decomposition before embedding, not sentence-level embeddingRetrieval modes 3 (skip / preload / vector search), selected per-request by glossary cardinalityThreshold calibration Continuous, weekly, against per-locale-pair quality scoresTerminology error reduction 17–45% across five LLM providers (controlled study, 42,000+ quality judgments)Scoring Independent cross-model evaluation, asynchronous, per-request<br>Why do sentence embeddings miss glossary terms?#<br>A glossary term is 1–3 words. "Localization engine." "Access token." "Deployment pipeline."<br>Input text is a JSON object with values ranging from two words (a button label) to two hundred words (a product description). When the full string "Configure the localization engine for production deployment" is embedded, the resulting vector captures the semantic meaning of the sentence – something about configuration and production systems. The glossary-relevant phrase "localization engine" dissolves into the sentence-level representation.<br>Cosine similarity between that sentence vector and the glossary entry "localization engine" lands in the 0.6–0.7 range. Below retrieval threshold. The term exists in the input. The retrieval system misses it.<br>The issue is granularity: sentence-level representations querying phrase-level targets. The embedding model faithfully represents the meaning of the sentence as a whole. Constituent terminology occupies no independent region of the vector space.<br>We found this out the hard way. On production payloads – nested JSON objects with 20–50 keys, values of varying length – sentence-level retrieval was missing the majority of applicable glossary terms. The localization request completed fine. The output read fluently. But "localization engine" was becoming "translation tool" – grammatically valid, semantically adjacent, terminologically wrong. And the pipeline reported success.<br>How does n-gram decomposition fix glossary retrieval?#<br>The fix turned out to be decomposing input into phrase-level units before embedding. Every string value becomes a set of overlapping n-gram windows:<br>textCopy<br>Input: "Configure the localization engine for production"
1-grams: [configure, the, localization, engine, for, production]<br>2-grams: [configure the, the localization, localization engine,<br>engine for, for production]<br>3-grams: [configure the localization, the localization engine,<br>localization engine for, engine for production]
Each n-gram becomes an independent retrieval query. "Localization engine" queries the glossary as a standalone phrase – and finds its match at high similarity.<br>The decomposition pipeline:<br>Recursively extract all string values from nested JSON structures<br>Split into sentences, strip HTML and markup annotations<br>Normalize whitespace, remove enclosing quotes, unescape formatting<br>Generate overlapping 1-gram, 2-gram, and 3-gram phrases from each sentence<br>A 50-word paragraph yields approximately 150 n-grams. A typical API payload with 20 keys yields 1,000–3,000 searchable phrases. Each phrase is embedded independently, each embedding runs a nearest-neighbor query against the glossary's vector index.<br>We measured the difference on the same production payloads that exposed the original problem. Glossary terms now match regardless of the sentence context surrounding them – a 2-word term buried in a 200-word product description retrieves with the same recall as a standalone label.<br>How does adaptive retrieval work for different glossary sizes?#<br>N-gram decomposition and batch embedding is the correct approach for large glossaries. For small ones, it turned out to be computationally wasteful.<br>A localization engine configured with 8 glossary terms resolves faster with direct injection – one database query, deterministic, sub-millisecond. A localization engine with 2,000 terms requires vector search – context window limits and relevance dilution make full injection impossible.<br>Three retrieval modes operate per-request, selected based on glossary cardinality for the...