The Evolution of 'More Like This'
The Evolution of 'More Like This'
Author: Sergey Nikolaev<br>Published: May 29, 2026 - 10 Min read
In many search scenarios, the user does not start from an empty query box, but from an existing result.<br>A user opens an article and wants to find related material. A buyer views a product card and looks for close alternatives. A support engineer investigates an incident and wants to see earlier cases with the same symptoms. In all these situations, the user already has a relevant document to start from.<br>This scenario is traditionally called More Like This (MLT) : a function for finding documents similar to the selected one. In this article, MLT means search that starts from a known document, not from a newly typed query.<br>The classic MLT approach, or similar-document search, was based on comparing textual matches. Modern implementations increasingly use embeddings: numerical representations of documents. A search index stores embeddings as vectors, and the search system can find documents with close vector representations.<br>Short glossary<br>To avoid repeating definitions throughout the article, here are the main terms:<br>TermMeaning in this articleMore Like This (MLT)search for documents similar to an already selected documentembeddinga numerical representation of text, a product, an image, or another objectembedding vectora numerical representation of an object, such as text or a product, stored in the index to find similar objects by vector proximityKNN, nearest-neighbor searchsearch for nearest neighbors, meaning objects with close vectorsANN, approximate nearest neighborsapproximate nearest-neighbor search; it speeds up KNN on large datasets without scanning every vectorRAG, Retrieval-Augmented Generationan approach where the search system retrieves context for a generative modelhybrid searchcombining full-text search and vector search in one scenariorerankingan additional sorting step for already retrieved candidates using a more precise model or ruleWhat classic More Like This did<br>Classic MLT was lexical. It answered a simple question: which documents use similar important words?<br>The process usually looked like this:<br>The search system took the source document.<br>It analyzed its text.<br>It selected informative terms.<br>It built a query from those terms.<br>It searched for documents with a similar set of words.<br>It returned a list of similar documents.<br>Internally, this used familiar full-text search mechanisms: TF-IDF or BM25, term frequency, stopwords, field boosts, and document-frequency limits. That is why older MLT implementations exposed parameters such as min_term_freq, min_doc_freq, max_doc_freq, and max_query_terms.<br>This was not just an interface element, but a full search mechanism. MLT was used for related articles and products, duplicate detection, support-ticket matching, legal search, patent research, and internal knowledge bases.<br>Where the lexical approach is still strong<br>Lexical MLT works well when specific words, identifiers, and stable formulations matter.<br>Examples:<br>error codes;<br>product SKUs;<br>part numbers;<br>function names;<br>stack traces;<br>legal wording;<br>nearly identical product or ticket descriptions.<br>The reason is that exact matching is critical here. If two incident reports contain the same error code or the same stack trace, full-text search sees a direct match. For example, when searching tickets with the code ERR_404, lexical MLT quickly finds every mention of that code, while vector search may return tickets that describe similar but not identical problems.<br>Lexical MLT had another advantage: it was cheap to run. The inverted index is already in the search engine. The analyzers are already configured. Ranking already works. There is no need to deploy separate search infrastructure just to support a “find similar” feature.<br>The limitation is also clear. If two documents describe the same thing in different words, lexical MLT may fail to connect them. Synonyms work unevenly. Paraphrases are harder. Cross-lingual similarity is usually unavailable. For example, memory leak and unbounded heap growth may describe the same problem, but a standard analyzer sees different tokens.<br>Lexical MLT efficiently finds documents with matching or similar wording. Semantic search helps when the meaning matches, not the words.<br>What embeddings change<br>Using<br>embeddings<br>— numerical representations of documents — changes the comparison principle: instead of words, the system compares vector representations.<br>A document no longer has to be represented only as a set of weighted terms. It can be stored as a dense vector. Nearby vectors usually correspond to documents that are similar in meaning, even if they are written in different words.<br>The lexical approach looks for matches by words and terms, while embedding search looks at the proximity of document vector representations. The first approach is optimal for exact matches such as error codes and SKUs. The second finds semantically close documents,...