How to Make xt850 Match xt 850

snikolaev1 pts0 comments

How to Make xt850 Match xt 850

How to Make xt850 Match xt 850

Author: Sergey Nikolaev<br>Published: May 04, 2026 - 7 Min read

TL;DR<br>Since version 23.0.0, Manticore can make searches like xt850 match xt 850 using<br>bigram_delimiter<br>together with digit-aware<br>bigram_index<br>modes.<br>This solves a common tokenization mismatch in product search, where users remove spaces from model names but the source data stores them as separate tokens.<br>Assumptions and verification<br>This article assumes:<br>RT tables created with SQL examples exactly as shown<br>default tokenization unless the example explicitly changes a setting<br>ASCII digits in model names, because second_numeric and second_has_digit are digit-aware modes built around 0-9<br>All SQL examples and expected outputs in this article were verified against a real Manticore 23.0.0 instance before publishing, using fresh tables created from scratch for each scenario.<br>The broader search problem<br>Imagine a catalog containing:<br>xt 850 action camera<br>iphone 5se battery case<br>canon eos 80d body<br>thinkpad x1 carbon<br>Now imagine users searching for:<br>xt850<br>iphone5se<br>eos80d<br>thinkpadx1<br>From the user's point of view, these should obviously match. From the engine's point of view, they often do not, because the indexed text is tokenized as separate terms.<br>Search systems usually attack that mismatch in one of four ways:<br>index prefixes or infixes<br>add custom normalization rules<br>duplicate content into alternate normalized fields<br>index adjacent token pairs and optionally store glued variants too<br>Manticore's newer bigram functionality is a structured way to do the fourth option without awkward field duplication.<br>Baseline: why xt850 fails by default<br>Here is the problem in its simplest form:<br>DROP TABLE IF EXISTS bi_default_demo;

CREATE TABLE bi_default_demo(title text);

INSERT INTO bi_default_demo VALUES<br>(1,'xt 850 action camera');

SELECT id, title FROM bi_default_demo WHERE MATCH('xt850');

Expected result:<br>Empty set

Why does this fail?<br>Because the document is indexed as two separate tokens, xt and 850, while the query is a single token, xt850.<br>By default, Manticore does not assume that:<br>xt850 should be split into xt + 850<br>or xt + 850 should also be searchable as xt850<br>So this is not really a typo-tolerance problem or a phrase problem. It is a tokenization mismatch: the index sees two tokens, while the query provides one.<br>That is the gap the newer bigram settings are designed to close. They let Manticore index selected adjacent token pairs in a form that can also match glued queries.<br>Why bigrams help here<br>bigram_index<br>can help with both<br>phrase acceleration<br>and model-name matching, and in this article we focus on the xt 850 vs xt850 problem.<br>The key idea is simple:<br>detect adjacent token pairs that look like model names<br>store those pairs in a glued form too<br>let queries such as xt850, iphone5se, or thinkpadx1 hit the spaced text<br>That is where<br>bigram_delimiter<br>matters.<br>A note about<br>bigram_delimiter<br>bigram_index decides which adjacent pairs are eligible.<br>bigram_delimiter decides how eligible bigrams are stored:<br>true: internal delimited token only<br>none: glued token only, such as galaxy24<br>both: both forms<br>The practical difference is easiest to understand from the query side:<br>with true, Manticore keeps the internal bigram form used for phrase optimization, but it does not keep the glued user-facing form, so a query like xt850 will not match xt 850<br>with none, Manticore keeps only the glued form, so xt850 can match xt 850, but you are leaning entirely on the glued representation for those pairs<br>with both, Manticore keeps both the internal bigram representation and the glued form, so xt850 can match xt 850 without giving up ordinary phrase behavior<br>For this use case, both is usually the safer default because it covers the user-visible problem directly while keeping behavior less surprising for normal phrase queries and mixed workloads.<br>Mode 1: second_numeric<br>bigram_index = second_numeric<br>bigram_delimiter = both

This mode is aimed at model names where the second token is purely numeric.<br>That is common in product catalogs:<br>xt 850<br>galaxy 24<br>playstation 5<br>pixel 8<br>The idea is simple: users often search these as glued terms such as xt850, galaxy24, or playstation5, even though the source text stores them with a space.<br>second_numeric stores the pair only when the second token is ASCII digits only.<br>Use it when:<br>you have product generations and numbered models<br>users often remove spaces in search<br>the second token is usually just digits<br>Example<br>DROP TABLE IF EXISTS bi_second_numeric_demo;

CREATE TABLE bi_second_numeric_demo(title text)<br>bigram_index='second_numeric'<br>bigram_delimiter='both';

INSERT INTO bi_second_numeric_demo VALUES<br>(1,'xt 850 action camera'),<br>(2,'galaxy 24 ultra'),<br>(3,'playstation 5 slim'),<br>(4,'iphone 5se case'),<br>(5,'canon eos 80d body'),<br>(6,'thinkpad x1 carbon');

Then test the queries one by one:<br>SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('xt850');

+------+----------------------+<br>| id |...

xt850 match token glued manticore from

Related Articles