Let Products Build Their Own Taxonomy | by Mirakl Labs | Jun, 2026 | Mirakl Tech BlogSitemapOpen in appSign up<br>Sign in
Medium Logo
Get app<br>Write
Search
Sign up<br>Sign in
Mirakl Tech Blog
Deep dives with members of the Mirakl engineering, product and data teams who are at the forefront of the enterprise marketplace revolution.
Let Products Build Their Own Taxonomy
Mirakl Labs
11 min read·<br>22 hours ago
Listen
Share
By Robin Brochier & Robin Vaysse, AI Researchers at Mirakl<br>Press enter or click to view image in full size
How we let 3 million unique products define a 16,000-category pivot taxonomy and turned a combinatorial mapping problem into a linear one.<br>The hidden tax of every marketplace<br>Picture a seller listing the same cordless drill on two different marketplaces. On the first, it lives under “Impact drill” with an attribute “Voltage: 12V”. On the second, it’s a “Perceuse à percussion” (French equivalent) with “Tension: 12 volts”. Same product, two taxonomies, two attribute schemas, two manual mapping efforts.<br>Now scale that up. Every marketplace maintains its own product taxonomy, and every seller has to re-categorize their entire catalog into the specific format of each marketplace they list on. With M sellers and N marketplaces , this naive approach requires M × N distinct taxonomy mappings.<br>This is not a cosmetic problem. Misclassification directly hurts search visibility, recommendation reach, and conversion.<br>There is a way out: a single shared pivot taxonomy . If every seller maps their catalog to the pivot once (M mappings), and every marketplace maintains one mapping from the pivot to its own taxonomy (N mappings), the effort collapses from M × N to M + N .<br>Press enter or click to view image in full size
Here, with 5 sellers and 3 marketplaces, the pivot nearly halves the work. In reality at Mirakl, we operate with 100,000+ sellers and 450+ marketplaces, where direct mapping would mean millions of transformations. The pivot brings that number down to roughly 100k.<br>The hard part in this approach, of course, is building a pivot taxonomy good enough to sit in the middle. This post is about how we did it at Mirakl. A bottom-up approach, from real product data, with a pipeline that combines embeddings and large language models, and what we learned along the way.<br>What makes a taxonomy “good”<br>A taxonomy that can serve as a universal pivot needs three things:<br>Coverage. Whatever the product, a sensible category should already exist. No generic catch-all bucket.<br>Bounded granularity. Categories must be precise enough to separate products, but not so fine that they degenerate into per-product labels. In most cases, an attribute value should not leak into a category name. It’s “T-Shirts”, not “Red Cotton T-Shirts”.<br>Evolvability. You should be able to add or adjust categories on the fly as the catalog changes.<br>The classic way to get there is top-down : domain experts design a hierarchy up front. These are valuable references, but the approach has well-known weaknesses. It bakes in subjective design choices, its coverage is limited to what the designers imagined, it requires constant maintenance as products evolve, and it’s especially fragile in long-tail domains (precisely where fine category distinctions matter most).<br>Press enter or click to view image in full size
Instead of the classical way, we took the opposite route. We did not want to impose a structure and force products into it so we let the product data drive the taxonomy formation.<br>Designing the pivot: the constraints we committed to<br>Before building anything, we fixed a small set of deliberate design constraints:<br>Flat. A single layer of categories, no parent-child hierarchy. Committing to a tree would inevitably misalign with some marketplace’s own tree, so we don’t commit at all.<br>English-only category names. One language reduces ambiguity in cross-marketplace mapping. Handling other languages is the job of the target-taxonomy mapper, not the pivot.<br>Typed attributes per category. Each category carries the schema of attributes its products should declare like voltage, capacity, material, and so on.<br>Incrementally updatable. New products and new categories can be added without recomputing the whole taxonomy.<br>These are trade-offs, but being explicit about them up front kept the rest of the engineering honest.<br>The construction pipeline<br>The pipeline runs in three stages: data preparation, iterative category creation, and per-category attribute extraction. We start from a corpus of more than 3 million unique products, identified by GTIN (the barcode-level unique identifier) and purchased from third-party data providers.<br>Stage 1: Data preparation<br>Each GTIN comes with one or more catalog records from different sources, varying wildly in language, completeness, and format. We normalize all of them into a single canonical record using a Qwen3–14B model that we LoRA fine-tuned to distill the normalization behavior of GPT-5.4. This allowed us to...