Evolutionary Data Making – How to train embedding models

Evolutionary Data Making | Wafer ← Blog Evolutionary Data Making<br>Chris Gresla 2026-03-17<br>TLDR

We needed a better way to acquire training data for the embedding model that powers search on our phone OS. Static data generation methods rely on heuristics to pair queries with relevant documents, capturing obvious associations but failing to scale or find nuanced data. Inspired by the principles of evolution, we built a search system where frontier LLMs explore, grade, and refine data generation policies, guided by a constitution of quality principles we call The Good Data Manifesto. A 0.6B parameter model trained on this data improved NDCG@10 by 37% and won or tied 82% of blind head-to-head comparisons on real user queries.

Data Generation Policies

Pause<br>Static Policy

Evolutionary Policy

0 samples

The Distribution Matching Problem

One can reduce all of inductive machine learning into two halves: a dataset and an algorithm for learning the associations of that dataset. In the sub-field of text retrieval, learning algorithms are quite mature. The canonical recipe1,2 is to train dense transformer models that project queries and documents into a shared embedding space — where useful associations between data points are captured through geometric proximity under a model’s representation. Methods like Multiple Negatives Ranking loss3 and advances in hard negative mining4 and multi-task training have remained stable and effective choices for years. The algorithmic half of the problem is, in a sense, solved.

The dataset half is not. Among top embedding model providers, the landscape is strikingly opaque. Qwen3 Embedding5 outlines its synthetic data pipeline at a high level but omits key details and does not release the training dataset. Cohere, Voyage AI, OpenAI, and Mistral have not published research papers or released data describing their embedding training methodology. Jina6 releases model weights but not the curated fine-tuning data. A few groups have bucked this trend — Nomic7 released the full 235M training pairs for Nomic Embed, and BAAI published the data behind the BGE family8 — but even these releases contain the data itself rather than the process that created it. The methodology for generating high-quality retrieval training data remains largely unshared. We believe the data generation process is the more consequential half of the problem, and that sharing how it works is as important as sharing the data itself.

At Wafer, we are building a mobile operating system that understands you as well as you do. Part of our vision for what computing should be involves making all of your data transparently accessible. Users should be able to surf over their personal data as they enjoy surfing the web today. Paramount to our search system is the embedding model described in this post: the retrieval layer that connects natural-language queries to information scattered across a user’s digital life — emails from Gmail, group chats in WhatsApp, late-night transactions in Venmo, planning sessions in Slack, travel itineraries from your Calendar, research notes in Notion, and those esoteric songs you found on YouTube at 2am.9 As an example, one of our founding team members’ phone indexes data from 57 distinct applications, totaling ~194,000 individual sources drawn over the course of a few months of usage.

Our OS enables users to access their data in — unfortunately, and beautifully — novel ways. Consider a brief trip to Denver you made with your family. You want to know: “How much did I spend on my Denver trip?” To answer this yourself, you would have to go through a mental accounting exercise — thinking through things like “okay what all did I do, where did we go, what did we eat?” — and then opening each disparate application to piece together a picture of what you spent, all the while tracking everything manually. But there are data points scattered across the applications on your phone that we could use to get this information autonomously: a cash withdrawal notice from your bank, an Airbnb confirmation email, Venmo paybacks from friends, a Google Maps timeline of places visited, calendar events for flights and activities. From these disjoint pieces of information, our phone can assemble a meaningful and comprehensive answer for the user.

Unfortunately, these varied data points are not semantically similar in the classical sense, nor are they necessarily syntactically similar. Useful textual information relates to queries through a nearly infinite set of characteristics: relevance over time, causality, shared themes, common entities, project membership, transactional context, and more. We consider this challenge as a distribution matching problem10: the distribution of queries and the distribution of useful information/answers occupy different regions in the space of all text, connected by latent structures that are rich, varied, and difficult to fully enumerate in a vacuum.

The combinatorial space of...

Evolutionary Data Making – How to train embedding models

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast