AI Datasets Are Never Done. So We Built Loon. - Zilliz blog
Blog<br>Introducing Loon: A New Storage Engine for Vector Data That Never Stops Changing
Copy page
Introducing Loon: A New Storage Engine for Vector Data That Never Stops Changing<br>Jun 05, 202629 min read
Ted Xu
Content<br>Key takeaways<br>Introduction<br>A vector dataset is never really finished.<br>The first problem: long columns make write amplification expensive<br>The second problem: the same data must support scans and point reads<br>The third problem: the dataset does not live inside one engine<br>Why patches are not enough<br>Loon: a storage engine behind Milvus and Zilliz Cloud for evolving vector datasets<br>Design 1: use the right file format for the right column group<br>Design 2: align physical files through row IDs<br>Design 3: make the Manifest the source of truth<br>What changes for users when storage becomes versioned<br>Loon is available in Milvus 3.0 beta and Zilliz Vector Lakebase<br>Try Loon under Zilliz Vector Lakebase<br>Start Free, Scale Easily<br>Try the fully-managed vector database built for your GenAI applications.<br>Try Zilliz Cloud for Free
Key takeaways
This is a long, in-depth engineering dive, so here are the key points before we get into the details.
AI datasets are not static tables. The same rows keep changing as teams replace embedding models, add sparse vectors, revise captions, backfill labels, rebuild indexes, and run offline analysis.
Traditional storage layouts break down in three ways: long vector columns make backfills expensive, a single file format cannot serve both scans and point reads well, and private database storage forces external pipelines to create extra copies of the truth.
Loon is the new storage engine for Milvus and Zilliz Vector Lakebase. It is built around hybrid file formats, row ID alignment, and a Manifest that defines the dataset’s versioned state.
The goal is to enable a single vector dataset to support online search, offline analysis, backfills, compaction, and external compute without constantly copying, rewriting, or reimporting data.
Introduction
For a while, there was one argument against vector databases that sounded reasonable.
Traditional databases already store integers, strings, JSON, blobs, and indexes. Why not add a _vector_ type, build an ANN index beside it, and call it a day?
For early semantic search, that works well enough. A vector column plus an index can support a demo, a small RAG application, or an internal search feature. The problem shows up later, when the dataset starts behaving less like a table and more like an AI data system.
A production vector dataset has rows, primary keys, scalar fields, and queryable columns. In that sense, it looks like a database table. But it also has the scale and workflow shape of a data lake. It may contain hundreds of millions of records. It is repeatedly read and rewritten by Spark, Ray, DuckDB, training pipelines, evaluation jobs, and data quality systems.
It also depends on object storage. The source objects are often videos, images, PDFs, audio files, or web documents that remain in S3, GCS, OSS, or another object store. The database stores references, metadata, derived features, and indexes. Then it adds things traditional storage models were not built to manage as first-class objects: dense embeddings, sparse vectors, captions, vector indexes, text indexes, delete logs, statistics, model versions, parser versions, external blob references, and the version relationships between all of them.
That is where “just add a vector column” starts to break down. The issue is not whether a database can store vector bytes. Many systems can. The harder question is whether the storage model can handle how vector data changes, how it is queried, and how it is shared across the AI data stack.
This is why we built Loon, the new storage engine for Milvus and Zilliz Vector Lakebase (the next evolution of Zilliz Cloud).
Loon is designed with three ideas:
Use different physical formats for different kinds of columns.
Align those columns through a shared row ID space.
Use a Manifest to define the dataset's versioned state.
To see why those pieces matter, let's start with a common multimodal workflow.
A vector dataset is never really finished.
Imagine an AI team building a video dataset for multimodal training.
A long video is uploaded to object storage. A pipeline cuts it into clips based on scene changes, shot boundaries, or time windows. Clips that are too long or too short, blurry, duplicated, or low-quality are filtered out. The remaining clips are scored by an aesthetic model, captioned by another model, embedded by a vision-language model, and stored in a vector database for search, deduplication, and training-data filtering.
At a high level, the workflow looks simple:
video<br>→ clips<br>→ metadata<br>→ aesthetic_score<br>→ caption<br>→ embedding<br>→ search / dedup / training data filtering
But the dataset does not arrive fully formed.
In the first week, the table may only...