We spent 8ys making vector databases faster. Then, we stopped

We spent 8 years making vector search faster. Then AI changed the compute model - Zilliz blog

Blog We spent 8 years making vector databases faster. Then we stopped.

Copy page

We spent 8 years making vector databases faster. Then we stopped. May 12, 202613 min read

James Luan

Content The physics behind the always-on serving model Four barriers to the Lakebase on-demand search Back to that Slack user Zilliz Vector Lakebase is available in public preview Start Free, Scale Easily Try the fully-managed vector database built for your GenAI applications. Try Zilliz Cloud for Free

Cost matters. It always has. But there's an order: you can only cut costs after you've met the performance bar. A system that's cheap but returns wrong results isn't useful. Neither is one that can't hold latency under load.

Milvus started in 2017 with a simple belief: vector databases would become core data infrastructure, not a feature hidden inside an application. For eight years, that belief led us in one direction: make vector search faster and more predictable. Index compression, segment scheduling, HNSW tuning, prefetch strategies — almost every major optimization pointed at the same thing: get data into local cache and search faster.

That work is still the foundation. Always-on serving is the right architecture for high-QPS, low-latency vector search workloads. If a collection is queried constantly, keeping indexes resident in memory is not waste — it is the cost of serving the product experience.

Then we turned to cost. Tiered storage helped — hot segments in memory, cold data on disk and object storage, real savings. But the nodes never turned off. For a workload that runs five hours a month, you were still paying for the other 715.

That gap is one of the problems the new Zilliz Vector Lakebase is designed to solve. The bigger shift is not simply “make vector search cheaper.” It is to let persistent semantic data support more than one compute lifecycle: always-on serving when latency and throughput matter, and on-demand compute when the data needs to stay queryable but does not need dedicated machines running all month.

The physics behind the always-on serving model

S3 read latency is 20–50 ms per request. HNSW graph traversal touches hundreds of nodes per query. Put those two numbers together and the conclusion is obvious: vector indexes have to live in local memory to serve queries. Not a design flaw — physics.

To make this concrete: 100M vectors, 768 dimensions, float32. Raw vector data is ~286 GB; the HNSW graph (M=48) adds another ~55 GB in neighbor links — roughly 340 GB total.

Traditional Milvus QueryNode model:

┌──────────────────────────────────────────────────────────────┐ │ Traditional Milvus architecture │ │ │ │ 100M × 768-dim float32 → ~340 GB split across 3 QueryNodes │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ QueryNode 1 │ │ QueryNode 2 │ │ QueryNode 3 │ │ │ │ 128GB RAM │ │ 128GB RAM │ │ 128GB RAM │ │ │ │ + NVMe │ │ + NVMe │ │ + NVMe │ │ │ │ seg 0-99 │ │ seg 100-199 │ │ seg 200-299 │ │ │ │ (~113 GB) │ │ (~113 GB) │ │ (~113 GB) │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ load() │ load() │ load() │ │ └─────────────────┼─────────────────┘ │ │ ▼ │ │ ┌───────────────────────┐ │ │ │ S3 (source of truth) │ │ │ │ 340 GB full dataset │ │ │ └───────────────────────┘ │ │ Collection queryable only when all 340 GB are loaded │ │ Node fails → its segments go dark → reload from S3 │ └──────────────────────────────────────────────────────────────┘

Every segment needs a resident node before the collection is queryable. 340 GB of data, three 128 GB machines, running 24/7. For frequently queried collections, this works fine. Then AI changed the demand pattern.

Product teams run two-week A/B experiments, after which those embeddings are never queried again. In SaaS products, 90% of users didn't log in last week. In RAG knowledge bases, 80% of documents haven't been retrieved in the past month. The data isn't useless — it might be queried anytime — but it's rarely queried. Traditional databases handle this with tiering: hot data in memory, cold data on disk, and pages-in on demand. Vector databases had no such concept. Either you loaded the entire collection, or it wasn't queryable.

Before AI-generated embeddings became widespread, that binary wasn't a problem. Most vector workloads were either clearly online serving systems, where keeping indexes resident in memory made sense, or offline experiments that could tolerate bespoke pipelines. AI changed that middle ground.

We started seeing this shift in customer conversations. Embeddings were no longer just powering production RAG chatbots. A global GPU leader was embedding autonomous driving data — camera frames, driving sessions, weather, location, timestamps, and other metadata — so engineers could mine rare driving scenarios across tens of billions of vectors. An education technology company was using semantic search for...

We spent 8ys making vector databases faster. Then, we stopped

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast