A Double Shot of DuckDB

A Double Shot of DuckDB - peterdohertys.website

Published on: 5/27/2026

Overview

This is the third entry in my series of exploratory posts about DuckDB. Readers should have no trouble following along but you may want to start with A Dab of DuckDB if DuckDB is entirely new to you.

As the title suggests, in this post we will dive into two DuckDB topics: vector similarity search (VSS) and the new Quack Protocol. I was originally going to focus vector support but Quack was announced while I was working on this post and I couldn’t resist including it once I “got it”. I also think there’s a compelling case for how they can be used together in modern applications, as we’ll see in the comprehensive demo at the end of the post.

Vector Similarity Search

As mentioned in my last post about full-text search in DuckDB, I was very keen to learn about the state of vector search in the DuckDB ecosystem. I’m just coming off an ambitious project on which we relied heavily on the pgvector Postgres extension. I also have experience with vector storage/search solutions, like pgvectorscale and Weaviate.

So, I was pleasantly surprised to learn that DuckDB offers vector similarity search via the VSS core extension and that you can start using it in seconds with just a few SQL statements:

INSTALL vss; LOAD vss;

CREATE TABLE IF NOT EXISTS frames ( id UUID PRIMARY KEY, embedding FLOAT[512] NOT NULL, );

-- see Cosine Similarity Demo below for query interface / comparison function If you’re exploring an unfamiliar data set this functionality will undoubtedly yield results you would have otherwise missed if ssearching uing concrete terms (“Alice never responded” vs. “my sister never replied”). In contrast with full-text search, as powerful as it is, this feature set is next level and truly feels magical. Though, this power comes at a cost – especially if you lean on HNSW indexes for performance, which you will probably want to do. Indexes require time/space to compute and can go stale (e.g. post-delete but pre-rebuild). Also, because of the “approximate” nature of HNSW, you may see different results between brute force queries and queries utilizing indexes. See the “Inserts, Updates, Deletes and Re-Compaction” section of the docs.

DuckDB offering powerful features with this level of simplicity should no longer surprise me but … it still does. It’s just so refreshing to be spared from having to mess around with c/make, Docker or signing up for some web service before you can start experimenting.

I also want to make sure I call attention to just how nice DuckDB’s EXCLUDE clause is when working with vector columns. When working with pgvector, I found myself having to remember to only output certain columns for fear of accidentally dumping thousands of lines of vector output into my terminal when naively using SELECT * FROM some_table. With DuckDB, you can instead use SELECT * EXCLUDE (emeddings) FROM frames; without fear of spamming your terminal.

Indexes

VSS offers the following index metrics:

(The linear algebra and applications of index types are outside the scope of this post but I’ll note where these indexes are generally useful inline.)

- [DEFAULT] Euclidean distance (geospatial, measurements, color space, etc.) - Cosine similarity (semantic/similarity search) - Negative inner product (recommendations and more efficient than cosine for L2-normalized vectors) Each metric type has a corresponding comparison function: array_distance, array_cosine_distance and array_negative_inner_product.

There are also various options which can be configured when creating indexes with the usual time/space considerations. See the docs.

It’s worth calling attention to this note in the docs about index creation:

Note that while each HNSW index only applies to a single column you can create multiple HNSW indexes on the same table each individually indexing a different column. Additionally, you can also create multiple HNSW indexes to the same column, each supporting a different distance metric.

So, it’s worth internalizing that indexes are never created automatically. Also, you can use the various comparison functions for brute force searches (more accurate and reliable but wholly unoptimized), which are good for experimentation and the avoidance of premature optimization (i.e. index creation and tuning).

Cosine Similarity Demo

This demo requires a non-trivial amount of supporting infrastructure, as we need to use PyTorch, Transformers and a model (e.g. openai/clip-vit-base-patch32) to create the embeddings for the data set and the query string. I will include the complete script in the demo repository linked at the end of the post.

Assuming we have the following table:

INSTALL vss; LOAD vss;

CREATE TABLE IF NOT EXISTS documents ( id INTEGER PRIMARY KEY, content TEXT NOT NULL, embedding FLOAT[512] NOT NULL, ); And we populate it by creating rows containing embeddings for a list of sentences like:

CORPUS =...

A Double Shot of DuckDB

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine