Scalar and binary quantization for pgvector vector search and storage |<br>Jonathan Katz<br>Jonathan Katz
Scalar and Binary Quantization for Pgvector Vector Search and Storage
Tue, Apr 9, 2024
21-minute read<br>While many AI/ML embedding models generate vectors that provide large amounts of information by using high dimensionality, this can come at the cost of using more memory for searches and more overall storage. Both of these can have an impact on the cost and performance of a system that’s storing vectors, including when using PostgreSQL with the pgvector for these use cases.<br>When I talk about vector search in PostgreSQL, I have a slide that I like to call “no shortcuts without tradeoffs” that calls out the different challenges around searching vectors in a database. I first like to highlight that a 1,536 dimensional vector with 4-byte floating point (fp32) dimensions requires 6KiB of storage. That may not seem like a lot, but storing 1,000,000 of these vectors in a PostgreSQL database requires 5.7GB of storage without any indexing. And again, that may not seem like a lot, but a single database row is closer to 600B of data (and typically even less than that) than 6KB.<br>The next point I talk about is compression: unfortunately, you just can’t compress a bunch of seemingly random floating point numbers. However, there are techniques to reduce the amount of information that you can store, from principal component analysis (aka PCA), which can reduce the overall dimensionality, to quantization techniques that can reduce the size or overall number of dimensions of a vector. These can reduce your overall storage and memory requirements – and perhaps even improve performance in certain areas, but recall that I mentioned there are no shortcuts without tradeoffs. To use quantization for vector searches, we’ll have to give something up, whether it’s in the overall relevancy of our searches (recall) or in some area of performance in our system.<br>There are three common quantization techniques around vector databases:<br>Scalar quantization , which reduces the overall size of the dimensions to a smaller data type (e.g. a 4-byte float to a 2-byte float or 1-byte integer).<br>Binary quantization , which is a subset of scalar quantization that reduces the dimensions to a single bit (e.g. > 0 to 1, to 0).<br>Product quantization , which uses a clustering technique to effectively remaps the original vector to a vector with smaller dimensionality and indexes that (e.g. reduce a vector from 128-dim to 8-dim).<br>There is already a lot of content available that does better exploration and explanation of these techniques than I will; instead I’ll focus on how to use 2 of these 3 techniques in the context of pgvector.<br>The upcoming pgvector 0.7.0 release is planning to include functionality that allows for you to use both scalar and binary quantization as part of your indexing strategy. Specifically, pgvector 0.7.0 adds support for indexing 2-byte floats (halfvec) and bit/binary vectors (using the PostgreSQL bit data type). The 0.7.0 will also add support for sparse vectors via sparsevec, but that will be for a future blog post.<br>ASIDE : These quantization techniques also let you index vectors that are larger than 2,000 dimensions , which has been a major request for pgvector. halfvec can store up to 4,000 dimensions, bit can store up to 64,000 dimensions, and sparsevec can store up to 1,000 nonzero elements (which means it can store vectors with very high dimensionality).<br>Using existing pgvector functionality, such as HNSW indexing, along with PostgreSQL features like expression indexes, we can explore how scalar and binary quantization can help reduce both space and memory consumption – letting us scale vector workloads even further on PostgreSQL, understand what performance gains and tradeoffs we must make, how embedding models can impact results, and determine when it makes sense to use quantization.<br>Test setup and system configuration<br>Before diving in, let’s do a quick review of how we want to test these techniques. I previously provided guidance on testing approximate nearest neighbor (ANN) algorithms over vector indexes; below is a quick summary of what we will look at over the course of this testing:<br>Index size: This is a key metric with a quantization test, as we should be building indexes that are smaller than a full (or “flat”) vector index. The interesting data is “how much smaller,” and the impacts to the other metrics we’ll review.<br>Index build time: How does quantization impact overall build time? If the build time is longer, do we gain enough in index size / recall / query performance that it’s OK to make that tradeoff? If the build time is shorter, do we lose anything in recall / query performance such that we should avoid quantization?<br>Queries per second (QPS): How does quantization impact how many queries we can execute per second, or our...