PGM-index:range searches, deletes, updates using orders of magnitude less space

The PGM-index

Home

Docs

The PGM-index

Home

Docs

The PGM-index

The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

Source

PyGM

Docs

Papers

New The PGM-index now supports orthogonal range queries in k-dimensions!

New Added a page discussing the computational complexity

-->

Unlike traditional tree-based indexes that are blind to the possible regularity present in the input data, the PGM-index exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query and update time performance.

In addition to that, the PGM-index offers compression , distribution-awareness , and multi-criteria adaptability, thus resulting suitable for addressing the increasing demand for big data systems that adapt to the rapidly changing constraints imposed by the wide range of modern devices and applications.

Features

Learned

It is one of the first results on learned indexes which achieves astonishing performance by capturing the distribution of the input data.

Optimal

It is the first learned index with provably optimal time and space complexity guarantees. This makes it resistant to adversarial inputs and queries.

Memory efficient

It always consumes less space than traditional tree-based indexes, often orders of magnitude less. If this is not enough, there is even a compressed version.

Fast construction

Its construction, based on a single scan of the input data, matches the efficiency of traditional indexes even on gigabytes of data.

Tunable

It can be tailored to various storage devices and memory hierarchies and auto-tuned to any given constraints on memory usage and query time.

Flexible

It can be beneficial in various applications, from databases to geographic information systems and search engines, as it supports several kinds of queries, from point to multidimensional.

Computational complexity

Let $n$ be the number of keys, and $B$ be the page size of the machine.

PGM-index B-tree Self-balancing BST† Skip list Sorted array

Predecessor query§ (static case) $\Oh(\log_B n)$ $\Oh(\log_B n)$ $\Oh(\log n)$ $\Oh(\log n)$ w.h.p. $\Oh(\log n)$

Predecessor query (dynamic case#) $\Oh(\log^2_B n)$ $\Oh(\log_B n)$ $\Oh(\log n)$ $\Oh(\log n)$ w.h.p. $\Oh(\log n)$

Insert/delete $\Oh(\log_B n)$ amortised $\Oh(\log_B n)$ $\Oh(\log n)$ $\Oh(\log n)$ w.h.p. $\Oh(n)$

Index space in words $\Oh(\frac{n}{B^2})$ w.h.p.‡ $\Oh(\frac{n}{B})$ $\Oh(n)$ $\Oh(n)$ w.h.p. $\Oh(1)$

§ Given a key $q$, a predecessor query returns the maximum key $k$ in the input set such that $k \leq q$. This is a more powerful operation than the basic lookup (aka membership query), which only aims to find whether a key belongs to the input set or not.

# The dynamic case corresponds to a PGM-index that support both predecessor queries and inserts/deletes.

&dagger; C++'s std::set and std::map are often implemented as self-balancing binary search trees (e.g. red-black trees or AVL trees).

&Dagger; Assuming the gaps between consecutive input keys taken in sorted order have finite mean and variance.

For more information, visit the computational complexity page.

Running example

#include #include #include #include #include "pgm/pgm_index.hpp"

int main() { // Generate some random data std::vectorint> data(1000000); std::generate(data.begin(), data.end(), std::rand); data.push_back(42); std::sort(data.begin(), data.end());

// Construct the PGM-index const int epsilon = 128; // space-time trade-off parameter pgm::PGMIndexint, epsilon> index(data);

// Query the PGM-index auto q = 42; auto range = index.search(q); auto lo = data.begin() + range.lo; auto hi = data.begin() + range.hi; std::cout *std::lower_bound(lo, hi, q);

return 0;

PGM-index:range searches, deletes, updates using orders of magnitude less space

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs