How we made WINDOW JOIN parallel and vectorized

How we made WINDOW JOIN parallel and vectorized | QuestDB New: QuestDB For AI Agents New: QuestDB For AI Agents Learn more

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control. It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

Consider a workload that comes up constantly on a trading desk: for every executed trade, attach the average bid and ask within a 1-second window around the trade. Without a dedicated operator it takes two joins, an ASOF JOIN for the carry-forward quote at the window start plus a range join for the rows inside the window, stitched with UNION ALL and folded with a GROUP BY:

-- QuestDB timestamps are microseconds, so 1_000_000 is 1 second. WITH prevailing AS ( -- ASOF-match against the window start (trade timestamp - 1 s), -- not the trade timestamp itself. SELECT t.orig_ts ts, t.symbol, p.bid, p.ask FROM ( (SELECT (timestamp - 1000000) AS ts, symbol, timestamp AS orig_ts FROM trades) TIMESTAMP(ts) ) t ASOF JOIN prices p ON p.sym = t.symbol ), in_window AS ( SELECT t.timestamp ts, t.symbol, p.bid, p.ask FROM trades t JOIN prices p ON p.sym = t.symbol WHERE p.ts > t.timestamp - 1000000 AND p.ts SELECT ts, symbol, avg(bid) avg_bid, avg(ask) avg_ask FROM (SELECT * FROM prevailing UNION ALL SELECT * FROM in_window) GROUP BY ts, symbol;

This works, but it's a lot of SQL for a simple operation. The ASOF JOIN and the range JOIN walk the prices table independently even though they are answering two halves of the same question, and the range JOIN forces the planner to hash on sym and then re-filter every matched pair against the BETWEEN predicate. The outer GROUP BY over ts is a hash aggregation that has to materialize a row per (ts, symbol) pair, which works out to 50 million groups in our test data. There is nothing here for the optimizer to fuse, parallelize cleanly, or vectorize.

WINDOW JOIN is QuestDB's dedicated syntax for aggregating one table over a time window around each row of another. The same query, dedicated operator:

SELECT t.*, avg(p.bid) avg_bid, avg(p.ask) avg_ask FROM trades t WINDOW JOIN prices p ON p.sym = t.symbol RANGE BETWEEN 1 second PRECEDING AND 1 second FOLLOWING;

Now the operator knows what it is doing: for every row on the left-hand side of the join (LHS - trades here), find rows on the right-hand side (RHS - prices) whose timestamp falls inside a [lo, hi] window around the LHS timestamp, restrict to matching symbol keys, and reduce them with a batch of aggregate functions.

Making that fast comes down to two pieces: data-level parallelism over the LHS, plus a low-cardinality fast path that copies values into contiguous buffers so the SIMD aggregation kernels we already ship for SAMPLE BY run on window slices unchanged. Benchmarked against Timescale, DuckDB, and ClickHouse on a 50M-row trades table joined to a 150M-row prices table, the parallel + SIMD path runs 5.0x faster than QuestDB's own single-threaded fallback and 25x faster than ClickHouse's best rewrite.

Data-level parallelism

QuestDB stores data in append-only column files, partitioned by time. The query engine reads them as a sequence of page frames: contiguous, columnar slabs of memory that map directly onto file pages. Filtering and aggregation both work at this granularity: a page frame is the unit of dispatch to a worker thread.

WINDOW JOIN follows the same model. The LHS table is sliced into page frames; each worker takes a frame and is responsible for producing the aggregate result for every LHS row in that frame. To do that it needs the RHS rows that fall inside the union of all windows the frame covers.

Concretely, for a frame whose LHS timestamps run from tLo to tHi with a [-w_lo, +w_hi] window, the worker needs RHS rows in [tLo - w_lo, tHi + w_hi]. Locating that slice cheaply is what makes the parallel plan viable, and the enabler is QuestDB's storage layout: rows in both tables are kept in designated timestamp order on disk, so the RHS slice for any time range collapses to a single binary search per worker rather than a scan per LHS row.

Then, for the join keys present in the LHS frame, the worker builds a small in-memory index from the RHS slice: a per-key list of RHS timestamps, plus per-key arrays of the values to aggregate. Once that index is built, the inner loop over LHS rows is just two binary searches per row, one for the window's low bound and one for its high, followed by an aggregate over the resulting contiguous range. Both binary searches walk forward monotonically, so they amortize across rows in the same frame.

Roughly:

LHS page frames ┌─────┬─────┬─────┬─────┬─────┐ │ F0 │ F1 │ F2 │ F3 │ ... │ └──┬──┴──┬──┴──┬──┴──┬──┴─────┘ │ │ │ │ ┌──────┴┐ ┌──┴───┐ │ │ │worker0│ │worker1│ ... │ workers pulled from a shared pool └───┬───┘ └──┬────┘ │ one frame at a time │ │ │ ┌───────┴───────┐│...

How we made WINDOW JOIN parallel and vectorized

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi