Multi-stage distributed query execution in ClickHouse Cloud

Introducing multi-stage distributed query execution in ClickHouse Cloud Open searchOpen region selectorEnglish Japanese

47.6kSign inGet started

->Scroll to top

BackBlog Engineering Copy pageCopied!More actionsView as Markdown Open this page in Markdown Open in ChatGPT Ask questions about this page Open in Claude Ask questions about this page Open in v0 Ask questions about this page

Introducing multi-stage distributed query execution in ClickHouse Cloud

Alexander Gololobov May 27, 2026 · 19 minutes read

TL;DR Multi-stage distributed execution gives ClickHouse Cloud a new way to scale one query across many nodes. It repartitions intermediate data between stages, removing key bottlenecks in large joins and high-cardinality aggregations.

Early TPC-H results show up to 3.4× speedups for join-heavy queries while retaining near-linear aggregation scaling: 7.4× faster on 8 nodes than on 1 node.

Scaling one query across many nodes #

ClickHouse has always been able to scale a single query across multiple nodes. In shared-nothing deployments, users do this with physical sharding and the Distributed table engine. In ClickHouse Cloud, parallel replicas brought intra-query scaling to shared storage.

These mechanisms work well for many analytical queries, but they were not the final answer for modern PB-scale workloads. They could fan out work across nodes, but they could not freely repartition intermediate results between execution stages. That limited how far ClickHouse could scale high-cardinality aggregations, and especially large joins.

Multi-stage distributed query execution is the next step. It gives ClickHouse Cloud a new way to parallelize a single query across the CPU and memory of all available nodes, without the bottlenecks of the previous execution models.

In this post, we introduce the new extension of ClickHouse’s query execution model and walk through how it works. We use a multi-table join as the running example because joins are among the hardest analytical workloads to scale, but the mechanism is much broader: it is a new foundation for distributed query execution in ClickHouse Cloud.

Before we look at the new mechanics, let’s review what came before and why those approaches weren’t enough for modern PB-scale workloads.

Why existing distributed execution was not enough #

The existing distributed execution was useful but not elastic enough for PB-scale workloads.

In shared-nothing open source deployments, ClickHouse scales a query by physically sharding data across nodes and querying those shards through a Distributed table. Each node processes its local slice, and the requester merges the results.

That works, but capacity is tied to the shard layout.

Bottleneck: capacity is tied to shard layout Adding compute does not automatically make one query faster. Large tables must first be redistributed across more shards.

Large joins across physically sharded tables expose a second limitation. A join only works when matching rows meet on the same machine. With a distributed JOIN, each node keeps its local left side, fetches the missing right-side shards from the other nodes, builds a full right-side hash table, and returns its local join result to the requester.

GLOBAL JOIN reduces the many-to-many network round-trip by computing the right side once and broadcasting it to every node.

But the core problem remains: large right sides still have to be copied across the cluster.

Bottleneck: large right sides are copied everywhere Distributed JOIN and GLOBAL JOIN handle network traffic in different ways, but both still make every shard join against a full right side.

ClickHouse Cloud removed the physical sharding problem by moving to shared storage. Any node can access the same table data, and parallel replicas allow multiple nodes to participate in a single query. Nodes can be added or removed instantly, with no data copying or reshuffling.

That made intra-query scaling in ClickHouse Cloud much more elastic. But parallel replicas still had a structural limitation: they could split work across replicas, but they could not freely repartition intermediate data between execution stages.

This shows up in two places.

First, joins. On a single node, ClickHouse can parallelize both sides of its default hash join strategy: it partitions rows by the join key into multiple hash tables, so both build and probe work can run across CPU cores. The same remains true inside each node when parallel replicas are used.

The limitation is one level higher. Across multiple nodes, dividing the build side itself would require a shuffle stage that repartitions both inputs by join key between nodes. Parallel replicas do not have that mechanism. The next best option is to distribute the left-side read ranges after primary-index pruning. That parallelizes probe-side work across nodes, but those ranges are not partitioned by the join key. A row in one left-side range can match rows anywhere in the right-side table,...

Multi-stage distributed query execution in ClickHouse Cloud

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine