Benchmarks and Obscurantism: A "red" line that should not be crossed

Benchmarks and Obscurantism: A “red” line that should not be crossed | ClickHouse

Open searchOpen region selectorEnglish Japanese

48.4kSign inGet Started

->Scroll to top BackBlog Engineering Copy pageCopied!More actionsView as Markdown Open this page in Markdown Open in ChatGPT Ask questions about this page Open in Claude Ask questions about this page Open in v0 Ask questions about this page

Benchmarks and Obscurantism: A “red” line that should not be crossed

Melvyn Peignon Jun 29, 2026 · 13 minutes read

TL;DR

Databricks used its keynote to show ClickHouse “crashing” in a benchmark for Reyden, its new gated low-latency compute product. But the benchmark did not disclose the hardware, cost, configuration, cache settings, or enough methodology for anyone outside Databricks to reproduce or validate the result.

We tried to reproduce the claimed ClickHouse failure using the only clearly identified setup detail: TPC-H SF1 and Q6. ClickHouse did not crash. A single 30 vCPU node sustained about 420 QPS at sub-second P90 latency, and scaling to 15,000 QPS came down to straightforward sizing: roughly 30 to 40 untuned nodes.

That is the real point of this post: benchmark results are useful only when they are open, reproducible, and detailed enough to inspect. Without that, they are claims you have to take on faith.

Why benchmark transparency matters #

At ClickHouse, we love testing how our products perform across a variety of datasets and benchmarks. We strongly believe that benchmarking products in a transparent and reproducible manner is key to providing quality information to end users, and that it fosters a fair and transparent competitive landscape that ultimately pushes different technologies to innovate.

That said, comparing two software products is not a trivial task, especially when no one is equally expert in all systems evaluated. Each product has its own architecture, configuration model, optimizations, and tradeoffs, which means even the best good-faith benchmark can miss something important. That is why we believe benchmarks should be open, transparent, and reproducible. At the very least, this provides a common baseline that can start a conversation and highlight the nuances between systems. If experts in one of the systems see a configuration issue in the benchmark, they should be able to point it out, and the benchmark should be easy to update and rerun. That is what happened recently with Snowflake: after our initial benchmark results, Snowflake shared feedback. We incorporated their feedback, updated the setup based on all of their suggestions, and we reran the comparison. It's then up to the consumer of the benchmark to decide, based on the data and methodology, what matters most to them and what insights they want to extract.

To be useful, benchmarks need to be reproducible and run with a clear methodology. If they aren't, we slowly fall into deception and obscurantism — or, as some like to call it, "benchmarketing."

The Databricks Reyden benchmark #

I wanted to write about benchmark transparency because I just got back from San Francisco, where I attended the Databricks Data and AI Summit — the conference where Databricks showcases its new product announcements. The one that piqued my interest the most was the Reyden announcement. If you haven't watched it, the tl;dr is that Databricks developed a new compute group that aims to address low-latency query workloads. ClickHouse was highlighted and referenced a few times during the keynote.

This is great news for the real-time analytics space: it means more people will be working on the problem, and we might see more innovation. But as I watched the keynote, one particular benchmark caught my eye.

The yellowish line is ClickHouse (by the way, this is our old color, the new one is #FAFF69), the blue lines are Snowflake, and the red line is Databricks. As the product manager for ClickHouse, seeing that ClickHouse “crashed” during the benchmark was a big problem for me, so immediately after the keynote I set out to reproduce the benchmark and see how ClickHouse could possibly crash with that load.

Datasets selection #

During the Reyden announcement, they at least shared that two datasets were used:

The TPC-H benchmark (with a big emphasis on TPC-H SF 1 — the smallest scale factor of the benchmark), which is also, conveniently, a sample dataset provided by Databricks.

The NYC Taxi dataset. This dataset can be found in different sizes, but Databricks provides a sample that coincidentally matches the range of the query highlighted during the keynote; the sample they provide is around 22K rows.

Overall, that would be an interesting choice, for one primary reason:

These datasets are tiny. They are so small they can fit in memory on an iPhone.

At that point, you are not measuring how the engine performs at scale. You are measuring how fast it can query data that is already in an in-memory cache.

For a query engine meant to work at...

Benchmarks and Obscurantism: A "red" line that should not be crossed

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level