Lies, Damn Lies and Database Benchmarks

Lies, Damn Lies and Database Benchmarks | QuestDB New: QuestDB For AI Agents New: QuestDB For AI Agents Learn more

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control. It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

Benchmarks, everyone loves benchmarks. People look at a benchmark result and start spreading the word that database X is the top dog, since it is so much faster than database Y.

A decent benchmark might be pictured as a strict Olympic Games-like running competition where the "Citius, Altius, Fortius" principle is precisely implemented. But in reality, when you approach the athletes, you start hearing unexpected noises. What is that? It turns out the competition is more like those weird contests you find on the Internet: the athletes must whistle "Yellow Submarine" accurately while running as fast as they can. The winner is no longer the fastest runner. It is whoever best balances raw speed against a skill that has nothing to do with running, and the quickest sprinter on the track can easily finish last.

That analogy applies to a thing as complex as database benchmarks, especially when quite different categories of databases are being compared. A perfect, completely fair database benchmark is like a unicorn: good luck finding one. Today we will try to illustrate this by toying with a public, well-recognized benchmark.

The benchmark we will use is ClickBench, but do not get us wrong: we are here to question all database benchmarks, not ClickBench specifically. ClickBench is just convenient. It is a solid comparison for analytical databases and already includes a large roster of engines.

How ClickBench measures things

ClickBench runs the same workload against every system: a single web-analytics table of around 100 million rows and 105 columns (the famous hits dataset), and 43 analytical queries over it. Each engine ships a small set of shell scripts. The flow is always the same: a script installs the database, loads the data (importing from CSV/TSV, or simply pointing the engine at a downloaded Parquet file if it can read external files), and then runs the 43 queries.

Each query is measured in two flavors:

Cold run. This is the first execution of a query, with all operating system page caches and database caches cleared beforehand. It captures the worst case, when nothing is warm.

Hot run. Quoting the ClickBench rules, "each of the 43 queries is run three times," and "the smaller of the 2nd and 3rd runtime is used if both runs are successful." The first run is supposed to populate the caches, so the two later runs are expected to be the fastest.

That cold definition hides an asymmetry the public dashboard does not advertise. Clearing the OS page cache and restarting the server is only possible when the database runs on the benchmark machine. A managed cloud service, say Snowflake, BigQuery, Redshift, or Databricks, runs on the provider's hardware, where the harness has no shell, no drop_caches, and no way to bounce the server, so its three runs all hit the same live, never-restarted service. Its cold number is therefore never forced cold the way a self-hosted engine's is, which tilts the cold-run ranking toward hosted systems, and with it the combined score that folds cold runs in. ClickBench's rules require that restart for a true cold run, and a restart is something you can only ask of a server you control. Every engine in this post runs self-hosted on the same box, so they all play by the same rule, but it is worth remembering the next time you compare cold-run numbers across hosted and self-managed systems.

We will focus on hot-run results only. We will not compare individual queries either, only the overall score. The score is the one the public dashboard shows: for each query, ClickBench computes a ratio against the fastest system on that query,

ratio = (0.01 + hot_time) / (0.01 + baseline_time)

where baseline_time is the best hot time among the compared systems for that query. The 0.01 is a 10 ms cushion that stops sub-10 ms queries from dominating. The final score is the geometric mean of those ratios across all 43 queries. Lower is better, and a hypothetical 1.000 would mean "fastest on every single query." A failed query is penalized heavily. The full results for everything below live in the support repository linked at the end, so you can re-score them yourself.

INFO

Here is the subtlety that drives this entire post. Every ClickBench query script records the engine's own internal query time: DuckDB's Run Time, ClickHouse's --time, DataFusion's Elapsed, QuestDB's timings.execute, Polars' internal elapsed. Process and client startup is in nobody's recorded number. So keeping a process alive cannot change the score by removing a startup term, because that term was never in the number to...

Lies, Damn Lies and Database Benchmarks

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org