Lies, Damn Lies and Database Benchmarks | QuestDB<br>New: QuestDB For AI Agents<br>New: QuestDB For AI Agents<br>Learn more
QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control.<br>It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine.<br>Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.
Benchmarks, everyone loves benchmarks. People look at a benchmark result and<br>start spreading the word that database X is the top dog, since it is so much<br>faster than database Y.
A decent benchmark might be pictured as a strict Olympic<br>Games-like running competition where the "Citius, Altius, Fortius" principle is<br>precisely implemented. But in reality, when you approach the athletes, you start<br>hearing unexpected noises. What is that? It turns out the competition is more<br>like those weird contests you find on the Internet: the athletes must whistle<br>"Yellow Submarine" accurately while running as fast as they can. The winner is no<br>longer the fastest runner. It is whoever best balances raw speed against a skill<br>that has nothing to do with running, and the quickest sprinter on the track can<br>easily finish last.
That analogy applies to a thing as complex as database benchmarks, especially<br>when quite different categories of databases are being compared. A perfect,<br>completely fair database benchmark is like a unicorn: good luck finding one.<br>Today we will try to illustrate this by toying with a public, well-recognized<br>benchmark.
The benchmark we will use is<br>ClickBench, but do not get us<br>wrong: we are here to question all database benchmarks, not ClickBench<br>specifically. ClickBench is just convenient. It is a solid comparison for<br>analytical databases and already includes a large roster of engines.
How ClickBench measures things
ClickBench runs the same workload against every system: a single web-analytics<br>table of around 100 million rows and 105 columns (the famous hits dataset),<br>and 43 analytical queries over it. Each engine ships a small set of shell<br>scripts. The flow is always the same: a script installs the database, loads the<br>data (importing from CSV/TSV, or simply pointing the engine at a downloaded<br>Parquet file if it can read external files), and then runs the 43 queries.
Each query is measured in two flavors:
Cold run. This is the first execution of a query, with all operating<br>system page caches and database caches cleared beforehand. It captures the<br>worst case, when nothing is warm.
Hot run. Quoting the ClickBench rules, "each of the 43 queries is run<br>three times," and "the smaller of the 2nd and 3rd runtime is used if both runs<br>are successful." The first run is supposed to populate the caches, so the two<br>later runs are expected to be the fastest.
That cold definition hides an asymmetry the public dashboard does not advertise.<br>Clearing the OS page cache and restarting the server is only possible when the<br>database runs on the benchmark machine. A managed cloud service, say Snowflake,<br>BigQuery, Redshift, or Databricks, runs on the provider's hardware, where the<br>harness has no shell, no drop_caches, and no way to bounce the server, so its<br>three runs all hit the same live, never-restarted service. Its cold number is<br>therefore never forced cold the way a self-hosted engine's is, which tilts the<br>cold-run ranking toward hosted systems, and with it the combined score that folds<br>cold runs in. ClickBench's rules require that restart for a true cold run, and a<br>restart is something you can only ask of a server you control. Every engine in<br>this post runs self-hosted on the same box, so they all play by the same rule,<br>but it is worth remembering the next time you compare cold-run numbers across<br>hosted and self-managed systems.
We will focus on hot-run results only. We will not compare individual queries<br>either, only the overall score. The score is the one the public dashboard<br>shows: for each query, ClickBench computes a ratio against the fastest system on<br>that query,
ratio = (0.01 + hot_time) / (0.01 + baseline_time)
where baseline_time is the best hot time among the compared systems for that<br>query. The 0.01 is a 10 ms cushion that stops sub-10 ms queries from<br>dominating. The final score is the<br>geometric mean of those ratios<br>across all 43 queries. Lower is better, and a hypothetical 1.000 would mean "fastest on<br>every single query." A failed query is penalized heavily. The full results for<br>everything below live in the support repository linked at the end, so you can<br>re-score them yourself.
INFO
Here is the subtlety that drives this entire post. Every ClickBench query<br>script records the engine's own internal query time: DuckDB's Run Time,<br>ClickHouse's --time, DataFusion's Elapsed, QuestDB's<br>timings.execute, Polars' internal elapsed. Process and client startup is in nobody's recorded<br>number. So keeping a process alive cannot change the score by removing a startup<br>term, because that term was never in the number to...