How the Other Half Counts: Planner Statistics in Oracle, Db2, MySQL, SQLite

pgedge_postgres1 pts0 comments

How the Other Half Counts — The Build

PostgreSQL has ANALYZE. You run it (or autovacuum runs it for you), it draws a sample of 300 × default_statistics_target rows, and it writes a row per column into pg_statistic: a null fraction, an n-distinct estimate, a most-common-values list, an equi-depth histogram, and a physical-vs-logical correlation. The planner reads those numbers, multiplies selectivities together, costs a handful of join strategies, and picks one. Three join algorithms are on the menu: nested loop, merge join, hash join.

That is the entire shape of the problem, and every cost-based optimizer ever shipped solves the same one. They differ in three places, and only three: where the numbers come from, how stale the numbers are allowed to get, and which plan shapes are even legal to choose between. The algorithms are the boring part. Everybody hash-joins. The interesting part is the bookkeeping.

So: PostgreSQL has ANALYZE. What does everyone else have? Six answers, arranged from the system with the most knobs to the system with none.

A note on the menu

Before the statistics, a word about what they’re feeding, because “hash join vs. merge join” quietly assumes the database has both. Three of the six here don’t, and one of them doesn’t have either.

A row-store with B-tree indexes and OLTP ancestry tends to carry the full classical set: nested loop for small driving sets with a good index on the inner side, merge join for two already-sorted inputs, hash join for large unsorted equijoins. Oracle and Db2 are this. A system born for analytics, scanning columns in vectorized batches, finds merge join almost useless; sorting both sides to join them is a tax you pay only when the optimizer is cornered. The vectorized hash join wins nearly every analytical equijoin, so that’s what gets built and tuned. DuckDB and Snowflake are this. MySQL is its own case: it shipped for two decades with exactly one join algorithm and only gained a second in 2019. And SQLite is the limit of the argument; it has one join algorithm, nested loop, and has never had any other. There is no hash join and no merge join to weigh it against. The planner’s entire job is to pick the order of the loops.

Keep that in mind as the statistics get more elaborate. Half the precision exists to choose between options the engine may not even have, and at least one engine here spends its statistics on a decision with only one possible operator and several possible orderings.

Oracle: every feature anyone ever asked for

Oracle Database is the maximalist. If a statistics idea has appeared in a paper since 1995, Oracle has shipped a version of it, kept the old version for compatibility, and added a preference to control which one runs.

The gatherer is DBMS_STATS, a PL/SQL package, not a SQL statement. By default you don’t call it; an AutoTask job runs during the nightly maintenance window and re-gathers anything missing or stale, where “stale” means roughly 10% of rows changed since the last gather. Oracle 19c added high-frequency automatic statistics collection , a lightweight task that revisits stale objects every 15 minutes by default, so the window job isn’t the only line of defense. For bulk operations, online statistics gathering (12c) piggybacks a stats collection onto CREATE TABLE AS SELECT and direct-path inserts; you were scanning every row anyway, so you may as well count them on the way past.

The column statistics are where Oracle shows off. Histograms come in four flavors, and the optimizer picks which to build based on the data: frequency (one bucket per distinct value, when there are few enough), top-frequency (frequency for the popular values, the long tail ignored), height-balanced (equi-depth, the legacy type), and hybrid (height-balanced buckets that also track the frequency of the value at each bucket boundary, which is the type you actually want and the reason the other two mostly stopped mattering after 12c). N-distinct is computed with a HyperLogLog sketch (APPROXIMATE_NDV_ALGORITHM, defaulting to HYPERLOGLOG since 19c) rather than by sorting and counting, which is the difference between a stats gather that finishes and one that doesn’t. PostgreSQL people will recognize the move; pg_stat_statements and friends reach for the same sketch family for the same reason.

Then there is the correlation problem, which Oracle attacks from two directions. Extended statistics let you declare a column group (DBMS_STATS.CREATE_EXTENDED_STATS on (make, model)) so the optimizer stops assuming model = 'Accord' and make = 'Honda' are independent events; this is the direct analog of PostgreSQL’s CREATE STATISTICS. Expression statistics do the same for UPPER(last_name) and similar derived values.

The genuinely different machinery is adaptive. SQL plan directives are notes the optimizer leaves itself: “last time I planned a query with this predicate shape, my cardinality estimate was garbage.” The directive outlives the individual...

join statistics oracle hash merge optimizer

Related Articles