DuckDB isn't just fast (2024)

tosh1 pts0 comments

DuckDB isn't just fast

csvbase is a simple website for sharing table data.<br>Join the discord.

DuckDB isn't just fast

A whistlestop tour of the cool bits of DuckDB

2024-05-30

by Cal Paterson

DuckDB is a single file SQL database. It's designed for data analysis and so,<br>probably because of the bent of people who are into that sort of thing, a lot<br>of the evaluations of it end up being quantitative. This isn't just true of<br>DuckDB - most comparisons of most data tools tend to focus on the measureable.

That means they<br>mainly<br>look<br>at<br>speed. And DuckDB generally does<br>well.

The notes on benchmark performance graphs often read "higher is better" and<br>performance improvements are even called "optimisations". But the truth is, at<br>least as a user, once performance reaches a satisfactory level - enough for<br>your own data analysis to complete in a reasonable about of time - there is no<br>further benefit from increased speed. Instead of being called "performance<br>optimisation" it should probably be called "performance satisfaction" as once<br>it is satisfactory you have finished.

Usability is different. The whole point of computers is as an aid to<br>productivity so user-friendliness is actually the bit you want to optimise.<br>Unlike speed, being easier to use is always better and there is very little<br>limit to that. So it's "usability improvements" that should be called<br>"optimisations" but perhaps the relevant ships on all of these terms have<br>sailed.

Anyway to balance out the force out I want to demonstrate some usability<br>benefits of DuckDB. Mostly, they cannot be measured:

Good developer ergonomics

It handles larger than memory ("out of core") datasets

Easy to install & run

Ergonomics

DuckDB takes care to make the common stuff straightward. For example, you can<br>create tables (including inferring the table schema) straight from input<br>files:

-- loading a table from a parquet file<br>CREATE TABLE stock_exchanges AS<br>FROM<br>read_parquet(<br>"https://csvbase.com/meripaterson/stock-exchanges.parquet"<br>);

Looking at the schema of that table:

-- the output of: .schema stock_exchanges<br>CREATE TABLE stock_exchanges (<br>csvbase_row_id bigint,<br>Continent varchar,<br>Country varchar,<br>"Name" varchar,<br>MIC varchar,<br>"Last changed" date<br>);

DuckDB has inferred all the columns, including their types, from the Parquet<br>file. Brill. And as you can see, that Parquet file can come from anywhere on<br>the web, it need not be local. That's perhaps not a big advance on some of the<br>common dataframe libraries, but it is a big advance in the world of SQL-based<br>tools, most of which can only read CSV (not Parquet) and then also require the<br>schema to be created beforehand.

And you don't actually have to create a table first in order to query the data.<br>The read_parquet function returns a relation and so can act as a sub-query.<br>A specific example of that, this time with a csv file:

-- pulling down the most recent EUR:USD exchange rate<br>SELECT<br>rate<br>FROM<br>read_csv_auto("https://csvbase.com/table-munger/eurofxref.csv")<br>WHERE<br>currency = 'USD';

So you can freely query parquet and csv files on the web with the minimum of<br>fuss.

But how much of SQL does DuckDB support? A very wide swathe. I haven't done<br>any comprehensive analysis but of the stuff I use in Postgres I haven't found<br>much if anything that isn't also implemented in DuckDB.

For example, window functions are fully supported:

-- smoothed history of the eur:usd exchange rate<br>SELECT<br>date,<br>avg(rate) OVER (<br>ORDER BY date<br>ROWS BETWEEN 100 PRECEDING AND CURRENT ROW<br>) AS rolling<br>FROM<br>read_parquet('https://csvbase.com/table-munger/eurofxref-hist.parquet')<br>WHERE<br>currency = 'USD';

And that's not the end of DuckDB making the simple stuff easy. I did the above<br>query at the library on a slow internet connection and DuckDB helpfully started<br>to display a progress bar, which even Postgres doesn't have.

Then, when the query was done it politely avoided swamping my terminal with the<br>6500 lines of output by abbreviating them, just like Pandas does.

Datasets larger than memory

One of the problems that arises with more than a few data tools is that once<br>the dataset gets bigger than the computer memory (or gets within 50%) the tool<br>breaks down.

This is an underrated source of pain. Sometimes I've seen someone write<br>something quickly with one tool as a quick prototype. The prototype works<br>great and you want to run it on the full dataset - but wait - you can't.<br>You're getting memory errors, heavy swapping, etc. The problem is that the<br>tool was loading the whole dataset into memory and so suddenly you have to<br>change technology. Always an unpleasant discovery.

DuckDB fully supports datasets larger than memory. That's in contrast to<br>Pandas, which starts to struggle once your dataframe is >50% of system memory.<br>The majority of dataframe libraries do not support datasets larger than memory<br>or require alternate, more limited, modes of<br>operation when using<br>them - but in DuckDB everything works.

Single file, single machine model - and the...

duckdb table from memory parquet data

Related Articles