DuckDB isn't just fast
csvbase is a simple website for sharing table data.<br>Join the discord.
DuckDB isn't just fast
A whistlestop tour of the cool bits of DuckDB
2024-05-30
by Cal Paterson
DuckDB is a single file SQL database. It's designed for data analysis and so,<br>probably because of the bent of people who are into that sort of thing, a lot<br>of the evaluations of it end up being quantitative. This isn't just true of<br>DuckDB - most comparisons of most data tools tend to focus on the measureable.
That means they<br>mainly<br>look<br>at<br>speed. And DuckDB generally does<br>well.
The notes on benchmark performance graphs often read "higher is better" and<br>performance improvements are even called "optimisations". But the truth is, at<br>least as a user, once performance reaches a satisfactory level - enough for<br>your own data analysis to complete in a reasonable about of time - there is no<br>further benefit from increased speed. Instead of being called "performance<br>optimisation" it should probably be called "performance satisfaction" as once<br>it is satisfactory you have finished.
Usability is different. The whole point of computers is as an aid to<br>productivity so user-friendliness is actually the bit you want to optimise.<br>Unlike speed, being easier to use is always better and there is very little<br>limit to that. So it's "usability improvements" that should be called<br>"optimisations" but perhaps the relevant ships on all of these terms have<br>sailed.
Anyway to balance out the force out I want to demonstrate some usability<br>benefits of DuckDB. Mostly, they cannot be measured:
Good developer ergonomics
It handles larger than memory ("out of core") datasets
Easy to install & run
Ergonomics
DuckDB takes care to make the common stuff straightward. For example, you can<br>create tables (including inferring the table schema) straight from input<br>files:
-- loading a table from a parquet file<br>CREATE TABLE stock_exchanges AS<br>FROM<br>read_parquet(<br>"https://csvbase.com/meripaterson/stock-exchanges.parquet"<br>);
Looking at the schema of that table:
-- the output of: .schema stock_exchanges<br>CREATE TABLE stock_exchanges (<br>csvbase_row_id bigint,<br>Continent varchar,<br>Country varchar,<br>"Name" varchar,<br>MIC varchar,<br>"Last changed" date<br>);
DuckDB has inferred all the columns, including their types, from the Parquet<br>file. Brill. And as you can see, that Parquet file can come from anywhere on<br>the web, it need not be local. That's perhaps not a big advance on some of the<br>common dataframe libraries, but it is a big advance in the world of SQL-based<br>tools, most of which can only read CSV (not Parquet) and then also require the<br>schema to be created beforehand.
And you don't actually have to create a table first in order to query the data.<br>The read_parquet function returns a relation and so can act as a sub-query.<br>A specific example of that, this time with a csv file:
-- pulling down the most recent EUR:USD exchange rate<br>SELECT<br>rate<br>FROM<br>read_csv_auto("https://csvbase.com/table-munger/eurofxref.csv")<br>WHERE<br>currency = 'USD';
So you can freely query parquet and csv files on the web with the minimum of<br>fuss.
But how much of SQL does DuckDB support? A very wide swathe. I haven't done<br>any comprehensive analysis but of the stuff I use in Postgres I haven't found<br>much if anything that isn't also implemented in DuckDB.
For example, window functions are fully supported:
-- smoothed history of the eur:usd exchange rate<br>SELECT<br>date,<br>avg(rate) OVER (<br>ORDER BY date<br>ROWS BETWEEN 100 PRECEDING AND CURRENT ROW<br>) AS rolling<br>FROM<br>read_parquet('https://csvbase.com/table-munger/eurofxref-hist.parquet')<br>WHERE<br>currency = 'USD';
And that's not the end of DuckDB making the simple stuff easy. I did the above<br>query at the library on a slow internet connection and DuckDB helpfully started<br>to display a progress bar, which even Postgres doesn't have.
Then, when the query was done it politely avoided swamping my terminal with the<br>6500 lines of output by abbreviating them, just like Pandas does.
Datasets larger than memory
One of the problems that arises with more than a few data tools is that once<br>the dataset gets bigger than the computer memory (or gets within 50%) the tool<br>breaks down.
This is an underrated source of pain. Sometimes I've seen someone write<br>something quickly with one tool as a quick prototype. The prototype works<br>great and you want to run it on the full dataset - but wait - you can't.<br>You're getting memory errors, heavy swapping, etc. The problem is that the<br>tool was loading the whole dataset into memory and so suddenly you have to<br>change technology. Always an unpleasant discovery.
DuckDB fully supports datasets larger than memory. That's in contrast to<br>Pandas, which starts to struggle once your dataframe is >50% of system memory.<br>The majority of dataframe libraries do not support datasets larger than memory<br>or require alternate, more limited, modes of<br>operation when using<br>them - but in DuckDB everything works.
Single file, single machine model - and the...