DuckDB isn't just fast (2024)

DuckDB isn't just fast

csvbase is a simple website for sharing table data. Join the discord.

DuckDB isn't just fast

A whistlestop tour of the cool bits of DuckDB

2024-05-30

by Cal Paterson

DuckDB is a single file SQL database. It's designed for data analysis and so, probably because of the bent of people who are into that sort of thing, a lot of the evaluations of it end up being quantitative. This isn't just true of DuckDB - most comparisons of most data tools tend to focus on the measureable.

That means they mainly look at speed. And DuckDB generally does well.

The notes on benchmark performance graphs often read "higher is better" and performance improvements are even called "optimisations". But the truth is, at least as a user, once performance reaches a satisfactory level - enough for your own data analysis to complete in a reasonable about of time - there is no further benefit from increased speed. Instead of being called "performance optimisation" it should probably be called "performance satisfaction" as once it is satisfactory you have finished.

Usability is different. The whole point of computers is as an aid to productivity so user-friendliness is actually the bit you want to optimise. Unlike speed, being easier to use is always better and there is very little limit to that. So it's "usability improvements" that should be called "optimisations" but perhaps the relevant ships on all of these terms have sailed.

Anyway to balance out the force out I want to demonstrate some usability benefits of DuckDB. Mostly, they cannot be measured:

Good developer ergonomics

It handles larger than memory ("out of core") datasets

Easy to install & run

Ergonomics

DuckDB takes care to make the common stuff straightward. For example, you can create tables (including inferring the table schema) straight from input files:

-- loading a table from a parquet file CREATE TABLE stock_exchanges AS FROM read_parquet( "https://csvbase.com/meripaterson/stock-exchanges.parquet" );

Looking at the schema of that table:

-- the output of: .schema stock_exchanges CREATE TABLE stock_exchanges ( csvbase_row_id bigint, Continent varchar, Country varchar, "Name" varchar, MIC varchar, "Last changed" date );

DuckDB has inferred all the columns, including their types, from the Parquet file. Brill. And as you can see, that Parquet file can come from anywhere on the web, it need not be local. That's perhaps not a big advance on some of the common dataframe libraries, but it is a big advance in the world of SQL-based tools, most of which can only read CSV (not Parquet) and then also require the schema to be created beforehand.

And you don't actually have to create a table first in order to query the data. The read_parquet function returns a relation and so can act as a sub-query. A specific example of that, this time with a csv file:

-- pulling down the most recent EUR:USD exchange rate SELECT rate FROM read_csv_auto("https://csvbase.com/table-munger/eurofxref.csv") WHERE currency = 'USD';

So you can freely query parquet and csv files on the web with the minimum of fuss.

But how much of SQL does DuckDB support? A very wide swathe. I haven't done any comprehensive analysis but of the stuff I use in Postgres I haven't found much if anything that isn't also implemented in DuckDB.

For example, window functions are fully supported:

-- smoothed history of the eur:usd exchange rate SELECT date, avg(rate) OVER ( ORDER BY date ROWS BETWEEN 100 PRECEDING AND CURRENT ROW ) AS rolling FROM read_parquet('https://csvbase.com/table-munger/eurofxref-hist.parquet') WHERE currency = 'USD';

And that's not the end of DuckDB making the simple stuff easy. I did the above query at the library on a slow internet connection and DuckDB helpfully started to display a progress bar, which even Postgres doesn't have.

Then, when the query was done it politely avoided swamping my terminal with the 6500 lines of output by abbreviating them, just like Pandas does.

Datasets larger than memory

One of the problems that arises with more than a few data tools is that once the dataset gets bigger than the computer memory (or gets within 50%) the tool breaks down.

This is an underrated source of pain. Sometimes I've seen someone write something quickly with one tool as a quick prototype. The prototype works great and you want to run it on the full dataset - but wait - you can't. You're getting memory errors, heavy swapping, etc. The problem is that the tool was loading the whole dataset into memory and so suddenly you have to change technology. Always an unpleasant discovery.

DuckDB fully supports datasets larger than memory. That's in contrast to Pandas, which starts to struggle once your dataframe is >50% of system memory. The majority of dataframe libraries do not support datasets larger than memory or require alternate, more limited, modes of operation when using them - but in DuckDB everything works.

Single file, single machine model - and the...

DuckDB isn't just fast (2024)

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi