DuckDB Internals: Why Is DuckDB Fast? (Part 1)

DuckDB Internals: Why is DuckDB Fast? | Greybeam Back to BlogKyle Cheung/May 4, 2026/18 min read/DuckDB DuckDB Internals: Why is DuckDB Fast? (Part 1)

DuckDB has gone from a research project at CWI Amsterdam in 2019 to one of the most widely adopted databases of the past decade. The list of places it shows up is long: notebooks, ETL pipelines, dashboards, CI test runners, embedded analytics inside SaaS products, even an iPhone running TPC-H at scale factor 100. iPhone in a box of dry ice, running TPC-H. (source)Companies have started building real products around it. MotherDuck is wrapping DuckDB into a cloud data warehouse. BI and data app platforms like Hex, Omni, and Evidence use it as an in-app execution engine and cache. Fivetran's Managed Data Lake Service uses DuckDB inside its data-lake writer for merging and compaction. Rill builds an open-source BI tool on top of it. We use it at Greybeam too, powering millions of queries for BI and analytics workloads. What is DuckDB?# DuckDB is an in-process analytical SQL database . Analytical means it's optimized for the kind of queries that scan millions of rows to filter, aggregate, and join — not the kind that look up a single record by primary key. In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars. DuckDB has received widespread adoption because it's just so damn easy to use. It ships as a single binary under 20 MB with no external dependencies. You install it with pip install duckdb, brew install duckdb, or by linking libduckdb into a C++ project. It opens any directory of Parquet, CSV, or JSON files like they were already a SQL database. DuckDB also happens to be one of the fastest single-node analytical engines available, regularly holding its own against entire clusters that cost millions of dollars per year. This is the first post in a three-part deep dive into DuckDB internals. We'll follow a query from the moment it enters the engine to the moment the result is returned, and at each stage we'll look at the design choice that makes it fast. DuckDB's speed comes from a handful design choices:

In-process execution

Columnar, compressed storage with zonemaps

Vectorized execution

Morsel-driven parallelism

Snapshot isolation with optimistic MVCC

And much more!

This post covers the path from your SQL to the moment the engine is ready to run the query, plus the storage layer the query will read from. By the end you'll have a clear mental model of DuckDB's setup work and storage layout. Query execution is covered in Part 2 so make sure to subscribe! Queries Run In-Process # You point DuckDB at a 6 GB Parquet file on your laptop. The results come back in under a second. No cluster, no setup, no migration, no CREATE TABLE. How does that work? SELECT FROM 'orders.parquet';Copy Most analytical databases are servers. Snowflake, Postgres, BigQuery, Redshift. You open a connection, send SQL over TCP (a protocol to send data over a network), and wait for results to come back. Along the way, every record in the result is serialized into a wire protocol, transmitted across the network, and deserialized on the other end. Serializing and Deserializing# Inside a database, a query result lives as typed values at specific memory addresses. A 64-bit integer here, a pointer to a string there. Those addresses only exist in that process. To send the result to a client on another machine, the database has to rewrite every value into an agreed byte format (Postgres has its own, MySQL has another, with ODBC and JDBC as client-side APIs that drivers expose on top) so it can be pushed through a TCP socket. The client then parses those bytes back into its own native types. Every value may be touched multiple times, once to encode and once to decode, and on a large result set, that work often takes longer than the query itself. DuckDB is not a server. It's a library. There is no DuckDB daemon, no port, no cluster. You load libduckdb into your program and call functions directly against it. In 2017, Mark Raasveldt and Hannes Mühleisen published Don't Hold My Data Hostage, a paper measuring what actually happens when you pull a result set out of a warehouse. They found that the client protocol itself — ODBC, JDBC, and similar row-by-row value APIs — was often the slowest single step in the entire query, sometimes dwarfing the time the database spent computing the answer. Two costs drive this. The first is raw bandwidth: a typical gigabit Ethernet link caps you at around 125 MB/s, and a large result set can take longer to transmit than it took to compute. The second is per-value overhead. ODBC and JDBC hand back results one row and one value at a time, which means the client makes a separate function call for every field in every row. On a 100-million-row result, that's hundreds of millions of function calls, each one doing its own little memory copy, type check, and string...

DuckDB Internals: Why Is DuckDB Fast? (Part 1)

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y