DuckDB Storage Engine for MariaDB

tanelpoder1 pts0 comments

DuckDB Storage Engine for MariaDB. When the Sea Lion Learns to Quack. - MariaDB.org

Skip to content

Newsletter • Contact • Events • Planet

Search for:

Search for:

Tags: #new feuatre, analytics, MariaDB, Storage engine

Subscribe to the Newsletter

I have read and I agree to the MariaDB Foundation Privacy Policy

An early look at the DuckDB storage engine for MariaDB — columnar, vectorized analytics that live right next to your transactional tables.

The problem

MariaDB’s InnoDB is excellent at what it was built for: transactions. Row-by-row inserts, updates, point lookups, strong consistency. But the moment you ask it to scan tens of millions of rows for a multi-way join with a few aggregations, a row store has to work hard.

The usual answer is to stand up a separate analytical system, then build ETL pipelines to copy data into it. Now you maintain two systems and you live with the lag between them.

What if the analytical engine just lived inside the same server — same SQL, same client, same data directory?

What we built

The DuckDB storage engine for MariaDB does exactly that. Create a table with ENGINE=DuckDB and its data is stored in DuckDB‘s native columnar format; analytical queries against it run through DuckDB’s vectorized, multi-core engine — in-process, inside the MariaDB server.

No separate cluster. No ETL. No new protocol. The familiar mariadb client, one SQL dialect, InnoDB and DuckDB tables side by side in the same database.

A note on MariaDB ColumnStore

MariaDB already ships ColumnStore , a mature columnar engine for large-scale analytics — typically deployed as a scale-out, multi-node data warehouse. The DuckDB engine targets a different point on the spectrum: a single-node, in-process, zero-ops analytical engine that sits next to your InnoDB tables for lightweight embedded analytics and HTAP. The two are complementary, not competing — reach for ColumnStore when you need a distributed warehouse, and the DuckDB engine when you want in-process analytics with no extra moving parts.

A 30-second taste

Create a DuckDB table just like any other:

CREATE TABLE sales (<br>id BIGINT PRIMARY KEY,<br>product VARCHAR(64),<br>amount DECIMAL(12,2),<br>sold_at TIMESTAMP<br>) ENGINE=DuckDB;

Query it with ordinary SQL — the whole SELECT is pushed down to DuckDB:

SELECT product, SUM(amount) AS revenue<br>FROM sales<br>GROUP BY product<br>ORDER BY revenue DESC<br>LIMIT 10;

And the part that makes it genuinely useful — cross-engine joins . A single query can combine a DuckDB analytical table with an InnoDB operational table. Here it is end to end — two tables in two different engines, joined in one statement:

-- analytical data in DuckDB, operational data in InnoDB<br>CREATE TABLE analytics.orders (id BIGINT PRIMARY KEY, product_id INT, amount DECIMAL(12,2)) ENGINE=DuckDB;<br>CREATE TABLE inventory.products (id INT PRIMARY KEY, name VARCHAR(64)) ENGINE=InnoDB;

INSERT INTO inventory.products VALUES (1, 'Wireless Keyboard'), (2, 'Mechanical Switch');<br>INSERT INTO analytics.orders VALUES (4217, 1, 1299.00), (4218, 2, 42.00);

-- one query, both engines<br>SELECT d.id, d.amount, i.name<br>FROM analytics.orders d -- ENGINE=DuckDB<br>JOIN inventory.products i -- ENGINE=InnoDB<br>ON d.product_id = i.id<br>WHERE d.amount > 1000;

DuckDB handles the join, aggregation, and sorting; the InnoDB rows are produced on demand through the MariaDB optimizer. No copying, no ETL.

Performance

We benchmarked the engine on TPC-H SF10 (~11 GB, 86.6M rows , 60M in lineitem). Two things matter for an analytical engine: how fast you can load data, and how fast you can query it.

Hardware / environment: Intel Core i7-13700H (14 cores / 20 threads), 64 GB RAM, NVMe SSD. MariaDB 11.4.13 with embedded DuckDB v1.5.2, duckdb_memory_limit=8 GiB, threads=20. Warm runs, stable to ~&plusmn;15%.

Query latency — all 22 TPC-H queries

The full suite runs warm in ~4.3 seconds :

QueryTime (s)QueryTime (s)QueryTime (s)q010.253q090.337q170.116q020.076q100.253q180.318q030.134q110.049q190.207q040.135q120.152q200.166q050.142q130.606q210.486q060.070q140.128q220.113q070.133q150.101Total 4.30 q080.145q160.178

Each query also pays a small fixed cost (~40 ms) for client connect and pushdown setup — noticeable on the cheapest queries, negligible on the heavy analytical ones where DuckDB&rsquo;s vectorized execution dominates.

Bulk loading

PathTotal loadIn-engine COPY (run_in_duckdb, 8 GB mem)33 s LOAD DATA LOCAL INFILE~400 s (6.7 min)

Loading external CSVs through DuckDB&rsquo;s native parallel reader in-process — via the run_in_duckdb() function — ingests all 86.6M rows in ~33 s . The standard LOAD DATA LOCAL INFILE path is ~12&times; slower because every row is routed through MariaDB&rsquo;s SQL/handler layer, serializing what COPY does in parallel. For bulk loads, prefer in-engine COPY via run_in_duckdb and raise duckdb_memory_limit for large tables.

How it works (briefly)

DuckDB&rsquo;s speed rests on three pillars: columnar storage (read only the columns you...

duckdb engine mariadb data analytics innodb

Related Articles