Throwing 107 GB and 5B fake rows of order data at DuckDB and Athena

b-man1 pts0 comments

Throwing 107 GB and 5 billion fake rows of order data at DuckDB and Athena April 25, 2023 Throwing 107 GB and 5 billion fake rows of order data at DuckDB and Athena

DuckDB is an in-process database management system for OLAP.

Athena is Amazon’s serverless analytics service.

Cloudflare R2 is an alternative to AWS S3 where egress is free of charge.

If you don’t care about DuckDB already, go ahead and read my other post. This post details some highlights of my experiments on a technical level. If you’re more into what it means for your hosting bill, I wrote about that too.

Test data

I used a dataset with one table containing 5 billion fake order lines from 1 billion orders over 10 years. The data is in snappy-compressed Parquet files, partitioned by year and month. The total file size is 107 GB (about 330 GB uncompressed). Here is a sample of records:

┌──────────────────────┬───────┬────────────┬──────────┬────────────────────┬────────────────────┬─────────────┬─────────────────────┬───────────┬─────────┬─────────┬──────────┬─────────┐<br>│ order_id │ line │ product_id │ quantity │ price │ line_price │ customer_id │ created │ num_lines │ day │ month │ store_id │ year │<br>│ int64 │ int64 │ int64 │ int64 │ double │ double │ int64 │ timestamp │ int64 │ varchar │ varchar │ varchar │ varchar │<br>├──────────────────────┼───────┼────────────┼──────────┼────────────────────┼────────────────────┼─────────────┼─────────────────────┼───────────┼─────────┼─────────┼──────────┼─────────┤<br>│ 1135077679086384973 │ 1 │ 854 │ 5 │ 16.083097153631755 │ 80.41548576815877 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │<br>│ 1135077679086384973 │ 2 │ 561 │ 3 │ 43.970102508551975 │ 131.91030752565592 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │<br>│ 1135077679086384973 │ 3 │ 99 │ 3 │ 54.659783092799465 │ 163.9793492783984 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │<br>│ 1135077679086384973 │ 4 │ 261 │ 5 │ 33.972849604393296 │ 169.8642480219665 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │<br>│ 4791786128601458882 │ 3 │ 854 │ 6 │ 16.083097153631755 │ 96.49858292179053 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │<br>│ 4791786128601458882 │ 2 │ 83 │ 2 │ 88.68105154796304 │ 177.36210309592607 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │<br>│ 4791786128601458882 │ 1 │ 368 │ 10 │ 33.87760446706489 │ 338.7760446706489 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │<br>│ 4791786128601458882 │ 4 │ 781 │ 2 │ 97.59143722363255 │ 195.1828744472651 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │<br>│ 4791786128601458882 │ 5 │ 396 │ 2 │ 3.9038578062505547 │ 7.807715612501109 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │<br>│ -1029262612490749341 │ 8 │ 854 │ 10 │ 16.083097153631755 │ 160.83097153631755 │ 5408 │ 2016-09-06 05:27:59 │ 9 │ 6 │ 9 │ 8 │ 2016 │<br>├──────────────────────┴───────┴────────────┴──────────┴────────────────────┴────────────────────┴─────────────┴─────────────────────┴───────────┴─────────┴─────────┴──────────┴─────────┤<br>│ 10 rows 13 columns │<br>└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘<br>Each Parquet file has a size of approximately 50 MB. A sample of file paths:

orders/year=2023/month=3/data_0.parquet<br>orders/year=2023/month=3/data_10.parquet<br>orders/year=2023/month=3/data_11.parquet<br>orders/year=2023/month=3/data_12.parquet<br>orders/year=2023/month=3/data_13.parquet<br>orders/year=2023/month=3/data_14.parquet<br>orders/year=2023/month=3/data_15.parquet<br>orders/year=2023/month=3/data_1.parquet<br>orders/year=2023/month=3/data_2.parquet<br>orders/year=2023/month=3/data_3.parquet<br>orders/year=2023/month=3/data_4.parquet<br>Point DuckDB at local Parquet files

create or replace view orders as<br>select * from parquet_scan('orders/*/*/*.parquet', hive_partitioning=1);<br>Get data to and from R2

I used rclone to upload data to R2 and got a throughput of ~470 Mb/s from a box in Helsinki. I downloaded the dataset to a box in Falkenstein at ~920 Mb/s.

Your rclone.conf needs a block like this:

[r2]<br>type = s3<br>provider = Cloudflare<br>env_auth = false<br>access_key_id = $access<br>secret_access_key = $secret<br>endpoint = https://$account.r2.cloudflarestorage.com<br>acl = private<br>Point DuckDB at R2

install httpfs;<br>load httpfs;<br>set s3_endpoint='$account.r2.cloudflarestorage.com';<br>set s3_region='auto';<br>set s3_access_key_id='$access';<br>set s3_secret_access_key='$secret';

create or replace view orders as<br>select * from parquet_scan('s3://duckdata/year-month/orders/*/*/*.parquet', hive_partitioning=1);<br>Notably, my first attempt to query R2 failed while later attempts succeeded:

explain analyze select count(1) from orders;<br>14% ▕████████▍ ▏ Error: IO Error: Connection error for HTTP HEAD to 'https://duckdata.$account.r2.cloudflarestorage.com/year-month/orders/year%3D2019/month%3D9/data_4.parquet'<br>Upload data to S3

Again I used rclone to upload the dataset to AWS S3 and achieved a throughput of...

orders parquet year month data duckdb

Related Articles