Benchmarking Hardwood 1.0 on a Threadripper 9980X — Jack Vanlightly
Hardwood is a minimal-dependency Java library for reading Parquet files. It currently has row-reader and columnar-reader APIs, with Parquet writing planned for the future.<br>Gunnar Morling, Hardwood’s author, published some initial benchmarks in the v1.0 announcement, comparing Hardwood’s row and column readers against Parquet Java. Those benchmarks measured read speed against already-downloaded Parquet files.<br>Gunnar’s benchmarks ran on an m7i.2xlarge, with 8 vCPUs / 4 physical cores. Each test used three variants:<br>Hardwood with decoder threads = Runtime.getRuntime().availableProcessors(), which equals 8
Hardwood pinned to one CPU thread with taskset
Parquet Java, single-threaded
I was curious how the same benchmarks would look on my Threadripper 9980X: 64 cores / 128 threads, with 256 GB ECC DDR5. I modified Gunnar’s benchmark code to also test Hardwood with fixed decoder-thread counts: 1, 4, and 8.<br>That gives the following Threadripper variants:<br>Hardwood, unpinned, decoder threads = 128 (available processors)
Hardwood, unpinned, decoder threads = 8
Hardwood, unpinned, decoder threads = 4
Hardwood, unpinned, decoder threads = 1
Hardwood pinned to one CPU thread (taskset)
Parquet Java, single-threaded
One important detail: decoder threads = 1 is not the same as the pinned 1-core test. With decoder threads = 1, the main thread can run on another core. The pinned test constrains the whole process to one logical CPU which is the closest we can get for like-for-like comparison to single-threaded Parquet Java.<br>Flat full scan (columnar reader)<br>This benchmark reads all columns of the dataset 48M row dataset.<br>m7i.2xlarge
Fig 1: m7i.2xlarge, Hardwood (all cores) 16.5M/s, Hardwood pinned 1-core 3.9M/s, Parquet Java (single-threaded) 3.3M/s
Threadripper 9980X
Fig 2: Threadripper, Hardwood (all cores) 43.4M/s, Hardwood dt=8 48.4M/s, Hardwood dt=4 44.9M/s, Hardwood dt=1 15.5.9M/s, Hardwood pinned 1-core 11.0M/s, Parquet Java (single-threaded) 5.8M/s
A few things stand out:<br>The Threadripper is much faster in the single-core cases than the m7i.2xlarge. Hardwood pinned to one core reaches 11.0M rows/s (with some runs reaching over 12M), versus 3.9M rows/s on the m7i.2xlarge. Generally about 3x faster.
Hardwood’s single-core result on the Threadripper is also much stronger relative to Parquet Java. On the m7i.2xlarge, Hardwood 1-core is only modestly ahead of Parquet Java: 3.9M rows/s versus 3.3M rows/s. On the Threadripper, Hardwood 1-core is almost 2x faster: 11.0M rows/s versus 5.8M rows/s.
More decoder threads help, but only up to a point. The best result here is 8 decoder threads, at 48.4M rows/s. Four decoder threads are close behind at 44.9M rows/s. The default availableProcessors() setting, which gives 128 decoder threads on this machine, is slower than both, which is not surprising.
Flat full scan (row reader)<br>This benchmark reads all rows of the dataset 48M row dataset. It has two variants:<br>Indexed (positional) columns, i.e. r.getLong(3)
Named-columns, i.e. r.getLong("passenger_count")
m7i.2xlarge
Fig 3: m7i.2xlarge, Indexed-columns, Hardwood (all cores) 14.9M/s, Hardwood 1-core 4.4M/s, Parquet Java (single-threaded) 1.4M/s. Named-columns, Hardwood (all cores) 2.8M/s, Hardwood 1-core 1.9M/s, Parquet Java (single-threaded) 1.4M/s
Threadripper 9980X
Fig 4: Threadripper, indexed (positional) columns, Hardwood (all cores) 33.4M/s, Hardwood dt=8 36.1M/s, Hardwood dt=4 34.9M/s, Hardwood dt=1 14.4M/s, Hardwood pinned 1-core 10.8M/s, Parquet Java (single-threaded) 3M/s. Named columns, Hardwood (all cores) 5.9M/s, Hardwood dt=8 5.8M/s, Hardwood dt=4 5.9M/s, Hardwood dt=1 5.7M/s, Hardwood pinned 1-core 4.3M/s, Parquet Java (single-threaded) 2.6M/s
The indexed-column row reader shows the same basic pattern as the columnar full scan. Hardwood is much faster than Parquet Java even in the pinned 1-core case: 10.8M rows/s versus 3.0M rows/s. The best multi-threaded result is again with 8 decoder threads, at 36.1M rows/s, with 4 decoder threads close behind.<br>The named-column reader is different. Hardwood is still ahead of Parquet Java, but it does not meaningfully scale with decoder threads. The unpinned Hardwood results are all around 5.7M to 5.9M rows/s, regardless of whether the benchmark uses 1, 4, 8, or 128 decoder threads.<br>If you want high throughput, use the indexed-column approach.<br>Flat filtered scan (column reader)<br>This test generates data with 4 columns and 50M rows where event_time is perfectly ordered. The filter is event_time selective: event_time matchAll: event_time The test measures the time for the filtered scan to complete.<br>m7i.2xlarge
Fig 5: Selective (5%), Hardwood (all cores) 12.9 ms, Hardwood pinned 1-core 53.8 ms, Parquet Java (single-threaded) 173 ms. Match-all (100%), Hardwood (all cores) 222 ms, Hardwood pinned 1-core 983 ms, Parquet Java (single-threaded) 3157 ms
Threadripper
Fig 6: Selective (5%), Hardwood...