Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM

Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM - Gunnar Morling

Gunnar Morling

Random Musings on All Things Software Engineering

Posted at Jun 25, 2026

parquet

open-source

java

performance

hardwood

Table of Contents

Why Hardwood

What’s in Hardwood 1.0

Performance

The Hardwood CLI

Building Open-Source With AI

A Big Thank You

What’s Ahead

Hardwood is a new Parquet library for the JVM, written from scratch to do one thing well: read (and soon, write) Apache Parquet files fast, with no mandatory dependencies. It is performance-focused and multi-threaded at its core, fanning page decoding out across all your CPU cores by default.

Today, Hardwood reaches 1.0 . After five preview releases since the start of the year (Alpha1, Beta1, Beta2, CR1, CR2), we now consider Hardwood ready for production, and its public API will evolve with a strong focus on backwards compatibility going forward. Hardwood targets Java 21 or newer, is open-source (Apache License 2.0), and is available from Maven Central.

Why Hardwood

Working with the Apache Parquet columnar file format on the JVM has traditionally come with a fairly heavyweight stack: a large number of dependencies on the classpath and a single-threaded reader at the core. Hardwood explores a different set of tradeoffs. The full rationale is in the original project announcement; in a nutshell, the goals are:

Implement a Parquet library without any mandatory dependencies: 1 Parquet files which are either uncompressed or gzip-compressed don’t require any 3rd party libraries at all; for parsing files compressed with Snappy/Zstd/LZ4/Brotli you only need to provide the (typically single-JAR) codec of your choosing

Utilize modern multi-core CPUs as much as possible: unlike parquet-java, which is single-threaded at its core, Hardwood fans out the decoding of the individual pages of a Parquet file to multiple threads, resulting in significantly reduced wall clock parsing times

Be compatible: every file which can be parsed by parquet-java should also be parseable with Hardwood; if that’s not the case for a given file, we consider this a bug which needs fixing

[1] In particular, no Hadoop. IYKYK.

What’s in Hardwood 1.0

The 1.0 release implements all the key capabilities you’d expect from a Parquet reader: coverage of all the physical and logical Parquet column types, including VARIANT and a first cut of handling geo-spatial columns, support for all relevant column encodings and compression schemes, the ability to parse both local and remote files (on object storage such as S3), projections and predicate push-down, and much more. The hardwood-examples repository is a great starting point to learn all about Hardwood’s capabilities and how to make the best use of them.

Hardwood comes with two distinct APIs which are at opposite ends of the ergonomics-vs-performance spectrum. The row reader API provides structured access to the records of a Parquet file, including nested and repeatable columns. It’s a great starting point for general-purpose access to Parquet:

try (ParquetFileReader fileReader = ParquetFileReader.open( InputFile.of(path));

RowReader rowReader = fileReader.rowReader()) {

while (rowReader.hasNext()) { rowReader.next();

long id = rowReader.getLong("id"); String name = rowReader.getString("name"); LocalDate birthDate = rowReader.getDate("birth_date"); Instant createdAt = rowReader.getTimestamp("created_at");

The column reader API on the other hand exposes a batch-style API for accessing arrays of raw Parquet column values, with a layer scheme inspired by Apache Arrow for representing repeatable columns. It trades ergonomics for throughput: minimal per-value overhead, and batches of primitive arrays the caller can hand straight to a pool of worker threads or a vectorized loop. This makes the column reader the right foundation for analytical workloads over large numbers of values.

While striving to expose exactly one way to achieve a given task generally is a good idea for API design, ergonomics and peak throughput genuinely require different shapes, and we didn’t want to compromise one for the sake of the other. To learn more about the differences between the two reader APIs in Hardwood and when to use which one, refer to this guide on reader models.

Performance

Speaking of performance, let’s take a look at some numbers. Benchmarking Parquet workloads is a wide field; in the following we’re going to touch on two specific workloads.

The first one is a full scan of a flat dataset (i.e. no repeatable or nested columns) of taxi rides in New York City, provided by the NYC Taxi & Limousine Commission. The benchmark folds the values of all 20 columns across twelve files from the dataset, representing the months January to December of 2025. There are 48.7M rows, with a total size of 830 MB compressed.

Both benchmarks are implemented using JMH, running five...

Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi