Parquet: More than just "Turbo CSV"

tosh2 pts0 comments

Parquet: more than just "Turbo CSV"

csvbase is a simple website for sharing table data.<br>Join the discord.

Parquet: more than just "Turbo CSV"

Quicker, but also more convenient

2023-04-03

by Cal Paterson

Parquet is an efficient, binary<br>file format for table data. Compared to csv, it is:

Quicker to read

Quicker to write

Smaller

On a real world 10 million row financial data table I just tested with pandas I<br>found that Parquet is about 7.5 times quicker to read than csv, ~10 times<br>quicker to write and a about a fifth of the size on disk. So way to<br>think of Parquet is as "turbo csv" - like csv, just faster (and smaller).

That's not all there is to Parquet though. Although Parquet was originally<br>designed for Big Data, it also has benefits for small data.

Scaling down

One of the main advantages of Parquet is the format has an explicit schema that<br>is embedded within the file - and that schema includes type information.

So unlike with csv, readers don't need to infer the types of the columns by<br>scanning the data itself. Inferring types via scanning is fragile and a common<br>source of data bugs. It's not uncommon for the values for a column to begin<br>looking like ints only for later in the file to change into freeform text<br>strings. When exchanging data, it's better to be explicit.

The representation of types is also standardised - there is only one way to<br>represent a boolean - unlike the YES, y, TRUE, 1, [x] and so on of<br>csv files - which all need to be recognised and handled on a per feed basis.<br>Date/time string parsing is also eliminated: Parquet has both a date type and<br>the datetime type (both sensibly recorded as integers in UTC).

Parquet also does away with character encoding confusion - a huge practical<br>problem with textual file formats like csv. Early on, God cursed<br>humanity with multiple<br>languages. The early programmers cursed computers too: there are numerous ways<br>to represent characters as bytes. Different tools use different<br>encodings - UTF-8,<br>UTF-16, UTF-16 with the bytes the wrong<br>way round, Win-1252,<br>ASCII (but something else on days when<br>the feed needs to include a non-ASCII character) - the list goes on.

More sophisticated programs apply statistics to the byte patterns in the early<br>parts of the file to try to guess what the<br>character encoding might be - but again, it's fragile. And if you guess wrong -<br>the result is garbled nonsense.

And finally, Parquet provides a single way to represent missing data - the<br>null type. It's easy to tell "null" apart from "", for<br>example. And the fact that there is an official way to represent null (mostly)<br>eliminates the need to infer that certain special strings (eg N/A) actually<br>mean null.

Column — and row — oriented

How does Parquet work then? Parquet is partly row oriented and partly column<br>oriented. The data going into a Parquet file is broken up into "row chunks" -<br>largeish sets of rows. Inside a row chunk each column is stored separately in<br>a "column chunk" - this best facilitates all the tricks to make the data<br>smaller. Compression works better when similar data is adjacent. Run-length<br>encoding is possible. So is delta encoding.

Here's a diagram:

At the end of the file is the index, which contains references to all the other<br>row chunks, column chunks, etc.

That's actually one of the few downsides of Parquet - because the index is at<br>the end of the file you can't stream it. A lot of programs that process csv<br>files stream through them to allow them to handle csv files that are larger<br>than memory. Not possible with Parquet.

Instead, with Parquet, you tend to split your data across multiple files<br>(there is explicit support for this in the format) and then use the indexes to<br>skip around to find the data you want. But again - that requires random<br>access - no streaming.

Trying it out

You can add .parquet to any csvbase table url to get a Parquet file, so<br>that's an easy way to try the format out:

import pandas as pd<br>df = pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")

If you want to see the gory details of the format, try the parquet-tools<br>package on PyPI with a sample file:

pip install -U parquet-tools<br>curl -O "https://csvbase.com/meripaterson/stock-exchanges.parquet"<br>parquet-tools inspect --detail stock-exchanges.parquet

That shows a lot of detail and in conjunction with the<br>spec can help you understand exactly<br>how the format is arranged.

Keep in touch

Please do send me an email about this article,<br>especially if you disagreed with it.

I am also on Mastodon:<br>@calpaterson@fosstodon.org.

If you liked this, you might like other things I've written: on the csvbase<br>blog, or on my blog.

Follow new posts via RSS.

Comments

No comments yet - add the first one

Please register (or sign in) to leave a comment.

Markdown is supported

Add comment

parquet data file format column csvbase

Related Articles