Parquet: more than just "Turbo CSV"
csvbase is a simple website for sharing table data.<br>Join the discord.
Parquet: more than just "Turbo CSV"
Quicker, but also more convenient
2023-04-03
by Cal Paterson
Parquet is an efficient, binary<br>file format for table data. Compared to csv, it is:
Quicker to read
Quicker to write
Smaller
On a real world 10 million row financial data table I just tested with pandas I<br>found that Parquet is about 7.5 times quicker to read than csv, ~10 times<br>quicker to write and a about a fifth of the size on disk. So way to<br>think of Parquet is as "turbo csv" - like csv, just faster (and smaller).
That's not all there is to Parquet though. Although Parquet was originally<br>designed for Big Data, it also has benefits for small data.
Scaling down
One of the main advantages of Parquet is the format has an explicit schema that<br>is embedded within the file - and that schema includes type information.
So unlike with csv, readers don't need to infer the types of the columns by<br>scanning the data itself. Inferring types via scanning is fragile and a common<br>source of data bugs. It's not uncommon for the values for a column to begin<br>looking like ints only for later in the file to change into freeform text<br>strings. When exchanging data, it's better to be explicit.
The representation of types is also standardised - there is only one way to<br>represent a boolean - unlike the YES, y, TRUE, 1, [x] and so on of<br>csv files - which all need to be recognised and handled on a per feed basis.<br>Date/time string parsing is also eliminated: Parquet has both a date type and<br>the datetime type (both sensibly recorded as integers in UTC).
Parquet also does away with character encoding confusion - a huge practical<br>problem with textual file formats like csv. Early on, God cursed<br>humanity with multiple<br>languages. The early programmers cursed computers too: there are numerous ways<br>to represent characters as bytes. Different tools use different<br>encodings - UTF-8,<br>UTF-16, UTF-16 with the bytes the wrong<br>way round, Win-1252,<br>ASCII (but something else on days when<br>the feed needs to include a non-ASCII character) - the list goes on.
More sophisticated programs apply statistics to the byte patterns in the early<br>parts of the file to try to guess what the<br>character encoding might be - but again, it's fragile. And if you guess wrong -<br>the result is garbled nonsense.
And finally, Parquet provides a single way to represent missing data - the<br>null type. It's easy to tell "null" apart from "", for<br>example. And the fact that there is an official way to represent null (mostly)<br>eliminates the need to infer that certain special strings (eg N/A) actually<br>mean null.
Column — and row — oriented
How does Parquet work then? Parquet is partly row oriented and partly column<br>oriented. The data going into a Parquet file is broken up into "row chunks" -<br>largeish sets of rows. Inside a row chunk each column is stored separately in<br>a "column chunk" - this best facilitates all the tricks to make the data<br>smaller. Compression works better when similar data is adjacent. Run-length<br>encoding is possible. So is delta encoding.
Here's a diagram:
At the end of the file is the index, which contains references to all the other<br>row chunks, column chunks, etc.
That's actually one of the few downsides of Parquet - because the index is at<br>the end of the file you can't stream it. A lot of programs that process csv<br>files stream through them to allow them to handle csv files that are larger<br>than memory. Not possible with Parquet.
Instead, with Parquet, you tend to split your data across multiple files<br>(there is explicit support for this in the format) and then use the indexes to<br>skip around to find the data you want. But again - that requires random<br>access - no streaming.
Trying it out
You can add .parquet to any csvbase table url to get a Parquet file, so<br>that's an easy way to try the format out:
import pandas as pd<br>df = pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")
If you want to see the gory details of the format, try the parquet-tools<br>package on PyPI with a sample file:
pip install -U parquet-tools<br>curl -O "https://csvbase.com/meripaterson/stock-exchanges.parquet"<br>parquet-tools inspect --detail stock-exchanges.parquet
That shows a lot of detail and in conjunction with the<br>spec can help you understand exactly<br>how the format is arranged.
Keep in touch
Please do send me an email about this article,<br>especially if you disagreed with it.
I am also on Mastodon:<br>@calpaterson@fosstodon.org.
If you liked this, you might like other things I've written: on the csvbase<br>blog, or on my blog.
Follow new posts via RSS.
Comments
No comments yet - add the first one
Please register (or sign in) to leave a comment.
Markdown is supported
Add comment