Compressing floating point data with Gorilla - quanttype
Last week, I gave a talk titled Compressing floating point data with Gorilla at Helsinki Python Meetup.<br>There’s no recording, but here is a blog version of the slides.
Gorilla is a simple algorithm for compressing floats.<br>It’s over a decade old and it’s not the state of the art anymore but since it sparked the development of a whole family of algorithms, it’s a great starting point for learning about this topic.
I had a lot of fun learning about this and I figured others would enjoy it as well.<br>You get to learn a bit about floats, time series data, and data compression.<br>I hope you like it!
Actually, don’t just read this blog, go subscribe!
The talk title mentions the Gorilla algorithm, but the actual Gorilla was a time series database developed by Facebook.<br>It was not ever made publicly available but they described it in a paper published in 2015.
They were trying to monitor web backends, including metrics like CPU load and network latency, and they wanted to be able to query recent data quickly.<br>Because of this, they designed a new database that could hold the data in the memory.<br>It was pretty much an observability tool, although they didn’t call it that back then.
To make the data fit in the memory, they came up with a simple but efficient encoding for the floats.<br>This encoding is the algorithm that we’re talking about.
There are many general purpose compressing algorithms like DEFLATE and Zstandard.<br>These algorithms need to work on any data you can throw at them.<br>However, if you know something about the data you’re going to be compressing,<br>you may be able to design an algorithm that is faster or compresses better or uses less memory than the general purpose algorithms.
With Gorilla, we’re going to compress time series data, which is data where the data points are associated with a timestamp.<br>Do we know anything about time series data that we could use?
One thing is that it’s often produced by some sort of slow process.<br>For example, on the slides you can see some temperature data from my home.<br>As you can see, the numbers are roughly the shame.<br>The temperature changes slowly and within a limited range.<br>It’s not going to suddenly jump to plus or minus hundred degrees.
Here’s an insight we need: in time series data, the consecutive values typically resemble each other and stay within a narrow range.
Before we can talk about the algorithm itself, we need know a bit about how floats work.
Link: https://float.exposed/0x3ff0000000000000
float.exposed is a web playground for seeing how computers represent the floats.<br>You can enter a float at the top and it shows you the bits and the decomposition.
You can click the bits to toggle them.
See what happens when you toggle the sign bit (marked with red underline).
See what happens when you increase or decrease the exponent.
Toggle the bits in the significand (marked with blue underline) and see if you can do the binary math in your head to verify the result.
Try entering 1.25.<br>The slides said that the exponent would be 0 but float.exposed says it’s 1023.<br>That’s because the value is biased.<br>You have to subtract 1023 to get the actual exponent.<br>This is what allows us to store negative exponents in that fields - try entering 1 in the exponent and see what you get.
We don’t have to remember all of this to understand Gorilla.<br>However, it’s good to notice that the sign bit and the exponent come first.<br>This means that if we’ve got a bunch of numbers that a roughly the same, and they aren’t right around zero,<br>the left-hand side of their bit representations is going to be roughly the same as it’s only the significand that changes.
Now we’re ready to talk about the algorithm!
The second column shows the binary representation for the inputs.<br>I made up the syntax in the third row - you can’t actually xor two floats in Python - but I’m using it to mean the xor of the binary representations.
Xor sets the bits that differ to one and the bits that are the same to zero. See the highlighted bits - they are the only two that differ in the inputs.
Xor is also reversible. If you xor 20.5 with 21.0 and then xor the result with 21.0 again, you’ll get 20.5 back.
Now if we look at the xor result, we can see there’s a run of leading zeros.<br>Then there’s a handful of meaningful bits.<br>Now it’s just two ones but there could be some zeros in the middle.<br>And finally there’s a run of trailing zeros.
Here’s how we’re going to encode this: we’re going to store the number of zeros instead of the actual zero bits.
We’re going to store the first value as-is and then for the remaining values, the value xorred with the previous value.<br>Since xor is reversible, we can revert this process when we’re decoding the data.
Then we encode the...