PivCo-Huffman "Merge" Operations

ibobev1 pts0 comments

PivCo-Huffman "merge" operations | The ryg blog

Skip to content

Follow:<br>RSS<br>Twitter

The ryg blog

When I grow up I'll be an inventor.

Home

About

Coding

Compression

Computer Architecture

Demoscene

Graphics Pipeline

Maths

Multimedia

Networking

Papers

Stories

Thoughts

Uncategorized

PivCo-Huffman "merge" operations

June 21, 2026

There’s a new paper out called "PivCo-Huffman" (HTML version with annotations here) and it’s very interesting. Normal Huffman decoding (and, to a lesser extent, encoding) is inherently quite serial. We can get explicit parallelism by using multiple streams, which scales just fine to moderate numbers of streams – something like 4-8 is usually not an issue. Not very suitable for vectorization or wide vector machines like GPUs, though: every extra stream adds signaling overhead in the bitstream, and decoding from many streams at once is a gather-heavy operation. GPUs can cope with gathers okay (although spread-out data accesses still consume a lot more cycles than more compact memory access patterns do), most everything else is very unhappy with extremely gather-heavy patterns.

Especially since our usual table-based decoders will then immediately follow up with another gather, although that part is more easily fixed: canonical codes admit reasonably easy table-less decoding (at least for the critical "length determination" part), if desired. These approaches are not particularly juicy in a scalar one-at-a-time decoder, where a single load is very cheap, but if you’re working on 32-wide vectors, doing a bunch of integer math that saves you a gather is a more interesting proposition.

You can also use ANS-style interleaving to turn many logical bitstreams into a single physical bitstream. This is what GDeflate does to get a bitstream that has more potential parallelism without gathering from all over the place. Interleaving solves the "reads all over memory" problem and works great on GPUs and CPUs that provide suitable facilities (e.g. AVX-512 "EXPAND" family instructions, or just fast gathers from the same cache line), but is slow without that kind of hardware support. It also forces you to pick a magic number (the interleave factor) as part of the bitstream definition. Too low, and wide vector machines like GPUs get bad utilization when decoding. Too high, and scalar machines can’t efficiently decode the bitstream at all. Worse, numbers that are just right for say AVX-512 or GPUs overlap, but the smallest number of interleaved streams most GPUs or even CPUs with something like AVX-512 can get decent utilization with is around 32, the largest number scalar or narrower vector machines with typical register counts can comfortably decode is around 8-16 (depending on how exactly you do it), and in the middle between 16 and 32 is a valley of tears where nobody’s happy.

That’s why you don’t see much of this style of bitstream outside of GDeflate; it is inherently reliant on you picking a magic number that has a huge influence on every single implementation, yet at the same time is subject to ever-shifting hardware details. Yes, NVidia GPUs have been built around warps with 32 "threads" (more like SIMD lanes) using 32 bits per register each for a good while, but AMD used to prefer 64 (before switching to mostly-32 a few generations ago), Intel GPUs were designed around 8, some mobile GPUs used to do as little as 4, and 4 is also the number of 32-bit lanes you get for something like ARM NEON or SSE4.2 on the x86 side, later widened to 8 with AVX2. Finally, ARM SVE and the RISC-V V extension leave it implementation defined, which is one thing when the model is that you’re running a simple vector math kernel like a SAXPY, but quite another when the physical vector width ends up being a magic constant that is baked into some persistent wire format, potentially for decades.

The final major popular solution to the "this is very serial" conundrum is to completely brute-force it. Fetch a bunch of bits from the input. In parallel, start decoding at bit 0, and at a bit 1, at bit 2, 3, and so forth. Do this for a lot of sequential candidate positions. I’m talking something like 16 or more.

Eventually, we discover that the code starting at bit 0 was 6 bits long. Cool! That means we discard all the work we did at bit offsets 1 through 6, and look at what our already pre-decoded code starting at bit 6 was. That one is 5 bits long, so we discard the speculative work we did for bits 7 through 10, and find that starting at bit 11, there’s another code that’s 5 bits long. Congratulations, we’ve just decoded 3 values for the price of 16.

This works just fine. Believe it or not, this is actually the preferred method of decoding such a sequential bitstream in hardware if a throughput of more than one value per cycle is required. Especially with canonical codes and not-too-large length limits, this can be made reasonably cheap. Chasing down the sequence of actual values to emit at the end is...

gpus like bitstream decoding bits huffman

Related Articles