Simple batch decoding of unary codes | The ryg blog
Skip to content
Follow:<br>RSS<br>Twitter
The ryg blog
When I grow up I'll be an inventor.
Home
About
Coding
Compression
Computer Architecture
Demoscene
Graphics Pipeline
Maths
Multimedia
Networking
Papers
Stories
Thoughts
Uncategorized
Simple batch decoding of unary codes
May 30, 2026
Unary codes are a form of universal variable-length code (UVLC) that are sometimes used on their own but more commonly used as a building block for more general families of UVLCs like Golomb, Rice or Gamma/Exp-Golomb codes. There are multiple conventions in use, the one I’ll use in the following has codewords terminated with a "1" bit and is 0-based, i.e. the codebook goes
0 -> 1<br>1 -> 01<br>2 -> 001<br>3 -> 0001<br>4 -> 00001
and so forth. That is, some value i is encoded by sending i "0" bits and a single "1" bit at the end. You sometimes see the opposite convention (variable-length run of 1s terminated by a single 0), but having the runs be 0 bits is a bit nicer in code. The main reason is that "count leading/trailing zero bits" instructions are readily available on most machines these days; these can also be used to count leading/trailing ones by doing a bitwise complement first, but that extra work disappears when we use 0s in the first place.
In this post, I want to consider the case where we’re decoding a bunch of unary-coded values in a row. For the sake of concreteness, let’s consider an actual decoder that uses LSB-first bit packing and a Little Endian byte stream, using variant 4 from "Reading bits in far too many ways (part 2)". That decoder will usually look something like this on a 64-bit platform:
// refill<br>uint64 next = read64LE(bitptr);<br>bitbuf |= next > 3;<br>bitcount |= 56;
if ((bitbuf & ((1ull >= len;
This should be fairly straightforward. Unary codes are usually expected to not be too long, and a common mistake is to not check properly and then run into overflow issues or UB when such inputs show up, hence the explicit limit in this version. Theoretical UVLCs can encode any non-negative integer; in practice, any format I’ve ever seen that doesn’t enforce hard limits has nasty (and frequently exploitable) overflow bugs. If you really do need to support very large codes, I would recommend still writing your loops for the expected case (short runs, unless you’re using a unary code in the wrong place) and treating long runs as an exception that you handle more carefully.
Anyway, back to our topic. The code above works, but it’s relatively expensive. As many bitstream decoding tasks, this is completely serial, and we also have a lot of overhead in the bitstream handling. The unary code corresponds to a geometric distribution with success probability 1/2, so the expected code value works out to 1, and hence the expected number of bits consumed every iteration is 2. But we still need to do the refill handling every iteration because any given code could be a long one.
In the usual setting where unary codes are arbitrarily interleaved with other codes, it’s hard to do much better than this. For example, even a basic Rice code alternates unary prefixes with a fixed-length suffix, and that combination destroys a lot of valuable structure that the constituent parts (namely, unary and fixed-length codes) have on their own. Table-driven decoders can accelerate common cases and even handle multiple symbols at once (provided their codewords are short enough), but often at the cost of increased logistical difficulties, and the fundamentally serial nature of decoding with the critical path through most of the decode stays the same.
By contrast, if we can assume a sequence of pure unary codes with nothing in between, things immediately get simpler. For example, if our bit buffer contains at least 2 set bits, we know that we can decode two code words before needing to refill, which lets us unroll the loop and amortize the refill overhead. This is cheaper than it sounds because this test, using the well-known trick to clear the lowest set bit (if any), boils down to:
// clear lowest set bit in bitbuf:<br>uint64 next = bitbuf & (bitbuf - 1);<br>if (next != 0) {<br>// we know we have at least two set bits in bitbuf,<br>// and can use ctz on "bitbuf" and "next" to determine<br>// the next two code values in parallel.<br>} else {<br>// only one set bit in bitbuf, do single decode as before
This strategy works and does in fact result in an almost perfect 2x speed-up without needing a table-based decode (which would add at least one L1D access’s worth of time to the critical path between iterations), but I won’t spend any more time on this here, because with a pure unary code stream, we also have a much better option.
Tunstall-style decoding
Prefix codes map from a known input alphabet to variable-sized (in general) code words. Tunstall coding is the dual approach that maps a variable number of input symbols to code words of a fixed size. Decoding typically uses a table that has space...