zeux.io - Zigzag decoding with AVX-512
Bits, pixels, cycles and more
Arseny Kapoulkine (email, , )
Author
Arseny Kapoulkine (email, , )
Work, projects and publications
Posts
If you like this blog, you should follow it via RSS and read older posts. Here are the recent posts:
Zigzag decoding with AVX-512
Quantizing tangent frames
meshoptimizer 1.0 released
Billions of triangles in minutes
Do not disrespect the fractal
Load-store conflicts
Measuring acceleration structures
Year of independence
Unlearning metrics and algorithms
X is justifiably slow
" Quantizing tangent frames
Zigzag decoding with AVX-512
17 Jun 2026
I’ve been working on speeding up AVX-512 vertex decoding in meshoptimizer recently; in the process I stumbled upon two optimizations that I did not end up using but I thought they might be fun to write about! The optimizations that actually made it in require some higher level background / explanations that will have to wait until another day :)
Both optimizations discussed here touch upon two new (for me) features of AVX-512 that I haven’t had a chance to experiment with until now, and both apply to the same problem: how to decode zigzag encoded integers.
Zigzag encoding
Suppose that you have a lot of signed integers to encode; and these integers come as a result of delta encoding, representing differences between input values. If the input values are reasonably close to each other, the resulting deltas are small in magnitude, but the sign of the deltas is fairly unpredictable.
We could encode the resulting deltas as is, in their two’s complement format - but that may be inconvenient if we want to encode the deltas using few bytes or bits. Ideally we’d like small deltas to be encoded with a number of bits that scales with their magnitude. A zigzag encoding gives us that property, and it transforms values such that positive values v are stored as 2*v, and negative values are stored as 2*(~v)+1. So 0 encodes as 0, -1 encodes as 1, 1 encodes as 2, etc. - positive values become even values and negative values become odd values. The sign bit is effectively shifted to the least significant bit.
To decode this, we can directly decode either a positive or a negative value1:
(v & 1) == 0 ? (v >> 1) : ~(v >> 1)
… or, use a common branchless decoding function:
(v >> 1) ^ -(v & 1)
The way branchless decoding works is that -(v & 1) is 0 when v is even, and a mask of all 1s when v is odd. Xor-ing with that mask keeps the magnitude (v >> 1) as is if the mask is 0, and inverts it if the mask is all 1s - which is equivalent to the ternary expression above.
The net effect is that the resulting value is a small positive value if the input was a small negative or positive value; we can then encode the result using any variable-width scheme of our choosing: common options include VByte (which encodes unsigned integer values as 1-5 bytes) and other forms of bit-wise encoding2.
SIMD decoding
The branchless decoding function maps in a very straightforward way to any SIMD instruction set of your choosing: assuming you somehow already produced unsigned value v in each lane of a SIMD vector, you can easily write a (width-dependent) function that decodes the original signed value. For example, using SSE2, and assuming 32-bit integer inputs, we get:
__m128i xs = _mm_and_si128(v, _mm_set1_epi32(1));<br>__m128i xl = _mm_sub_epi32(_mm_setzero_si128(), xs);<br>__m128i xr = _mm_srli_epi32(v, 1);
return _mm_xor_si128(xl, xr);
This is a direct translation of the C code above, except that SSE2 doesn’t have an obvious first class way to negate integers, so you need to subtract from 0 instead. This compiles exactly as you would expect (and also supports any integer width and can use wider AVX2 or AVX-512 vectors in an obvious way)3:
vpsrld xmm1, xmm0, 1 ; shift by 1<br>vpandq xmm0, xmm0, xmm2 ; and<br>vpsubd xmm0, xmm3, xmm0 ; sub from 0<br>vpxorq xmm0, xmm0, xmm1 ; xor
Every instruction used here has 1 cycle of latency on Zen 4; note that vpandq does not depend on the results of vpsrld, so the cumulative latency of the 4 instructions is 3 cycles.
Masks
The decoding transformation here is quite simple: our branchless formulation is expressed using arithmetic, but really it’s just moving bits around and occasionally flipping them. This feels like something that a sufficiently advanced bit permutation engine in hardware can do easily, and certainly dedicated circuitry here would be much simpler than an adder. But we make do with what we are given, and what we are given is AVX-512 - which mostly consists of unsurprising instructions (arithmetic, bitwise ops, etc.), and the obvious “surprising” instructions like vpternlogd (which allows creating custom bitwise functions that combine three bits into one using a truth table) do not seem to help here either.
There is, however, a different way we can go which is a little unusual if you’re used to earlier SIMD instruction sets: instead of using a branchless...