Pushing memory bound kernels beyond the speed of light with lossless decompression
Pushing memory bound kernels beyond the speed of light with lossless decompression<br>26 May 2026 12 min read
In an earlier post on this blog<br>we measured the entropy of the weights of a lot of modern open-weight models.<br>As a reminder: the Shannon<br>entropy sets a<br>lower bound for the compressibility of a stream of bytes.
We found real headroom between the bit-width the weights are stored at and the<br>bit-width their actual values demand. But compressibility only starts to matter<br>when something downstream runs faster, or fits somewhere it previously didn’t.<br>So the question is operational: given that the weights can be compressed, what<br>can we actually do if we compress them?
There are lots of things you could think of doing. Some<br>ideas:
Faster weight loading. Pulling the weights off disk or across the network<br>into GPU memory is the path a previous<br>post here pulled on. It<br>gets more interesting on Blackwell, where the decompression engine is in<br>hardware rather than a kernel you have to write yourself.
Cheaper collectives. Compress weights or activations in flight between<br>GPUs, so the bytes crossing the interconnect are smaller than the bytes the<br>model nominally operates on.
Memory-bound kernels. Keep the weights compressed in HBM and decompress<br>them on the fly inside the kernel that consumes them, so the same arithmetic<br>runs against fewer bytes read, and the kernel runs faster.
The first two seem likely to be possible. The last seems hard. In this post<br>we’ll see if we can get any purchase on it.
By way of making it even harder — it’s no longer very interesting to build<br>optimizations that only apply to bf16 weights. Nobody serves a large model in<br>bf16 if they can avoid it. The realistic baseline is fp8, and increasingly fp4,<br>both of which have already taken a large bite out of the same headroom the<br>entropy numbers were measuring. So the right question is not really how much we<br>can compress bf16 weights: it is how much further we can compress weights that<br>are already quantised to fp8 or fp4, and whether what is left is worth the<br>cost of the decompressor.
We’ll show here that we canIn fp8, on a consumer GPU. fp4 & other GPUs remain open questions..
What should we compare against?§
What we’re after is a memory-bound kernel that, given inputs we’ve compressed<br>offline, runs at an effective bandwidth above the hardware’s effective peak HBM<br>throughput. I’m running everything in this<br>post on an RTX 4090, because that’s what I have under my desk. The question of<br>whether the trick generalises to bigger GPUs is one I’ll come back to at the<br>end.
The kernel we’ll use is a vector add. It’s the canonical example of a<br>memory-bound workload: you load two operands from HBM, do one addition, and<br>write the result back, which works out to three bytes of memory traffic for<br>every floating-point operation. That puts it deep into the memory-bound regime,<br>where the arithmetic units spend almost all of their time waiting on bytesWe’re interested in fp8 really, but I couldn’t find a simple fp8 vecadd kernel that could reasonably be called speed of light. The one I wrote hit ~900 GB/s. So bf16 seems fairer..
import torch
# 1 GiB working set per tensor (bf16 = 2 bytes per element)<br>n = 512 * 1024 * 1024<br>a = torch.randn(n, dtype=torch.bfloat16, device="cuda")<br>b = torch.randn(n, dtype=torch.bfloat16, device="cuda")<br>c = torch.add(a, b)<br>The maximum achievable bandwidth is something like 920 GB/s, about 92% of the rated peakThe 4090’s rated HBM bandwidth is 1008 GB/s; GDDR6X on a 384-bit bus at 21 Gbps, per the Ada architecture whitepaper. On this chip the aten torch.add kernel gets ~922 GB/s, NVIDIA’s CCCL binary_transform (their hand-tuned device-wide elementwise primitive) gets ~927 GB/s, and memcpy gets ~900 GB/s. A lot of variation is probably compounded by bad thermals on consumer GPUs, but 90% of peak being the max achievable is fairly typical..
For the operands, we want bytes that look like what a real inference kernel<br>would be reading off HBM. So both operands are<br>drawn from the empirical distribution of FP8 weights in<br>Qwen3-14B-FP8, the same<br>distribution measured in the entropy post.
The quantity we’re tracking is what I’ll call bandwidth amplification: the<br>ratio of the time the raw kernel takes on uncompressed operands to the time the<br>fused kernel takes on compressed operands, end to end. An amplification of 1<br>means the compressed path matched raw and we got nothing for our trouble, and<br>anything above 1 is the result we’re afterAs a reminder, fp8 weights for typical models are about ~15% compressible. So the best achievable bandwidth amplification factor is ≈1.15\approx 1.15≈1.15..
How to do parallelisable compression and decompression§
We’ve written about lossless compression in this series before, in both the<br>rANS and<br>tANS flavours of asymmetric<br>numeral systems. For<br>the rest of this post we’ll work with tANS specifically, because ANS coders...