Pushing memory bound CUDA kernels past the speed of light with data compression

somnial1 pts0 comments

Pushing memory bound kernels beyond the speed of light with lossless decompression

Pushing memory bound kernels beyond the speed of light with lossless decompression<br>26 May 2026 12 min read

In an earlier post on this blog<br>we measured the entropy of the weights of a lot of modern open-weight models.<br>As a reminder: the Shannon<br>entropy sets a<br>lower bound for the compressibility of a stream of bytes.

We found real headroom between the bit-width the weights are stored at and the<br>bit-width their actual values demand. But compressibility only starts to matter<br>when something downstream runs faster, or fits somewhere it previously didn’t.<br>So the question is operational: given that the weights can be compressed, what<br>can we actually do if we compress them?

There are lots of things you could think of doing. Some<br>ideas:

Faster weight loading. Pulling the weights off disk or across the network<br>into GPU memory is the path a previous<br>post here pulled on. It<br>gets more interesting on Blackwell, where the decompression engine is in<br>hardware rather than a kernel you have to write yourself.

Cheaper collectives. Compress weights or activations in flight between<br>GPUs, so the bytes crossing the interconnect are smaller than the bytes the<br>model nominally operates on.

Memory-bound kernels. Keep the weights compressed in HBM and decompress<br>them on the fly inside the kernel that consumes them, so the same arithmetic<br>runs against fewer bytes read, and the kernel runs faster.

The first two seem likely to be possible. The last seems hard. In this post<br>we’ll see if we can get any purchase on it.

By way of making it even harder — it’s no longer very interesting to build<br>optimizations that only apply to bf16 weights. Nobody serves a large model in<br>bf16 if they can avoid it. The realistic baseline is fp8, and increasingly fp4,<br>both of which have already taken a large bite out of the same headroom the<br>entropy numbers were measuring. So the right question is not really how much we<br>can compress bf16 weights: it is how much further we can compress weights that<br>are already quantised to fp8 or fp4, and whether what is left is worth the<br>cost of the decompressor.

We’ll show here that we canIn fp8, on a consumer GPU. fp4 & other GPUs remain open questions..

What should we compare against?§

What we’re after is a memory-bound kernel that, given inputs we’ve compressed<br>offline, runs at an effective bandwidth above the hardware’s effective peak HBM<br>throughput. I’m running everything in this<br>post on an RTX 4090, because that’s what I have under my desk. The question of<br>whether the trick generalises to bigger GPUs is one I’ll come back to at the<br>end.

The kernel we’ll use is a vector add. It’s the canonical example of a<br>memory-bound workload: you load two operands from HBM, do one addition, and<br>write the result back, which works out to three bytes of memory traffic for<br>every floating-point operation. That puts it deep into the memory-bound regime,<br>where the arithmetic units spend almost all of their time waiting on bytesWe’re interested in fp8 really, but I couldn’t find a simple fp8 vecadd kernel that could reasonably be called speed of light. The one I wrote hit ~900 GB/s. So bf16 seems fairer..

import torch

# 1 GiB working set per tensor (bf16 = 2 bytes per element)<br>n = 512 * 1024 * 1024<br>a = torch.randn(n, dtype=torch.bfloat16, device="cuda")<br>b = torch.randn(n, dtype=torch.bfloat16, device="cuda")<br>c = torch.add(a, b)<br>The maximum achievable bandwidth is something like 920 GB/s, about 92% of the rated peakThe 4090’s rated HBM bandwidth is 1008 GB/s; GDDR6X on a 384-bit bus at 21 Gbps, per the Ada architecture whitepaper. On this chip the aten torch.add kernel gets ~922 GB/s, NVIDIA’s CCCL binary_transform (their hand-tuned device-wide elementwise primitive) gets ~927 GB/s, and memcpy gets ~900 GB/s. A lot of variation is probably compounded by bad thermals on consumer GPUs, but 90% of peak being the max achievable is fairly typical..

For the operands, we want bytes that look like what a real inference kernel<br>would be reading off HBM. So both operands are<br>drawn from the empirical distribution of FP8 weights in<br>Qwen3-14B-FP8, the same<br>distribution measured in the entropy post.

The quantity we’re tracking is what I’ll call bandwidth amplification: the<br>ratio of the time the raw kernel takes on uncompressed operands to the time the<br>fused kernel takes on compressed operands, end to end. An amplification of 1<br>means the compressed path matched raw and we got nothing for our trouble, and<br>anything above 1 is the result we’re afterAs a reminder, fp8 weights for typical models are about ~15% compressible. So the best achievable bandwidth amplification factor is ≈1.15\approx 1.15≈1.15..

How to do parallelisable compression and decompression§

We’ve written about lossless compression in this series before, in both the<br>rANS and<br>tANS flavours of asymmetric<br>numeral systems. For<br>the rest of this post we’ll work with tANS specifically, because ANS coders...

weights kernel memory bound bytes torch

Related Articles