Pushing memory bound CUDA kernels past the speed of light with data compression

Pushing memory bound kernels beyond the speed of light with lossless decompression

Pushing memory bound kernels beyond the speed of light with lossless decompression 26 May 2026 12 min read

In an earlier post on this blog we measured the entropy of the weights of a lot of modern open-weight models. As a reminder: the Shannon entropy sets a lower bound for the compressibility of a stream of bytes.

We found real headroom between the bit-width the weights are stored at and the bit-width their actual values demand. But compressibility only starts to matter when something downstream runs faster, or fits somewhere it previously didn’t. So the question is operational: given that the weights can be compressed, what can we actually do if we compress them?

There are lots of things you could think of doing. Some ideas:

Faster weight loading. Pulling the weights off disk or across the network into GPU memory is the path a previous post here pulled on. It gets more interesting on Blackwell, where the decompression engine is in hardware rather than a kernel you have to write yourself.

Cheaper collectives. Compress weights or activations in flight between GPUs, so the bytes crossing the interconnect are smaller than the bytes the model nominally operates on.

Memory-bound kernels. Keep the weights compressed in HBM and decompress them on the fly inside the kernel that consumes them, so the same arithmetic runs against fewer bytes read, and the kernel runs faster.

The first two seem likely to be possible. The last seems hard. In this post we’ll see if we can get any purchase on it.

By way of making it even harder — it’s no longer very interesting to build optimizations that only apply to bf16 weights. Nobody serves a large model in bf16 if they can avoid it. The realistic baseline is fp8, and increasingly fp4, both of which have already taken a large bite out of the same headroom the entropy numbers were measuring. So the right question is not really how much we can compress bf16 weights: it is how much further we can compress weights that are already quantised to fp8 or fp4, and whether what is left is worth the cost of the decompressor.

We’ll show here that we canIn fp8, on a consumer GPU. fp4 & other GPUs remain open questions..

What should we compare against?§

What we’re after is a memory-bound kernel that, given inputs we’ve compressed offline, runs at an effective bandwidth above the hardware’s effective peak HBM throughput. I’m running everything in this post on an RTX 4090, because that’s what I have under my desk. The question of whether the trick generalises to bigger GPUs is one I’ll come back to at the end.

The kernel we’ll use is a vector add. It’s the canonical example of a memory-bound workload: you load two operands from HBM, do one addition, and write the result back, which works out to three bytes of memory traffic for every floating-point operation. That puts it deep into the memory-bound regime, where the arithmetic units spend almost all of their time waiting on bytesWe’re interested in fp8 really, but I couldn’t find a simple fp8 vecadd kernel that could reasonably be called speed of light. The one I wrote hit ~900 GB/s. So bf16 seems fairer..

import torch

# 1 GiB working set per tensor (bf16 = 2 bytes per element) n = 512 * 1024 * 1024 a = torch.randn(n, dtype=torch.bfloat16, device="cuda") b = torch.randn(n, dtype=torch.bfloat16, device="cuda") c = torch.add(a, b) The maximum achievable bandwidth is something like 920 GB/s, about 92% of the rated peakThe 4090’s rated HBM bandwidth is 1008 GB/s; GDDR6X on a 384-bit bus at 21 Gbps, per the Ada architecture whitepaper. On this chip the aten torch.add kernel gets ~922 GB/s, NVIDIA’s CCCL binary_transform (their hand-tuned device-wide elementwise primitive) gets ~927 GB/s, and memcpy gets ~900 GB/s. A lot of variation is probably compounded by bad thermals on consumer GPUs, but 90% of peak being the max achievable is fairly typical..

For the operands, we want bytes that look like what a real inference kernel would be reading off HBM. So both operands are drawn from the empirical distribution of FP8 weights in Qwen3-14B-FP8, the same distribution measured in the entropy post.

The quantity we’re tracking is what I’ll call bandwidth amplification: the ratio of the time the raw kernel takes on uncompressed operands to the time the fused kernel takes on compressed operands, end to end. An amplification of 1 means the compressed path matched raw and we got nothing for our trouble, and anything above 1 is the result we’re afterAs a reminder, fp8 weights for typical models are about ~15% compressible. So the best achievable bandwidth amplification factor is ≈1.15\approx 1.15≈1.15..

How to do parallelisable compression and decompression§

We’ve written about lossless compression in this series before, in both the rANS and tANS flavours of asymmetric numeral systems. For the rest of this post we’ll work with tANS specifically, because ANS coders...

Pushing memory bound CUDA kernels past the speed of light with data compression

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine