Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

tosh1 pts0 comments

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]

Thonk From First Principles

SubscribeSign in

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]<br>Great minds discuss flops per watt.

Horace He<br>Apr 29, 2024

163

21<br>10

Share

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.<br>python mm_bench.py<br>> CuBLAS: 258 TeraflopsNot bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler.<br>./cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192<br>> CUTLASS: 288 Teraflops!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and somehow CUTLASS + autotuning is outperforming it by 10%? We gotta start using these matmuls yesterday.<br>The next step is to bind the CUTLASS kernels into Python and compare against CuBLAS using my previous script.<br>python cutlass_mm_bench.py<br>> CuBLAS: 258 Teraflops<br>> CUTLASS: 257 TeraflopsSomehow, in the light of Python, all of CUTLASS’s performance gains disappear. This in of itself is not shocking - it’s notoriously difficult to ensure consistent benchmarking across setups.<br>I tediously ablate the two benchmark scripts, until finally, I find that CUTLASS’s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:<br>zero_inputs = torch.zeros(N, N)<br>randn_inputs = torch.randn(N, N)<br>benchmark(zero_inputs) # 295 Teraflops<br>benchmark(randn_inputs) # 257 TeraflopsWhat? How could the values of the matrix affect the runtime of the model? I know Nvidia has some weird data compression thing on A100s, but I wouldn’t have expected that to be on in matmuls. Let’s try some other data distributions, like an uniform distribution [0,1].

This was … confusing, to say the least. Somehow, the actual content of the tensors being multiplied is leading to different matmul performance.<br>There certainly are cases where the runtime depends on the content of the tensor — indirect indexing (e.g. A[b]), or things like sparsity.<br>But matrix multiplications have nothing like that at all! No matter what the contents of the matrix contain, the matrix multiplication kernel will 1. perform the same number of computations, 2. perform the same computations in the same order, 3. access the same memory addresses, and 4. access the same memory addresses in the same order.<br>Nowhere did my mental model of matrix multiplications and GPU hardware allow for the values in the matrix to influence matmul performance. And yet, here we are.<br>As it turns out, the culprit is ……. dynamic/switching power in semiconductors!<br>Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Subscribe

Power Usage in Semiconductors

An Nvidia A100 GPU has a power limit of 400W1. However, as the phrase “power limit” may hint, the GPU doesn’t always use all 400W. For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.

But when the GPU is running under load, that power usage will spike considerably, typically to around the power limit.

In order to stay under the power limit, a piece on the chip called the Voltage Regulator Module reduces the voltage supplied to the GPU, — throttling the clock frequency and reducing its performance.<br>In other words, if our GPU ends up using enough power to hit the power limit, our performance will become capped.<br>Most of us take it for granted that “GPU does something, power consumption goes up”. But there are actually two distinct mechanisms through which power gets consumed.

Dynamic/switching power on the left, static/leakage power on the right. Taken from https://semiengineering.com/knowledge_centers/low-power/low-power-design/power-consumption/<br>The first one is static/leakage power. You can think of this as the power that inevitably gets lost by just flowing power through your circuits. The amount of static power used is proportional the amount of silicon that is powered. As GPUs don’t do much power gating, this is essentially the amount of power used at idle (88W in the above photo).<br>However, the second one, dynamic (or switching) power, is the culprit. Specifically, a small amount of power is consumed whenever a transistor switches states. If the transistor never needs to switch states, it doesn’t consume any extra power. On the other hand, if it’s rapidly flipping, then it consumes a ton of dynamic/switching power. Multiply that by the billions of transistors in your GPU, and you get the overall increase in power consumption.

In other words, the reason why matrix multiplications are faster when passed zeros is that this reduces the “flipping” of enough transistors in the chip to stay under the power...

power matrix cutlass multiplications performance data

Related Articles