How to Optimize a CUDA Matmul Kernel for cuBLAS-Like Performance: A Worklog

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Subscribe Si_Boehm

Comments (16)

December 2022

In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.You can download the code for all kernels from Github. Also checkout wangzyon’s repo from which I copied the benchmarking setup. This post is less polished than my normal uploads, and includes many more sidenotes. I used it as notepad for ideas and scribbles while writing the kernels. That’s why I called it a worklog :)

Matrix multiplication on GPUs may currently be the most important algorithm that exists, considering it makes up almost all the FLOPs during the training and inference of large deep-learning models. So how much work is it to write a performant CUDA SGEMMSGEMM performs C=αAB+βC at single (=32b) precision. from scratch? I’ll start with a naive kernel and step-by-step apply optimizations until we get within 95% (on a good day) of the performance of cuBLAS (NVIDIA’s official matrix library):cuBLAS at FP32 that is. In my setting, doing the matmul using TF32 or BF16 precision allows cuBLAS to use the tensor cores, which increases FLOPS by 2.5x or 3.5x. I may look into tensor cores / warp matrix functions in a future post.

Kernel GFLOPs/s Performance relative to cuBLAS

1: Naive 309.0 1.3%

2: GMEM Coalescing 1986.5 8.5%

3: SMEM Caching 2980.3 12.8%

4: 1D Blocktiling 8474.7 36.5%

5: 2D Blocktiling 15971.7 68.7%

6: Vectorized Mem Access 18237.3 78.4%

9: Autotuning 19721.0 84.8%

10: Warptiling 21779.3 93.7%

0: cuBLAS 23249.6 100.0%

Kernel 1: Naive Implementation

In the CUDA programming model, computation is ordered in a three-level hierarchy. Each invocation of a CUDA kernel creates a new grid, which consists of multiple blocks. Each block consists of up to 1024 individual threads.These constants can be looked-up in the CUDA Programming guide. Threads that are in the same block have access to the same shared memory region (SMEM).

The number of threads in a block can be configured using a variable normally called blockDim, which is a vector consisting of three ints. The entries of that vector specify the sizes of blockDim.x, blockDim.y and blockDim.z, as visualized below:

Similarly, the number of blocks in a grid is configurable using the gridDim variable. When we launch a new kernel from the hostIn accelerator lingo, host refers to the CPU and device is the accelerator, here the GPU., it creates a single grid, containing the blocks and threads as specified.From here on I’ll only be talking about 2D grids and blocks, partly because the 3D-structure is seldom used and because drawing in 3D is too hard. It’s important to keep in mind that the thread hierarchy we just talked about mostly concerns program correctness. For program performance, as we’ll see later, it’s not a good idea to treat all threads in the same block as equals.

For our first kernel, we’ll use the grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. Then that thread will compute the dot product of the corresponding row of A and column of B, and write the result to C. Due to each location of C being written to by only one thread, we have to do no synchronization. We’ll launch the kernel like so:

// create as many blocks as necessary to map all of C dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32), 1); // 32 * 32 = 1024 thread per block dim3 blockDim(32, 32, 1); // launch the asynchronous execution of the kernel on the device // The function call returns immediately on the host sgemm_naivegridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);

CUDA code is written from a single-thread perspective. In the code of the kernel, we access the blockIdx and threadIdx built-in variables. These will return different values based on the thread that’s accessing them.In our example, threadIdx.x and threadIdx.y will vary from 0 to 31 based on the position of the thread in the grid. Same for blockIdx.x and blockIdx.y, which will vary from 0 to CEIL_DIV(N, 32) or CEIL_DIV(M, 32) based on the position of the thread’s block in the grid. We’ll do a lot of indexing into strided in-memory representations of matrices. Edward Yang’s post on PyTorch Internals contains a good explanation of strided tensors.

__global__ void sgemm_naive(int M, int N, int K, float alpha, const float *A, const float *B, float beta, float *C) { // compute position in C that this thread is responsible for const uint x = blockIdx.x * blockDim.x + threadIdx.x; const uint y = blockIdx.y * blockDim.y + threadIdx.y;

// `if` condition is necessary for when M...

How to Optimize a CUDA Matmul Kernel for cuBLAS-Like Performance: A Worklog

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy