How to Optimize a CUDA Matmul Kernel for cuBLAS-Like Performance: A Worklog

Areibman1 pts0 comments

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Subscribe<br>Si_Boehm

Comments<br>(16)

December 2022

In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA.<br>My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning.<br>This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.You can download the code for all kernels from Github. Also checkout wangzyon’s repo from which I copied the benchmarking setup. This post is less polished than my normal uploads, and includes many more sidenotes. I used it as notepad for ideas and scribbles while writing the kernels. That’s why I called it a worklog :)

Matrix multiplication on GPUs may currently be the most important algorithm that exists, considering it makes up almost all the FLOPs during the training and inference of large deep-learning models.<br>So how much work is it to write a performant CUDA SGEMMSGEMM performs C=αAB+βC at single (=32b) precision. from scratch?<br>I’ll start with a naive kernel and step-by-step apply optimizations until we get within 95% (on a good day) of the performance of cuBLAS (NVIDIA’s official matrix library):cuBLAS at FP32 that is. In my setting, doing the matmul using TF32 or BF16 precision allows cuBLAS to use the tensor cores, which increases FLOPS by 2.5x or 3.5x. I may look into tensor cores / warp matrix functions in a future post.

Kernel<br>GFLOPs/s<br>Performance relative to cuBLAS

1: Naive<br>309.0<br>1.3%

2: GMEM Coalescing<br>1986.5<br>8.5%

3: SMEM Caching<br>2980.3<br>12.8%

4: 1D Blocktiling<br>8474.7<br>36.5%

5: 2D Blocktiling<br>15971.7<br>68.7%

6: Vectorized Mem Access<br>18237.3<br>78.4%

9: Autotuning<br>19721.0<br>84.8%

10: Warptiling<br>21779.3<br>93.7%

0: cuBLAS<br>23249.6<br>100.0%

Kernel 1: Naive Implementation

In the CUDA programming model, computation is ordered in a three-level hierarchy.<br>Each invocation of a CUDA kernel creates a new grid, which consists of multiple blocks.<br>Each block consists of up to 1024 individual threads.These constants can be looked-up in the CUDA Programming guide.<br>Threads that are in the same block have access to the same shared memory region (SMEM).

The number of threads in a block can be configured using a variable normally called blockDim, which is a vector consisting of three ints.<br>The entries of that vector specify the sizes of blockDim.x, blockDim.y and blockDim.z, as visualized below:

Similarly, the number of blocks in a grid is configurable using the gridDim variable.<br>When we launch a new kernel from the hostIn accelerator lingo, host refers to the CPU and device is the accelerator, here the GPU., it creates a single grid, containing the blocks and threads as specified.From here on I’ll only be talking about 2D grids and blocks, partly because the 3D-structure is seldom used and because drawing in 3D is too hard.<br>It’s important to keep in mind that the thread hierarchy we just talked about mostly concerns program correctness.<br>For program performance, as we’ll see later, it’s not a good idea to treat all threads in the same block as equals.

For our first kernel, we’ll use the grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C.<br>Then that thread will compute the dot product of the corresponding row of A and column of B, and write the result to C.<br>Due to each location of C being written to by only one thread, we have to do no synchronization.<br>We’ll launch the kernel like so:

// create as many blocks as necessary to map all of C<br>dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32), 1);<br>// 32 * 32 = 1024 thread per block<br>dim3 blockDim(32, 32, 1);<br>// launch the asynchronous execution of the kernel on the device<br>// The function call returns immediately on the host<br>sgemm_naivegridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);

CUDA code is written from a single-thread perspective.<br>In the code of the kernel, we access the blockIdx and threadIdx built-in variables.<br>These will return different values based on the thread that’s accessing them.In our example, threadIdx.x and threadIdx.y will vary from 0 to 31 based on the position of the thread in the grid. Same for blockIdx.x and blockIdx.y, which will vary from 0 to CEIL_DIV(N, 32) or CEIL_DIV(M, 32) based on the position of the thread’s block in the grid. We’ll do a lot of indexing into strided in-memory representations of matrices. Edward Yang’s post on PyTorch Internals contains a good explanation of strided tensors.

__global__ void sgemm_naive(int M, int N, int K, float alpha, const float *A,<br>const float *B, float beta, float *C) {<br>// compute position in C that this thread is responsible for<br>const uint x = blockIdx.x * blockDim.x + threadIdx.x;<br>const uint y = blockIdx.y * blockDim.y + threadIdx.y;

// `if` condition is necessary for when M...

kernel thread cuda cublas from blockdim

Related Articles