2678x Faster Matrix Multiplication with a GPU
Sign in
Subscribe
In the previous blog post, I teased how GPUs can speed up matrix multiplication. However, I introduced the basics of GPU programming using a simple vector addition example, which is perfect for introducing parallel programming. In this blog post, let's perform a parallel matrix multiplication on a GPU and answer three simple questions.<br>What is matrix multiplication?<br>What is the computational complexity of matrix multiplication, and how can GPUs help reduce the computational cost?<br>How can we use the CUDA programming model for a 2D dataset?<br>Introduction<br>Matrix Multiplication Algorithm<br>Matrix multiplication is a binary operation in linear algebra that produces a new matrix from two input matrices. When a matrix multiplication is performed, each element of the output matrix \(C\) is an inner product of a row of one input matrix \(A\) and a column of the other input matrix \(B\).<br>💡<br>Inner product of two vectors \(a\) and \(b\) (of length \(n\)), is given as \(\sum_{i=1}^n a_i \cdot b_i\). In the case of matrix multiplication, \(a\) would be a row of matrix \(A\), and \(b\) would be a column of matrix \(B\).
For simplicity, let's assume that \(N=M=P\). There are four main steps to sequential matrix multiplication:<br>Loop over every row of the matrix \(A\).<br>Loop over every column of the matrix \(B\).<br>Loop over all the elements in the selected row of \(A\) and column of \(B\). It is important to note that the number of elements in the two must be the same (that's essential for matrix multiplication to be valid).<br>Multiplication between the two elements is performed, and the result is added in every step.<br>void sq_mat_mul_cpu(float* A, float* B, float* C, int N)<br>// 1. Loop over the rows<br>for (int i = 0; i There are three loops in this algorithm and a total of \(N^2\) elements in matrix \(C\), where for each element, we need to perform around \(N\) multiplications and additions each. Hence, the total complexity is \(\approx N^2 \cdot (N+N) = \mathcal{O} (N^3)\). This means that with increasing \(N\), the computational cost increases drastically (cubic increase). However, if you analyze the algorithm carefully. You can see that the evaluation of the elements of the output matrix \(C\) does not depend on each other. In other words, we can execute the two loops (steps 1. and 2.) in parallel. GPUs are specially adapted to these types of computations (where \(N\) can be quite large) because the hardware contains 1000s of cores for parallel execution.<br>CPU vs GPU Hardware<br>CPUs and GPUs have execution units (called cores) that perform the arithmetic operations and an off-chip memory unit (RAM and VRAM, respectively) to store the required data. The big difference is that a CPU has much fewer cores than a GPU.<br>💡<br>CPU architecture is far more complicated than this, but to keep things simple, I decided not to discuss other components like cache, control unit, etc. The same applies to GPUs, but I will cover the details related to GPU hardware in a future blog post.
A GPU can't function independently. It's the job of a CPU to move data between RAM and VRAM. In other words, a CPU can be seen as an instructor who manages most of the tasks and is responsible for assigning specific tasks to the GPU (where it has an advantage).<br>Parallel Matrix Multiplication<br>CUDA is a heterogeneous computing platform developed by NVIDIA. CUDA C extends C programming language with minimal new syntax and library functions to allow programs to run on GPU and CPU cores. The structure of a CUDA C program reflects the presence of a CPU and one or more GPUs. It is such that the execution starts with the host, and the host assigns very specialized tasks to the device.<br>💡<br>Threads are at the heart of modern computing. A thread is a simplified view of how a processor executes a sequential program in modern computers. Adhering to the worker analogy, a thread can be seen as an individual worker, and the execution of a thread is sequential as far as a user is concerned (i.e., a worker can only do one task at a time).
When the program starts, there is a single CPU thread that moves data from RAM to VRAM. It then calls a kernel function (which defines the executions to be performed on the GPU), and multiple threads are generated on the GPU. Each thread computes the same kernel function in parallel and stores the results in VRAM. The CPU thread then fetches the results from VRAM and puts it back into RAM, where the user can either print the results or perform further operations.<br>The very first step is to define matrices A and B. For simplicity, I'm only dealing with square matrices of size N, and dynamically allocating them (with elements assigned random values).<br>#include<br>#include
#define MAX_NUM 10<br>#define MIN_NUM -10
int main(int argc, char const *argv[])<br>// Generate NxN square matrices A and B<br>int N = 100;<br>float* A = (float*)malloc(N*N*sizeof(float));<br>float* B =...