Matrix Multiplication on Blackwell

Modular: Matrix Multiplication on Blackwell: Part 1 - Introduction

Qualcomm to Acquire Modular. Read More →

August 28, 2025

Matrix Multiplication on Blackwell: Part 1 - Introduction Ali Taha

Jiexiang Liu

Hengjie Wang

Series

🛠️ Code to all kernels mentioned in this series is available on GitHub. This series of blog posts will showcase how one can: Write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation. Shows how one can leverage Mojo's special features to make the kernel as simple as possible. It is our hope that this series of blogs serves as the reference worklog for NVIDIA’s Blackwell GPU. While there is prior work on optimizing NVIDIA's Ampere and Hopper generation of GPUs, none currently exist to provide a blueprint for optimizing NVIDIA's Blackwell GPUs. In Part 1 (this blog post) we cover what a Matrix Multiplication (matmul) is, its importance for LLMs, and why we need to optimize it. Then we explain what a GPU is, GPU history since Ampere, and finally how to write a simple (not super performant) implementation of matmul on a GPU in 4 lines of Mojo. In part 2, we’ll explain the hardware instructions introduced in Blackwell GPUs, and continue improve on our kernels' performance to make it leverage the new hardware instructions. As we continue through the blog series, we will incrementally leverage new Blackwell features to improve our matmul implementation until the end of the series where we achieve performance that surpasses that of NVIDIA's cuBLAS library.

Performance at a glanceWhat is matmul? Given two dense matrices A and B of dimensions MxK and KxN respectively, we want to compute the matrix multiplication C = A.B which is defined by

Mojo

for row in range(M): for col in range(N): C[row][col] = 0 for inner in range(K): C[row][col] += A[row][inner]*B[inner][col]

Copy

Since matrix multiplication is a core part of linear algebra and presents itself in many areas, there has been extensive research on writing efficient algorithms. Readers who are interested in a deeper background of matmul are encouraged to read our blog post from 2 years ago. Why Does matmul matter today? All LLMs, be it Meta's Llama, Alibaba's Qwen, Deepseek, Anthropics' Claude, OpenAI's ChatGPT, or Google's Gemini, utilize matrix multiplications at their core. These matmuls might be disguised under multiple names, for instance, the Multi-Layer Perceptron (MLP), which is sometimes called the Linear layer, is an A.B^T matmul operation. The same is true for Attention, Latent Attention, Mixture of Experts, and so on. In fact, if we look at a profile from the Llama 8B model using FP8 on 2xB200, we observe that over 83% of the model's runtime is occupied executing some variant of matmul (e.g. linear, attention and MLP layers).

matmul makes up more than 80% of Llama 8B executionAs a result, even a 10% improvement in matmul performance yields around 8% end-to-end speedup. For companies spending hundreds of millions on serving, these optimizations translate directly to millions of dollars in savings. Why do we care about GPUs? We will motivate the value of GPUs by looking at matmul. Furthermore, for simplicity, let’s assume both A and B matrices are square, such that:

Simplfied illustration of matrix multiplicationIf we want to do the matmul on a CPU, then here’s pseudocode code we’ll have to write:

mojo

for row in range(M): for col in range(N): for inner in range(K): result[row][col]+=A[row][inner]*B[inner][col];

Copy

Essentially the code calculates the inner product across the K dimension for output element.

Matrix multiplication as a collection of inner productsCPUs are limited to on the order of a few hundred cores (with lower-end CPUs having around 32 cores and high-end models having on the order of 128 cores). GPUs, on the other hand, offer massive parallelism: modern GPUs handle over 100,000 threads simultaneously (B200s can handle up to 151,552 threads), making them the ideal hardware choice for repetitive, data-parallel operations like matrix multiplication. To accelerate things even further, recent GPUs (since Volta) have a dedicated fast hardware unit for matrix multiply accumulate (MMA) operations called tensor core. While originally Tensor cores were limited to small matmuls (on the order of 16x16x16), the 5th generation tensors cores introduced in Blackwell can perform a large sub-matrix multiplication (up to 256x256x16). This enables the Blackwell GPUs to increase the peak computation throughput. GPU from the hardware architect perspective? To understand GPUs better, let's look at how the GPU is organized from a hardware architecture perspective. A GPU, like other Von Neumann architectures, is composed of elements that compute (commonly known as Arithmetic Logical Units or ALUs), and elements that load/store the data for these computations. A GPU will contain several Streaming Multiprocessor (SM), L2 Cache and global memory which are...

Matrix Multiplication on Blackwell

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI