Cutlass Tutorial: Efficient GEMM Kernel Designs with Pipelining

CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining - Colfax Research

2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275<br>research@colfax-intl.com

CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining

Welcome to Part 2 of our tutorial series on GEMM (GEneral Matrix Multiplication). In Part 1, we discussed the computational side of GEMM by going over WGMMA, which is the primitive instruction to multiply small matrix tiles on GPUs based on the NVIDIA® Hopper™ architecture. In this part, we turn our focus to the memory side of GEMM. Specifically, we will explain how to efficiently bring small tiles of operand tensors from a GPU’s global memory into its on-chip memory, from where they can be passed into WGMMA (or other primitive MMA instructions, for that matter).

The main concept to explain is how to orchestrate a pipeline of data in order to efficiently feed the tensor cores. In the context of GEMM kernel design, pipelining refers to the idea of overlapping copy and MMA operations through maintaining multiple data buffers. In this article, we will cover two pipelining strategies that are effective on the Hopper architecture:

Warp-specialization. Specializing warps into producers (data transfer) and consumers (compute), and having them run concurrently.

Multistage. Masking data transfer by using asynchronous copy (TMA on Hopper or cp.async on Ampere) to load the next set of data, while computing on the current set. Warps take on both producer and consumer roles.

To then ensure correctness of the kernel, one needs to pay careful attention to the data dependencies at hand, which govern when buffers can be read by the MMA instructions or filled by the copy operations. We will go into detail on how to write the necessary synchronization logic for a pipelined GEMM kernel using tools from the CUTLASS library, most notably the CUTLASS Pipeline classes.

We then present a performance evaluation of pipelining and show how exploiting this one optimization idea already achieves ~65% utilization for a Hopper GEMM kernel in half-precision. Finally, in the Appendix we explain how to write a pipelined GEMM kernel for GPUs based on the NVIDIA Ampere architecture.

The big picture: "Feeding the beast"

There are 2 main actions in a GEMM kernel: copying the numbers to the correct memory addresses, and multiply-accumulating them. The former action is handled by copy instructions: TMA in Hopper, cp.async in Ampere, and vanilla copy in earlier architectures. The latter action, since the Volta architecture in 2017, has become the exclusive business of the tensor cores.

Through many generations, the tensor cores have become a beast at consuming the numbers fed to them. For instance, the H200 SXM GPU’s tensor cores can deliver up to 3,958 TFLOPS (TeraFLOPs per second). On the other hand, the memory bandwidth of the same H200 SXM GPU is only 4.8 TB/s (TeraBytes per second). This data transferring speed is much slower than the tensor cores’ speed, and oftentimes is not trivial to fully utilize! As such, a common theme of CUDA programming — and GEMM kernel design in particular — is to figure out how to copy numbers fast enough to keep the tensor cores busy. We call this process "feeding the beast."

In general, there are two overarching strategies to "feed the beast," which are complementary and function at different scopes (grid vs. block). The first strategy is effective threadblock scheduling, which entails distributing the computation among the CTAs to obtain good load balancing and a higher rate of L2 cache hits. We will discuss this in a later blog post, but for now, we refer curious readers to the techniques of threadblock rasterization and persistent kernels, for instance as implemented in CUTLASS. The second strategy, which we focus on in this tutorial, is to overlap copying with math operations. In particular, while the tensor cores are busy multiplying a batch of numbers that they receive, we should tell the copying units to copy the next batch of numbers. That way, we effectively hide part of the copying latency. This is the goal of pipelining.

Latency, warps, and warp-specialization

Before discussing the mechanics of pipelining, we go over some history regarding the two overlapping strategies mentioned in the introduction: multistage and warp-specialization.

First, the idea of overlapping memory copy with math operations is neither new nor specific to GPUs. Readers familiar with CPUs may find it similar to the cache prefetching technique, where an asynchronous fetch request is made before the data is needed. In fact, the pipelining technique we discuss in this post is conceptually the same as CPU cache prefetching! However, since prefetching on GPUs is expensive in terms of silicon area on the chip, the technique is implemented differently.

The most basic method by which GPU programmers can create overlapping is via excess warps (warps are groupings of 32...

Cutlass Tutorial: Efficient GEMM Kernel Designs with Pipelining

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi