NVFP4 Blockscaled GEMM on NVIDIA RTX Pro Blackwell GPUs (SM12x)

NVFP4 Blockscaled GEMM on NVIDIA RTX Pro Blackwell GPUs (SM12x) - Colfax Research

2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275<br>research@colfax-intl.com

In this article, we explore hardware-supported NVFP4 blockscaled GEMM on SM12x GPUs, such as the NVIDIA RTX Pro 6000 Blackwell Server Edition (SM120) or NVIDIA DGX Spark (SM121). We will first discuss features of these GPUs and their kernel programming paradigm, situating them relative to SM10x (e.g. B200 or B300) and SM8x (Ampere/Ada). Then, we will discuss sub-byte blockscaled GEMM on SM12x, covering the necessary PTX background and scale-factor layouts before moving on to implementation details as a CuTe DSL kernel. Finally, we present some GEMM benchmarks, which we will expand on in a subsequent post on optimization.

SM12x architecture

SM12x denotes the feature set available in Blackwell systems such as the RTX Pro 6000 Blackwell Server Edition. However, even though these devices carry the Blackwell family name, their programming model diverges substantially from that for B200/B300 GPUs, instead inheriting much of the design familiar from the older SM8x architectures (Ampere/Ada). In this section, we compare these three architectures in terms of what matters for kernel development.

First, we recall that SM10x systems like the B200 provides the asynchronous tcgen05.mma instruction for targeting the tensor cores, which uses Tensor Memory (TMEM) as a dedicated on-chip memory for holding the MMA accumulators. In contrast, SM12x does not use tcgen05 or TMEM; instead, like with SM8x it uses warp-level mma.sync instructions.

This difference has important consequences for kernel structure. On SM10x, tcgen05.mma is launched asynchronously from a single thread, sources operands and accumulator from shared memory (SMEM) and TMEM, and is locked to one CTA per SM. As a result, SM10x GEMM kernels follow a warp-specialized paradigm in which all the MMA issue logic is delegated to one warp while other warps are responsible for loading the operands and writing out the result.

On SM12x, one instead uses mma.sync, which is a synchronous, warp-collective operation that sources its operands from register memory (RMEM) and also accumulates into RMEM, with fixed partitionings of the tiles over threads. Because it uses register fragments, the instruction tile for mma.sync is necessarily much smaller than that for tcgen05.mma, and one needs multiple warps handling MMA to achieve good throughput.

We note that mma.sync can sometimes be available on SM10x but (a) will never achieve the peak MMA throughput of the device and (b) for certain options like low-precision is also incompatible. So, the takeaways are:

SM10x GEMM kernels are completely incompatible with SM12x and will not run on those devices.

SM12x GEMM kernels may run on SM10x but will perform poorly.

SM8x GEMM kernels will run on SM12x and can often perform reasonably. Indeed, much of the optimization logic around scheduling and pipelining mainloops from SM8x directly applies to SM12x.

Second, we point out that despite the close similarities, SM12x has also advanced in many ways over SM8x:

Sub-byte support and blockscaling: mma.sync on SM12x supports lower precision formats than SM8x, as well as hardware-supported blockscaling.

Tensor Memory Accelerator (TMA) is available on SM12x.

Warpgroup register re-allocation is supported only on certain SM12x devices (sm120a/sm120f).

Cluster Launch Control (CLC) – a hardware-supported dynamic work-distribution scheme for persistent scheduling – can be used on SM12x.

Programmatic Dependent Launch (PDL) – kernel parallelism between dependent kernel launches – is supported on SM12x.

In terms of the programming model, the addition of TMA, register reallocation and CLC combine to make persistent scheduling much more favorable. But the SM8x-style static grid scheduling still works fairly well.

Sub-byte datatype and Blockscaling review

We covered both sub-byte datatype and blockscaling in a previous post; this section serves as a brief review of these concepts.

Low precision computation: NVFP4

Low-precision data types are useful for reducing model size, memory traffic and computational load. SM12x supports multiple sub-byte data types, which are data types that narrower than 8 bits. We focus on the NVFP4 format.

NVFP4 is an NVIDIA standard data type that bundles a fixed-length vector of low-precision numbers with a scale factor. The operand data type is E2M1, a 4-bit floating-point with 2 exponent bits and 1 mantissa bit. Unlike the IEEE formats, the value with all exponent bits set to 1 is not reserved for NaN or ∞. This modestly extends the range of E2M1. The possible values are:

[0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6]

The scale factor used to dequantize NVFP4 has type UE4M3, a nonnegative 8-bit floating point number with 4 exponent bits and 3 mantissa bits. The...

NVFP4 Blockscaled GEMM on NVIDIA RTX Pro Blackwell GPUs (SM12x)

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews