Occupancy Math on the AMD MI355X: A From-First-Principles Guide

Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide · Shekhar Pandey ← all posts Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide<br>May 31, 2026 · GPU, AMD, CDNA4, kernels, occupancy<br>Ask a GPU kernel engineer how their kernel is doing and occupancy comes up within a sentence or two. It’s the number everyone quotes and the dial everyone reaches for — and, in my experience, the metric people understand least. Most treat it as an opaque percentage the profiler hands back. It isn’t. Occupancy is fully derivable by hand from a kernel’s resource usage and a handful of fixed hardware limits, and being able to do that derivation changes how you tune.

TL;DR. On MI355X, occupancy — the fraction of a SIMD’s wavefront slots your kernel keeps filled — is set by whichever of four resource limiters runs out first: VGPRs, SGPRs, LDS, or workgroup/barrier slots. Each is just a fixed hardware budget divided by what your kernel spends, so you can compute the ceiling by hand from the binary: occupancy = min(those four limiters). The VGPR file is 512 per lane, shared by regular and accumulator registers (not a separate AccVGPR pool). And maximizing occupancy is usually the wrong goal: in a measured MXFP8 MFMA sweep below, the matrix core stays at ~97% of peak even as occupancy falls to a fraction of full — its throughput tracks matrix-engine utilization, not how full the SIMD is.

This is a from-first-principles guide to occupancy math on the AMD Instinct MI355X (CDNA4, gfx950). We’ll build it from the silicon up: what the hardware budget actually is, which four resources cap how many wavefronts can go resident, and how to compute a kernel’s occupancy ceiling on paper and then confirm it with rocprofv3. The worked examples lean on MXFP8 GEMM tiles — the kind of kernel where these numbers decide whether the kernel is fast.

The post is in three parts:

Part 1 — The MI355X architecture. The CU, the four SIMDs, the wavefront, the Matrix Core, and the memory hierarchy that feeds them — the resources the math counts.

Part 2 — Occupancy math. The definition, the four limiters (VGPRs, SGPRs, LDS, and workgroup/barrier slots), the per-SIMD vs per-CU split that trips everyone up, worked examples, and how to measure occupancy for real.

Part 3 — Better performance at lower occupancy. The twist at the end: once you can compute the ceiling, why reaching for it is often the wrong move. Little’s Law, ILP versus occupancy, and a microbenchmark where the matrix core stays saturated even as occupancy collapses.

Occupancy is worth understanding precisely — both so you can fix the kernels that are genuinely occupancy-limited, and so you can recognize the ones that aren’t.

Part 1 — The MI355X architecture

Occupancy is, start to finish, a story about how a fixed pool of resources gets divided among the work you launch. So before any of the math lands, you need a clear mental model of the chip — and in particular, which resources are private and which are shared. We’ll build it top-down and keep returning to that one distinction, because it’s the hinge the whole calculation turns on.

The chip, top down

The MI355X is CDNA4, ISA target gfx950. At the top level it’s eight Accelerator Complex Dies (XCDs) stitched together over fourth-gen Infinity Fabric, totaling 256 Compute Units — 32 per XCD — clocked up to 2.4 GHz. Wrapped around the dies are 288 GB of HBM3E at 8 TB/s, fronted by 256 MB of last-level Infinity Cache. Each XCD carries its own L2 slice, which is why scheduling that keeps a tile’s traffic on one die (XCD-aware swizzling) pays off — but that’s a locality story, not an occupancy one.

For occupancy, the only unit you reason about is the Compute Unit. Everything above it governs how data moves; occupancy is decided one CU at a time. So that’s where we zoom in.

Inside a Compute Unit

A CU is four SIMD units plus shared infrastructure. Each SIMD is 64 lanes wide and owns a private register file. The pieces that matter:

VGPRs — a 512-entry-per-lane vector register file, private to the SIMD. Allocated per wave. This is usually the resource that caps occupancy.

AccVGPRs — the matrix accumulators, carved from that same 512 file. MFMA instructions can accumulate here. On CDNA4 the file is split between regular VGPRs (≤256/wave) and accumulation VGPRs (≤256/wave), but the two share one 512-entry budget — a wave’s regular plus accumulator count is what’s measured against 512 (ISA §3.6.4: “up to 512 total VGPRs, 256 of each type… the number of each type is flexible”). This is not the separate physical ACC file of MI100/MI200; CDNA3 unified them and CDNA4 keeps it that way. In practice the compiler fills regular VGPRs first: most of the MXFP8 GEMM tiles I profiled run with zero AccVGPRs — the accumulator sits in regular VGPRs — and only the largest tiles spill into the accumulator pool. Either way it’s one budget, and that’s the fact Part 3 turns on.

SGPRs — ~800 per SIMD. Scalar registers, allocated...

Occupancy Math on the AMD MI355X: A From-First-Principles Guide

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews