A Fast Attention Kernel for MI300X, Written in HIP, Not Assembly - MoonMath.ai
A Fast Attention Kernel for MI300X, Written in HIP, Not Assembly
Jun 17, 2026<br>AMD LiteAttention MoonLite
A deep dive into building a bf16 forward attention kernel for AMD MI300X in HIP, using instruction level control without dropping into a full assembly codebase.
TL;DR
A bf16 forward attention kernel for AMD MI300X, written in HIP rather than full hand assembly.
The core technique is one-instruction asm wrappers: choose exact opcodes while leaving register allocation to the compiler.
The design plans the CDNA3 pipeline directly: eight waves, two groups, and two carefully placed barriers.
Most of the speedup comes from memory placement: K in LDS, V kept hot in L1, Q and accumulators in registers.
3Q tiling and the tail KV split turn a near tie into wins across the reported sweep.
Code is open-source under MIT and can be found here.
We would like to thank HotAisle, our AMD cloud, for bare metal access and general support.
A bf16 forward attention kernel for AMD MI300X (gfx942), written in HIP rather than assembly, that beats AITER v3 on every shape and every rounding mode. Geomean 1.18×, 1.15×, 1.08× (RTNE, RTNA, RTZ), up to 1.26×, across an 8K to 128K sweep, and 1.37× to 1.59× versus Modular MAX.
1. What we built, and the idea behind it
Attention is the fused softmax(QKᵀ/√d)·V operation at the core of every transformer. As part of our kernel engineering work, we built a full BF16 forward-attention kernel for AMD MI300X (CDNA3, gfx942). In our benchmarks, it outperformed the best existing AMD implementation on MI300X. This write-up provides the technical details, lessons, and some techniques we found useful and can be reused.
AITER is AMD's own optimized kernel library, and its v3 forward attention is written by hand in GCN assembly. Against it, our kernel wins on every shape and every rounding mode across an 8K to 128K token sweep. The geomean is 1.18×, 1.15×, 1.08× for RTNE, RTNA, and RTZ rounding, and 1.37× to 1.59× against Modular MAX. RTNE and RTNA lead by 1.11× to 1.26×. RTZ is AITER's own fastest mode, the tightest race, and there we win by a slimmer 1.03× to 1.11×.
For a rigidly structured computation like attention, the way to win is to make the scheduling decisions yourself instead of leaving them to the compiler. The compiler's job is to be a good generalist, and a fused attention kernel is not a general program. It is a pipeline whose exact shape we already know, down to which instruction should issue when, what should be in flight, and where each barrier goes. Almost every section here takes one of those decisions in hand. There is one honest exception, in §3.1, where the smarter move is the opposite: hand the compiler an easier problem and let it do the overlap itself. We flag it when we get there.
This blog post is structured as follows: the toolkit (§2) is the handful of barely documented techniques that make instruction level control possible from HIP at all. The architecture (§3) covers why eight waves, how the two halves overlap, and which matrix core instruction we use and why. The memory hierarchy (§4) is what lives in registers, LDS, and L1, and the lengths we go to to keep it there. The two advanced wins (§5) turn a near tie into a clean sweep. The full results are in §6.
2. The toolkit
2.1 Templated asm functions
Everything that follows rests on one technique, and it is the one we found least documented anywhere, so we start here.
Every HIP kernel author hits the same dilemma when they want a specific machine instruction. Compiler intrinsics (__builtin_amdgcn_*) keep you in tidy C++, but the compiler then feels free to help in ways you did not ask for. It may rename your accumulator into a fresh register and add a v_mov to copy it back, re-zero an operand you know is already zero, or reorder the instruction against its neighbours. Drop to raw inline assembly and you get exactly the instruction you asked for, but AMD's own guidance is blunt about the price: inline asm is "not recommended, since the compiler does not look at the semantics of the inlined instructions, and may not take care of data hazards" (AMD matrix cores lab note). And now you are managing registers and addresses by hand too.
The way through is a small pattern. Wrap exactly one instruction in a __device__ __forceinline__ function and describe its operands with GCC or Clang extended asm constraints. You dictate the opcode, and the compiler still allocates the registers and tracks the data flow.[1] Here is the MFMA we issue hundreds of thousands of times:
// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.<br>__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {<br>asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"<br>: "+v"(c) : "v"(a), "v"(b));<br>The whole trick is in the constraint string. "+v"(c) says that c is a read-write VGPR operand, so the accumulator's input and output...