Elusive order of async GPU kernels: scheduling, abstractions, DSL implications

matt_d1 pts0 comments

The elusive order of things – Ian’s Blog

Skip to content

The elusive order of things

Written by

Ian

in

ML Infrastructure, posts

SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.

Choosing what to run where and when is a scheduling problem, and there have been three effective approaches to that so far.

You can schedule statically: decide ahead of time what all the units should do each tick. You can schedule temporally: swapping in different phases of workers via a pipeline. Or you can schedule spatially: divide the resources of the machine into different roles.

The underlying mechanics of which one you pick tends to be determined by the hardware. A chip like a TPU spends most of its silicon on math, and fairly little on orchestrating work. That means static scheduling, and a compiler that can build you that schedule.

Ampere and before1, and all the modern AMD chips, encourage temporal pipelining. The hardware will swap in warps (or waves) when one stalls ,and by structuring your kernels into phases you can hide memory latency and keep the chips busy.

Hopper and beyond are where spatial scheduling started mattering, in the form of warp specialization. Nvidia GPUs let you assign different register footprints to different warp groups. When you introduce warp-group scoped MMA for compute and TMA for executing data moves from a single thread you have the ingredients to divide the pipeline between groups. Instead of the same worker doing load -> compute -> store you have different workers exclusively working on different parts of the pipeline. Blackwell made this… much harder. TMEM and UMMA added new operator and memory types, so you now need to schedule movement between shared memory, tensor memory, registers, global memory, and a variety of compute units.

The problem is: how do you do that?

To stick with Nvidia for a moment, at the bottom of the stack are barriers. An mbarrier is a phase switch for a specific number of arrivals: one side waits, the other increases the arrival count. When the counter matches the expected number, the phase flips. It’s elegant and straightforward, and easy to get wrong. A classic example is the phase parity bug: if you screw up the wraparound the kernel can work perfectly at first, but then deadlock waiting on the wrong phase.

Next up, libraries like CUTLASS, and newer ones like ThunderKittens, package the patterns you tend to write. The CUTLASS Pipeline combines buffers and synchronization into a unit and makes it easy to compose common structures. This is where much of the expert-kernel-writing time goes, but that time encodes a lot of hardware-specific behavior. Hopper wants one set of patterns, Blackwell another, and even within a generation there can be differences between variants of the hardware. The more explicit the schedule is for the developer, the more they own the portability problem.

The subsequent step is to make the schedule less explicit, while still keeping the roles visible. AsyncGraphene’s ARef is a good example of this. An ARef is a reference to asynchronously produced data. Basically, a channel, with synchronization attached. A producer writes, a consumer reads, and both sides can know when the other is done. A compiler can then plan a schedule. Nvidia’s TAWA work does this explicitly for Triton, tagging producers and consumers and lowering to ARefs. TLX on the other hand, as well as systems like PipeThreader, allow defining subtasks in a kernel that a compiler can schedule.

TileIR and CuTile also enable building an explicit graph, but through focusing on the data itself. Attaching usage information on how data is read or written gives the compiler room to bundle work into tasks and reschedule.

Getting the graph is the starting point, but then you need to identify what the right schedule actually is. In practice this involves exploring different shapes and combinations to work out which is best. You can either do that explicitly through heuristics and cost models of the hardware, or do it via searching across many different possible schedules to find the ones that work best. Most systems do both.

But what do we need in a kernel DSL?

If you are building a DSL for writing kernels, the starting point is to reflect whatever the hardware does. This is not only direct, but also a necessary option because there are always smart people operating at the frontier who have a strong intuition around how to drive the most performance. They’re often targeting very new hardware which is not yet well understood (sometimes, even by the people that made it).

Beyond that, deciding what else should be on offer means answering three questions:

1. How do you think about portability?

Portability doesn’t mean “write one generic...

schedule hardware different memory work scheduling

Related Articles