Finding deadlocks in CuTe kernels with SPINGeorge's Blog
SearchSearch
Dark modeLight mode<br>Reader mode
!a.isFolder&&!b.isFolder||a.isFolder&&b.isFolder?a.displayName.localeCompare(b.displayName,void 0,{numeric:!0,sensitivity:\"base\"}):!a.isFolder&&b.isFolder?1:-1","filterFn":"node=>node.slugSegment!==\"tags\"","mapFn":"node=>node"}">Explorer
Finding deadlocks in CuTe kernels with SPIN<br>May 26, 2026
Using SPIN model checker to statically find or prove the absence of deadlocks in CuTe DSL kernels on NVIDIA B200, and presenting a proof-of-concept github.com/cheshire/cute2promela lowering from CuTe to SPIN.
Synchronization bugs in GPU kernels are hard to debug. When a barrier deadlocks, the hardware yields no stack trace, and no error code until the benchmark times out. Hence each iteration of the debug loop starts to potentially cost tens of minutes.
As we’ve worked on FlashInfer MLSYS Challenge (our solution took 1st place in the mixture-of-experts track), we had to iterate on a persistent fused mixture-of-experts kernel for DeepSeek-V3, written in CUTLASS’s CuTe DSL for an NVIDIA B200 and stitched together from FF1, SwiGLU, and FF2 stages across clusters of CTAs. The pipeline is coordinated via mbarriers in shared memory, peer-CTA mbarriers reached via mapa-translated DSMEM pointers, and GMEM atomic counters across clusters. A bug in any of those can potentially result in a deadlock, which is hard to debug, especially when GPUs are only available via Modal.
As my background is in formal verification, I wanted to try to instead encode the synchronization model in Promela DSL and check them with the SPIN model checker. This would not only shorten the iteration cycle, but also deterministically either demonstrate a counterexample to the desired property (e.g. a deadlock) or prove that no such interleaving exists.
Short Primer: Blackwell Synchronization Primitives
Let’s start with an overview of Blackwell synchronization primitives we would have to model.
An mbarrier is a 64-bit hardware object that lives in shared memory. Its bits hold a current arrival count, a pre-declared expected count, and a single-bit phase. You allocate it like any SMEM variable, then have one thread initialize it with the expected count; from that point on, the hardware treats it as a small state machine with two transitions:
Arrive. Any thread can call mbarrier_arrive on the barrier. The hardware atomically increments the arrival count. If the new count equals expected, the barrier completes: the count resets, the phase bit flips. Otherwise the call returns and the thread continues.
Wait. A thread that calls mbarrier_wait(phase=P) stalls until the barrier’s phase differs from P. Since the phase only changes on completion, this means “wait until the barrier completes one more time after I last looked”.
The single-bit phase is what makes mbarriers reusable. The first time you arrive-and-wait, you wait on phase 0; after completion, the barrier is at phase 1 and ready for the next round; you wait on phase 1; and so on. A persistent kernel that does N iterations of arrive-then-wait has to flip its expected-phase tracking each iteration. Get the flipping wrong and an iteration deadlocks.
mbarrier in CuTe DSL
In CuTe DSL, every mbarrier op is one of a small handful of calls:
Initialization. Barriers are allocated in static SMEM and initialized with their expected arrival count. One elected thread per cluster does this, and a fence makes the initialization visible before anyone arrives.
with cute.arch.elect_one():<br>cute.arch.mbarrier_init(gate_mbar, expected=256) # 128 local + 128 peer<br>cute.arch.mbarrier_init_fence()<br>Local arrive. A thread drops one count on a barrier in its own CTA’s SMEM.
cute.arch.mbarrier_arrive(gate_mbar)<br>TMA arrive-and-expect-tx. A variant used by TMA producers: the barrier accumulates bytes rather than arrivals, and completes when the expected byte count has landed. Typically followed by an async copy that names the barrier and a wait on the same barrier.
cute.arch.mbarrier_arrive_and_expect_tx(w2_mbar, w2_copy_bytes)<br>cute.copy(tma_atom_w2, ..., tma_bar_ptr=w2_mbar)<br>cute.arch.mbarrier_wait(w2_mbar, tma_phase_w2)<br>tma_phase_w2 = tma_phase_w2 ^ cutlass.Int32(1)<br>Phase wait. A consumer blocks until the barrier’s phase differs from the value it last saw.
cute.arch.mbarrier_wait(gate_mbar, sgate_phase)<br>sgate_phase = sgate_phase ^ cutlass.Int32(1) # toggle for next iter<br>Cross-CTA arrive. The B200 cluster lets two CTAs share each other’s shared memory through a DSMEM window. To arrive on a peer CTA’s mbarrier, the local SMEM pointer is translated through mapa.shared::cluster to the peer’s view of the same allocation, and the arrive lands on that translated address.
def mbarrier_arrive_peer(mbar_ptr, peer_cta_rank):<br>remote = cute.arch.mapa_shared_cluster(mbar_ptr, peer_cta_rank)<br>cute.arch.mbarrier_arrive(remote, space="cluster")<br>From the peer’s perspective, that arrive lands in their count the same way a...