A Case for Tracing Based DSL Kernel Languages

A Case for Tracing Based DSL Kernel LanguagesGeorge's Blog

SearchSearch

Dark modeLight mode Reader mode

!a.isFolder&&!b.isFolder||a.isFolder&&b.isFolder?a.displayName.localeCompare(b.displayName,void 0,{numeric:!0,sensitivity:\"base\"}):!a.isFolder&&b.isFolder?1:-1","filterFn":"node=>node.slugSegment!==\"tags\"","mapFn":"node=>node"}">Explorer

A Case for Tracing Based DSL Kernel Languages May 26, 2026

On the architectural divide between parsing and tracing kernel DSLs, and what tends to go wrong in each.

The language for writing NVIDIA GPU kernels was always exclusively CUDA, but since Triton appeared, a wave of Pythonic DSLs has followed: CuTe-DSL, cuTile, Pallas, Gluon, Warp, and the more recent TileLang used in DeepSeek’s DeepGEMM. Most of these systems share the same goal of lowering a tile-oriented program into PTX or LLVM-IR, and are embedded in Python.

The question is how to embed the DSL into Python. Triton and CuTe-DSL parse the source AST. Pallas runs the function under abstract values and traces the resulting operations. (PyTorch’s torch.compile intercepts CPython bytecode rather than source, but that is still parsing, just against a smaller, post-desugared grammar; the same trade-offs apply.)

Most DSLs follow Triton’s lead and use parsing. This essay takes the alternative and argues that a tracing-based approach is often preferable.

CUDA and Templates

A CUDA kernel directly specifies the execution code for each thread. A textbook fused-softmax kernel in CUDA looks roughly like this:

template typename T, int BLOCK_SIZE> __global__ void softmax_kernel(const T* __restrict__ x, T* __restrict__ y, int n_cols) { int row = blockIdx.x; int tid = threadIdx.x;

__shared__ float sdata[BLOCK_SIZE]; const T* row_ptr = x + row * n_cols;

float local_max = -INFINITY; for (int i = tid; i n_cols; i += BLOCK_SIZE) local_max = fmaxf(local_max, float(row_ptr[i])); sdata[tid] = local_max; __syncthreads(); // ... tree reduction, exp, normalize, store ... The element type T and the block size BLOCK_SIZE must be known at compile time, as __shared__ memory is statically sized, and the compiler must specialise loop bounds to enable vectorisation of the body. Hence any expansion of the supported configuration space multiplies the number of instantiations. Three element types and four block sizes already imply twelve instantiations, and the responsibility for dispatching among them rests with the caller.

Adding more templates and more generalisations to CUDA, one eventually reaches a heavily templated CUTLASS-like state.

CUTLASS: Building Blocks for CUDA Kernels

CUTLASS is what C++ template metaprogramming looks like when taken as a way to write GPU kernels. Consider the declaration of its principal Gemm class, the entry point most users first encounter, from include/cutlass/gemm/device/gemm.h:

template /// Element type for A matrix operand typename ElementA_, /// Layout type for A matrix operand typename LayoutA_, /// Element type for B matrix operand typename ElementB_, /// Layout type for B matrix operand typename LayoutB_, /// Element type for C and D matrix operands typename ElementC_, /// Layout type for C and D matrix operands typename LayoutC_, /// Element type for internal accumulation typename ElementAccumulator_ = ElementC_, /// Operator class tag typename OperatorClass_ = arch::OpClassSimt, /// Tag indicating architecture to tune for typename ArchTag_ = arch::Sm70, /// Threadblock-level tile size (concept: GemmShape) typename ThreadblockShape_ = typename DefaultGemmConfiguration OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::ThreadblockShape, /// Warp-level tile size (concept: GemmShape) typename WarpShape_ = typename DefaultGemmConfiguration OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::WarpShape, // ... ten more parameters elided ... bool ScatterD = false, typename PermuteDLayout = layout::NoPermute> class Gemm { /* ... */ }; cutlass/gemm/device/gemm.h, lines 169–233. Around twenty template parameters, several with defaults that recursively look up DefaultGemmConfiguration.

A fragment of the canonical Hopper warp-specialized GEMM example shows how a user composes a kernel from nested CollectiveBuilders, each a template that pulls in dozens of further instantiations:

using namespace cute;

using TileShape = Shape_128,_128,_32>; // CTA tile using ClusterShape = Shape_4,_2,_1>; // cluster of CTAs

using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, TileShape, ClusterShape, cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator, ElementAccumulator, ElementC, LayoutC, AlignmentC, ElementC, LayoutC, AlignmentC, cutlass::epilogue::collective::EpilogueScheduleAuto >::CollectiveOp;

using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder ArchTag, OperatorClass, ElementA, LayoutA, AlignmentA, ElementB, LayoutB,...

A Case for Tracing Based DSL Kernel Languages

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models