A Tiny Compiler for Data-Parallel Kernels

A Tiny Compiler for Data-Parallel Kernels — Andrew HealeyA lot of fast code starts as a boring loop. Modern hardware can perform the same operation on multiple values at once (e.g. SIMD and SIMT), and sometimes we write code directly for those execution models but other times, a compiler starts with regular-looking code and rewrites it so multiple loop iterations can run together. I built a tiny compiler (~180LOC of Python) to understand what that transformation looks like. My compiler lowers kernels (rewrites them into a simpler, more explicit form where data parallelism is visible). The input is a small hand-written AST, and the output is a lowered IR that I print as Python-like code. Rather than going all the way from source code to instructions, think of this compiler as an intermediate step in a larger compiler. Let's take a look at an example. Scaling audio is easy to parallelize, but it is still common to write non-explicitly parallel code like this: kernel scale_audio(samples, out, n, volume): for i in range(n): out[i] = samples[i] * volume

My compiler turns it into: kernel scale_audio(samples, out, n, volume): vector_for base in range(0, n, LANES): let i = (base + lane_id) let active = (i n) masked_store(out, i, (masked_load(samples, i, active) * volume), active)

The goal is to replace for loops with vector_for loops, which allow multiple iterations of a loop to be executed in parallel. Each position in that grouped execution is called a lane. Lanes and Masks A lane is one independent element position within a grouped execution. For example, if a grouped operation handles four values at once, it has four lanes: [ 10 | 20 | 30 | 40 ] ^ ^ ^ ^ lane0 lane1 lane2 lane3

Each slot is a lane, so when you execute an operation like multiplication across the group, it looks like this: [10 | 20 | 15 | 30] [ 3 | 3 | 3 | 3] [30 | 60 | 45 | 90]

When the code is executed for each lane, they have a unique offset to know which data to operate on. Each lane works on a different logical element, and the grouped operation runs them side by side. To handle cases where the data is not divisible by the number of lane slots, a per-lane boolean mask is applied to skip out-of-bounds loads and stores. You can think of a mask like [true, true, false, false]: the first two lanes are allowed to run, and the last two are ignored. This is useful for the final chunk of an array, where there may not be enough elements left to fill every lane. A follow-up step would turn the body of the vector_for into instructions that target a specific architecture. In the ideal case this can improve performance by a factor of the number of lanes, though memory access and other overheads do play a part. Deciding What Can Be Lowered The lowering pass needs to answer two questions: Can different iterations of this loop run independently? For each value in the loop, is it uniform or varying? A uniform value is the same for every lane. A varying value may be different in each lane. This distinction matters because uniform values are shared across all lanes, while varying values require per-lane computation. In the audio example, volume is uniform and every lane multiplies by the same value. But i is varying, because each lane handles a different index, which means samples[i] is also varying. value kind why volume uniform same value in every lane i varying each lane gets a different index samples[i] varying each lane reads a different sample

The core of my compiler is a small AST-walking classifier: # (pseudocode but not far off reality) def kind(expr, env): if expr is a literal: return UNIFORM

if expr is a variable: return env[expr.name]

if expr is a load: return VARYING if kind(expr.index, env) == VARYING else UNIFORM

if expr is a binary expression: left = kind(expr.left, env) right = kind(expr.right, env) return VARYING if VARYING in (left, right) else UNIFORM

Before the loop, kernel parameters are assumed to be uniform. Inside the lowered loop, the compiler marks the loop index as varying: env = {param: UNIFORM for param in kernel.params} env[i] = VARYING

From there, varying-ness flows through the expression. Since i is varying, samples[i] is varying. Since samples[i] is varying, samples[i] * volume is varying (even though volume itself is uniform). This classification tells the compiler what to emit. If each lane reads the next adjacent element, like samples[i], the compiler can use a masked load. That keeps the operation grouped while preventing inactive lanes from reading past the end: samples[i] -> masked_load(samples, i, active)

But a varying load from an arbitrary per-lane index becomes a gather. A gather is the grouped-load version of "each lane reads from its own address", so the addresses may be different and non-adjacent. For example, in this unoptimized code: kernel color_by_number(color_number, colors, out, n): for i in range(n): number = color_number[i] out[i] = colors[number]

Each lane loads a different number, so number is...

A Tiny Compiler for Data-Parallel Kernels

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi