What happens when you run a CUDA kernel?

What happens when you run a CUDA kernel

What happens when you run a CUDA kernel 29 Jun 2026 · 35 min read · Cover: Salomon de Caus's pinned-cylinder water organ, engraving from Les Raisons des Forces Mouvantes (1615).

Here’s a simple CUDA program. It adds two vectors.

__global__ void vadd(const float* a, const float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i n) c[i] = a[i] + b[i];

int main() { int n = 1 20; // a million floats (1,048,576) size_t bytes = n * sizeof(float);

float *a = (float*)malloc(bytes), *b = (float*)malloc(bytes), *c = (float*)malloc(bytes); for (int i = 0; i n; i++) a[i] = b[i] = 1.0f;

float *da, *db, *dc; cudaMalloc(&da, bytes); cudaMalloc(&db, bytes); cudaMalloc(&dc, bytes); cudaMemcpy(da, a, bytes, cudaMemcpyHostToDevice); cudaMemcpy(db, b, bytes, cudaMemcpyHostToDevice);

vadd4096, 256>>>(da, db, dc, n); // 4096 * 256 = n threads, one per float

cudaMemcpy(c, dc, bytes, cudaMemcpyDeviceToHost); printf("c[0]=%f c[n-1]=%f\n", c[0], c[n-1]); Compiled for an RTX 4090, and launched, it does correctly work out that 1+1=21+1=21+1=2, a million timesI didn’t check all of them..

$ nvcc -arch=sm_89 -o vadd vadd.cu && ./vadd c[0]=2.000000 c[n-1]=2.000000 Telling you that involved tens of millions of CPU instructions, a couple of device files, nine hundred ioctls, and one memory-mapped doorbell register. In this post, we’ll follow this one kernel from the code down to the warps, and back up to the answerAn aside, this post is an instance of the ‘legibility transition’ that agents have engendered. There really is very little about computers you can’t find out with curiosity and (machine-enhanced) persistence. An interesting discussion of the implications of legibility for what AI can help us to know here..

Compiling our program with nvcc§

We ought to start with how to turn this CUDA program into something that the device can actually read. To do that we need a compiler. Really, we need many compilers.

nvcc is a driver program that runs several other compilers and combines their output. If you pass --keep it leaves the whole pipeline on disk for you to read:

$ nvcc --keep -arch=sm_89 -o vadd vadd.cu && ls ... vadd.ptx # device code as PTX (from cicc) vadd.sm_89.cubin # device code as SASS (from ptxas) vadd.fatbin # cubin + PTX, bundled (from fatbinary) vadd.cudafe1.stub.c # host launch stub + kernel registration vadd.o # final host object, fatbin embedded ... The host code goes to your host compiler. The device code (vadd) takes more steps: cicc, an LLVM-based compiler, turns it into PTX, and then ptxas turns the PTX into SASS.

PTX is a virtual ISA. It has infinitely many typed registers, and no notion of how many of them the hardware actually has. Here is the (elided) body of vadd in PTX:

$ cat vadd.ptx ... mad.lo.s32 %r1, %r3, %r4, %r5; // set register r1 to ctaid*ntid + tid setp.ge.s32 %p1, %r1, %r2; // set predicate p1 if i >= n @%p1 bra $L__BB0_2; // if out of bounds, skip to exit cvta.to.global.u64 %rd4, %rd1; // convert generic pointer %rd1 to a global address, store in %rd4 mul.wide.s32 %rd5, %r1, 4; // multiply r1 by 4, store the result in %rd5 add.s64 %rd6, %rd4, %rd5; // add %rd4, %rd5, result in %rd6 ld.global.f32 %f2, [%rd6]; // load a[i] into %f2 ... add.f32 %f3, %f2, %f1; // add %f1 and %f2, result in %f3 st.global.f32 [%rd10], %f3; // store c[i] = ... in global memory The virtual registers look like %rd1–%rd10, %f1–%f3The prefix is the type: %r is a 32-bit integer, %rd a 64-bit one, %f a 32-bit float, %p a one-bit predicate..

PTX is more ‘longhand’ than you might expect. For example, forming one address in %rd6 takes three PTX instructions. This happens because PTX is device agnostic.

Why three? CUDA pointers are “generic” by default, meaning they could name global, shared, or local memory. cvta.to.global asserts the pointer lives in the global window, so a cheaper ld.global can be used later. mul.wide.s32 then turns the index i into a byte offset by multiplying by 4 (sizeof(float)) and widening 32→64 bits in one step. add.s64 adds that to the base pointer.

Next, ptxas transforms our PTX, which is device agnostic, into the SASS for your architecture, which isn’t. The SASS it emits looks different:

$ cuobjdump -sass vadd /*0000*/ MOV R1, c[0x0][0x28] ; // set up the stack pointer (ABI; unused here) /*0010*/ S2R R6, SR_CTAID.X ; // R6 = blockIdx.x /*0020*/ S2R R3, SR_TID.X ; // R3 = threadIdx.x /*0030*/ IMAD R6, R6, c[0x0][0x0], R3 ; // i = ctaid*ntid + tid /*0040*/ ISETP.GE.AND P0, PT, R6, c[0x0][0x178], PT ;// P0 = (i >= n) /*0050*/ @P0 EXIT ; // if so, exit /*0060*/ MOV R7, 0x4 ; // load literal 4 (sizeof(float)) into R7 as multiplier /*0070*/ ULDC.64 UR4, c[0x0][0x118] ; // uniform load of a driver-provided system value /*0080*/ IMAD.WIDE R4, R6, R7, c[0x0][0x168] ; // &b[i] /*0090*/ IMAD.WIDE R2, R6, R7, c[0x0][0x160] ; // &a[i] /*00a0*/ LDG.E R4, [R4.64] ; // b[i] /*00b0*/ LDG.E R3, [R2.64] ; // a[i] /*00c0*/ IMAD.WIDE R6, R6,...

What happens when you run a CUDA kernel?

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars