What happens when you run a CUDA kernel?

mezark3 pts0 comments

What happens when you run a CUDA kernel

What happens when you run a CUDA kernel<br>29 Jun 2026 · 35 min read ·<br>Cover: Salomon de Caus's pinned-cylinder water organ, engraving from Les Raisons des Forces Mouvantes (1615).

Here’s a simple CUDA program. It adds two vectors.

__global__ void vadd(const float* a, const float* b, float* c, int n) {<br>int i = blockIdx.x * blockDim.x + threadIdx.x;<br>if (i n) c[i] = a[i] + b[i];

int main() {<br>int n = 1 20; // a million floats (1,048,576)<br>size_t bytes = n * sizeof(float);

float *a = (float*)malloc(bytes), *b = (float*)malloc(bytes),<br>*c = (float*)malloc(bytes);<br>for (int i = 0; i n; i++) a[i] = b[i] = 1.0f;

float *da, *db, *dc;<br>cudaMalloc(&da, bytes);<br>cudaMalloc(&db, bytes);<br>cudaMalloc(&dc, bytes);<br>cudaMemcpy(da, a, bytes, cudaMemcpyHostToDevice);<br>cudaMemcpy(db, b, bytes, cudaMemcpyHostToDevice);

vadd4096, 256>>>(da, db, dc, n); // 4096 * 256 = n threads, one per float

cudaMemcpy(c, dc, bytes, cudaMemcpyDeviceToHost);<br>printf("c[0]=%f c[n-1]=%f\n", c[0], c[n-1]);<br>Compiled for an RTX 4090, and launched, it does correctly work out that<br>1+1=21+1=21+1=2, a million timesI didn’t check all of them..

$ nvcc -arch=sm_89 -o vadd vadd.cu && ./vadd<br>c[0]=2.000000 c[n-1]=2.000000<br>Telling you that involved tens of millions of CPU instructions, a couple of<br>device files, nine hundred ioctls, and one memory-mapped doorbell register. In<br>this post, we’ll follow this one kernel from the code down to the warps, and<br>back up to the answerAn aside, this post is an instance of the ‘legibility transition’ that<br>agents have engendered. There really is very little about computers you can’t<br>find out with curiosity and (machine-enhanced) persistence. An interesting<br>discussion of the implications of legibility for what AI can help us to know<br>here..

Compiling our program with nvcc§

We ought to start with how to turn this CUDA program into something that the<br>device can actually read. To do that we need a compiler. Really, we need many<br>compilers.

nvcc is a driver program that runs several other compilers and combines their<br>output. If you pass --keep it leaves the whole pipeline on disk for you to<br>read:

$ nvcc --keep -arch=sm_89 -o vadd vadd.cu && ls<br>...<br>vadd.ptx # device code as PTX (from cicc)<br>vadd.sm_89.cubin # device code as SASS (from ptxas)<br>vadd.fatbin # cubin + PTX, bundled (from fatbinary)<br>vadd.cudafe1.stub.c # host launch stub + kernel registration<br>vadd.o # final host object, fatbin embedded<br>...<br>The host code goes to your host compiler. The device code (vadd) takes more<br>steps: cicc, an LLVM-based compiler,<br>turns it into<br>PTX,<br>and then ptxas turns the PTX into<br>SASS.

PTX is a virtual<br>ISA. It has<br>infinitely many typed registers, and no notion of how many of them the hardware<br>actually has. Here is the (elided) body of vadd in PTX:

$ cat vadd.ptx<br>...<br>mad.lo.s32 %r1, %r3, %r4, %r5; // set register r1 to ctaid*ntid + tid<br>setp.ge.s32 %p1, %r1, %r2; // set predicate p1 if i >= n<br>@%p1 bra $L__BB0_2; // if out of bounds, skip to exit<br>cvta.to.global.u64 %rd4, %rd1; // convert generic pointer %rd1 to a global address, store in %rd4<br>mul.wide.s32 %rd5, %r1, 4; // multiply r1 by 4, store the result in %rd5<br>add.s64 %rd6, %rd4, %rd5; // add %rd4, %rd5, result in %rd6<br>ld.global.f32 %f2, [%rd6]; // load a[i] into %f2<br>...<br>add.f32 %f3, %f2, %f1; // add %f1 and %f2, result in %f3<br>st.global.f32 [%rd10], %f3; // store c[i] = ... in global memory<br>The virtual registers look like %rd1–%rd10, %f1–%f3The prefix is the type: %r is a 32-bit integer, %rd a 64-bit one,<br>%f a 32-bit float, %p a one-bit predicate..

PTX is more ‘longhand’ than you might expect. For example, forming one address<br>in %rd6 takes three PTX instructions. This happens because PTX is device<br>agnostic.

Why three? CUDA pointers are “generic” by default, meaning they could name global, shared,<br>or local memory. cvta.to.global asserts the pointer lives in the global<br>window, so a cheaper ld.global can be used later. mul.wide.s32 then turns<br>the index i into a byte offset by multiplying by 4 (sizeof(float)) and<br>widening 32→64 bits in one step. add.s64 adds that to the base pointer.

Next, ptxas transforms our PTX, which is device agnostic, into the SASS for<br>your architecture, which isn’t. The SASS it emits looks different:

$ cuobjdump -sass vadd<br>/*0000*/ MOV R1, c[0x0][0x28] ; // set up the stack pointer (ABI; unused here)<br>/*0010*/ S2R R6, SR_CTAID.X ; // R6 = blockIdx.x<br>/*0020*/ S2R R3, SR_TID.X ; // R3 = threadIdx.x<br>/*0030*/ IMAD R6, R6, c[0x0][0x0], R3 ; // i = ctaid*ntid + tid<br>/*0040*/ ISETP.GE.AND P0, PT, R6, c[0x0][0x178], PT ;// P0 = (i >= n)<br>/*0050*/ @P0 EXIT ; // if so, exit<br>/*0060*/ MOV R7, 0x4 ; // load literal 4 (sizeof(float)) into R7 as multiplier<br>/*0070*/ ULDC.64 UR4, c[0x0][0x118] ; // uniform load of a driver-provided system value<br>/*0080*/ IMAD.WIDE R4, R6, R7, c[0x0][0x168] ; // &b[i]<br>/*0090*/ IMAD.WIDE R2, R6, R7, c[0x0][0x160] ; // &a[i]<br>/*00a0*/ LDG.E R4, [R4.64] ; // b[i]<br>/*00b0*/ LDG.E R3, [R2.64] ; // a[i]<br>/*00c0*/ IMAD.WIDE R6, R6,...

vadd float bytes global device cuda

Related Articles