Do AI Agents Make ML Compilers Obsolete? - by manjeet singh
maderix’s Substack
SubscribeSign in
Do AI Agents Make ML Compilers Obsolete?<br>AI agents can now write GPU kernels and low level code on their own. That was supposed to make the ML compiler obsolete so why did the industry just spend billions doubling down on it?
manjeet singh<br>Jun 25, 2026
Share
When I first got into machine learning around 2019, I did what any self-respecting systems person would do. I ignored the tutorials and went straight to the guts. What is TensorFlow actually doing when I call model.fit()? What does PyTorch turn my nice Python code into before it hits the GPU?<br>What I found was a familiar layered cake. Python at the top. C++ in the middle. And at the very bottom, hand-written CUDA kernels doing the actual math on silicon. The same pattern computing has used since the 1950s: humans write something readable, and something else turns it into what the hardware actually wants.<br>Fast forward to two weeks ago. Google and Hugging Face ran the Fast Gemma Challenge1 , where over 60 AI agents coordinated through a shared message board and autonomously optimized inference speed for Google’s Gemma 4 E4B model. Custom CUDA kernels, quantization strategies, speculative decoding. Over 127 tokens per second2 , a substantial jump from the baseline.<br>Then yesterday, two pieces of news dropped within hours. Qualcomm announced it’s acquiring Modular3 for $3.9 billion. And OpenAI unveiled Jalapeno4 , its first custom inference chip, co-developed with Broadcom. Both say something very specific about where this industry thinks the real value sits.<br>If AI agents can write and optimize kernels autonomously, do we still need ML compilers? I’ve been chewing on this for a while, and I think the answer will annoy people on both sides.<br>We’ve Had This Argument Before
In the 1950s, programming meant writing machine code. Binary sequences on punch cards. If you wanted to add two numbers on an IBM 704, you needed the exact opcode, the register layout, and a tolerance for suffering.<br>Then assembly language came along. ADD R1, R2 instead of 0x01 0x01 0x02. People thought this was plenty of abstraction.<br>When John Backus and his team at IBM shipped FORTRAN in 1957, the assembly programmers were not impressed. “It’ll be slower.” “You lose control.” “Real programmers don’t need this.”<br>FORTRAN code was slower at first. About 20% slower in early benchmarks. But FORTRAN programs were written in a fraction of the time, could be maintained by someone other than the original author, and improved with every compiler release. General Motors studies showed 5–10x productivity improvement over assembly. Within a decade, over half of all IBM computer code was FORTRAN-generated, and the “real programmers write assembly” crowd had moved on to arguing about whether COBOL was a real language.<br>C in the 1970s. C++ in the 1980s. Java in the 1990s. Same story every time. “Too slow, too abstract, you’re giving up control.” And every time, the abstraction won. The compilers got better. The people who insisted on doing things by hand either found new, harder problems to work on or got very good at complaining.<br>The exact same argument is playing out right now with ML compute. Fancier hardware, Twitter instead of Usenet, identical pattern. To understand why, you need to see how the ML compute stack is actually built.
The Two-Layer Cake
If you’ve written a neural network in PyTorch, you’ve probably written something like this:
Clean. Readable. You’re thinking about architecture, tensor shapes, data flow.<br>But when this runs on a GPU, that nn.Linear call becomes a matrix multiplication, and that multiplication becomes thousands of GPU threads reading from shared memory in specific patterns to avoid bank conflicts, using tensor cores when available, tiling the computation into blocks that fit the GPU’s cache hierarchy.<br>The actual kernel code looks nothing like Python:
This two-layer setup made sense. Researchers think in layers, attention heads, loss functions. Kernel engineers think in warps, shared memory, instruction-level parallelism. Different skill sets, different people. The researchers define models in Python. The kernel engineers write the compute in C and CUDA.
For a long time, this worked great. NVIDIA’s cuDNN5 and cuBLAS6 provided hand-tuned kernels for the most common operations. Standard layers? Great performance, essentially free.<br>When “Free” Stopped Being Enough
Then ML models stopped being standard.<br>Attention mechanisms. Custom activation functions. Novel normalization schemes. Mixture-of-experts routing. The architectures that exploded after the 2017 “Attention Is All You Need” paper meant that the operations people needed kept growing faster than any kernel team could keep up with. PyTorch now has thousands of operators. The combinatorial explosion of fusing N of those operations across different hardware is simply not something you can do by hand.<br>This is when ML...