Reverse engineering the Qualcomm NPU compiler - datavorous
datavorous
Reverse engineering the Qualcomm NPU compiler
What I pulled out of a stripped QAIRT binary
Jun 2026 · datavorous
My work is to maximise the usage of NPUs to make edge deployment faster for whatever models we want to run on them. But NPU documentation on the web is basically nonexistent, and the little that's out there was so disappointing that at one point I thought of quitting - so I reverse engineered the compiler instead. I previously wrote a small primer on NPUs: what, and where they break, it should be enough to understand for what's coming next.
For none of the SoC's did Qualcomm publish the memory capacity of VTCMs. How am I supposed to understand whether my tensors were spilling all over? Or whether quantisation is really needed? Add to that my curiousty to know how they simulate the working of a model even before it runs on the actual hardware, and which optimization algorithms are involved.
I (equipped with Claude Code) doubled down on the *.so files of QNPU SDK (v2.46.0.260424), and banked on the unmangled names that survived stripping, the raw machine code decompiled with Ghidra, and some empirical parameter sweeping on my Linux.
Some of the novel findings are (no one has the attention span to read the entire writeup anyway):
HTP solves VTCM placement as an MILP and solves it using HiGHS (optimization solver rather than heuristics) which was completely unknown publicly
VTCM placement uses a recursive backtracking allocator operating in a 3D coordinate space
The compiler can automatically alter weight precision (without you knowing) during placement to relieve memory pressure
The effective fit/spill boundary depends on the target architecture even when different SoCs report the same vtcmSize
HTP contains a hidden analytical simulator called Hextimate from where we recovered roofline equations and contention models
The appendix contains more information, which I could not fit in my content body.
That's the gist, rest of this piece will cover three things I found, which I think were never publicly documented anywhere on the internet, and will benefit anyone willing to do edge deployment on Qualcomm NPUs.
The memory wall
The Hexagon chip has a small pool of on chip scratch memory called VTCM (vector tightly coupled memory). On the other side we have the DDR which is the main memory, but it's slow. The significant bottleneck for ML inference is caused by how much it takes to move your data around. The entire job of the compiler in this case is to decide what gets to sit on the VTCM at each moment, because anything that doesn't fit has to be taken out to DDR and fetched back later, which is expensive and energy inefficient.
Every tensor in your model has a lifetime. At any instant during the execution, some set of tensors is alive (can be termed as working set). If that fits in VTCM, then everything will be fast. If not, then the compiler will start inserting spill operations (pushing a tensor out to DDR) and fill operatoins (pull it back). This is the cliff I wanted find out.
Using the same model (Qwen 0.8B), on an SM8350 the compiler reported spilling 5.46 MB and filling 33.9 MB, with total DDR of 37.9 MB, whereas on an SM8650 (V75), nothing spilled out and the DDR read 1.15 MB. 33x time jump in DDR read traffic from nothing but the target chip (which is expected). But for some peculiar reason, the chips reported the same VTCM size in the compiled output - a field that just says 4. Now I don't know 4 "what", or whether it's some code. The behaviour is completely different anyway. I didn't recover the actual capacities, that's something I wish to do next.
The compiled binary carries a metadata field called spillFillBufferSize, which when 0 indicates that the model weights fit entirely on chip. It can help anyone determine quickly to find the causal relation for their inference being slow.
Now I can confidently spend my time on quantising my model if my target chip is an SM8350, instead of second guessing it.
One more thing decides whether you fit, and things get a little more exciting there.
The scheduler playing tetris with time
The order in which the chip runs operations decides how long each tensor stays alive, which decides how big the working set gets, which decides whether you hit the cliff. Hence the task of the compiler's scheduler is to find the correct order which will keep the working set smallest and within the limits of VTCM.
It does this with something called "Priority BFS". It walks the graph, and measures the peak VTCM that order would require, and then peak_tcm -s it. It returns SMALL/LARGE after this, indicating no spill/spill respectively. The fill/spill decision is therefore the outcome of whether the best order it found keeps the peak working set under capacity. Then the ordering metric underneath is an op's position in a depth first topological sort of the graph, which has a nice...