Extract More Kernel Performance with Nvidia CompileIQ Auto-Tuning

gmays1 pts0 comments

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning | NVIDIA Technical Blog

Technical Blog

Subscribe

Related Resources

Developer Tools & Techniques

English中文

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning

May 26, 2026

By Aditya Srikanth, Pedro Torruella, Jonathan Bentz and Tony Scudiero

Like

Discuss (0)

AI-Generated Summary

Like

Dislike

NVIDIA CompileIQ is an AI-driven compiler auto-tuning framework integrated in CUDA 13.3 that uses evolutionary and genetic algorithms to optimize internal compiler parameters for specific GPU workloads, surpassing default heuristics in performance tuning.<br>It targets critical kernel hotspots in workloads like LLM inference, where small code sections dominate compute time, enabling fractional performance gains to yield significant overall throughput improvements.<br>CompileIQ supports multi-objective optimization balancing runtime, compile time, and power consumption, producing Pareto-optimal compiler configurations that are reproducible, portable, and secure for production use in AI and HPC environments.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA CompileIQ tackles one of the hardest problems in performance engineering: finding the compiler options that unlock the best performance for a specific workload.

Consider a team that has spent weeks optimizing an LLM inference pipeline on GPUs, tuning batch sizes, quantizing to FP8, adopting flash attention, fusing every kernel they can. The profiler says there’s nothing left to squeeze.

But what if you could turn the compiler itself into a tunable parameter? Now you can. The release of NVIDIA CUDA 13.3 includes CompileIQ, an AI-powered compiler auto-tuning framework that uses evolutionary and genetic algorithms to optimize NVIDIA general purpose GPU compilers for individual workloads.

NVIDIA GPU compilers apply the same default heuristics (register allocation strategies, instruction scheduling decisions, loop unrolling thresholds, etc.) to every kernel they compile. These heuristics are engineered to produce good results across a vast range of workloads. But “good across the board” and “optimal for your workload” are two very different things.

The competitive landscape in AI infrastructure has made this gap impossible to ignore. Teams building custom CUDA, Triton, and Helion kernels are striving for every percentage point of throughput. Until now, there hasn’t been a way to fine-tune code generation for a specific workload.

The 90% problem and the opportunity

To understand why compiler-level optimization matters so much, consider where GPU compute actually goes in modern LLM inference.

In attention inference kernels, GEMMs in the linear layers of FFN/MLP blocks plus the Q, K, V, and output projections account for approximately 70% of total FLOPs. Scaled dot-product attention, fused and flash attention variants account for another 25%. Together, these two kernel families represent more than 90% of end-to-end inference compute.

This is not unique to AI inference. There are many applications and algorithms where a large portion of the compute time is spent in relatively small portions of the code, which means these small code sections contribute an outsized influence to the performance of the application. Because of this, performance improvements in those code portions, even fractions of a percent, have outsized improvements on overall application performance.

Introducing CompileIQ

CompileIQ is an AI-powered compiler auto-tuning framework that uses evolutionary and genetic algorithms to optimize NVIDIA GPU compilers for individual workloads. Instead of accepting one generic compiler configuration for all workloads, CompileIQ flips the script, generating specialized compiler configurations tailored to each of your most critical kernels.

Under the hood, CompileIQ explores a rich space of internal compiler parameters that aren’t exposed through any public compiler flag: register allocation strategies, instruction scheduling policies, loop transformations, and more. The output is an advanced controls file (ACF) that the compiler ingests via the –apply-controls flag, producing a kernel binary optimized specifically for your workload.

Think of it this way: Your compiler already has the capability to generate better code for your kernel. It just doesn’t know which combination of internal settings will get there. CompileIQ’s evolutionary search finds that combination automatically.

The team that hit a wall after exhausting every optimization lever they knew now has a new lever with CompileIQ—the compiler itself.

CompileIQ is available and can be installed into your favorite Python environment using pip, as shown in the next section. Leading AI labs are already using it in production for their most performance-critical workloads.

Getting started in 4 steps

CompileIQ is a Python package with a simple...

compileiq compiler performance nvidia kernel tuning

Related Articles