TIRx: An Open Compiler Stack for Evolving Frontier ML Kernels
TIRx: An Open Compiler Stack for Evolving Frontier ML Kernels
Jun 22, 2026
Apache TVM Community
Today we are introducing TIRx , an open-source, hardware-native DSL and compiler for ML kernels, built on Apache TVM. It targets the part of the AI software stack where fast-moving kernels meet fast-moving hardware: TIRx compiles to GPUs and specialized AI accelerators today and is designed to grow with the generations that follow. The same design serves expert-written kernels, agent-generated kernels, and megakernel systems.
We have been working together with the broader community to provide the following materials at launch:
PyPI wheel and Python frontend. A Python-embedded hardware-native kernel DSL with @T.jit / @T.prim_func style authoring, parser utilities, and Python APIs for constructing TIRx programs.
TIRx kernel library and benchmarks. End-to-end examples covering GEMM, attention-style kernels, and low-precision operators on Blackwell GPUs.
Open course on modern GPU programming. This curated online course was taught as part of the machine learning systems course at Carnegie Mellon University, and uses TIRx to teach students modern GPU programming for machine learning systems.
You can find the following resources:
GitHub: https://github.com/apache/tvm
Documentation: https://tvm.apache.org/docs/tirx/overview.html
PyPI wheel: https://pypi.org/project/apache-tvm/0.25.0/
pip install apache-tvm==0.25.0
Community TIRx kernel library: https://github.com/mlc-ai/tirx-kernels
Modern GPU programming for machine learning systems: https://mlc.ai/modern-gpu-programming-for-mlsys/index.html
Motivation
Kernel DSLs are most effective when they choose the right boundary between the programmer and the machine. For mature kernels and mature hardware, that boundary can be high-level: the compiler hides thread assignment, memory movement, layout details, and instruction selection behind compact tensor or tile abstractions. Triton is the canonical example, and its adoption shows how well this works for established kernel patterns. At the frontier, the same boundary is under more pressure. New instructions, memory spaces, cooperation patterns, and kernel algorithms often appear before a compiler has the built-in machinery to automate them well. When that happens, the parts a high-level compiler would normally hide are exactly the parts an expert still needs to control by hand.
TIRx (pronounced “tier-ex”) responds by choosing a lower and more explicit boundary, organized around three decisions:
Orchestration stays in the hardware-native source. Pipeline structure, synchronization, role assignment, memory placement, and backend intrinsics are the parts that most often need expert control at the frontier, so TIRx keeps them in source rather than behind an abstraction that may not yet model a new feature.
Recurring tile primitives are exposed to the compiler. Execution scope, tensor layout, and tile primitive dispatch let common operations stay reusable, analyzable, and portable across backends, without forcing the whole kernel through a fixed compiler pipeline. The cost of hardware-native control is engineering effort: writing every operation by hand for each kernel and backend is laborious. Exposing recurring operations as tile primitives alleviates this, so authors reuse a dispatched implementation instead of re-writing the same data movement or matrix multiply each time.
New hardware enters as intrinsics first, tile primitives later. A new feature can be used immediately as a native intrinsic — a thin, backend-specific wrapper over a single hardware operation. Once the usage pattern stabilizes across kernels, it can be promoted to a tile primitive: a layout-aware operation that dispatches across scopes, operands, and backends. The core abstraction stays small, and adding an intrinsic for a new feature never breaks existing ones.
The result is a DSL and compiler stack that can grow with the hardware. This is the core design philosophy behind TIRx: keep the foundation small and explicit, and let the backend library evolve as new accelerator generations arrive.
This places TIRx below systems like TileLang, which also lowers the boundary relative to Triton by exposing memory scopes and pipelining, while still leaving layout inference and thread binding to the compiler. TIRx deliberately leaves those higher-level concerns outside its core and provides a minimal foundation that such systems can build on; we are working with the TileLang community to bring TIRx as a new minimal foundation to support TileLang compilation.
The same small, explicit foundation is what lets one design serve several kinds of users who pursue peak performance while reducing engineering effort as much as possible: expert-written production kernels, agent-generated kernels, and megakernel systems, each of which needs both control at the native level and recurring operations...