Show HN: INT21 – Self-Improving PTX Kernel Factory

antinucleon1 pts0 comments

Introducing INT21 and PTX Kernel Factory | INT21 Skip to content<br>Request access -> ▷<br>INT21 Use compute to improve compute<br>On this page<br>01 The Layer Most People Don’t See<br>02 Our First Product: PTX Kernel Factory<br>03 How PTX Kernel Factory Works<br>04 What We Mean by Self-Improving<br>05 Our First Proof: Two Very Different AI Workloads<br>06 Benchmark Highlights<br>07 Correctness Comes Before Speed<br>08 Why Start Here<br>09 Open-Sourcing the First Four Factory Artifacts<br>10 About Us<br>11 A New Era of Compute<br>Today, we are launching INT21: the first company to achieve self-improving AI<br>agent swarms, applied to AI infrastructure.

The Layer Most People Don’t See

Most conversations about AI progress focus on models. But every model depends<br>on a less visible layer: the software that tells GPUs how to perform each<br>operation.

That software has an outsized effect on speed and cost. It is also difficult to<br>build. Reaching the best performance on a new GPU often requires specialists<br>who understand both the algorithm and the hardware in extraordinary depth.

INT21 exists to make that work more scalable. We call this new category<br>Self-Improving Compute Infrastructure .

Our First Product: PTX Kernel Factory

PTX Kernel Factory is an AI system that generates and improves software for<br>NVIDIA GPUs. A team defines the operation, the requirements, and the measure of<br>success. The factory writes an implementation, tests it, measures it on the<br>target hardware, learns from the result, and repeats.

The first four implementations produced by PTX Kernel Factory are open source<br>today. The product is also entering beta, with early access available at<br>int21.ai.

For AI workloads running on NVIDIA GPUs, each model operation eventually<br>becomes instructions executed by the hardware. A GPU kernel is the small,<br>specialized program responsible for one such operation, such as normalization,<br>attention, or moving data through memory. Thousands of these kernels run beneath<br>every training job and AI application.

PTX is NVIDIA’s low-level, assembly-like GPU language. It sits between<br>higher-level GPU software and the final machine instructions executed by the<br>hardware, making it one of the closest programmable layers in the NVIDIA stack.

Working at this level gives precise control over how data moves through memory,<br>how threads cooperate, when work is synchronized, and which specialized GPU<br>instructions are used. Those choices can determine whether an expensive GPU<br>spends its time working or waiting.

Very few engineers can write and optimize PTX well. The work requires a rare<br>combination of algorithm knowledge, GPU architecture expertise, numerical<br>rigor, and performance intuition. Higher-level tools, libraries, and compilers<br>make GPU development accessible to far more people, but when a new AI operation<br>has no mature implementation, or when existing abstractions cannot reach the<br>required performance, this scarce low-level expertise becomes a bottleneck.

It is also exceptionally difficult. A kernel can look correct while failing on<br>one rare input. It can be fast for one shape and slow for another. It can use<br>too many registers, move too much data, or perform well on Hopper and regress<br>on Blackwell. Even expert engineers must test many ideas, and most of those<br>ideas do not work. Each hardware generation changes part of the problem.

That combination makes PTX an ideal first proving ground for INT21: it is<br>technically demanding, economically important, and objectively measurable. A<br>generated kernel is correct or it is not. It is faster or it is not.

PTX Kernel Factory turns the expert loop of writing, testing, profiling, and<br>revising low-level GPU code into a process that can run continuously and learn<br>from its results.

How PTX Kernel Factory Works

The interface is intentionally simple:

Describe the operation. Define what the kernel needs to do and the inputs<br>it must support.

Set the requirements. Provide correctness tests, target hardware, and any<br>integration constraints.

Define success. Choose the performance metric the system should optimize.

From there, the factory runs a long-horizon engineering process. It generates<br>candidate implementations, compiles them, rejects incorrect results, benchmarks<br>valid candidates, and uses the evidence to guide the next round.

The released implementations combine CUDA C++ with inline PTX, giving the<br>system control over hardware details that higher-level tools may intentionally<br>hide. Rather than relying on a single agent to produce a one-shot answer, PTX<br>Kernel Factory coordinates multiple AI agents across this loop.

Human engineers still define the goal, constraints, and acceptance criteria.<br>PTX Kernel Factory automates the expensive search between a clear specification<br>and a strong implementation.

What We Mean by Self-Improving

A coding agent can produce an answer. A reliable engineering system also needs<br>to determine whether the answer works, understand why an attempt failed, and<br>carry useful knowledge into the...

kernel factory int21 first hardware level

Related Articles