Show HN: INT21 – Self-Improving PTX Kernel Factory

Introducing INT21 and PTX Kernel Factory | INT21 Skip to content Request access -> ▷ INT21 Use compute to improve compute On this page 01 The Layer Most People Don’t See 02 Our First Product: PTX Kernel Factory 03 How PTX Kernel Factory Works 04 What We Mean by Self-Improving 05 Our First Proof: Two Very Different AI Workloads 06 Benchmark Highlights 07 Correctness Comes Before Speed 08 Why Start Here 09 Open-Sourcing the First Four Factory Artifacts 10 About Us 11 A New Era of Compute Today, we are launching INT21: the first company to achieve self-improving AI agent swarms, applied to AI infrastructure.

The Layer Most People Don’t See

Most conversations about AI progress focus on models. But every model depends on a less visible layer: the software that tells GPUs how to perform each operation.

That software has an outsized effect on speed and cost. It is also difficult to build. Reaching the best performance on a new GPU often requires specialists who understand both the algorithm and the hardware in extraordinary depth.

INT21 exists to make that work more scalable. We call this new category Self-Improving Compute Infrastructure .

Our First Product: PTX Kernel Factory

PTX Kernel Factory is an AI system that generates and improves software for NVIDIA GPUs. A team defines the operation, the requirements, and the measure of success. The factory writes an implementation, tests it, measures it on the target hardware, learns from the result, and repeats.

The first four implementations produced by PTX Kernel Factory are open source today. The product is also entering beta, with early access available at int21.ai.

For AI workloads running on NVIDIA GPUs, each model operation eventually becomes instructions executed by the hardware. A GPU kernel is the small, specialized program responsible for one such operation, such as normalization, attention, or moving data through memory. Thousands of these kernels run beneath every training job and AI application.

PTX is NVIDIA’s low-level, assembly-like GPU language. It sits between higher-level GPU software and the final machine instructions executed by the hardware, making it one of the closest programmable layers in the NVIDIA stack.

Working at this level gives precise control over how data moves through memory, how threads cooperate, when work is synchronized, and which specialized GPU instructions are used. Those choices can determine whether an expensive GPU spends its time working or waiting.

Very few engineers can write and optimize PTX well. The work requires a rare combination of algorithm knowledge, GPU architecture expertise, numerical rigor, and performance intuition. Higher-level tools, libraries, and compilers make GPU development accessible to far more people, but when a new AI operation has no mature implementation, or when existing abstractions cannot reach the required performance, this scarce low-level expertise becomes a bottleneck.

It is also exceptionally difficult. A kernel can look correct while failing on one rare input. It can be fast for one shape and slow for another. It can use too many registers, move too much data, or perform well on Hopper and regress on Blackwell. Even expert engineers must test many ideas, and most of those ideas do not work. Each hardware generation changes part of the problem.

That combination makes PTX an ideal first proving ground for INT21: it is technically demanding, economically important, and objectively measurable. A generated kernel is correct or it is not. It is faster or it is not.

PTX Kernel Factory turns the expert loop of writing, testing, profiling, and revising low-level GPU code into a process that can run continuously and learn from its results.

How PTX Kernel Factory Works

The interface is intentionally simple:

Describe the operation. Define what the kernel needs to do and the inputs it must support.

Set the requirements. Provide correctness tests, target hardware, and any integration constraints.

Define success. Choose the performance metric the system should optimize.

From there, the factory runs a long-horizon engineering process. It generates candidate implementations, compiles them, rejects incorrect results, benchmarks valid candidates, and uses the evidence to guide the next round.

The released implementations combine CUDA C++ with inline PTX, giving the system control over hardware details that higher-level tools may intentionally hide. Rather than relying on a single agent to produce a one-shot answer, PTX Kernel Factory coordinates multiple AI agents across this loop.

Human engineers still define the goal, constraints, and acceptance criteria. PTX Kernel Factory automates the expensive search between a clear specification and a strong implementation.

What We Mean by Self-Improving

A coding agent can produce an answer. A reliable engineering system also needs to determine whether the answer works, understand why an attempt failed, and carry useful knowledge into the...

Show HN: INT21 – Self-Improving PTX Kernel Factory

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y