Toward Better Hip Kernel Generation for AMD GPUs

skidrow1 pts0 comments

Toward Better HIP Kernel Generation for AMD GPUs: Synthetic Data, Multi-Agent Search, and Reinforcement Learning | Scaling Intelligence Lab at Stanford University

-->

Scaling Intelligence Lab

Home<br>About -->

People --><br>&middot;<br>Publications<br>&middot;<br>Blogs<br>&middot;<br>Openings<br>&middot;<br>Code

@mitvis on GitHub

-->

@mitvis on Twitter

-->

Toward Better HIP Kernel Generation for AMD GPUs: Synthetic Data, Multi-Agent Search, and Reinforcement Learning

Laasya Konidala*

Stanford

Natalia Pahlavan*

Stanford

Annmaria Antony*

Stanford

Simon Guo

Stanford

Azalia Mirhoseini

Stanford

TLDR

In this work, we explore how to make language models better at generating high performance HIP kernels for AMD GPUs . We present the following:

A synthetic dataset of 500 new PyTorch reference tasks using mutation , composition , and constraint-based generation to cover a broader range of workloads.

A multi-agent optimization pipeline for HIP kernel generation. Instead of relying on single-shot prompting, we used specialized agents for task generation, PyTorch-to-HIP translation, hardware evaluation, and evolutionary optimization to search for faster kernels.

A framework based on small, low-cost open source models using SFT followed by GRPO RL . While SFT helped the model learn correct HIP patterns, RL pushed performance further by directly rewarding correctness and speedup on MI350X GPUs.

Our results showed improvements in both compilation and correctness rates across all KernelBench levels, with RL providing the strong gains. However, achieving meaningful speedup over PyTorch still requires much deeper hardware awareness and optimization reasoning. From here, we look to integrate the ROCm profiler to teach the model hardware profiler-based rewards.

Motivation

The performance of every modern AI workload is bottlenecked by kernel quality. Writing high-performance kernels requires deep familiarity with hardware, low-level languages, and optimization techniques that are critically scarce outside NVIDIA’s CUDA ecosystem.

AMD’s HIP is a good example of this deficit. It’s a compiler-verified, low-level programming language with comparatively little open-source training data, yet it targets accelerators that are increasingly present in production AI clusters. This asymmetry can be empirically observed: SOTA LLMs generally produce fluent CUDA, but when generating HIP the models might hallucinate APIs or emit kernels that appear plausible but fail at compile time or under multi-seed correctness.

Approach

We investigate three complementary ideas: (1) expanding the task space with synthetic PyTorch workloads , (2) optimizing kernels through multi-agent evolutionary search , and (3) training a small, low-cost open source model (Qwen2.5-Coder-14B-Instruct ) with SFT followed by GRPO-based RL . We measure all approaches on kernel compilation , correctness , and runtime performance using KernelBench extended to AMD MI350X GPUs (Ouyang et al., 2025).

Our approach is as follows:

1. Synthetic Data Generation

We generate a corpus of verified HIP kernels paired with PyTorch references using a multi-agent pipeline with Gemini-2.5-Flash. The pipeline has eight cooperating agents:

Task Generator : Wraps a PyTorch reference into a structured task and synthesizes new reference modules via mutation*, composition*, and constraint-based generation*, with each synthesized module sanity-checked before entering the pipeline.

Translator : Produces the first working HIP kernel from the PyTorch reference, retrying with the verifier’s error and the previous attempt fed back into the prompt. For each synthetic data task, the agent produced a correct kernel within five attempts.

Correctness Verifier : The deterministic correctness gate that rejects shortcut patterns and runs the candidate against the PyTorch reference across multiple seeds.

Evolutionary Optimizer : Iteratively samples new candidates conditioned on the most similar prior verified kernels following Lange et al., 2025, the current best kernel, and a memory of recent failures, keeping the fastest correct kernel as the seed for the next generation.

Plausibility Screener : An LLM-based reviewer that scores each candidate on compilation and plausibility so only promising kernels reach the GPU.

Hardware Evaluator : Compiles each surviving candidate on AMD MI350X GPUs, checks correctness against the PyTorch reference across multiple seeds, and measures runtime.

Archive Manager : Persists every candidate with its labels, scores, and runtimes to a per-task archive and emits SFT and RL training records for downstream post-training.

Offline Auditors : A paired generator and auditor that run curated correct, broken, and deceptive test cases through both verifiers and report each verifier’s false positives and false negatives against their expected labels.

*Modes of Task Generation

Mutation : We take a subset of existing KernelBench problems and ask the model to generate...

kernel generation pytorch kernels gpus task

Related Articles