Agentic Memory Management for GPU Code Generation

Agentic Memory Management for GPU Code Generation | ADRS — AI-Driven Research for Systems All posts

This post is part of the AI-Driven Research for Systems (ADRS) blog series, where we explore how AI can be applied to systems research. We feature exciting work from Makora this week!

In this blog post, we examine the problem of balancing memory and other knowledge sources for GPU kernel generation agents. Memory helps GPU kernel agents only when it saves more search than it costs in context. Search discovers useful coding patterns, and memory prevents rediscovering them.

✍️ Previous Blogs: https://ucbskyadrs.github.io/

📝 ADRS Paper: https://arxiv.org/abs/2510.06189

👩‍💻 ADRS Code: github.com/UCB-ADRS/ADRS

💬 Join us : join.slack.com/t and Discord

Introduction

Everyone agrees that agents need memory but the harder question is when they should ignore it. In most agentic systems, memory is treated as long-term recall: store prior experience, retrieve relevant pieces later, and use them to improve the next action. That framing is incomplete for optimization agents. GPU kernel generation is an iterative search problem: each candidate must be generated, compiled, checked for correctness, benchmarked, and often profiled before the agent knows whether it improved the objective. In principle, this loop can be extended with more inference-time compute, and recent work on kernel generation explicitly benefits from such scaling. In practice, however, each additional iteration consumes model calls, compiler time, benchmark time, and accelerator time, so the agent operates under a practical compute/search/context budget [1,2]. The current candidate code, the latest compiler error, profiler output, retrieved documentation, prior kernels, and run-local notes all compete for the same memory. Adding memory is not free: the wrong memory can crowd out the local evidence that an agent needs to make the next kernel better.

This suggests a different view. For GPU code generation, memory looks less like a notebook and more like a cache. A useful retrieval is a cache hit as it avoids recomputing or rediscovering information that would otherwise cost additional search. Stale entries, low-utility items, and prompt clutter are the corresponding failure modes.

Using MakoraGenerate, our GPU kernel generation agent, we argue that cache-style context management is especially important for optimization agents with hard per-step budget pressure and can be evaluated against a verifiable objective. The key question is not how much memory an agent can access. It is what belongs in the agent's working set (or "cache") at each step.

Kernel generation as an agentic search problem

Some parts of systems research are surprisingly amenable to automation. When the problem comes with a verifier that exactly represents the optimization objective -- runtime, throughput, latency, cost -- an agent can propose candidates, evaluate them, and search its way toward better ones. Pair an LLM with this loop and let iteration do the work. That recipe is the basis of AI-Driven Research Systems (ADRS).

MakoraGenerate is our instantiation of this recipe for GPU kernel optimization. It is a configurable multi-agent evolutionary system that generates, compiles, validates, and benchmarks kernels across NVIDIA GPUs, AMD GPUs, TPUs, and NPUs. At each step, an LLM proposes a candidate kernel, the system checks correctness against a PyTorch reference implementation on identical, randomly-seeded inputs across multiple trials, and requires that every output tensor match the reference in dtype, shape, and value within configurable absolute/relative tolerances. Then, the agent profiles runtime, and the measured speedup over a baseline serves as the reward. The agent maintains a ranked population and uses diversity-based selection to inherit effective patterns while avoiding premature convergence.

The baseline already has some memory, the search state itself. Past candidates and recent failures shape the next generation. But this is within-run artifact memory. It carries continuity inside a single search; it does not accumulate reusable knowledge across kernels, operators, or hardware backends. Adding richer memory is what we ablate next.

image.png

Knowledge sources

Each memory type we use plays a distinct role in the search.

Retrieval memory consists of external references such as CUDA, CUTLASS, and CuTe DSL documentation. This matters most on stacks that are new or evolving, where the relevant API, layout constraint, intrinsic, or hardware-specific behavior may be missing or outdated in pre-training. Retrieval memory helps the agent enter the valid search space by answering what APIs exist, what constraints apply, and why a compiler is rejecting the current attempt.

Optimization memory consists of reusable expert rules and code patterns drawn from examples, GitHub snippets, prior kernels, and curated...

Agentic Memory Management for GPU Code Generation

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs