Show HN: Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer

GitHub - Dogacel/auto-gpu-kernel: Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Dogacel

auto-gpu-kernel

Public

Notifications You must be signed in to change notification settings

Fork

Star 38

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 7 Commits 7 Commits

dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64

dsa_topk_indexer_fp8_h64_d128_topk2048_ps64

template

.gitignore

README.md

report.pdf

View all files

Repository files navigation

Auto GPU Kernel 🏆

Autonomous GPU-kernel discovery & optimizer.

Technical Report

Ranked #1 on MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x. Submissions can be found at:

Kernel Runtime (ms)

dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64 — DSA Sparse Attention 0.010

dsa_topk_indexer_fp8_h64_d128_topk2048_ps64 — DSA TopK Indexer 0.016

Setup

Copy the template directory into a separate folder / git repository to make sure your agents work in an isolated environment.

The kernel agent is compatible with FlashInfer format and can run without a local GPU on cloud using Modal. Requires Claude Code CLI.

# Python env conda create -n fi-bench python=3.12 conda activate fi-bench pip install flashinfer-bench modal

# One-time environment setup modal setup modal volume create flashinfer-trace modal volume put flashinfer-trace /path/to/flashinfer-trace/

To get started clone the MLSys-2026 Contest Dataset. To change the kernel you are implementing, please refer to the FlashInfer-Trace - Bring Your Own Kernel guide.

Important Make sure you update CLAUDE.md to describe the kernel you are optimizing. The example in template is customized for sparse attention. Also optimize.md and benchmark.md has some parameters tuned for sparse attention such as number of test cases to run to get a sanity check. You can ask an agent to help you adjsut them.

Launch the loop

To run one iteration,

claude --dangerously-skip-permissions -p "/optimize"

Or you can launch interactive mode by running claude --dangerously-skip-permissions, selecting the right model, thinking mode and enter /loop Run /optimize every 15 minutes.

That's it. The loop runs indefinitely, each iteration picks one optimization, benchmarks it, logs an experiment folder, and continues. Stop with Ctrl+C when you want to step in. As agent struggles to find new optimizations, it will start to change its schedule to be less frequent.

Architecture

For more details on the agentic loop, please refer to the technical report.

Agents:

Profiler

Research

Workload inspector

Command Purpose

/optimize Main loop

/benchmark One-shot Modal run

/log-experiment Snapshot + write result.md + update index

See CLAUDE.md for rules and .claude/commands/ for full command specs.

solution/triton/sparse_fused.py — the kernel being optimized (overwritten each iteration)

experiments/exp_N/ — snapshot + results for iteration N

experiments/summary.md — master index, one row per iteration

experiments/LESSONS.md — durable cross-experiment findings

About

Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x

Topics

kernel

gpu

triton

attention

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

38 stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 100.0%

You can’t perform that action at this time.

Show HN: Auto GPU Kernel – Autonomous GPU-kernel discovery and optimizer

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits