GitHub - Dogacel/auto-gpu-kernel: Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Dogacel
auto-gpu-kernel
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star<br>38
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>7 Commits<br>7 Commits
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64
template
template
.gitignore
.gitignore
README.md
README.md
report.pdf
report.pdf
View all files
Repository files navigation
Auto GPU Kernel 🏆
Autonomous GPU-kernel discovery & optimizer.
Technical Report
Ranked #1 on MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x. Submissions can be found at:
Kernel<br>Runtime (ms)
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64 — DSA Sparse Attention<br>0.010
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64 — DSA TopK Indexer<br>0.016
Setup
Copy the template directory into a separate folder / git repository to make sure your agents work in an isolated environment.
The kernel agent is compatible with FlashInfer format and can run without a local GPU on cloud using Modal. Requires Claude Code CLI.
# Python env<br>conda create -n fi-bench python=3.12<br>conda activate fi-bench<br>pip install flashinfer-bench modal
# One-time environment setup<br>modal setup<br>modal volume create flashinfer-trace<br>modal volume put flashinfer-trace /path/to/flashinfer-trace/
To get started clone the MLSys-2026 Contest Dataset. To change the kernel you are implementing, please refer to the FlashInfer-Trace - Bring Your Own Kernel guide.
Important<br>Make sure you update CLAUDE.md to describe the kernel you are optimizing. The example in template is customized for sparse attention. Also optimize.md and benchmark.md has some parameters tuned for sparse attention such as number of test cases to run to get a sanity check. You can ask an agent to help you adjsut them.
Launch the loop
To run one iteration,
claude --dangerously-skip-permissions -p "/optimize"
Or you can launch interactive mode by running claude --dangerously-skip-permissions, selecting the right model, thinking mode and enter /loop Run /optimize every 15 minutes.
That's it. The loop runs indefinitely, each iteration picks one optimization, benchmarks it, logs an experiment folder, and continues. Stop with Ctrl+C when you want to step in. As agent struggles to find new optimizations, it will start to change its schedule to be less frequent.
Architecture
For more details on the agentic loop, please refer to the technical report.
Agents:
Profiler
Research
Workload inspector
Command<br>Purpose
/optimize<br>Main loop
/benchmark<br>One-shot Modal run
/log-experiment<br>Snapshot + write result.md + update index
See CLAUDE.md for rules and .claude/commands/ for full command specs.
solution/triton/sparse_fused.py — the kernel being optimized (overwritten each iteration)
experiments/exp_N/ — snapshot + results for iteration N
experiments/summary.md — master index, one row per iteration
experiments/LESSONS.md — durable cross-experiment findings
About
Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x
Topics
kernel
gpu
triton
attention
Resources
Readme
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
38<br>stars
Watchers
watching
Forks
forks
Report repository
Releases
No releases published
Packages
Uh oh!
There was an error while loading. Please reload this page.
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Python<br>100.0%
You can’t perform that action at this time.