WarpSpeed Just Beat Cursor

ben_mussay1 pts0 comments

WarpSpeed approaches Speed of Light on Blackwell

Skip to main content<br>Introducing doubleAI's WarpSpeed Release<br>Read the WarpSpeed Post<br>Close Announcement Banner

All Posts

Blog<br>WarpSpeed approaches Speed of Light on Blackwell<br>WarpSpeed beats NVIDIA's optimised PyTorch baselines on 90% of SOL-ExecBench's 235 Blackwell kernels, running 2.24x faster on average — after a single day of search.

Published on<br>May 24, 2026

Share this post

Read the Post

View the GitHub Repo

WarpSpeed approaches Speed of Light on Blackwell<br>NVIDIA recently released SOL-ExecBench: a benchmark of 235 of the hardest CUDA kernels from real production models, including DeepSeek, Qwen, Gemma, Kimi, and Stable Diffusion.<br>We ran WarpSpeed on these kernels for a single day. Our system beat the benchmark's optimised kernel baselines on 90% of the problems , running 2.24× faster than those baselines, on average [1]. WarpSpeed topped all four of the benchmark's problem sets , both in terms of the number of problems beating the reference and the average speedup.<br>For context, Cursor reported the first major results on this benchmark last month, April 2026, with what was then the top score. Cursor's multi-agent system ran for three weeks on the benchmark, and beat the optimised baseline on 63% of problems, reaching an average speedup of 1.38× .<br>Extremely fast kernels<br>WarpSpeed is doubleAI's artificial expert intelligence for performance engineering. It designs, implements, verifies, specialises, and tunes hand-crafted kernels for any target hardware, including GPUs and CPUs, often exceeding the performance of expert-written code. The verification framework, learning loop, and reasoning techniques behind it are described in our prior post.<br>We ran our system on NVIDIA's most recent performance engineering benchmark, SOL-ExecBench, for a single day. On 90% of the problems , WarpSpeed produces a kernel that runs faster than NVIDIA's own optimised PyTorch implementation, running 2.24× faster on average .

Average speedup over NVIDIA's optimised PyTorch baseline, per problem set. Higher is faster.Not all problem sets are equal. Some problems leave more room to optimise than others, and our speedups vary accordingly. Our system did best on the quantisation kernels (NVFP4 and FP8 attention, MoE routing, projection layers), where the right instruction choices and register layout matter enormously. Our fastest among these is an NVFP4 grouped-query attention kernel, the core of NVFP4 inference for any modern transformer; it runs essentially at the speed of light for this workload, and is 14.9× faster than the optimised reference.<br>The per-problem picture is also worth taking a look at. WarpSpeed delivers consistent gains across the entire benchmark, producing a meaningfully faster kernel on nearly every problem. A head-to-head comparison against Cursor's agentic swarm illustrates this: our system comes out on top on nearly every problem; the handful of exceptions have small margins, with the widest explained by reward hacks in Cursor's kernels (we cover these in the verification section below).

Head-to-head per-problem speedup, sorted by margin. Each bar averages three problems with neighbouring speedups. Green: WarpSpeed is faster. Red: Cursor is faster.NVIDIA's baselines are not soft targets. The benchmark's 235 problems are split across four sets: atomic single-op kernels lifted from real model architectures (L1), fused multi-op blocks like decoder layers (L2), low-precision FP8 and NVFP4 kernels, the formats powering modern production inference (Quant), and inference primitives traced directly from vLLM and SGLang's serving stacks for Llama-3.1-8B, Qwen3-30B-A3B, and DeepSeek-V3 (FlashInfer-Bench). WarpSpeed beats the optimised PyTorch baseline on the vast majority of problems, in every one of these sets.

Win-rate against NVIDIA's optimised PyTorch baseline, per problem set.Verification is crucial<br>It's not only about speed.<br>At doubleAI, we treat verification as paramount. A fast kernel that's quietly wrong is worse than a slow one that's right. For an agentic system that generates kernels at scale, the evaluation harness and the verifier are the only thing telling those two apart.<br>SOL-ExecBench is a real advance over prior benchmarks. The bulk of its harness engineering goes into hardening the measurement path against reward hacking: every submission runs in an isolated subprocess with SM clocks locked, the L2 cache is flushed between iterations, pointers are shifted to defeat data-address caching, and CUDA events have been replaced with CUPTI activity tracing, which captures kernel timestamps on every stream and closes a common exploit wherein work is hidden on side-streams the timer doesn't see.<br>The aim of this section is to examine the verifiers of SOL-ExecBench. The intent is not to critique any particular agent's kernels; both Cursor's system and ours optimised against these verifiers. The question we want to ask is: what do these verifiers...

warpspeed kernels optimised faster benchmark problems

Related Articles