Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

steveharing11 pts0 comments

Alibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better. - Firethering

back to top

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

Search

Monday, May 25, 2026

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

Facebook<br>Instagram<br>Twitter<br>Vimeo<br>Youtube

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

Search

HomeTechAlibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still...

Alibaba’s Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.

By Mohit Geryani

May 25, 2026

Last updated: May 25, 2026

Share

Facebook

Twitter

Pinterest

WhatsApp

- Advertisement -

Alibaba gave Qwen3.7-Max a kernel optimization task on a hardware platform the model had never encountered before. No documentation or profiling data. No example kernels for the architecture. Just a task description, an existing implementation, and an evaluation script.

The model ran for 35 hours. It made 1,158 tool calls. It wrote, compiled, profiled, and rewrote the kernel repeatedly, diagnosing failures, fixing bugs, identifying blocks, and redesigning the architecture multiple times without anyone watching. After 30 hours it was still finding meaningful improvements.

The final result was a 10x speedup over the reference implementation.

For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.

Table of Contents

What the task actually was

The kernel in question is Extend Attention, a production component in SGLang, a widely used inference framework. Specifically it handles attention between newly generated tokens and a prefix KV-cache of up to 32K entries, a memory-bound, latency-critical operation that directly affects how fast LLMs serve responses.

The hardware was T-Head ZW-M890 PPUs, a processor architecture that wasn’t in any training data. The model had no prior knowledge of how it behaved. It started cold.

Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.

This matters because it tests something different from standard benchmarks. Most evaluations measure whether a model can produce a correct answer given a well-defined problem. This one measured whether a model could sustain coherent strategy across more than a thousand tool calls on an open-ended optimization problem with no human guidance. Those are different skills and most models don’t have it.

What Benchmarks Shows

via: Qwen Blog<br>The numbers below are from Alibaba’s own evaluation.

BenchmarkQwen3.7-MaxClaude Opus 4.6DeepSeek V4 ProSWE-Verified80.480.880.6Terminal Bench 2.069.765.467.9GPQA Diamond92.491.390.1HLE41.440.037.7HMMT 2026 Feb97.196.295.2BFCL-V475.076.770.6<br>On coding agents it trades blows with Opus 4.6 and DeepSeek rather than clearly beating either. Terminal Bench is the exception where it leads. The reasoning numbers are where the gap opens up more consistently, GPQA Diamond, HLE, and HMMT all show Qwen3.7-Max at or above the strongest available comparison points.

The remaining benchmarks gives you clear idea , if its a right model for your use case.

You May Like: ByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

Why this model trains differently

Most models get better by seeing more text. Qwen3.7-Max got better by seeing more situations.

Alibaba calls it environment scaling. Instead of optimizing for specific benchmarks, they built a large and diverse set of agentic training environments, different tasks, different tools, different harnesses, and trained the model across all of them. The idea is the same as why a model trained on diverse text generalizes better than one trained on narrow text. Diversity of experience produces capability that transfers.

The practical result is cross-harness generalization....

games model tools apps qwen3 hours

Related Articles