Step 3.7 Flash — A high-efficiency Flash model for Real-World
Join us<br>Try Step 3.7 Flash
2026-05-29
Step 3.7 Flash
The new frontier is agent efficiency.
A high-efficiency Flash model for real-world agents.
Multimodal Understanding & Action|Web & Visual Search Enhancement|Reliable Tool Use & Orchestration|Agent Ecosystem Compatibility
GitHub
HuggingFace
ModelScope
Key Features
Native Multimodal Understanding & Acting
Understands images across the full range — product UIs, documents, charts, and natural scenes — then writes code or calls tools to act on what it sees.
Web & Visual Search Enhancement
Web search reaches further — more sources, deeper follow-up. Visual search recognizes what other systems don't — long-tail entities, freshly emerged concepts.
Reliable Tool Use & Orchestration
Drives terminals, browsers, Office tools, search, and beyond — staying coherent however long the run gets. Less drift, fewer broken toolcalls, fewer failed runs.
Agent Ecosystem Compatibility
Works with mainstream harnesses (Claude Code, KiloCode, Hermes Agent, OpenClaw) and Skills — lower integration cost, less workflow rewiring.
Agentic Coding
SWE-Bench Pro
65
35
S';"><br>56.3
Step 3.7 Flash
Score: 56.3
Params: 196B
S';"><br>51.3
Step 3.5 Flash
Score: 51.3
Params: 196B
D';"><br>55.6
DeepSeek V4 Flash
Score: 55.6
Params: 284B
G';"><br>55.1
Gemini 3.5 Flash
Score: 55.1
Params: Unknown
G';"><br>58.6
GPT 5.5
Score: 58.6
Params: Unknown
O';"><br>64.3
Claude Opus 4.7
Score: 64.3
Params: Unknown
Terminal-Bench 2.1
85
43
S';"><br>59.5
Step 3.7 Flash
Score: 59.5
Params: 196B
S';"><br>53.4
Step 3.5 Flash
Score: 53.4
Params: 196B
D';"><br>62.0
DeepSeek V4 Flash
Score: 62.0
Params: 284B
G';"><br>76.2
Gemini 3.5 Flash
Score: 76.2
Params: Unknown
G';"><br>82.7
GPT 5.5
Score: 82.7
Params: Unknown
O';"><br>69.4
Claude Opus 4.7
Score: 69.4
Params: Unknown
Multimodal
SimpleVQA (with Tool)
80
78
75
S';"><br>79.2
Step 3.7 Flash
Score: 79.2
Params: 196B
G';"><br>78.2
GLM 5V Turbo
Score: 78.2
Params: Unknown
K';"><br>78.2
Kimi K2.6
Score: 78.2
Params: Unknown
G';"><br>79.1
GPT 5.5
Score: 79.1
Params: Unknown
V* (with Python)
97
91
85
S';"><br>95.3
Step 3.7 Flash
Score: 95.3
Params: 196B
G';"><br>89.0
GLM 5V Turbo
Score: 89.0
Params: Unknown
K';"><br>96.9
Kimi K2.6
Score: 96.9
Params: Unknown
G';"><br>96.3
Gemini 3 Flash
Score: 96.3
Params: Unknown
General Agent
GDPval
66
33
S';"><br>45.8
Step 3.7 Flash
Score: 45.8
Params: 196B
S';"><br>28.0
Step 3.5 Flash
Score: 28.0
Params: 196B
D';"><br>44.0
DeepSeek V4 Flash
Score: 44.0
Params: 284B
G';"><br>57.8
Gemini 3.5 Flash
Score: 57.8
Params: Unknown
G';"><br>63.0
GPT 5.5
Score: 63.0
Params: Unknown
O';"><br>63.0
Claude Opus 4.7
Score: 63.0
Params: Unknown
Toolathlon
66
33
S';"><br>49.5
Step 3.7 Flash
Score: 49.5
Params: 196B
S';"><br>33.3
Step 3.5 Flash
Score: 33.3
Params: 196B
D';"><br>52.8
DeepSeek V4 Flash
Score: 52.8
Params: 284B
G';"><br>56.5
Gemini 3.5 Flash
Score: 56.5
Params: Unknown
G';"><br>60.2
GPT 5.5
Score: 60.2
Params: Unknown
O';"><br>65.4
Claude Opus 4.7
Score: 65.4
Params: Unknown
ClawEval-1.1 (2026-05-09)
75
38
S';"><br>67.1
Step 3.7 Flash
Score: 67.1
Params: 196B
S';"><br>43.6
Step 3.5 Flash
Score: 43.6
Params: 196B
D';"><br>57.8
DeepSeek V4 Flash
Score: 57.8
Params: 284B
G';"><br>57.8
Gemini 3.1 Pro
Score: 57.8
Params: Unknown
G';"><br>60.3
GPT 5.4
Score: 60.3
Params: Unknown
O';"><br>70.8
Claude Opus 4.6
Score: 70.8
Params: Unknown
HLE (with Tool)
56
33
10
S';"><br>47.2
Step 3.7 Flash
Score: 47.2
Params: 196B
S';"><br>35.7
Step 3.5 Flash
Score: 35.7
Params: 196B
D';"><br>45.1
DeepSeek V4 Flash
Score: 45.1
Params: 284B
G';"><br>40.2
Gemini 3.5 Flash
Score: 40.2
Params: Unknown
G';"><br>52.2
GPT 5.5
Score: 52.2
Params: Unknown
O';"><br>54.7
Claude Opus 4.7
Score: 54.7
Params: Unknown
Note: On non-multimodal tasks, we organize comparisons in two groups: the left panel compares Step 3.7 Flash with DeepSeek V4 Flash, an open-source model of comparable Flash-size scale, while the right panel places Step 3.7 Flash alongside frontier closed-source models. In particular, Step 3.7 Flash, Gemini 3.5 Flash, and DeepSeek V4 Flash are evaluated on Terminal-Bench 2.1, where DeepSeek V4 Flash is a self-tested score. GPT 5.5 and Claude Opus 4.7 use official self-reported Terminal-Bench 2.0 scores. On GDPval, Step 3.7 Flash score is obtained through internal pairwise evaluation, while comparison models are sourced from the official Artificial Analysis Leaderboard.
Gallery
01 / 08<br>Landing Page
02 / 08<br>Heritage Building
03 / 08<br>Menu Recognition
04 / 08<br>Travel Guide
05 / 08<br>Deep Search
06 / 08<br>Draft to Code
07 / 08<br>Video to Summary
08 / 08<br>Sketch to Web Page
Agentic Coding
Foundation models are shifting from answering questions to taking action, and in the digital world that action takes the form of code. Coding is the substrate of digital agency, the purest form of the...