HWE Bench: A new unbounded Benchmark for LLMs (GPT 5.5 is on top)

fesens1 pts1 comments

HWE Bench · RISC-V CPU design benchmark for LLMs

RISC-V · RV32IM · single-issue · FPGA-grounded

HWE Bench

An unbounded benchmark for LLM hardware engineering.<br>Large language models design RISC-V CPUs from scratch.<br>Every design must first pass a full battery of formal correctness proofs,<br>so buggy CPUs are thrown out. The ones that survive are then scored by how<br>fast they would actually run on a physical FPGA.

Thesis<br>SWE-bench tops out at 100%. HWE Bench doesn't have a top.

The fitness number reflects an actual microarchitecture, and microarchitecture<br>has room to grow as long as models keep finding it.

Speed vs size

Score × Area

300

400

500

5.0k

7.5k

10.0k

baseline V0 · 283 · 9.6k LUT<br>Fitness (CoreMark iter/s)<br>Area · LUT4 count (← smaller is better)

gpt-5_5_xhigh<br>525 · 5.5k LUT<br>gpt-5_4_xhigh<br>514 · 10.1k LUT<br>gpt-5_5_high<br>462 · 9.8k LUT<br>gpt-5_5_medium<br>432 · 7.8k LUT<br>kimi-k2_6<br>396 · 9.9k LUT<br>gemini-3_1-pro<br>355 · 10.2k LUT<br>VexRiscv<br>human ref · 370 · 3.4k LUT

Vertical axis: CoreMark fitness (how fast the CPU runs the benchmark).<br>Horizontal axis: chip area (LUT4 count, basically how many gates the design uses on the FPGA).<br>One point per model's best run. VexRiscv (3,957 LUT4 · fitness 370) is the human-engineered<br>reference. Up and to the left is the goal: faster chip, smaller chip.

Leaderboard

Peak fitness per model

Best of N=3 reps per model · 17 reps total · VexRiscv human reference in red · baseline V0 in italic

Model<br>Reps<br>Best<br>Δ%<br>Mean ± std<br>Area (LUT4)<br>Fmax (MHz)

gpt-5_5_xhigh<br>3/3<br>525.04<br>+85.6%<br>468.3 ± 52.8<br>5.5k<br>220

gpt-5_4_xhigh<br>2/2<br>513.84<br>+81.7%<br>505.0 ± 8.9<br>10.1k<br>203

gpt-5_5_high<br>3/3<br>461.87<br>+63.3%<br>430.2 ± 23.0<br>9.8k<br>187

gpt-5_5_medium<br>3/3<br>431.58<br>+52.6%<br>423.5 ± 11.2<br>7.8k<br>201

kimi-k2_6<br>2/3<br>396.13<br>+40.1%<br>339.5 ± 8.3<br>9.9k<br>166

VexRiscv (human ref)<br>n/a<br>370.00<br>+30.8%<br>n/a<br>3.4k<br>144

gemini-3_1-pro<br>3/3<br>354.73<br>+25.4%<br>339.4 ± 12.6<br>10.2k<br>150

baseline V0 (fixture)<br>n/a<br>282.82<br>n/a<br>n/a<br>9.6k<br>127

The VexRiscv row is the human-engineered reference, a well-known open-source RV32IM CPU<br>synthesized on the same FPGA used for the benchmark. 5<br>of the LLM-generated designs beat it. See the methodology page<br>for the full procedure.

Why unbounded

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution.<br>Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent<br>model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. Fitness is the CPU's actual speed running CoreMark on a real<br>FPGA, operating frequency times instructions-per-cycle (Fmax × IPC for the technically<br>inclined). There's no theoretical maximum: a smarter microarchitecture always scores<br>higher. As long as models keep finding new tricks (deeper pipelines, smarter branch<br>predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04<br>iter/s, +85.6% over the V0 baseline core, and clear of the<br>VexRiscv human reference. There is no theoretical ceiling, and within current budgets<br>the curve has not saturated.

Trajectory

Fitness over rounds, best rep per model

300

400

500

R0

R5

R10

R15

baseline 283

VexRiscv 370<br>Best fitness so far<br>Round (1 hypothesis × 3 slots each)

gpt-5_5_xhigh<br>525 at R15<br>gpt-5_4_xhigh<br>514 at R15<br>gpt-5_5_high<br>462 at R15<br>gpt-5_5_medium<br>432 at R15<br>kimi-k2_6<br>396 at R15<br>gemini-3_1-pro<br>355 at R15

Running max of CoreMark fitness across the 15 hypothesis rounds for each model's<br>best-performing rep. Lines step up when a winning hypothesis lands and stay flat<br>otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline<br>V0 core is the gray dashed line.

fitness bench vexriscv model human best

Related Articles