HWE Bench: A new unbounded Benchmark for LLMs (GPT 5.5 is on top)

HWE Bench · RISC-V CPU design benchmark for LLMs

RISC-V · RV32IM · single-issue · FPGA-grounded

HWE Bench

An unbounded benchmark for LLM hardware engineering. Large language models design RISC-V CPUs from scratch. Every design must first pass a full battery of formal correctness proofs, so buggy CPUs are thrown out. The ones that survive are then scored by how fast they would actually run on a physical FPGA.

Thesis SWE-bench tops out at 100%. HWE Bench doesn't have a top.

The fitness number reflects an actual microarchitecture, and microarchitecture has room to grow as long as models keep finding it.

Speed vs size

Score × Area

300

400

500

5.0k

7.5k

10.0k

baseline V0 · 283 · 9.6k LUT Fitness (CoreMark iter/s) Area · LUT4 count (← smaller is better)

gpt-5_5_xhigh 525 · 5.5k LUT gpt-5_4_xhigh 514 · 10.1k LUT gpt-5_5_high 462 · 9.8k LUT gpt-5_5_medium 432 · 7.8k LUT kimi-k2_6 396 · 9.9k LUT gemini-3_1-pro 355 · 10.2k LUT VexRiscv human ref · 370 · 3.4k LUT

Vertical axis: CoreMark fitness (how fast the CPU runs the benchmark). Horizontal axis: chip area (LUT4 count, basically how many gates the design uses on the FPGA). One point per model's best run. VexRiscv (3,957 LUT4 · fitness 370) is the human-engineered reference. Up and to the left is the goal: faster chip, smaller chip.

Leaderboard

Peak fitness per model

Best of N=3 reps per model · 17 reps total · VexRiscv human reference in red · baseline V0 in italic

Model Reps Best Δ% Mean ± std Area (LUT4) Fmax (MHz)

gpt-5_5_xhigh 3/3 525.04 +85.6% 468.3 ± 52.8 5.5k 220

gpt-5_4_xhigh 2/2 513.84 +81.7% 505.0 ± 8.9 10.1k 203

gpt-5_5_high 3/3 461.87 +63.3% 430.2 ± 23.0 9.8k 187

gpt-5_5_medium 3/3 431.58 +52.6% 423.5 ± 11.2 7.8k 201

kimi-k2_6 2/3 396.13 +40.1% 339.5 ± 8.3 9.9k 166

VexRiscv (human ref) n/a 370.00 +30.8% n/a 3.4k 144

gemini-3_1-pro 3/3 354.73 +25.4% 339.4 ± 12.6 10.2k 150

baseline V0 (fixture) n/a 282.82 n/a n/a 9.6k 127

The VexRiscv row is the human-engineered reference, a well-known open-source RV32IM CPU synthesized on the same FPGA used for the benchmark. 5 of the LLM-generated designs beat it. See the methodology page for the full procedure.

Why unbounded

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. Fitness is the CPU's actual speed running CoreMark on a real FPGA, operating frequency times instructions-per-cycle (Fmax × IPC for the technically inclined). There's no theoretical maximum: a smarter microarchitecture always scores higher. As long as models keep finding new tricks (deeper pipelines, smarter branch predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. There is no theoretical ceiling, and within current budgets the curve has not saturated.

Trajectory

Fitness over rounds, best rep per model

300

400

500

R10

R15

baseline 283

VexRiscv 370 Best fitness so far Round (1 hypothesis × 3 slots each)

gpt-5_5_xhigh 525 at R15 gpt-5_4_xhigh 514 at R15 gpt-5_5_high 462 at R15 gpt-5_5_medium 432 at R15 kimi-k2_6 396 at R15 gemini-3_1-pro 355 at R15

Running max of CoreMark fitness across the 15 hypothesis rounds for each model's best-performing rep. Lines step up when a winning hypothesis lands and stay flat otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline V0 core is the gray dashed line.

HWE Bench: A new unbounded Benchmark for LLMs (GPT 5.5 is on top)

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast