Veryl Simulator: Performance Comparison with Verilator

dalance1 pts0 comments

Veryl Simulator: Performance Comparison with Verilator | Veryl

Approach

Benchmark

Observations

What's next

Veryl Simulator: Performance Comparison with Verilator

2026-05-26

We have been working on a native Veryl simulator built on the new IR-based analyzer<br>introduced earlier this year. This post shares early performance numbers<br>comparing it against Verilator, the de facto<br>standard open source SystemVerilog simulator.

Approach

The Veryl simulator combines two execution backends:

A Cranelift-based backend that trades<br>optimization quality for compile speed, so the first run starts with little<br>upfront cost.

A GCC-based backend that runs in the background to produce a more<br>heavily optimized binary. Once the optimized binary is ready, the running<br>simulation switches over to it dynamically.

In practice the simulation starts running almost immediately on the Cranelift<br>output, and then speeds up mid-run once GCC has finished compiling.

Benchmark

We ran a Linux boot (about 30M simulated cycles) on<br>Heliodor, an Out-of-Order RISC-V core<br>written in Veryl, with 1, 2, and 4 core configurations.

Veryl: latest nightly (2026-05-26)

Verilator: v5.040

The Veryl simulator also supports 4-state simulation. We used 2-state mode for<br>this benchmark because this version of Verilator is 2-state only.

For each configuration we measured both the first run (no cached artifacts) and<br>the cached run (re-running after the optimized binary has been built), on two<br>machines representing different CPU generations.

Intel Xeon Gold 6434 (Sapphire Rapids, 2023)

AMD Ryzen Threadripper 1950X (Zen 1, 2017)

Observations

On the first run , Cranelift's fast compilation lets Veryl start executing<br>noticeably sooner than Verilator, which spends a significant portion of the<br>wall-clock time on C++ compilation. The first-run improvement ranges from<br>about 33 % to 61 % across the two machines.

On the cached run , both simulators reuse a previously built native binary,<br>so the comparison is between the GCC-optimized output of each toolchain. Veryl<br>is still consistently faster.

Across CPU generations , the gap is larger on the older Threadripper<br>1950X (Zen 1) than on the Xeon Gold 6434 (Sapphire Rapids) — the smallest<br>cached-run cases shrink to 4–8 % on Sapphire Rapids but stay at 24–49 % on<br>Zen 1. We suspect Verilator's generated C++ is more sensitive to older<br>microarchitectures than the Veryl backends.

Veryl is faster than Verilator in both modes: substantially on the first run,<br>and more modestly on the cached run. Most edit-compile-run cycles during<br>development are dominated by the first-run number. Once a simulation runs long<br>enough — regression sweeps, full OS boots — the cached-run number takes over.

What's next

The simulator is still under active development. We plan to extend the benchmark<br>to other CPU architectures and a wider range of designs, and to publish the<br>benchmark setup so the numbers can be reproduced.

veryl verilator simulator first cached benchmark

Related Articles