The Return of Rigorous Full-System Timing Simulation

matt_d1 pts0 comments

The Return of Rigorous Full-System Timing Simulation | SIGARCH

Home

Join

About

Bylaws

Officers

Committees

Reports

Logo

Contact

Select Page

Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

The Return of Rigorous Full-System Timing Simulation

by Shanqing Lin, Mohammad Alian, Babak Falsafi on Jun 8, 2026 | Tags: Simulation

The Return of Rigorous Full-System Timing Simulation

Accurate timing simulation remains one of the most important tools in computer architecture, but modern systems have made cycle-level simulation increasingly impractical. Today’s platforms combine many-core CPUs, deep memory hierarchies, accelerators, complex I/O, and large software stacks, making detailed simulation extremely slow—often requiring months to simulate seconds of execution. This “timing simulation wall” has pushed researchers toward approximations such as application-only simulation, fixed instruction windows, or instruction windows representing only the workload. While these reduce runtime, they often sacrifice rigorous end-to-end measurement of real microarchitectural behavior.

This blog argues for a return to rigorous full-system timing simulation—not by simulating everything in detail at all times, but by measuring the right execution intervals, using the right performance metrics, and applying statistically sound methods to make accurate simulation practical again.

Why Full-System Simulation?

Full-system simulation emulates an entire computer system: CPU, memory, devices, operating system, and applications. Unlike user-level simulation, it captures interactions across the full software and hardware stack. Full-system simulation matters because many critical behaviors emerge from OS activity, interrupts, I/O, memory management, synchronization, and device interactions—not from application code alone. Ignoring these layers can misrepresent real system bottlenecks and performance.

Full-system simulation dates back to the 1990s with systems like SimOS, later influencing platforms such as Simics (now Intel Simics Simulator), M5 (integrated into gem5) and QEMU (used in MARSS and QFlex).

Today, full-system simulation is becoming essential again for four reasons:

Modern workloads are service-oriented and multi-tenant, relying on microservices, RPCs, storage stacks, and OS-mediated interactions.

Many server and mobile workloads spend significant time in the OS, making kernel behavior central to performance analysis.

Heterogeneous systems increasingly combine CPUs with GPUs, accelerators, and smart NICs, with the CPU and OS orchestrating coordination, memory, and synchronization.

Agentic AI workloads depend heavily on tool invocation, scheduling, APIs, databases, and system integration, making CPU and OS behavior critical to end-to-end performance.

As a result, full-system simulation is no longer just a legacy methodology—it is increasingly necessary because the entire system stack has become the target of architectural innovation.

The Timing Simulation Wall

Simulators span a broad spectrum of abstraction, functionality, and performance. At the fastest end are execution-driven full-system simulators that use JIT translation to dynamically map target ISA instructions into the host ISA at runtime. Since early systems such as SimOS, these simulators have typically operated within roughly an order of magnitude of native hardware speed.

Modern ISA emulators such as QEMU can additionally generate detailed execution traces for functional simulation, enabling analysis of cache and TLB miss rates, branch predictor behavior, and prefetcher accuracy. This tracing introduces another order-of-magnitude slowdown relative to native execution.

Timing simulators go further by modeling cycle-level interactions among microarchitectural components in the CPU, accelerator, memory and I/O devices resulting in substantially lower simulation throughput. The table below compares simulation speeds for a single ARM Neoverse N1 target core with its cache hierarchy running server workloads on an AMD Zen 3 host.  The first row presents QEMU’s raw ISA emulation speed. The second row shows the slowdown due to instrumentation for user-level functional simulation. The third row demonstrates the impact on speed when functionally simulating the microarchitectural components, including the cache hierarchy and TLBs, front-end tables, and data prefetcher, for all user-level instructions. The fourth row shows the impact of functional simulation of all instructions, including the OS. Finally, the fifth row shows the timing simulation speed.

Modern workloads are not steady streams of similar instructions. Their performance fluctuates over time due to network activity, resource contention, background OS activity, synchronization effects, software hiccups, DVFS throttling, UI and graphics activity, and other asynchronous events. Alameldeen...

simulation system full timing rigorous performance

Related Articles