Quick Hardware Performance Counters on macOS ARM64 · Perpetually Curious Blog
Mastodon
Tim McGilchrist
GitHub ·<br>Bluesky
RSS ·<br>Atom
© Tim McGilchrist 2007-2026
Perpetually Curious Blog
Blog |<br>About |<br>Archive |<br>Talks |<br>Server Room
Quick Hardware Performance Counters on macOS ARM64
March 25, 2026<br>If you’ve ever profiled OCaml programs on Linux, you’ve probably reached for perf stat. It’s the go-to tool for grabbing hardware performance counters—cycles, instructions, cache misses—without any instrumentation overhead. On macOS, the equivalent story has been open Instruments , which is fine for GUI-driven investigation but terrible for automated benchmarking pipelines.
I wanted something I could stick in a shell script, get output as JSON, and run in the terminal. So I put together mperf, a perf stat-like CLI for Apple Silicon. Here is what it looks like:
$ sudo ./mperf-stat -e cycles -e instructions -e l1d-tlb-misses -- ./my_benchmark
Performance counter stats:
1,234,567,890 cycles<br>2,345,678,901 instructions # 1.90 IPC<br>12,345,678 l1d-tlb-misses
0.543210 seconds wall time<br>0.520000 seconds user<br>0.020000 seconds sys<br>Why Not Just Use Instruments?
Instruments is powerful, but it’s an interactive GUI tool. You can invoke it from the command line using xctrace but the results need the same GUI tool to view them. Sometimes you just need a simple cli tool that prints out the most interesting stats, in my case I want it invoked from a Makefile or a CI runner.
There’s also no good reason this should require full Xcode and Instruments. The hardware counters are right there in the CPU; the kernel exposes them through private frameworks. The only real requirement is root access, no need for disabling SIP or code signing or other special entitlements.
Using Apple’s Private Frameworks
Apple Silicon exposes hardware performance counters through two private frameworks: kperf.framework and kperfdata.framework, living under /System/Library/PrivateFrameworks/. These are the same frameworks that Instruments uses internally. They’re undocumented, but ibireme’s kpc_demo showed that you can load them at runtime with dlsym and drive them from userspace.
The CPU-specific event databases live in /usr/share/kpep/ as plist files—a14.plist for M1, a15.plist for M2, as4.plist for M4, and so on. mperf provides portable aliases (cycles, instructions, branch-misses, l1d-cache-misses, etc.) that resolve to the right event names for whatever chip you’re running on. You can also pass raw event names if you want something specific.
Apple Silicon gives you 2 fixed counters (cycles and instructions) plus 8 configurable counters, for a maximum of 10 simultaneous events. Unlike Linux perf, mperf doesn’t do multiplexing — if you ask for more than 10 events, it’s an error rather than a silently degraded estimate. More on that distinction below.
The Multi-Threading Problem
A simple approach would be to fork a child, start counting, wait for it to exit, read counters. That works for single-threaded programs, but OCaml 5.x programs with multiple domains spawn multiple pthreads—each domain gets a domain thread plus a backup thread for systhreads. A 4-domain program has at least 8 pthreads, and naive per-thread measurement would miss most of the work.
This is where Apple’s Profile Every Thread (PET) mechanism comes in. Instead of reading counters for a single thread, PET sets up a kernel timer that fires periodically (default: every 1ms) and snapshots PMC values for every thread matching a PID filter. These samples get written to a kernel trace buffer (kdebug) with thread IDs and timestamps.
The approach is:
Fork a child process, held at a pipe barrier
Configure the PMC hardware with requested events
Set up PET sampling filtered to the child’s PID
Enable kdebug tracing for PERF_KPC_DATA_THREAD events
Release the child (close the pipe), let it exec the target command
Poll kdebug for samples until the child exits
For each thread, compute the delta between first and last sample
Sum deltas across all threads
Thread 1: [sample_0] -------- [sample_1] -------- [sample_N]<br>Thread 2: [sample_0] -------- [sample_1] -------- [sample_N]<br>Thread 3: [sample_0] -------- [sample_1] -------- [sample_N]<br>...
Result = Σ (thread_last - thread_first) for all threads<br>This is fundamentally a sampling-based approximation rather than continuous counting. But for benchmarks that run longer than a few milliseconds, the results are accurate enough to be useful. The comparison with Linux perf stat is more nuanced than “exact vs approximate” though.
Sampling Period Trade-offs
The -p flag controls the sampling period. The default 1ms works well for most OCaml programs since domains typically live for the program’s duration. For short-lived benchmarks you can go faster at the cost of more overhead by setting smaller values for -p.
# Default 1ms - good balance for most programs<br>sudo ./mperf-stat -e cycles -e instructions -- ./benchmark
# 0.5ms -...