Lessons Learned Building High-Performance Rust Profiler
" How to Automate MEV Analysis on EVM Chains using OpenClaw MCP
How to Automate MEV Analysis on EVM Chains using OpenClaw MCP
" Prev
Lessons Learned Building High-Performance Rust Profiler
Updated May 12, 2026
23 minute read
The Rust performance book features over a dozen different profiling tools. So I’m not sure if the world needed a new Rust profiler. Still, I spent the last 6+ months building hotpath-rs. In this post, I’ll describe the design decisions behind the library and share a few performance challenges I encountered while working on it. We’ll go deep into the low-level details: cache-line contention, async futures instrumentation, and decoding raw CPU traces back into Rust symbols.
hotpath profiler 101
The next section is a brief overview of the library. Click here if you want to jump straight into the implementation details.
Over the last months, the hotpath profiler has grown to over 100k downloads on crates.io and is slowly gaining more adoption in the Rust ecosystem.
Before diving into implementation details, let’s quickly look at what hotpath does and why I built it. It’s an “all-in-one” Rust profiler / debugging toolkit designed to quickly identify performance bottlenecks.
The core idea is to combine multiple sources of data into reports that are quick and easy to mentally parse. You need only two macros, hotpath::main and hotpath::measure, to get started with instrumenting your codebase.
Let’s see it in action:
examples/overview.rs
#[hotpath::measure]<br>fn sync_work() {<br>let mut result: u64 = 1;<br>for i in 0..20000 {<br>result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));<br>result ^= result >> 3;
#[hotpath::measure]<br>fn sync_alloc() {<br>for _ in 0..1000 {<br>let buf: Vecu8> = vec![1; 1024];<br>std::hint::black_box(&buf);
#[hotpath::measure]<br>async fn async_sleep() {<br>tokio::time::sleep(Duration::from_millis(10)).await;
#[tokio::main]<br>#[hotpath::main]<br>async fn main() {<br>for _ in 0..1000 {<br>sync_work();<br>sync_alloc();<br>async_sleep().await;
This example features 3 functions, showcasing different modes of execution present in any Rust program:
sync_work - synchronous function that executes a CPU-bound code
sync_alloc - synchronous function that allocates memory
async_sleep - function that sleeps asynchronously. A bit artificial, but it’s meant to simulate a slow async I/O. Waiting for an SQL query or an HTTP endpoint would yield similar performance results.
Looking at the example, can you tell what is the REAL bottleneck?
Let’s see a hotpath report output before we answer this question:
cargo run --example overview --features='hotpath,hotpath-alloc,hotpath-cpu'
[hotpath] 15.61s | timing, alloc, cpu
timing - Function execution time metrics.<br>+------------------------+-------+----------+----------+---------+---------+<br>| Function | Calls | Avg | P95 | Total | % Total |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::main | 1 | 15.62 s | 15.62 s | 15.61 s | 100.00% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::async_sleep | 1000 | 11.77 ms | 12.10 ms | 11.77 s | 75.42% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::sync_work | 1000 | 3.14 ms | 4.59 ms | 3.14 s | 20.11% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::sync_alloc | 1000 | 1.23 µs | 2.62 µs | 1.23 ms | 0.01% |<br>+------------------------+-------+----------+----------+---------+---------+
alloc-bytes - Exclusive allocation bytes by each function.<br>Total: 1.1 MB<br>+------------------------+-------+----------+----------+-----------+---------+<br>| Function | Calls | Avg | P95 | Total | % Total |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::sync_alloc | 1000 | 1.0 KB | 1.0 KB | 1000.0 KB | 89.28% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::main | 1 | 120.1 KB | 120.1 KB | 120.1 KB | 10.72% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::async_sleep | 1000 | 0 B | 0 B | 0 B | 0.00% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::sync_work | 1000 | 0 B | 0 B | 0 B | 0.00% |<br>+------------------------+-------+----------+----------+-----------+---------+
cpu - CPU sampling attribution per function (exclusive).<br>+------------------------+---------+---------+<br>| Function | Samples | % Total |<br>+------------------------+---------+---------+<br>| cpu_basic::sync_work | 1915914 | 56.13% |<br>+------------------------+---------+---------+<br>| cpu_basic::async_sleep | 14056 | 0.41% |<br>+------------------------+---------+---------+<br>| cpu_basic::sync_alloc | 1581 | 0.05% |<br>+------------------------+---------+---------+<br>samply load /tmp/hotpath/61089-1778083683167502000/hp.json.gz
hotpath timing, alloc, and CPU usage report.
Optionally,...