Lessons Learned Building High-Performance Rust Profiler

vinhnx1 pts0 comments

Lessons Learned Building High-Performance Rust Profiler

" How to Automate MEV Analysis on EVM Chains using OpenClaw MCP

How to Automate MEV Analysis on EVM Chains using OpenClaw MCP

" Prev

Lessons Learned Building High-Performance Rust Profiler

Updated May 12, 2026

23 minute read

The Rust performance book features over a dozen different profiling tools. So I’m not sure if the world needed a new Rust profiler. Still, I spent the last 6+ months building hotpath-rs. In this post, I’ll describe the design decisions behind the library and share a few performance challenges I encountered while working on it. We’ll go deep into the low-level details: cache-line contention, async futures instrumentation, and decoding raw CPU traces back into Rust symbols.

hotpath profiler 101

The next section is a brief overview of the library. Click here if you want to jump straight into the implementation details.

Over the last months, the hotpath profiler has grown to over 100k downloads on crates.io and is slowly gaining more adoption in the Rust ecosystem.

Before diving into implementation details, let’s quickly look at what hotpath does and why I built it. It’s an “all-in-one” Rust profiler / debugging toolkit designed to quickly identify performance bottlenecks.

The core idea is to combine multiple sources of data into reports that are quick and easy to mentally parse. You need only two macros, hotpath::main and hotpath::measure, to get started with instrumenting your codebase.

Let’s see it in action:

examples/overview.rs

#[hotpath::measure]<br>fn sync_work() {<br>let mut result: u64 = 1;<br>for i in 0..20000 {<br>result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));<br>result ^= result >> 3;

#[hotpath::measure]<br>fn sync_alloc() {<br>for _ in 0..1000 {<br>let buf: Vecu8> = vec![1; 1024];<br>std::hint::black_box(&buf);

#[hotpath::measure]<br>async fn async_sleep() {<br>tokio::time::sleep(Duration::from_millis(10)).await;

#[tokio::main]<br>#[hotpath::main]<br>async fn main() {<br>for _ in 0..1000 {<br>sync_work();<br>sync_alloc();<br>async_sleep().await;

This example features 3 functions, showcasing different modes of execution present in any Rust program:

sync_work - synchronous function that executes a CPU-bound code

sync_alloc - synchronous function that allocates memory

async_sleep - function that sleeps asynchronously. A bit artificial, but it’s meant to simulate a slow async I/O. Waiting for an SQL query or an HTTP endpoint would yield similar performance results.

Looking at the example, can you tell what is the REAL bottleneck?

Let’s see a hotpath report output before we answer this question:

cargo run --example overview --features='hotpath,hotpath-alloc,hotpath-cpu'

[hotpath] 15.61s | timing, alloc, cpu

timing - Function execution time metrics.<br>+------------------------+-------+----------+----------+---------+---------+<br>| Function | Calls | Avg | P95 | Total | % Total |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::main | 1 | 15.62 s | 15.62 s | 15.61 s | 100.00% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::async_sleep | 1000 | 11.77 ms | 12.10 ms | 11.77 s | 75.42% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::sync_work | 1000 | 3.14 ms | 4.59 ms | 3.14 s | 20.11% |<br>+------------------------+-------+----------+----------+---------+---------+<br>| cpu_basic::sync_alloc | 1000 | 1.23 µs | 2.62 µs | 1.23 ms | 0.01% |<br>+------------------------+-------+----------+----------+---------+---------+

alloc-bytes - Exclusive allocation bytes by each function.<br>Total: 1.1 MB<br>+------------------------+-------+----------+----------+-----------+---------+<br>| Function | Calls | Avg | P95 | Total | % Total |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::sync_alloc | 1000 | 1.0 KB | 1.0 KB | 1000.0 KB | 89.28% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::main | 1 | 120.1 KB | 120.1 KB | 120.1 KB | 10.72% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::async_sleep | 1000 | 0 B | 0 B | 0 B | 0.00% |<br>+------------------------+-------+----------+----------+-----------+---------+<br>| cpu_basic::sync_work | 1000 | 0 B | 0 B | 0 B | 0.00% |<br>+------------------------+-------+----------+----------+-----------+---------+

cpu - CPU sampling attribution per function (exclusive).<br>+------------------------+---------+---------+<br>| Function | Samples | % Total |<br>+------------------------+---------+---------+<br>| cpu_basic::sync_work | 1915914 | 56.13% |<br>+------------------------+---------+---------+<br>| cpu_basic::async_sleep | 14056 | 0.41% |<br>+------------------------+---------+---------+<br>| cpu_basic::sync_alloc | 1581 | 0.05% |<br>+------------------------+---------+---------+<br>samply load /tmp/hotpath/61089-1778083683167502000/hp.json.gz

hotpath timing, alloc, and CPU usage report.

Optionally,...

hotpath cpu_basic rust function performance profiler

Related Articles