HFT Latency Monitoring with Probabilistic Calling Context

ibobev1 pts0 comments

HFT Latency Monitoring with Probabilistic Calling Context - HFT University

LIVE SUBMISSIONS — 3 DAYS

HFT Latency Monitoring with Probabilistic Calling Context

Published: June 09, 2026

HFT Latency Monitoring with Probabilistic Calling Context

You have a feed handler. It pulls a multicast stream off the exchange, parses each packet, decodes the ITCH messages packed inside, applies them to an order book, and hands the top of book to a strategy. Five stages, a few million messages a second, the whole thing pinned to one core isolated at boot. It has run clean for months.

Then one Tuesday the p99 on your tick-to-trade doubles and stays there, and you have ninety minutes until the open to find out why. You run perf, because that is what everyone does, and the flamegraph tells you parse_message is 18% of your cycles, the same 18% it was last week and will be next week. True and useless. The feed handler reaches the book builder through five different message types, the flamegraph collapsed all five into one bar, and the single path that went bad is averaged in with the four that are fine. Sampling found you a hot function. You needed a hot path, and those are not the same thing.

And the five message types are the benign version of the problem. Picture the call graph.

On a clean packet the book builder, and the stats callback hanging off it, are reached one way: event loop, packet parser, feed handler, book builder, stats. But A/B line arbitration is its own can of worms, and the morning the replay manager starts re-injecting packets it has already forwarded, those same two functions get hit a second way, through a path that detours into the replay manager and back. The flamegraph puts book_builder in the same place at the same percent either way. It cannot show you that half the hits are now arriving through a route that should never have fired, which is the one fact that would have pointed you straight at the bug.

What you actually want to know in production is which paths run, how often, and where each one spends its time. Not "the parser is hot." Rather: "the path event-loop → packet-parse → ITCH-add → book-insert ran 600 thousand times last second at a median of 24 cycles in the insert, with a p99 tail of 1500." That is a different question than sampling answers, and the tools that answer it, Callgrind and friends, are two orders of magnitude too slow to leave running on a live trading process.

So I built the thing that sits in the gap, and then I went looking for whether it already existed. The short version: the pieces are all old, the combination is not, and the whole thing is open source now at github.com/HFTrader/pathprof. Let me walk through what it does, what it costs in cycles, and the one question everyone asks first, which is why not just use XRay.

The trick is an old one with a new name

Here is the mechanism. Every function you care about gets a small gate at its entry and exit, and all the gate touches is one per-thread structure: a running hash that encodes the active path, and a shadow stack of frames.

struct Frame {<br>uint64_t enter_tsc; // when we entered, for the segment time<br>uint64_t hash_before; // running hash before this fn mixed in<br>uint64_t hash_after; // and after: this IS the path id if we are the leaf<br>uint64_t fn_key;<br>bool is_leaf; // did this frame call nobody?<br>};<br>struct Ctx {<br>uint64_t running_hash; // the live path encoding<br>int depth;<br>Frame stack[PP_MAX_DEPTH];<br>};

At entry the gate mixes the function's identity into the running hash, pushes a frame, and tells its parent it is no longer a leaf, because the parent just called a child. Nothing here allocates and nothing branches in a way you will mispredict.

inline void enter(uint64_t fn_key) {<br>Ctx& c = ctx();<br>Frame& f = c.stack[c.depth];<br>f.hash_before = c.running_hash;<br>c.running_hash = mix_hash(c.running_hash, splitmix(fn_key)); // mix me in<br>f.hash_after = c.running_hash;<br>f.is_leaf = true;<br>if (c.depth > 0) c.stack[c.depth - 1].is_leaf = false; // parent had a child<br>c.depth++;

The exit gate is where the path gets recorded. Pop the frame, and if it never had a child, it is a leaf: the hash you built on the way down is now the identity of one complete path from the event loop to here. Look it up, bump it, fold in the segment times, then restore the parent's hash and you are back where you were.

inline void leave() {<br>Ctx& c = ctx();<br>Frame& f = c.stack[--c.depth];<br>if (f.is_leaf) // called nobody, so a complete path ends here<br>record_leaf(c, c.depth, rdtsc());<br>c.running_hash = f.hash_before; // pop: restore the parent's hash

That is the whole hot path. No allocation, no trace to decode later, the table holds the answer already. The lookup is a fixed-size open-addressed table keyed by the hash, and the only cost that is not a handful of cycles is the first time a given path is seen, when you snapshot its function names into the slot so the output can name it.

If that sounds familiar it should. Maintaining a single integer...

path hash frame depth book uint64_t

Related Articles