Picollo: Modern HDR histogram and PMU counters for .NET

Introducing Picollo: modern HDR histogram and PMU counters for .NET · Victor Baybekov

Picollo is a new library for serious performance work in .NET. Picollo stands for Performance Instrumentation and Continuous Observation for Low-Level Optimization. It allows you to measure code performance with cycle-level precision, moving the time resolution to the picosecond level.

The initial public release of Picollo v0.1.0 contains two components. First, a modern version of an HDR histogram, which is much faster for recording data and easier to use. It has a thread-local implementation that allows scaling metrics collection across many threads with minimal impact on performance of the existing code. Second, a set of APIs for `perf_event_open`, including fast-path reads, that give raw access to PMU counters from .NET on Linux, WSL included.

HDR Histogram

An HDR histogram is a very useful tool to analyze not only networking latency, but any measurements that behave like a floor value plus some random noise. It is a very good fit for measuring code execution time and CPU counters.

How it works

The main purpose of a histogram is to compress the storage size required for observation counters. This is true for a standard and an HDR histogram. Compression comes at the cost of precision loss. The standard histogram usually keeps a bounded absolute precision, as defined by a bin width. The HDR histogram maintains the relative precision and can record a very large range of values with modest storage requirements.

There has already been, for many years, a standard/legacy implementation that works well in many languages. You can find the details on the website. Here I focus on my implementation, which is similar in the core idea but different in important details, and is much faster and easier to use, and also much easier to understand the implementation.

Here is the full implementation of the core algorithm (GitHub):

1internal readonly struct HdrBuckets 2{ 3 public readonly int BlockSize; 4 public readonly int BlockScale; 5 public readonly int BlockCount; 7 public double RelativeError => 0.5 / BlockSize; 9 public HdrBuckets(double relativeError = 0.001) 10 { 11 if (relativeError 0) 12 relativeError = 0.001; 13 else if (relativeError 0.000_001) 14 relativeError = 0.000_001; 15 else if (relativeError > 0.1) 16 relativeError = 0.1; 17 18 BlockSize = (int)BitOperations.RoundUpToPowerOf2((uint)(0.5 / relativeError)); 19 BlockScale = BitOperations.TrailingZeroCount(BlockSize); 20 BlockCount = 1 + 64 - BlockScale; 21 } 22 23 [MethodImpl(MethodImplOptions.AggressiveInlining)] 24 public nuint GetLogicalIndexForValue(ulong value) 25 { 26 int blockIndex = 64 - BitOperations.LeadingZeroCount(value >> BlockScale); 27 int stepScale = blockIndex - (blockIndex != 0 ? 1 : 0); // No branches, Roslyn recognizes it's just the result of != 28 ulong bucketIndexInBlock = (value >> stepScale) & ((1u BlockScale) - 1); 29 var index = (((nuint)(uint)blockIndex BlockScale) + (nuint)bucketIndexInBlock); 30 return index; 31 } 32}

It takes less than 10 important lines of code. Let’s review them.

Relative error or precision

It all starts with defining the maximum precision loss or relative error you can tolerate. Values between 1% and 0.1% are the most common. The code does some opinionated clamping of the provided value to keep it in the range of 0.0001% and 10%. We will see later that this parameter is the most important for the total counter storage size, so going below 0.1% only makes sense if your measurements are fundamentally that precise and you care about preserving that.

The relative error defines the block size. The value space is divided into blocks, each block has a fixed number of buckets that span over an increasing range of values, which I call a step. The block size must be a power of 2 for efficient indexing. But why do we have 0.5 as the numerator?

The first block is somewhat special. The buckets start at value zero with step 1. Imagine that we have the first block as [0, 1, 2, 3]. We are going to double the step and the block range later, but it’s hard to double a zero value, and the math does not work well without the second bucket that also has step 1: [4, 5, 6, 7]. Now we have 8 precise values without any error. The next bucket starts to quantize the values, it looks like: [8, 10, 12, 14], with the step value of 2. So, for example, a value 11 belongs to a bucket [10,12). This notation means 10 inclusive and 12 exclusive. Despite being integers, the values such as time are infinitely divisible, so one can think of integers as a computer artifact of rounding a real value to an integer, so 11.99 turns into 11.

The maximum relative error we get when using these 3 buckets to record is: relative_error = 1/8, or 12.5%. Here 1 is the step/2 and 8 is BlockSize*2. The first two blocks have no error, the third block starts to have some errors. The step is divided by 2 to get a midpoint, any value that falls...

Picollo: Modern HDR histogram and PMU counters for .NET

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast