Tuning a Server for Benchmarking | David Álvarez Rosa | Personal SiteJune 25, 2026Tuning a Server for Benchmarking<br>Optimizing code starts with measuring it, and a measurement is only<br>useful if it is repeatable: a 2% improvement is invisible under 5% of<br>noise. Yet on an untuned machine the same binary can easily run several<br>percent faster or slower between runs. In this post we take a tiny<br>benchmark and tune the machine step by step, re-measuring after every<br>change, until runs become deterministic.1 1<br>Note that tuning for<br>benchmarking is not the same as tuning for performance: a benchmark<br>wants the machine repeatable, even at the cost of some peak speed. A<br>production box, however, wants every last bit of speed.<br>A noisy baseline<br>§<br>Our running example sums an array of doubles, in short bursts. Real<br>services rarely hammer the CPU continuously: they handle a request, sit<br>idle, and wake up for the next one. Each timed iteration here runs a<br>burst of 256 sums after a 2 ms idle gap, with the gap excluded from the<br>measurement2 2<br>PauseTiming / ResumeTiming keep the sleep out of the<br>measured time, and DoNotOptimize keeps the result alive past the<br>optimizer; without it the compiler deletes the entire loop.<br>static auto BM_Sum(benchmark::State& state) -> void {<br>alignas(64) static std::arraydouble, 4096> data;<br>std::iota(data.begin(), data.end(), 0.0);<br>for (auto _ : state) {<br>state.PauseTiming(); // Idle between bursts, like a real service<br>std::this_thread::sleep_for(std::chrono::milliseconds(2));<br>state.ResumeTiming();<br>for (auto i = 0; i 256; ++i) {<br>auto sum = std::accumulate(data.cbegin(), data.cend(), 0.0);<br>benchmark::DoNotOptimize(sum);
BENCHMARK(BM_Sum);
Compile it in release with all optimizations, -O3, and -march=native -mtune=native -flto -ffast-math. Then run ten repetitions and<br>aggregate them<br>$ ./benchmark --benchmark_repetitions=10 --benchmark_min_time=200x<br>BM_Sum_mean 99575 ns<br>BM_Sum_stddev 2704 ns<br>BM_Sum_cv 2.72 %
The interesting line is cv, the coefficient of variation: standard<br>deviation divided by mean. Almost 3% of run-to-run noise—any<br>optimization smaller than that is invisible. Let’s bring it down.<br>Know your hardware<br>§<br>Before turning any knob, look at what you are tuning. lstopo draws<br>the whole machine in one picture: caches, cores, SMT pairs, and the PCIe<br>devices hanging off them. Start with my laptop<br>Figure 1: My laptop (Intel Core Ultra 5 135U). Three kinds of cores: two P-cores with two hardware threads each (dotted), eight E-cores in clusters of four sharing an L2, and two low-power E-cores (bottom left) sitting outside the L3 entirely.<br>Here the choice of core changes what you measure: land on CPU 4 and you<br>get an E-core at lower clocks; on CPU 12 you lose the L3 too. Now<br>compare that against my homelab server<br>Figure 2: My homelab server (AMD Ryzen 7 PRO 8700GE). Eight identical cores with identical caches; the NVMe drives and the NIC hang off PCIe on the right.<br>On the server every core is as good as any other: homogeneous machines<br>make better benchmarking boxes. The PCIe side matters once a benchmark<br>touches I/O: it shows which NVMe or NIC you are exercising and, on<br>multi-socket machines, which NUMA node it hangs off.<br>Pin to a core<br>§<br>The scheduler is free to migrate the benchmark between cores, and every<br>migration throws away warm caches. On hybrid CPUs it’s worse:<br>performance and efficiency cores run the same code at very different<br>speeds, so results turn bimodal depending on where the process lands.<br>Pin the benchmark to a single core (on hybrid parts, a P-core)<br>$ taskset -c 2 ./benchmark ...
The mean falls to 55.3 µs and the CV better than halves, to 1.06% .<br>The win is bigger than migration costs alone would suggest: every burst<br>now wakes the same core, so that core’s clock never has time to sag<br>between bursts.<br>Lock the CPU frequency<br>§<br>By default Linux scales the CPU frequency with load, so the benchmark<br>starts on a cold, slow clock and finishes on a hot, fast one. Switch<br>the frequency governor to performance to keep clocks locked high<br>$ sudo cpupower frequency-set --governor performance
and verify it took effect<br>$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor<br>performance
Re-measuring gives a mean of 54.9 µs and a CV of 0.79% . The<br>increment looks modest only because pinning already kept our core’s<br>clock warm: on its own, the performance governor takes the unpinned<br>baseline from 99.6 µs straight to 54.5 µs. Either way, no burst ever<br>wakes up on a cold clock again.<br>Disable hyperthreading<br>§<br>CPU still shares its execution units and L1/L2 caches with its SMT<br>sibling: anything the scheduler places there perturbs our measurement.<br>Disable SMT entirely<br>$ echo off | sudo tee /sys/devices/system/cpu/smt/control
The CV drops to 0.26% , three times better: the core now has its<br>execution units and caches all to itself.<br>Disable turbo boost<br>§<br>Even with the performance governor, turbo frequencies vary with<br>temperature and power budget: the...