Trust your compiler: Modern C++

foxhill1 pts0 comments

Categorica

Trust your compiler<br>C++ programmers absorb a lot of performance wisdom by osmosis: fast inverse square root1, XOR swap2, hand-unrolled loops3, Duff’s device4, exceptions are slow, virtual calls are slow, division is slow, lookup tables often beat maths. Even if once true, hardware, software, and compiler advances have drastically changed the landscape to the point where many are not now.<br>John Carmack’s DEC Alpha and Pentium Pro bear little resemblance to a modern Zen 5 x86_64 core - a piece of silicon who’s branch predictor alone likely has more transistors than all the workstations in the iD offices of the time. On top of that, LLVM’s Clang and GNU’s GCC are advanced enough to write optimal code from the naive input. For instance, both compilers have optimisation passes that essentially check if a code segment is actually just popcount in disguise5. In fact, writing “clever” code may be “worse” code; it could obscure intent from the optimiser, potentially costing you vectorisation, inlining, and target-specific lowering.<br>In this article we walk through a few old tricks, compare them with the obvious version, and see how they behave. The numbers below were produced on:<br>Ubuntu 24.04 LTS (With updated Linux kernel 6.18.1)<br>AMD Ryzen 9 9950X, 16 cores / 32 threads<br>128 GB of DDR5/3600 memory<br>Clang 21.1.1, -O3 -ffast-math -mtune=native<br>The post comes with benchmark code. All benchmarks use Google Benchmark. 6<br>Content<br>Part 1: New dogs, old tricks<br>Fast inverse square root<br>Popcount and bit-twiddling<br>Numerical recipes and row pointers<br>const& everywhere vs forwarding intent<br>Part 2: The library<br>Ranges and algorithms<br>Exceptions vs std::expected vs error codes<br>Virtual vs static polymorphism<br>How to use the benchmarks<br>Closing remarks<br>Part 1: New dogs, old tricks<br>Fast inverse square root<br>The Q_rsqrt from Quake III7, reproduced here:<br>float Q_rsqrt( float number )<br>long i;<br>float x2, y;<br>const float threehalfs = 1.5F;

x2 = number * 0.5F;<br>y = number;<br>i = * ( long * ) &y;<br>i = 0x5f3759df - ( i >> 1 );<br>y = * ( float * ) &i;<br>y = y * ( threehalfs - ( x2 * y * y ) );<br>return y;<br>} There are numerous articles explaining how this works1, which we will skip here. In short, late 90s CPUs floating point units were nothing like today’s. During the development of Quake 3 Arena, iD’s developers discovered that a lot of time was being spent determining vertex normals used for their new lighting model. Much of this time was being spent in one operation: the inverse square root. Their approach was to forgo accuracy, and take a “estimate and improve” approach for this one operation. It could beat the naive 1.0f / sqrtf(x) by a wide margin: FSQRT and FDIV could take over 500 cycles in the 8087 FPU, while iD’s method used cheap integer operations plus one Newton iteration 8.<br>Modern CPUs<br>Intel introduced rsqrtss and rsqrtps to the x86 family of architectures with SSE 9; AVX later added the vrsqrtss and vrsqrtps forms for wider SIMD widths. These instructions compute an approximate reciprocal square root directly, with a documented error bound and substantially lower latency than x87 fsqrt10. ARMv8/AArch64 has comparable reciprocal-square-root estimate instructions11:<br>ISAScalarPackedWidthx86-64 SSErsqrtssrsqrtps4 x f32x86-64 AVXvrsqrtssvrsqrtps8 x f32ARMv8 NEONfrsqrte (scalar)frsqrte4 x f32 Why this matters?<br>Writing this code in modern C++ might look something like:<br>constexpr float Q_rsqrt(float number) noexcept {<br>static_assert(sizeof(float) == sizeof(std::uint32_t));<br>auto i = std::bit_caststd::uint32_t>(number);<br>auto magic = 0x5f3759dfu - (i >> 1);<br>auto y = std::bit_castfloat>(magic);<br>return y * (1.5f - (number * 0.5f * y * y));<br>} Compare that with the naive version:<br>constexpr float naive_rsqrt(float x) noexcept {<br>return 1.0f / std::sqrt(x);<br>} Now compare the relevant assembly produced with -std=c++23 -O3 -ffast-math -march=znver4 on Compiler Explorer using Clang 21.1.0. The benchmark run used -march=native; the assembly excerpt uses znver4 so the target is explicit. The -ffast-math part matters: it lets the compiler make aggressive, potentially lossy floating-point assumptions1213, so this example assumes positive, finite inputs and does not preserve every strict IEEE floating-point edge case.<br>Q_rsqrt(float):<br>movd eax, xmm0<br>sar eax<br>mov ecx, 1597463007<br>sub ecx, eax<br>mulss xmm0, dword ptr [rip + .LCPI0_0]<br>movd xmm1, ecx<br>movdqa xmm2, xmm1<br>mulss xmm2, xmm1<br>mulss xmm0, xmm2<br>addss xmm0, dword ptr [rip + .LCPI0_1]<br>mulss xmm0, xmm1<br>ret

naive_rsqrt(float):<br>vrsqrtss xmm1, xmm0, xmm0<br>vmulss xmm0, xmm0, xmm1<br>vfmadd213ss xmm0, xmm1, dword ptr [rip + .LCPI1_0]<br>vmulss xmm1, xmm1, dword ptr [rip + .LCPI1_1]<br>vmulss xmm0, xmm1, xmm0<br>ret Benchmarking<br>This makes sense in theory, but is it actually true now?<br>BenchmarkTime (scalar)Time (n=1024)Time (n=65536)Q_sqrt380 ns24.5 ns1865 nsnaive_rsqrt373 ns25.0 ns2161 ns The scalar case is effectively a draw, with the obvious version slightly ahead. In the array kernels, the Quake version is somewhat faster in this...

float xmm0 xmm1 square code number

Related Articles