Branchless sorting of trivially relocatable types

ibobev1 pts0 comments

Branchless sorting of trivially relocatable types – Arthur O'Dwyer – Stuff mostly about C++

Branchless sorting of trivially relocatable types

A few days ago Christof Kaser posted a very impressive blog post on<br>“Fast Branchless Quicksort using Sorting-Networks”<br>(chkas/blqsort). A “branchless” algorithm is<br>one designed to exploit modern processors’ conditional-move instructions. So for<br>example the blqs::sort2 primitive, which looks like this:

template<br>void sort2(T& a, T& b, Compare comp) {<br>T x = a;<br>T y = b;<br>bool m = comp(x, y);<br>a = m ? x : y;<br>b = m ? y : x;

when instantiated for int compiles down to a couple of cmov instructions on x86-64<br>and a couple of csel instructions on ARM64. (Godbolt.)

But at the higher generic-programming level, count all the copy operations in sort2!<br>It copies a into x; then copy-assigns x back into a. If T were an expensive-to-copy<br>type like std::string, this would be slow code; and if T were unique_ptr, it wouldn’t<br>compile at all. Therefore, blqsort enables its entire branchless “fast path” only for<br>types that are trivially copyable and roughly register-sized.

As of this writing the gating condition is std::is_trivially_copyable::value && sizeof(T) ,<br>but I’ve pointed out to Christof that his heap_sort also depends on T to be trivially<br>default-constructible. It’s also possible (if pathological) for T to be trivially copyable<br>yet not copy-constructible or (more commonly) not copy-assignable. But this blog post isn’t<br>really about narrowing the gate; we’re going to broaden it instead!

“Trivially copyable” is what I call a “holistic” trait: it means something about the behavior<br>of the entire type, rather than about just one special member function or just one kind of expression.<br>And specifically what it means is that you can do any value-semantic operation — bringing new<br>copies into existence, poofing them out of existence, overwriting one’s value with another’s,<br>swapping or permuting or copying — as if the objects were just bags of bits. As long as you<br>never try to invent a new value out of whole cloth, you can shift copies of your given values<br>around as much as you like, even into completely new areas of memory, simply by memcpying them.<br>In the following diagram, each box represents a C++ object, and the color of the box represents<br>the object’s value (for example 42, or 3.14, or “hello,” or Tuesday).

We see that in the “After” picture, the green object has become blue. Was that by copy-assignment<br>from the original blue object? or move-assignment? or copying from the yellow, destroying, and<br>then copy-constructing from a blue object? With trivial copyability, we needn’t say! Each of those<br>possible operation-sequences is guaranteed to be physically tantamount to simply memcpying the<br>“blue” bytes into their final location.

“Trivially relocatable” (the widely deployed P1144 idiom, I mean, not the unusable version that<br>was briefly merged into the C++26 draft in 2025) is another “holistic” trait. Specifically<br>what trivially relocatable means is that you can do any affine value-semantic operation —<br>swapping, permuting, relocating from one place to another — as if the objects were just bags of<br>bits. As long as you preserve the number of copies of each given value, you can shift that<br>particular set of values around as much as you like, even into completely new areas of memory,<br>simply by memcpying them. (But unlike two paragraphs ago, with trivially relocatable you’re<br>not allowed to turn one value into two, or poof a value completely out of existence: each and<br>every input value must be represented the same number of times in the final output.)

As long as whatever highest-level algorithm we’re doing preserves this “affine,” one-to-one<br>property, every possible operation-sequence is guaranteed to be physically tantamount to<br>simply memcpying the bytes around.

The above images come from a ten-slide presentation on holistic traits I wrote in mid-2024.<br>See the whole slide deck here.

Algorithms that have this “affine” one-to-one property include swap, rotate, partition, and… sort!<br>Imagine rewriting blqs::sort2 like this (Godbolt):

template<br>void sort2(T& a, T& b, Compare comp) {<br>union U { T t; U() {} ~U() {} };<br>U x, y;<br>std::relocate_at(&a, &x.t);<br>std::relocate_at(&b, &y.t);<br>bool m = comp(x.t, y.t);<br>std::relocate_at(m ? &x.t : &y.t, &a);<br>std::relocate_at(m ? &y.t : &x.t, &b);

This is conceptually closer to what our sort2 algorithm actually “needs” to do.<br>It doesn’t really care that it’s making copies of T objects; conceptually it’s just<br>bringing the values “closer to hand” (which could be a relocate), comparing its close-up<br>copies, and then putting the values back “in memory” (which could be a relocate). For<br>trivially copyable T, it’s totally fine to replace the first relocate with a copy-construct<br>and the second relocate with a copy-assign: the end result is guaranteed to be the same,<br>assuming it compiles at all. The relocation-based version merely extends that guarantee<br>from “trivially...

trivially value copy branchless relocatable sort2

Related Articles