Branchless sorting of trivially relocatable types

Branchless sorting of trivially relocatable types – Arthur O'Dwyer – Stuff mostly about C++

A few days ago Christof Kaser posted a very impressive blog post on “Fast Branchless Quicksort using Sorting-Networks” (chkas/blqsort). A “branchless” algorithm is one designed to exploit modern processors’ conditional-move instructions. So for example the blqs::sort2 primitive, which looks like this:

template void sort2(T& a, T& b, Compare comp) { T x = a; T y = b; bool m = comp(x, y); a = m ? x : y; b = m ? y : x;

when instantiated for int compiles down to a couple of cmov instructions on x86-64 and a couple of csel instructions on ARM64. (Godbolt.)

But at the higher generic-programming level, count all the copy operations in sort2! It copies a into x; then copy-assigns x back into a. If T were an expensive-to-copy type like std::string, this would be slow code; and if T were unique_ptr, it wouldn’t compile at all. Therefore, blqsort enables its entire branchless “fast path” only for types that are trivially copyable and roughly register-sized.

As of this writing the gating condition is std::is_trivially_copyable::value && sizeof(T) , but I’ve pointed out to Christof that his heap_sort also depends on T to be trivially default-constructible. It’s also possible (if pathological) for T to be trivially copyable yet not copy-constructible or (more commonly) not copy-assignable. But this blog post isn’t really about narrowing the gate; we’re going to broaden it instead!

“Trivially copyable” is what I call a “holistic” trait: it means something about the behavior of the entire type, rather than about just one special member function or just one kind of expression. And specifically what it means is that you can do any value-semantic operation — bringing new copies into existence, poofing them out of existence, overwriting one’s value with another’s, swapping or permuting or copying — as if the objects were just bags of bits. As long as you never try to invent a new value out of whole cloth, you can shift copies of your given values around as much as you like, even into completely new areas of memory, simply by memcpying them. In the following diagram, each box represents a C++ object, and the color of the box represents the object’s value (for example 42, or 3.14, or “hello,” or Tuesday).

We see that in the “After” picture, the green object has become blue. Was that by copy-assignment from the original blue object? or move-assignment? or copying from the yellow, destroying, and then copy-constructing from a blue object? With trivial copyability, we needn’t say! Each of those possible operation-sequences is guaranteed to be physically tantamount to simply memcpying the “blue” bytes into their final location.

“Trivially relocatable” (the widely deployed P1144 idiom, I mean, not the unusable version that was briefly merged into the C++26 draft in 2025) is another “holistic” trait. Specifically what trivially relocatable means is that you can do any affine value-semantic operation — swapping, permuting, relocating from one place to another — as if the objects were just bags of bits. As long as you preserve the number of copies of each given value, you can shift that particular set of values around as much as you like, even into completely new areas of memory, simply by memcpying them. (But unlike two paragraphs ago, with trivially relocatable you’re not allowed to turn one value into two, or poof a value completely out of existence: each and every input value must be represented the same number of times in the final output.)

As long as whatever highest-level algorithm we’re doing preserves this “affine,” one-to-one property, every possible operation-sequence is guaranteed to be physically tantamount to simply memcpying the bytes around.

The above images come from a ten-slide presentation on holistic traits I wrote in mid-2024. See the whole slide deck here.

Algorithms that have this “affine” one-to-one property include swap, rotate, partition, and… sort! Imagine rewriting blqs::sort2 like this (Godbolt):

template void sort2(T& a, T& b, Compare comp) { union U { T t; U() {} ~U() {} }; U x, y; std::relocate_at(&a, &x.t); std::relocate_at(&b, &y.t); bool m = comp(x.t, y.t); std::relocate_at(m ? &x.t : &y.t, &a); std::relocate_at(m ? &y.t : &x.t, &b);

This is conceptually closer to what our sort2 algorithm actually “needs” to do. It doesn’t really care that it’s making copies of T objects; conceptually it’s just bringing the values “closer to hand” (which could be a relocate), comparing its close-up copies, and then putting the values back “in memory” (which could be a relocate). For trivially copyable T, it’s totally fine to replace the first relocate with a copy-construct and the second relocate with a copy-assign: the end result is guaranteed to be the same, assuming it compiles at all. The relocation-based version merely extends that guarantee from “trivially...

Branchless sorting of trivially relocatable types

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy