Lucky Code<br>Your code is fast - if you’re lucky
sort.h - a quicksort with sorting networks
// SPDX-License-Identifier: MIT<br>// sort.h - Branchless Quicksort<br>// (c) christof.kaser@gmail.com
#ifndef SORT_H<br>#define SORT_H
#ifndef BLQS_CMP<br>#define BLQS_CMP(a, b) ((a)<br>#include<br>#define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *sw = x; sw++; }<br>while (left endp + UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) { *sw = x; sw++; }<br>else { *rwr = x; rwr--; }
while (right - left >= UNROLL &&<br>(rwr - right > UNROLL || left - lwr > UNROLL)) {
while (rwr - right > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }<br>while (left - lwr > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }<br>do {<br>while (rwr > right && left right) && left 11) {<br>BLQS_TYPE* mid = partition_small(left, right);<br>smallsort(left, mid - 1);<br>left = mid + 1;<br>sorting_network(left, right - left);
static void sortr(BLQS_TYPE* left, BLQS_TYPE* right) {<br>while (1) {<br>ptrdiff_t partszm1 = right - left;<br>if (partszm1 left) sortr(left, mid - 1);<br>BLQS_TYPE piv = *mid;<br>mid += 1;<br>// collect duplicates<br>for (BLQS_TYPE* p = mid; p<br>test.c - sorting 50 million doubles
// SPDX-License-Identifier: MIT<br>#include<br>#include<br>#include<br>#include
#define BLQS_CMP(a, b) ((a)<br>On macOS/M1 (Clang, -O3):
Time: 4.39
C++ std::sort needs 1.33 seconds for this.
A few cosmetic changes
It is already micro-optimized using sorting networks and loop unrolling. Only a few cosmetic changes remain.
We rewrite this beginner‑friendly style, which explicitly shows how the pointers are moved:
if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }
into a more idiomatic and compact C form:
if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;
sort.h - rewritten
// SPDX-License-Identifier: MIT<br>// blqsort.h - Branchless Quicksort<br>// (c) christof.kaser@gmail.com
#ifndef SORT_H<br>#define SORT_H
#ifndef BLQS_CMP<br>#define BLQS_CMP(a, b) ((a)<br>#include<br>#define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *sw++ = x;<br>while (left endp + UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) *sw++ = x;<br>else *rwr-- = x;
while (right - left >= UNROLL &&<br>(rwr - right > UNROLL || left - lwr > UNROLL)) {
while (rwr - right > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;<br>while (left - lwr > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;
do {<br>while (rwr > right && left right) && left 11) {<br>BLQS_TYPE* mid = partition_small(left, right);<br>smallsort(left, mid - 1);<br>left = mid + 1;<br>sorting_network(left, right - left);
static void sortr(BLQS_TYPE* left, BLQS_TYPE* right) {<br>while (1) {<br>ptrdiff_t partszm1 = right - left;<br>if (partszm1 left) sortr(left, mid - 1);<br>BLQS_TYPE piv = *mid;<br>mid += 1;<br>// collect duplicates<br>for (BLQS_TYPE* p = mid; p<br>A quick test to make sure everything still works.
Time: 0.70
More than 6 times faster than before, and nearly twice as fast as std::sort. That’s quite something.
But what actually happened?
This “small cosmetic” change causes Clang to replace branches with csel.
With branches
; x20 = left, x9 = right, d8 = pivot
loop:<br>ldr d0, [x12], #8<br>fcmp d0, d8<br>b.pl ge_case<br>str d0, [x20], #8 ; left++<br>b next<br>ge_case:<br>str d0, [x9], #-8 ; right--<br>next:<br>cmp x12, x_end<br>b.lt loop
Fast with csel (branchless)
; x20 = left, x11 = right, d8 = pivot, x10 = 8
loop:<br>ldr d0, [x12], #8 ; load val<br>fcmp d0, d8 ; compare<br>csel x13, x20, x11, mi ; if = (!! fix)<br>str d0, [x13] ; store<br>add x20, x20, x14 ; update left<br>sub x11, x11, x15 ; update right<br>cmp x12, x_end<br>b.lt loop
On x86, Clang behaves similarly: with the compact if, it generates branchless code using cmov (conditional move).
GCC does not exhibit this “quirk” (different code generation for logically equivalent source). It consistently generates the slower branch-based version.
Links
blqsort - Fast Quicksort with C and C++ Interface
When ‘if’ slows you down, avoid it
Interactive sorting demo
christof.kaser@gmail.com