Your code is fast – if you're lucky

chrka2 pts0 comments

Lucky Code<br>Your code is fast - if you&rsquo;re lucky

sort.h - a quicksort with sorting networks

// SPDX-License-Identifier: MIT<br>// sort.h - Branchless Quicksort<br>// (c) christof.kaser@gmail.com

#ifndef SORT_H<br>#define SORT_H

#ifndef BLQS_CMP<br>#define BLQS_CMP(a, b) ((a)<br>#include<br>#define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *sw = x; sw++; }<br>while (left endp + UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) { *sw = x; sw++; }<br>else { *rwr = x; rwr--; }

while (right - left >= UNROLL &&<br>(rwr - right > UNROLL || left - lwr > UNROLL)) {

while (rwr - right > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }<br>while (left - lwr > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }<br>do {<br>while (rwr > right && left right) && left 11) {<br>BLQS_TYPE* mid = partition_small(left, right);<br>smallsort(left, mid - 1);<br>left = mid + 1;<br>sorting_network(left, right - left);

static void sortr(BLQS_TYPE* left, BLQS_TYPE* right) {<br>while (1) {<br>ptrdiff_t partszm1 = right - left;<br>if (partszm1 left) sortr(left, mid - 1);<br>BLQS_TYPE piv = *mid;<br>mid += 1;<br>// collect duplicates<br>for (BLQS_TYPE* p = mid; p<br>test.c - sorting 50 million doubles

// SPDX-License-Identifier: MIT<br>#include<br>#include<br>#include<br>#include

#define BLQS_CMP(a, b) ((a)<br>On macOS/M1 (Clang, -O3):

Time: 4.39

C++ std::sort needs 1.33 seconds for this.

A few cosmetic changes

It is already micro-optimized using sorting networks and loop unrolling. Only a few cosmetic changes remain.

We rewrite this beginner‑friendly style, which explicitly shows how the pointers are moved:

if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; }<br>else { *rwr = x; rwr--; }

into a more idiomatic and compact C form:

if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;

sort.h - rewritten

// SPDX-License-Identifier: MIT<br>// blqsort.h - Branchless Quicksort<br>// (c) christof.kaser@gmail.com

#ifndef SORT_H<br>#define SORT_H

#ifndef BLQS_CMP<br>#define BLQS_CMP(a, b) ((a)<br>#include<br>#define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *sw++ = x;<br>while (left endp + UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) *sw++ = x;<br>else *rwr-- = x;

while (right - left >= UNROLL &&<br>(rwr - right > UNROLL || left - lwr > UNROLL)) {

while (rwr - right > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *left++;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;<br>while (left - lwr > UNROLL && right - left >= UNROLL) {<br>for (int i = UNROLL; i--;) {<br>BLQS_TYPE x = *right--;<br>if (BLQS_CMP(x, piv)) *lwr++ = x;<br>else *rwr-- = x;

do {<br>while (rwr > right && left right) && left 11) {<br>BLQS_TYPE* mid = partition_small(left, right);<br>smallsort(left, mid - 1);<br>left = mid + 1;<br>sorting_network(left, right - left);

static void sortr(BLQS_TYPE* left, BLQS_TYPE* right) {<br>while (1) {<br>ptrdiff_t partszm1 = right - left;<br>if (partszm1 left) sortr(left, mid - 1);<br>BLQS_TYPE piv = *mid;<br>mid += 1;<br>// collect duplicates<br>for (BLQS_TYPE* p = mid; p<br>A quick test to make sure everything still works.

Time: 0.70

More than 6 times faster than before, and nearly twice as fast as std::sort. That’s quite something.

But what actually happened?

This &ldquo;small cosmetic&rdquo; change causes Clang to replace branches with csel.

With branches

; x20 = left, x9 = right, d8 = pivot

loop:<br>ldr d0, [x12], #8<br>fcmp d0, d8<br>b.pl ge_case<br>str d0, [x20], #8 ; left++<br>b next<br>ge_case:<br>str d0, [x9], #-8 ; right--<br>next:<br>cmp x12, x_end<br>b.lt loop

Fast with csel (branchless)

; x20 = left, x11 = right, d8 = pivot, x10 = 8

loop:<br>ldr d0, [x12], #8 ; load val<br>fcmp d0, d8 ; compare<br>csel x13, x20, x11, mi ; if = (!! fix)<br>str d0, [x13] ; store<br>add x20, x20, x14 ; update left<br>sub x11, x11, x15 ; update right<br>cmp x12, x_end<br>b.lt loop

On x86, Clang behaves similarly: with the compact if, it generates branchless code using cmov (conditional move).

GCC does not exhibit this &ldquo;quirk&rdquo; (different code generation for logically equivalent source). It consistently generates the slower branch-based version.

Links

blqsort - Fast Quicksort with C and C++ Interface

When &lsquo;if&rsquo; slows you down, avoid it

Interactive sorting demo

christof.kaser@gmail.com

left right unroll blqs_type blqs_cmp while

Related Articles