Your code is fast – if you're lucky

Lucky Code Your code is fast - if you’re lucky

sort.h - a quicksort with sorting networks

// SPDX-License-Identifier: MIT // sort.h - Branchless Quicksort // (c) christof.kaser@gmail.com

#ifndef SORT_H #define SORT_H

#ifndef BLQS_CMP #define BLQS_CMP(a, b) ((a) #include #define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) { BLQS_TYPE x = *left++; if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; } else { *sw = x; sw++; } while (left endp + UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *right--; if (BLQS_CMP(x, piv)) { *sw = x; sw++; } else { *rwr = x; rwr--; }

while (right - left >= UNROLL && (rwr - right > UNROLL || left - lwr > UNROLL)) {

while (rwr - right > UNROLL && right - left >= UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *left++; if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; } else { *rwr = x; rwr--; } while (left - lwr > UNROLL && right - left >= UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *right--; if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; } else { *rwr = x; rwr--; } do { while (rwr > right && left right) && left 11) { BLQS_TYPE* mid = partition_small(left, right); smallsort(left, mid - 1); left = mid + 1; sorting_network(left, right - left);

static void sortr(BLQS_TYPE* left, BLQS_TYPE* right) { while (1) { ptrdiff_t partszm1 = right - left; if (partszm1 left) sortr(left, mid - 1); BLQS_TYPE piv = *mid; mid += 1; // collect duplicates for (BLQS_TYPE* p = mid; p test.c - sorting 50 million doubles

// SPDX-License-Identifier: MIT #include #include #include #include

#define BLQS_CMP(a, b) ((a) On macOS/M1 (Clang, -O3):

Time: 4.39

C++ std::sort needs 1.33 seconds for this.

A few cosmetic changes

It is already micro-optimized using sorting networks and loop unrolling. Only a few cosmetic changes remain.

We rewrite this beginner‑friendly style, which explicitly shows how the pointers are moved:

if (BLQS_CMP(x, piv)) { *lwr = x; lwr++; } else { *rwr = x; rwr--; }

into a more idiomatic and compact C form:

if (BLQS_CMP(x, piv)) *lwr++ = x; else *rwr-- = x;

sort.h - rewritten

// SPDX-License-Identifier: MIT // blqsort.h - Branchless Quicksort // (c) christof.kaser@gmail.com

#ifndef SORT_H #define SORT_H

#ifndef BLQS_CMP #define BLQS_CMP(a, b) ((a) #include #define min(a, b) (((a) = UNROLL) for (int i = UNROLL; i--;) { BLQS_TYPE x = *left++; if (BLQS_CMP(x, piv)) *lwr++ = x; else *sw++ = x; while (left endp + UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *right--; if (BLQS_CMP(x, piv)) *sw++ = x; else *rwr-- = x;

while (right - left >= UNROLL && (rwr - right > UNROLL || left - lwr > UNROLL)) {

while (rwr - right > UNROLL && right - left >= UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *left++; if (BLQS_CMP(x, piv)) *lwr++ = x; else *rwr-- = x; while (left - lwr > UNROLL && right - left >= UNROLL) { for (int i = UNROLL; i--;) { BLQS_TYPE x = *right--; if (BLQS_CMP(x, piv)) *lwr++ = x; else *rwr-- = x;

do { while (rwr > right && left right) && left 11) { BLQS_TYPE* mid = partition_small(left, right); smallsort(left, mid - 1); left = mid + 1; sorting_network(left, right - left);

Time: 0.70

More than 6 times faster than before, and nearly twice as fast as std::sort. That’s quite something.

But what actually happened?

This “small cosmetic” change causes Clang to replace branches with csel.

With branches

; x20 = left, x9 = right, d8 = pivot

loop: ldr d0, [x12], #8 fcmp d0, d8 b.pl ge_case str d0, [x20], #8 ; left++ b next ge_case: str d0, [x9], #-8 ; right-- next: cmp x12, x_end b.lt loop

Fast with csel (branchless)

; x20 = left, x11 = right, d8 = pivot, x10 = 8

loop: ldr d0, [x12], #8 ; load val fcmp d0, d8 ; compare csel x13, x20, x11, mi ; if = (!! fix) str d0, [x13] ; store add x20, x20, x14 ; update left sub x11, x11, x15 ; update right cmp x12, x_end b.lt loop

On x86, Clang behaves similarly: with the compact if, it generates branchless code using cmov (conditional move).

GCC does not exhibit this “quirk” (different code generation for logically equivalent source). It consistently generates the slower branch-based version.

Links

blqsort - Fast Quicksort with C and C++ Interface

When ‘if’ slows you down, avoid it

Interactive sorting demo

christof.kaser@gmail.com

Your code is fast – if you're lucky

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI