Branchless Quicksort<br>Fast Branchless Quicksort using Sorting-Networks with C++ Interface
On modern CPUs, avoiding branch misprediction is a key technique to speed up programs: When ‘if’ slows you down, avoid it,
Performance results naturally depend on the underlying hardware. The following benchmarks show the execution times for sorting 50 million doubles using different sorting implementations. The measurements were taken on an Apple M1 system and on an AMD Ryzen 3 system with the -O3 compiler option.
Implementation<br>Apple M1<br>AMD Ryzen
std::sort<br>1.33s<br>5.56s
pdqsort<br>1.33s<br>2.81s
blqsort<br>1.02s<br>2.06s
blqsort
Full source code is included on the page in scrollable blocks.
The paper BlockQuicksort: How Branch Mispredictions don’t affect Quicksort by Edelkamp and A. Weiß shows how partitioning performance in Quicksort can be improved by avoiding conditional branches.
The strategy of using an auxiliary buffer for branchless partitioning is inspired by fluxsort. The “auxiliary buffer” here means a 512‑element stack array, not heap memory.
To avoid the O(n²) runtime caused by bad input data, the program can group identical elements together and switch to heapsort for that specific part if it detects a big imbalance during partitioning. The program also checks if a partition is already sorted.
For larger parts, it uses a median-of-medians strategy to find a good pivot. In addition, critical partitioning loops are explicitly unrolled.
For 2 to 12 elements, the algorithm uses custom sorting networks. This approach requires a separate code path for each size but sorts small subsets instantly with very few swaps. Source for sorting networks
By using a branchless sort‑2 primitive , all comparisons and swaps are expressed as a fixed sequence of operations, removing conditional branches entirely.
As a result, blqsort becomes faster than, for example, std::sort and pdqsort for random data.
blqs.h
// SPDX-License-Identifier: MIT<br>// blqs.h - Branchless Quicksort<br>// (c) 2026 christof.kaser@gmail.com
#ifndef BLQS_H<br>#define BLQS_H
#include<br>#include<br>#include<br>#include
namespace blqs {
template<br>struct Networks {<br>static inline void sort2(T& a, T& b, Compare comp) {<br>T x = a; T y = b;<br>bool m = comp(x, y);<br>a = m ? x : y; b = m ? y : x;<br>static inline void sort3(T& a, T& b, T& c, Compare comp) {<br>sort2(a, b, comp); sort2(b, c, comp); sort2(a, b, comp);<br>static inline void sort4(T& a, T& b, T& c, T& d, Compare comp) {<br>sort2(a, b, comp); sort2(c, d, comp); sort2(a, c, comp);<br>sort2(b, d, comp); sort2(b, c, comp);<br>static inline void sort5(T& a, T& b, T& c, T& d, T& e, Compare comp) {<br>sort2(b, c, comp); sort2(d, e, comp); sort2(b, d, comp);<br>sort2(a, c, comp); sort2(a, d, comp); sort2(c, e, comp);<br>sort2(a, b, comp); sort2(c, d, comp); sort2(b, c, comp);<br>static inline void sort6(T& a, T& b, T& c, T& d, T& e, T& f, Compare comp) {<br>sort2(a, b, comp); sort2(c, d, comp); sort2(e, f, comp);<br>sort2(a, c, comp); sort2(b, d, comp); sort2(e, f, comp);<br>sort2(a, e, comp); sort2(b, f, comp); sort2(c, e, comp);<br>sort2(d, f, comp); sort2(b, c, comp); sort2(d, e, comp);<br>sort2(c, d, comp);<br>static inline void sort7(T& a, T& b, T& c, T& d, T& e, T& f, T& g, Compare comp) {<br>sort2(a, g, comp); sort2(c, d, comp); sort2(e, f, comp);<br>sort2(a, c, comp); sort2(b, e, comp); sort2(d, g, comp);<br>sort2(a, b, comp); sort2(c, f, comp); sort2(d, e, comp);<br>sort2(b, c, comp); sort2(e, g, comp);<br>sort2(c, d, comp); sort2(e, f, comp);<br>sort2(b, c, comp); sort2(d, e, comp); sort2(f, g, comp);<br>static inline void sort8(T& a, T& b, T& c, T& d, T& e, T& f, T& g, T& h, Compare comp) {<br>sort2(a,b,comp); sort2(c,d,comp); sort2(e,f,comp); sort2(g,h,comp);<br>sort2(a,c,comp); sort2(b,d,comp); sort2(e,g,comp); sort2(f,h,comp);<br>sort2(b,c,comp); sort2(f,g,comp);<br>sort2(a,e,comp); sort2(b,f,comp); sort2(c,g,comp); sort2(d,h,comp);<br>sort2(c,e,comp); sort2(d,f,comp);<br>sort2(b,c,comp); sort2(d,e,comp); sort2(f,g,comp);<br>static inline void sort9(T& a, T& b, T& c, T& d, T& e, T& f, T& g, T& h, T& i, Compare comp) {<br>sort2(a,d,comp); sort2(b,h,comp); sort2(c,f,comp); sort2(e,i,comp);<br>sort2(a,h,comp); sort2(c,e,comp); sort2(d,i,comp); sort2(f,g,comp);<br>sort2(a,c,comp); sort2(b,d,comp); sort2(e,f,comp); sort2(h,i,comp);<br>sort2(b,e,comp); sort2(d,g,comp); sort2(f,h,comp);<br>sort2(a,b,comp); sort2(c,e,comp); sort2(d,f,comp); sort2(g,i,comp);<br>sort2(c,d,comp); sort2(e,f,comp); sort2(g,h,comp);<br>sort2(b,c,comp); sort2(d,e,comp); sort2(f,g,comp);<br>static inline void sort10(T& a, T& b, T& c, T& d, T& e, T& f, T& g, T& h, T& i, T& j, Compare comp) {<br>sort2(a,b,comp); sort2(c,f,comp); sort2(d,g,comp); sort2(e,h,comp); sort2(i,j,comp);<br>sort2(a,g,comp); sort2(b,i,comp); sort2(c,e,comp); sort2(d,j,comp); sort2(f,h,comp);<br>sort2(a,c,comp); sort2(b,d,comp); sort2(e,f,comp); sort2(g,i,comp); sort2(h,j,comp);<br>sort2(a,b,comp); sort2(c,h,comp); sort2(d,f,comp); sort2(e,g,comp); sort2(i,j,comp);<br>sort2(b,c,comp); sort2(d,e,comp); sort2(f,g,comp); sort2(h,i,comp);<br>sort2(b,d,comp); sort2(c,e,comp);...