Show HN: Fast, branchless single/multithreaded Quicksort for C/C++

chrka1 pts0 comments

GitHub - chkas/blqsort: Fast Branchless Quicksort for C and C++ - Single- and Multithreaded Versions · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

chkas

blqsort

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>34 Commits<br>34 Commits

cpp

cpp

LICENSE

LICENSE

README.md

README.md

View all files

Repository files navigation

blqsort - fast branchless quicksort

There are four implementations of blqsort here, each provided as a single header file.

File<br>Description

blqsort.h<br>C Single-Threaded

blqsort_thr.h<br>C Multi-Threaded

blqs.h<br>C++ Single-Threaded

blqs_thr.h<br>C++ Multi-Threaded

blqsort is typically faster than std::sort and pdqsort.

Performance results naturally depend on the underlying hardware. The following benchmarks show the execution times for sorting 50 million doubles using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an AMD Ryzen 3 (Linux) system using GCC, both compiled with the -O3 option. test_double.cpp

Implementation<br>Apple M1<br>AMD Ryzen

std::sort<br>1.33s<br>5.56s

pdqsort<br>1.33s<br>2.81s

blqsort (single threaded)<br>1.01s<br>2.06s

For a fair comparison, the single-threaded version of blqs was used here. On an M1, the threaded versions are another factor of 3 to 4 faster. In terms of runtime, the C++ versions differ only very little from the C version.

Branchless programming

On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This is much slower:

for (int i = 0; i 1000; i++) {<br>if (numbers[i] 500) {<br>small_numbers[smlen] = numbers[i];<br>smlen += 1;

than the branchless version:

for (int i = 0; i 1000; i++) {<br>small_numbers[smlen] = numbers[i];<br>smlen += (numbers[i] 500);

This paper by Edelkamp and Weiß shows how partitioning performance in Quicksort can be improved by avoiding conditional branches.

The strategy of using an auxiliary buffer for branchless partitioning is inspired by fluxsort. The “auxiliary buffer” here means a 512-element stack array, not heap memory.

Pivot strategy and sorting network

To avoid the O(n²) runtime caused by bad input data, the program can group identical elements together and switch to heapsort for that specific part if it detects a big imbalance during partitioning. The program also checks if a partition is already sorted.

For larger parts, it uses a median-of-medians strategy to find a good pivot. In addition, critical partitioning loops are explicitly unrolled.

For 2 to 12 elements, the algorithm uses custom sorting networks. This approach requires a separate code path for each size but sorts small subsets with very few swaps using a branchless sort-2 primitive. Source for sorting networks

C++

For types with higher copy or move costs, such as strings, the buffer-based branchless approach becomes less efficient. In this case, a BlockQuicksort variant is used, where only element indices are processed branchlessly, and the actual data is moved using fewer swap operations.

Usage

You only need to include blqs.h, and it can be used just as easily as std::sort.

#include "blqs.h"

double data[SIZE];

blqs::sort(data, data + SIZE);

For the C++ multithreaded variant, which uses C++ threads, include blqs_thr.h instead of blqs.h. The function call remains the same.

In C, the code specialized for the data type is generated using #define.

Usage

#define BLQS_CMP(a, b) ((a) #define BLQS_TYPE double<br>#include "blqsort.h"

double data[SIZE];

blqsort(data, SIZE);

For the C multithreaded variant, which uses POSIX threads, include blqsort_thr.h instead of blqsort.h. The function call remains the same here as well.

Sorting Custom Data Structures

In practice, we often need to sort custom data structures. This is where SIMD libraries like Google Highway - while very fast for simple numbers - become difficult to use.

Using std::sort or blqsort gives you much more flexibility.

C++

#include "blqs.h"

struct entry {<br>int id;<br>int value;

bool operatorconst entry& other)...

blqsort data branchless using single include

Related Articles