Show HN: blqsort – Fast Branchless Quicksort with C++ Interface

chrka1 pts0 comments

GitHub - chkas/blqsort: Fast Branchless Quicksort with C++ Interface · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

chkas

blqsort

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>9 Commits<br>9 Commits

LICENSE

LICENSE

README.md

README.md

blqs.h

blqs.h

test_double.cpp

test_double.cpp

test_int128.cpp

test_int128.cpp

test_str.cpp

test_str.cpp

test_struct.cpp

test_struct.cpp

View all files

Repository files navigation

blqsort

blqsort is a fast branchless quicksort implementation for C++ that outperforms std::sort and pdqsort on random datasets.

On modern CPUs, avoiding branch misprediction is a key technique to speed up programs: When 'if' slows you down, avoid it.

Performance results naturally depend on the underlying hardware. The following benchmarks show the execution times for sorting 50 million doubles using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an AMD Ryzen 3 system using GCC, both compiled with the -O3 option.

Implementation<br>Apple M1<br>AMD Ryzen

std::sort<br>1.33s<br>5.56s

pdqsort<br>1.33s<br>2.81s

blqsort<br>1.01s<br>2.06s

This paper by Edelkamp and A. Weiß shows how partitioning performance in Quicksort can be improved by avoiding conditional branches.

The strategy of using an auxiliary buffer for branchless partitioning is inspired by fluxsort. The “auxiliary buffer” here means a 512‑element stack array, not heap memory.

To avoid the O(n²) runtime caused by bad input data, the program can group identical elements together and switch to heapsort for that specific part if it detects a big imbalance during partitioning. The program also checks if a partition is already sorted.

For larger parts, it uses a median-of-medians strategy to find a good pivot. In addition, critical partitioning loops are explicitly unrolled.

For 2 to 12 elements, the algorithm uses custom sorting networks. This approach requires a separate code path for each size but sorts small subsets with very few swaps using a branchless sort‑2 primitive. Source for sorting networks

As a result, blqsort becomes faster than, for example, std::sort and pdqsort for random numbers.

For types with higher copy or move costs (such as strings), the buffer-based branchless approach becomes less efficient. In this case, a BlockQuicksort variant is used, where only element indices are processed branchlessly, and the actual data is moved using fewer swap operations.

Usage

You only need to include blqsort.h, and it can be used just as easily as std::sort.

#include "blqs.h"<br>double data[SIZE];<br>blqs::sort(data, data + SIZE);

Sorting Custom Data Structures

In practice, we often need to sort custom data structures. This is where SIMD libraries like Google Highway - while very fast for simple numbers - become difficult to use.

Using std::sort or blqs::sort gives you much more flexibility;

struct entry {<br>int32_t id;<br>int32_t value;<br>bool operator

Execution times for sorting 50 million of this structs.

Implementation<br>Apple M1<br>AMD Ryzen

std::sort<br>3.46s<br>4.75s

pdqsort<br>3.46s<br>4.72s

blqsort<br>0.97s<br>2.20s

About

Fast Branchless Quicksort with C++ Interface

Resources

Readme

License

MIT license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

C++<br>100.0%

You can’t perform that action at this time.

sort blqsort data branchless using reload

Related Articles