This Goes to Eleven (Part 1/∞) - damageboy
-->
You are using an outdated browser. Please upgrade your browser to improve your experience.
This Goes to Eleven (Part 1/∞)
Decimating Array.Sort with AVX2.
I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics.<br>There’s no reason I should go down alone.
39 minute read
GitHub
Nuget
damageboy
What did I do this time?
Follow
Website
Stack Overflow
Custom Social Profile Link
-->
Let’s do this
Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.
Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up1. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop.
Sounds interesting? If so, down the rabbit hole we go…
Note
In the final days before posting this series, Intel started seeding a CPU microcode update that is/was affecting the performance of the released version of CoreCLR 3.0/3.1 quite considerably. I managed to stir up a small commotion as this was unraveling in my benchmarks. As it happened, my code was (not coincidentally) less affected by this change, while CoreCLRs Array.Sort() took a 20% nosedive. Let it never be said I’m nothing less than chivalrous, for I rolled back the microcode update, and for this entire series, I’m going to run against a much faster version of Array.Sort() than what you, the reader, are probably using, Assuming you update your machine from time to time. For the technically inclined, here’s a whole footnote3 on how to double-check what your machine is actually running. I also opened two issues in the CoreCLR repo about attempting to mitigate this both in CoreCLRs C++ code and separately in the JIT. If/when there is movement on those fronts, the microcode you’re running will become less of an issue, to begin with, but for now, this just adds another level of unwarranted complexity to our lives.
A while back now, I was reading the post by Stephen Toub about Improvements in CoreCLR 3.0, and it became apparent that hardware intrinsics were common to many of these, and that so many parts of CoreCLR could still be sped up with these techniques, that one thing led to another, and I decided an attempt to apply hardware intrinsics to a larger problem than I had previously done myself was in order. To see if I could rise to the challenge, I decided to take on array sorting and see how far I can go.
What I came up with eventually would become a re-write of Array.Sort() with AVX2 hardware intrinsics. Fortunately, choosing sorting and focusing on QuickSort makes for a great blog post series, since:
Everyone should be familiar with the domain and even the original (sorting is the bread and butter of learning computer science, really, and QuickSort is the queen of all sorting algorithms).
It’s relatively easy to explain/refresh on the original.
If I can make it there, I can make it anywhere.
I had no idea how to do it.
I started with searching various keywords and found an interesting paper titled: Fast Quicksort Implementation Using AVX Instructions by Shay Gueron and Vlad Krasnov. That title alone made me think this is about to be a walk in the park. While initially promising, it wasn’t good enough as a drop-in replacement for Array.Sort for reasons I’ll shortly go into. I ended up having a lot of fun expanding on their basic approach. I will submit a proper pull-request to start a discussion with CoreCLR devs about integrating this code into the main dotnet repository4, but for now, let’s talk about sorting.
Since there’s a lot to go over here, I’ve split it up into no less than 6 parts:
In this part, we start with a refresher on QuickSort and how it compares to Array.Sort(). If you don’t need a refresher, skip it and get right down to part 2 and onwards. I recommend skimming through, mostly because I’ve got excellent visualizations which should be in the back of everyone’s mind as we deal with vectorization & optimization later.
In part 2, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In part 3, we go through the initial code for the vectorized sorting, and we’ll start seeing some payoff. We finish agonizing courtesy of the CPU’s Branch Predictor, throwing a wrench into our attempts.
In part 4, we...