Accelerating Copy_if Using SIMD

Accelerating copy_if using SIMD | Chaitanya Kumar's BlogTable of ContentsIntroduction First SIMD Attempt First Moment of (Bitter) Truth A Crash Course on CPU Microarchitecture and PMCs The Top-Down Analysis using Performance CountersLevel 1 Level 2 Retiring Microcode Profiling with AMD IBS

The Fix and Final Moment of Truth What’s Left Conclusion AppendixBenchmark SetupSources of variance Disabling SMT Setting Thread Affinity Increasing scheduling priority of the benchmark thread Putting it all together

llvm-mca

Introduction# I have a Zen 4 CPU with a bunch of AVX512 feature flags. So I thought - let’s try and use it to implement something, even if it’s in the realm of wheel-reinvention. I started with the following goals. Implement an algorithm that cannot be vectorized by my optimizing compiler, even with a polyhedral loop model. Systematically analyze its performance and answer the questionsIs it as fast as it can be? If not, why? And how can we fix it?

Start simple, make it work. Which means that dead simple algorithms like map/transform, reduce, adjacent_difference etc are out, as they are very autovectorizable. Even 2D stencils are out because look at this. So, I settled on std::copy_if. Implementing a SIMD implementation is the easy part. Figuring its perforamnce out ended up being less trivial than I anticipated. I already knew the tools that I will need. Google benchmark for writing microbenchmarks likwid-bench for determining performance upper bound on my machine llvm-mca for simulating the kernel on its model of Zen 4 perf-stat for drill-down performance analysis by counting events From cppreference, std::copy_if is a dead-simple algorithm. templateclass InputIt, class OutputIt, class UnaryPred> OutputIt copy_if(InputIt first, InputIt last, OutputIt d_first, UnaryPred pred) for (; first != last; ++first) if (pred(*first)) *d_first = *first; ++d_first;

return d_first;

The codegen is also very clean (compiler explorer link). It is however non-trivial to vectorize because of a loop-carried dependency: the value of d_first in iteration i+1 depends on the value of pred(*first) in iteration i. Let us measure our baseline before we go about vectorizing. These are the dimensions along with we can measure performance. Input size (henceforth n) Choice of predicate function Input distribution Input entropy The problem size (1) is trivial to sweep over; varying n results in different interactions with the memory subsystem (caches, hardware prefetchers, DRAM etc). The predicate and distribution together determine the density/sparsity of the output. E.g. the predicate [](auto x){ return x > 0; } along with a uniformly distributed input in the range (-1000,1000) results in an expected 50% of the input values being copied over. The entropy is not orthogonal to the distribution, but it’s worth mentioning separately. Perhaps I need to think of a better name too. This deterines how predictable the input is, because all pipelined CPUs have branch-prediction logic. E.g. if the CPU frontend (FE) finds a conditional jump instruction, it will not wait for its operand to be ready and will instead speculatively jump to a target address. Misspeculation reults in a large penalty requiring a complete pipeline flush and restarting execution. The same predicate and distribution combination as above can make it difficult for most branch predictors to have a high branch-miss-rate, thereby adversely affecting throughput. In the interest of brevity, we fix the predicate (x > 0) and distribution (uniform in (-1000,1000)), and sweep over the problem size. The performance analysis methods that we shall use here generalize well for tuning the implementation for inputs along other dimensions. We use likwid-bench Figure 1. Speed (MB/s) achieved by the copy and copy_avx512 benchmarks in likwid-benchReproduce using the following commands: $ for size in 16kB 64kB 256kB 1MB 4MB 16MB 64MB 256MB 1GB 4GB; do bw=$(likwid-pin -c 1 likwid-bench -t copy_avx512 -w S0:${size}:1 2>/dev/null | grep "MByte/s" | awk '{print $NF}'); \ echo "$size $bw"; done

$ for size in 16kB 64kB 256kB 1MB 4MB 16MB 64MB 256MB 1GB 4GB; do bw=$(likwid-pin -c 1 likwid-bench -t copy -w S0:${size}:1 2>/dev/null | grep "MByte/s" | awk '{print $NF}'); echo "$size $bw"; done First SIMD Attempt# There are three parts to the loop body. Load from &input[i] Evaluate predicate to get a bool value Conditionally store the loaded value to destination based on the previous result and update output counter/pointer. 1 and 2 are straightforward in most SIMD implementations. Let N be the width of the SIMD registers. E.g. in AVX-512 for loading 32-bit values, N = 512/32 = 16. Load into a SIMD register from &input[i]_mm512_loadu_epi32 and friends (TODO: add link to Intel intrinsics reference)

Evalute predicate on SIMD register to get a SIMD mask value (TODO: add footnote about masks)For our predicate (>(0)), const auto zero = _mm512_setzero_epi32(); return...

Accelerating Copy_if Using SIMD

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits