Accelerating Copy_if Using SIMD

chkmr1 pts0 comments

Accelerating copy_if using SIMD | Chaitanya Kumar's BlogTable of ContentsIntroduction<br>First SIMD Attempt<br>First Moment of (Bitter) Truth<br>A Crash Course on CPU Microarchitecture and PMCs<br>The Top-Down Analysis using Performance CountersLevel 1<br>Level 2<br>Retiring Microcode<br>Profiling with AMD IBS

The Fix and Final Moment of Truth<br>What&rsquo;s Left<br>Conclusion<br>AppendixBenchmark SetupSources of variance<br>Disabling SMT<br>Setting Thread Affinity<br>Increasing scheduling priority of the benchmark thread<br>Putting it all together

llvm-mca

Introduction#<br>I have a Zen 4 CPU with a bunch of AVX512 feature flags. So I thought - let&rsquo;s<br>try and use it to implement something, even if it&rsquo;s in the realm of<br>wheel-reinvention. I started with the following goals.<br>Implement an algorithm that cannot be vectorized by my optimizing compiler,<br>even with a polyhedral loop model.<br>Systematically analyze its performance and answer the questionsIs it as fast as it can be?<br>If not, why? And how can we fix it?

Start simple, make it work.<br>Which means that dead simple algorithms like map/transform, reduce,<br>adjacent_difference etc are out, as they are very autovectorizable. Even 2D stencils are out because look at this. So, I<br>settled on std::copy_if.<br>Implementing a SIMD implementation is the easy part. Figuring its perforamnce<br>out ended up being less trivial than I anticipated. I already knew the tools<br>that I will need.<br>Google benchmark for writing microbenchmarks<br>likwid-bench for determining performance upper bound on my machine<br>llvm-mca for simulating the kernel on its model of Zen 4<br>perf-stat for drill-down performance analysis by counting events<br>From cppreference,<br>std::copy_if is a dead-simple algorithm.<br>templateclass InputIt, class OutputIt, class UnaryPred><br>OutputIt copy_if(InputIt first, InputIt last,<br>OutputIt d_first, UnaryPred pred)<br>for (; first != last; ++first)<br>if (pred(*first))<br>*d_first = *first;<br>++d_first;

return d_first;

The codegen is also very clean (compiler explorer link). It<br>is however non-trivial to vectorize because of a loop-carried dependency: the<br>value of d_first in iteration i+1 depends on the value of pred(*first) in<br>iteration i. Let us measure our baseline before we go about vectorizing.<br>These are the dimensions along with we can measure performance.<br>Input size (henceforth n)<br>Choice of predicate function<br>Input distribution<br>Input entropy<br>The problem size (1) is trivial to sweep over; varying n results in different interactions<br>with the memory subsystem (caches, hardware prefetchers, DRAM etc).<br>The predicate and distribution together determine the density/sparsity of the<br>output. E.g. the predicate [](auto x){ return x > 0; } along with a uniformly<br>distributed input in the range (-1000,1000) results in an expected 50% of the<br>input values being copied over.<br>The entropy is not orthogonal to the distribution, but it&rsquo;s worth mentioning<br>separately. Perhaps I need to think of a better name too. This deterines how<br>predictable the input is, because all pipelined CPUs have branch-prediction<br>logic. E.g. if the CPU frontend (FE) finds a conditional jump instruction, it will<br>not wait for its operand to be ready and will instead speculatively jump to a<br>target address. Misspeculation reults in a large penalty requiring a complete<br>pipeline flush and restarting execution. The same predicate and distribution<br>combination as above can make it difficult for most branch predictors to have a<br>high branch-miss-rate, thereby adversely affecting throughput.<br>In the interest of brevity, we fix the predicate (x > 0) and distribution<br>(uniform in (-1000,1000)), and sweep over the problem size. The performance<br>analysis methods that we shall use here generalize well for tuning the<br>implementation for inputs along other dimensions.<br>We use likwid-bench<br>Figure 1. Speed (MB/s) achieved by the copy and copy_avx512 benchmarks in likwid-benchReproduce using the following commands:<br>$ for size in 16kB 64kB 256kB 1MB 4MB 16MB 64MB 256MB 1GB 4GB; do<br>bw=$(likwid-pin -c 1 likwid-bench -t copy_avx512 -w S0:${size}:1 2>/dev/null |<br>grep "MByte/s" |<br>awk '{print $NF}'); \<br>echo "$size $bw";<br>done

$ for size in 16kB 64kB 256kB 1MB 4MB 16MB 64MB 256MB 1GB 4GB; do<br>bw=$(likwid-pin -c 1 likwid-bench -t copy -w S0:${size}:1 2>/dev/null |<br>grep "MByte/s" |<br>awk '{print $NF}');<br>echo "$size $bw";<br>done<br>First SIMD Attempt#<br>There are three parts to the loop body.<br>Load from &input[i]<br>Evaluate predicate to get a bool value<br>Conditionally store the loaded value to destination based on the previous<br>result and update output counter/pointer.<br>1 and 2 are straightforward in most SIMD implementations. Let N be the width<br>of the SIMD registers. E.g. in<br>AVX-512 for loading 32-bit values, N = 512/32 = 16.<br>Load into a SIMD register from &input[i]_mm512_loadu_epi32 and friends (TODO: add link to Intel intrinsics reference)

Evalute predicate on SIMD register to get a SIMD mask value (TODO: add<br>footnote about masks)For our predicate (>(0)), const auto zero = _mm512_setzero_epi32(); return...

simd first size input predicate likwid

Related Articles