Does bulk memmove speed up std:remove_if? (No.)

jandeboevrie1 pts0 comments

Does bulk memmove speed up `std::remove_if`? (No.) – Arthur O'Dwyer – Stuff mostly about C++

Does bulk memmove speed up std::remove_if? (No.)

This morning I was reading the umpteenth std-proposals thread proposing some variety of<br>unstable_remove<br>and it occurred to me that one odd thing about a swap-and-pop-based unstable_remove<br>is that it tends to replace large swaths of contiguous removals by reversing the<br>elements that are kept. For example (Godbolt):

template<br>BidirIt unstable_remove_if(BidirIt first, BidirIt last, Pred pred) {<br>while (true) {<br>first = std::find_if(first, last, pred);<br>if (first == last) return first;<br>while (true) {<br>--last;<br>if (first == last) return first;<br>if (!pred(*last)) break;<br>*first++ = std::move(*last);

int main() {<br>auto in234 = [](int x) { return 2 v = {1,2,3,4,5,6,7,8,9};<br>v.erase(unstable_remove_if(v.begin(), v.end(), in234), v.end());<br>// v is {1,9,8,7,5,6}

It would be more aesthetically pleasing to produce {1,7,8,9,5,6}: for trivially copyable<br>elements, this would be a single memmove. In a sense, “reversing the elements” is extra work<br>that we might save time by avoiding. But to avoid that work, we might have to do even more<br>work in the form of bookkeeping. I haven’t attempted to benchmark any such change to<br>unstable_remove.

But the same aesthetic consideration applies to the ordinary, stable std::remove_if.<br>Traditionally it’s a loop over operator=, like this:

template<br>FwdIt smooth_remove_if(FwdIt first, FwdIt last, Pred pred) {<br>FwdIt dfirst = std::find_if(first, last, pred);<br>if (dfirst != last) {<br>for (first = std::next(dfirst); first != last; ++first) {<br>if (!pred(*first)) {<br>*dfirst++ = std::move(*first);<br>return dfirst;

But what if we “chunked” the writes to *dfirst, like this?

template<br>FwdIt chunky_remove_if(FwdIt first, FwdIt last, Pred pred) {<br>FwdIt dfirst = std::find_if(first, last, pred);<br>if (dfirst != last) {<br>for (first = std::next(dfirst); first != last; ++first) {<br>if (!pred(*first)) {<br>FwdIt sfirst = first;<br>first = std::find_if(std::next(first), last, pred);<br>dfirst = std::move(sfirst, first, dfirst);<br>if (first == last) break;<br>return dfirst;

Delegating the work to the std::move algorithm<br>allows the library to use memmove for that work, if the element type happens to be trivially<br>copyable. The nested loop complicates the generated code (Godbolt),<br>but is it worth it? Do we actually save CPU cycles?

Well, if the average length of a “run” of survivors is long, then we<br>might expect the cost of a single large memmove to beat out the cost of operator=-in-a-loop.<br>But at the other extreme, if the average length of a run is only 1 element, then memmove won’t<br>save anything; we’ll be paying all the bookkeeping cost for none of the benefit.

By “bookkeeping” I mean not only the extra register and icache pressure of the more complicated<br>algorithm, but also the overhead of the call to memmove itself, and that memmove will necessarily<br>start by checking whether it needs to handle forward overlap. (That branch is completely predictable<br>— it doesn’t — but is still overhead absent from the “smooth” version.)

I wrote a little benchmark (backup)<br>to time a call to std::remove_if on an array of a million ints,<br>with a bit-masking predicate that removed half of the elements; one in 8;<br>one in 128; or one in 1024. I contrived this benchmark deliberately to play<br>to the “chunky” algorithm’s strengths: trivially copyable elements, with large swaths moved<br>contiguously.

The “smooth” column here is merely to prove that my smooth_remove_if implementation<br>is essentially the same as libc++’s default implementation:<br>my chunky_remove_if can’t blame its defeat on “library vendor magic.”<br>Yet it is defeated, and decisively so:

Elements removed<br>std<br>smooth<br>chunky<br>Penalty<br>smooth<br>assignments<br>chunky<br>memmoves

1 in 2<br>3122 us<br>3157 us<br>3797 us<br>+20%<br>499549<br>249971

1 in 8<br>1082 us<br>1105 us<br>1471 us<br>+33%<br>874707<br>109724

1 in 128<br>416 us<br>420 us<br>468 us<br>+11%<br>992148<br>7750

1 in 1024<br>327 us<br>326 us<br>396 us<br>+21%<br>998482<br>958

As expected, chunky_remove_if loses when the expected length of a memmove is short.<br>But it also loses when the expected length of a memmove is long! And, counterintuitively,<br>as we remove fewer elements (and thus execute more individual assignments in the “smooth”<br>case and fewer individual memmoves in the “chunky” case), smooth_remove_if gets faster and<br>faster, and chunky_remove_if’s performance penalty doesn’t budge.

I tentatively blame this on the branch predictor. The fewer elements we remove, the more<br>predictable the result of pred(*first). Removing half the elements is the worst case for<br>smooth_remove_if simply because that turns it into a tight loop over a single completely<br>unpredictable branch. Making that branch more predictable dominates every other consideration,<br>including any speedup we might get from memmove (which, after all, is also a tight loop over<br>a mostly predictable branch: “have we counted all the way to n yet?”)

Conclusion: “Chunking” the elements moved by remove_if into larger...

first last pred memmove dfirst elements

Related Articles