Std: Is_heap Could Be Faster

signa111 pts0 comments

`std::is_heap` could be faster – Arthur O'Dwyer – Stuff mostly about C++

std::is_heap could be faster

The other day I was noodling around with some libc++ unit-test code that looked<br>roughly like this (Godbolt):

template<br>auto extract_container(A& a) {<br>struct UnwrapAdaptor : A { A::container_type& cc = A::c; };<br>return UnwrapAdaptor(a).cc;

template<br>void test_push_range(bool is_heapified) {<br>int in1[] = {1,3,7};<br>int in2[] = {2,4,5,6};<br>int expected[] = {1,3,7,2,4,5,6};<br>Adaptor a;<br>a.push_range(in1);<br>a.push_range(in2);<br>if (auto c = extract_container(a); is_heapified) {<br>assert(std::ranges::is_heap(c));<br>assert(std::ranges::is_permutation(c, expected));<br>} else {<br>assert(std::ranges::equal(c, expected));

int main() {<br>test_push_range>(false);<br>test_push_range>(false);<br>test_push_range>(true);

I tried to extend main to test also some non-default containers.<br>(Incidentally, did you know stack’s default container is deque,<br>not vector?)

test_push_range>>(false);<br>test_push_range>>(false);

…And suddenly the unit test no longer compiled!

error: no matching function for call to object of type 'const __is_heap'<br>22 | assert(std::ranges::is_heap(c));<br>| ^~~~~~~~~~~~~~~~~~~~<br>[...]<br>note: because 'std::list &' does not satisfy 'random_access_range'

That caught me by surprise. Why should testing whether a range is heapified<br>require random access? It’s simple to implement with only forward traversal,<br>and make_heap/is_heap seems to analogize perfectly against sort/is_sorted.<br>is_sorted doesn’t require random access; why should is_heap?

Ranges algorithm<br>Constraint

sort<br>random_access_range

is_sorted<br>forward_range

is_sorted_until<br>forward_range

make_heap<br>random_access_range

is_heap<br>random_access_range (could be forward_range)

is_heap_until<br>random_access_range (could be forward_range)

partition<br>forward_range

is_partitioned<br>input_range

is_partitioned_until<br>input_range

unique<br>forward_range

is_uniqued<br>forward_range

adjacent_find<br>forward_range

Note that is_partitioned only needs to view one element at a time, and remember<br>the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap<br>need to view two elements at a time; that’s why they can’t handle input_range.

The two “could bes” in the table above seem to indicate a design defect.

Benchmark it!

libc++’s implementation of std::is_heap looks shockingly inefficient:<br>its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element)<br>and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation,<br>avoiding all arithmetic and just advancing two iterators in lockstep,<br>would have taken only 18 lines:

template<br>_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator<br>__is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) {<br>_ForwardIterator __child = __first;<br>if (__child == __last) {<br>return __child;<br>while (true) {<br>++__child;<br>if (__child == __last || __comp(*__first, *__child))<br>break;<br>++__child;<br>if (__child == __last || __comp(*__first, *__child))<br>break;<br>++__first;<br>return __child;

Surprisingly, libstdc++<br>and MS STL also spend time adding and dividing<br>(or shifting) when they don’t need to.

Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient.<br>So I whipped up a simple benchmark:

void BM_vector_half(benchmark::State& state) {<br>auto n = state.range(0);<br>auto v = std::vector(n);<br>std::mt19937 g;<br>std::generate(v.begin(), v.end(), g);<br>std::make_heap(v.begin(), v.begin() + (n / 2));<br>for (auto _ : state) {<br>benchmark::DoNotOptimize(v);<br>auto it = std::is_heap_until(v.begin(), v.end());<br>benchmark::DoNotOptimize(it);

vector_full is the same but with make_heap(v.begin(), v.end()).<br>deque_{half,full} are the same but with deque instead of vector.

Here’s the benchmark result before and after applying the patch above to libc++.<br>In this case, our eyes don’t lie: what looks inefficient is inefficient!<br>(Note that our comparator here is less, which is cheap. An expensive comparator<br>could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)

Benchmark<br>CPU time (before)<br>CPU time (after)

vector_half<br>1K<br>4155 ns<br>2658 ns<br>−36%

vector_half<br>100K<br>311560 ns<br>263807 ns<br>−15%

vector_half<br>10M<br>32487789 ns<br>26089364 ns<br>−19%

vector_full<br>1K<br>8813 ns<br>5294 ns<br>−39%

vector_full<br>100K<br>644535 ns<br>535987 ns<br>−16%

vector_full<br>10M<br>59771100 ns<br>52807359 ns<br>−11%

deque_half<br>1K<br>4186 ns<br>2662 ns<br>−36%

deque_half<br>100K<br>375844 ns<br>264856 ns<br>−29%

deque_half<br>10M<br>31999750 ns<br>26152052 ns<br>−18%

deque_full<br>1K<br>9002 ns<br>5271 ns<br>−41%

deque_full<br>100K<br>619876 ns<br>523727 ns<br>−15%

deque_full<br>10M<br>65697333 ns<br>52488343 ns<br>−20%

Even with the current specification of is_heap, we can achieve this performance today.<br>The algorithm can remain constrained on random_access_range<br>(thus doing nothing to solve the motivating use-case that introduced this post)<br>while internally using nothing more than forward-iterator operations.<br>But once the algorithm uses nothing more...

is_heap __child forward_range benchmark auto test_push_range

Related Articles