`std::is_heap` could be faster – Arthur O'Dwyer – Stuff mostly about C++
std::is_heap could be faster
The other day I was noodling around with some libc++ unit-test code that looked<br>roughly like this (Godbolt):
template<br>auto extract_container(A& a) {<br>struct UnwrapAdaptor : A { A::container_type& cc = A::c; };<br>return UnwrapAdaptor(a).cc;
template<br>void test_push_range(bool is_heapified) {<br>int in1[] = {1,3,7};<br>int in2[] = {2,4,5,6};<br>int expected[] = {1,3,7,2,4,5,6};<br>Adaptor a;<br>a.push_range(in1);<br>a.push_range(in2);<br>if (auto c = extract_container(a); is_heapified) {<br>assert(std::ranges::is_heap(c));<br>assert(std::ranges::is_permutation(c, expected));<br>} else {<br>assert(std::ranges::equal(c, expected));
int main() {<br>test_push_range>(false);<br>test_push_range>(false);<br>test_push_range>(true);
I tried to extend main to test also some non-default containers.<br>(Incidentally, did you know stack’s default container is deque,<br>not vector?)
test_push_range>>(false);<br>test_push_range>>(false);
…And suddenly the unit test no longer compiled!
error: no matching function for call to object of type 'const __is_heap'<br>22 | assert(std::ranges::is_heap(c));<br>| ^~~~~~~~~~~~~~~~~~~~<br>[...]<br>note: because 'std::list &' does not satisfy 'random_access_range'
That caught me by surprise. Why should testing whether a range is heapified<br>require random access? It’s simple to implement with only forward traversal,<br>and make_heap/is_heap seems to analogize perfectly against sort/is_sorted.<br>is_sorted doesn’t require random access; why should is_heap?
Ranges algorithm<br>Constraint
sort<br>random_access_range
is_sorted<br>forward_range
is_sorted_until<br>forward_range
make_heap<br>random_access_range
is_heap<br>random_access_range (could be forward_range)
is_heap_until<br>random_access_range (could be forward_range)
partition<br>forward_range
is_partitioned<br>input_range
is_partitioned_until<br>input_range
unique<br>forward_range
is_uniqued<br>forward_range
adjacent_find<br>forward_range
Note that is_partitioned only needs to view one element at a time, and remember<br>the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap<br>need to view two elements at a time; that’s why they can’t handle input_range.
The two “could bes” in the table above seem to indicate a design defect.
Benchmark it!
libc++’s implementation of std::is_heap looks shockingly inefficient:<br>its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element)<br>and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation,<br>avoiding all arithmetic and just advancing two iterators in lockstep,<br>would have taken only 18 lines:
template<br>_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator<br>__is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) {<br>_ForwardIterator __child = __first;<br>if (__child == __last) {<br>return __child;<br>while (true) {<br>++__child;<br>if (__child == __last || __comp(*__first, *__child))<br>break;<br>++__child;<br>if (__child == __last || __comp(*__first, *__child))<br>break;<br>++__first;<br>return __child;
Surprisingly, libstdc++<br>and MS STL also spend time adding and dividing<br>(or shifting) when they don’t need to.
Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient.<br>So I whipped up a simple benchmark:
void BM_vector_half(benchmark::State& state) {<br>auto n = state.range(0);<br>auto v = std::vector(n);<br>std::mt19937 g;<br>std::generate(v.begin(), v.end(), g);<br>std::make_heap(v.begin(), v.begin() + (n / 2));<br>for (auto _ : state) {<br>benchmark::DoNotOptimize(v);<br>auto it = std::is_heap_until(v.begin(), v.end());<br>benchmark::DoNotOptimize(it);
vector_full is the same but with make_heap(v.begin(), v.end()).<br>deque_{half,full} are the same but with deque instead of vector.
Here’s the benchmark result before and after applying the patch above to libc++.<br>In this case, our eyes don’t lie: what looks inefficient is inefficient!<br>(Note that our comparator here is less, which is cheap. An expensive comparator<br>could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)
Benchmark<br>CPU time (before)<br>CPU time (after)
vector_half<br>1K<br>4155 ns<br>2658 ns<br>−36%
vector_half<br>100K<br>311560 ns<br>263807 ns<br>−15%
vector_half<br>10M<br>32487789 ns<br>26089364 ns<br>−19%
vector_full<br>1K<br>8813 ns<br>5294 ns<br>−39%
vector_full<br>100K<br>644535 ns<br>535987 ns<br>−16%
vector_full<br>10M<br>59771100 ns<br>52807359 ns<br>−11%
deque_half<br>1K<br>4186 ns<br>2662 ns<br>−36%
deque_half<br>100K<br>375844 ns<br>264856 ns<br>−29%
deque_half<br>10M<br>31999750 ns<br>26152052 ns<br>−18%
deque_full<br>1K<br>9002 ns<br>5271 ns<br>−41%
deque_full<br>100K<br>619876 ns<br>523727 ns<br>−15%
deque_full<br>10M<br>65697333 ns<br>52488343 ns<br>−20%
Even with the current specification of is_heap, we can achieve this performance today.<br>The algorithm can remain constrained on random_access_range<br>(thus doing nothing to solve the motivating use-case that introduced this post)<br>while internally using nothing more than forward-iterator operations.<br>But once the algorithm uses nothing more...