Std: Is_heap Could Be Faster

`std::is_heap` could be faster – Arthur O'Dwyer – Stuff mostly about C++

std::is_heap could be faster

The other day I was noodling around with some libc++ unit-test code that looked roughly like this (Godbolt):

template auto extract_container(A& a) { struct UnwrapAdaptor : A { A::container_type& cc = A::c; }; return UnwrapAdaptor(a).cc;

template void test_push_range(bool is_heapified) { int in1[] = {1,3,7}; int in2[] = {2,4,5,6}; int expected[] = {1,3,7,2,4,5,6}; Adaptor a; a.push_range(in1); a.push_range(in2); if (auto c = extract_container(a); is_heapified) { assert(std::ranges::is_heap(c)); assert(std::ranges::is_permutation(c, expected)); } else { assert(std::ranges::equal(c, expected));

int main() { test_push_range>(false); test_push_range>(false); test_push_range>(true);

I tried to extend main to test also some non-default containers. (Incidentally, did you know stack’s default container is deque, not vector?)

test_push_range>>(false); test_push_range>>(false);

…And suddenly the unit test no longer compiled!

error: no matching function for call to object of type 'const __is_heap' 22 | assert(std::ranges::is_heap(c)); | ^~~~~~~~~~~~~~~~~~~~ [...] note: because 'std::list &' does not satisfy 'random_access_range'

That caught me by surprise. Why should testing whether a range is heapified require random access? It’s simple to implement with only forward traversal, and make_heap/is_heap seems to analogize perfectly against sort/is_sorted. is_sorted doesn’t require random access; why should is_heap?

Ranges algorithm Constraint

sort random_access_range

is_sorted forward_range

is_sorted_until forward_range

make_heap random_access_range

is_heap random_access_range (could be forward_range)

is_heap_until random_access_range (could be forward_range)

partition forward_range

is_partitioned input_range

is_partitioned_until input_range

unique forward_range

is_uniqued forward_range

adjacent_find forward_range

Note that is_partitioned only needs to view one element at a time, and remember the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap need to view two elements at a time; that’s why they can’t handle input_range.

The two “could bes” in the table above seem to indicate a design defect.

Benchmark it!

libc++’s implementation of std::is_heap looks shockingly inefficient: its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element) and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation, avoiding all arithmetic and just advancing two iterators in lockstep, would have taken only 18 lines:

template _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator __is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) { _ForwardIterator __child = __first; if (__child == __last) { return __child; while (true) { ++__child; if (__child == __last || __comp(*__first, *__child)) break; ++__child; if (__child == __last || __comp(*__first, *__child)) break; ++__first; return __child;

Surprisingly, libstdc++ and MS STL also spend time adding and dividing (or shifting) when they don’t need to.

Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient. So I whipped up a simple benchmark:

void BM_vector_half(benchmark::State& state) { auto n = state.range(0); auto v = std::vector(n); std::mt19937 g; std::generate(v.begin(), v.end(), g); std::make_heap(v.begin(), v.begin() + (n / 2)); for (auto _ : state) { benchmark::DoNotOptimize(v); auto it = std::is_heap_until(v.begin(), v.end()); benchmark::DoNotOptimize(it);

vector_full is the same but with make_heap(v.begin(), v.end()). deque_{half,full} are the same but with deque instead of vector.

Here’s the benchmark result before and after applying the patch above to libc++. In this case, our eyes don’t lie: what looks inefficient is inefficient! (Note that our comparator here is less, which is cheap. An expensive comparator could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)

Benchmark CPU time (before) CPU time (after)

vector_half 1K 4155 ns 2658 ns −36%

vector_half 100K 311560 ns 263807 ns −15%

vector_half 10M 32487789 ns 26089364 ns −19%

vector_full 1K 8813 ns 5294 ns −39%

vector_full 100K 644535 ns 535987 ns −16%

vector_full 10M 59771100 ns 52807359 ns −11%

deque_half 1K 4186 ns 2662 ns −36%

deque_half 100K 375844 ns 264856 ns −29%

deque_half 10M 31999750 ns 26152052 ns −18%

deque_full 1K 9002 ns 5271 ns −41%

deque_full 100K 619876 ns 523727 ns −15%

deque_full 10M 65697333 ns 52488343 ns −20%

Even with the current specification of is_heap, we can achieve this performance today. The algorithm can remain constrained on random_access_range (thus doing nothing to solve the motivating use-case that introduced this post) while internally using nothing more than forward-iterator operations. But once the algorithm uses nothing more...

Std: Is_heap Could Be Faster

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast