A deep dive into SmallVector:push_back

mariuz1 pts0 comments

A deep dive into SmallVector::push_back | MaskRay

2026-06-27

tl;dr This blog post describes a recent<br>SmallVector::push_back optimization for approximately<br>trivially copyable element types.

SmallVector is LLVM's most-used container, and<br>push_back its hot operation. For the trivially-copyable<br>specialization the fast path should be fast.

#include

void f(llvm::SmallVectorImplint> &v, int x) { v.push_back(x); }

clang -S --target=x86_64 -O2 -DNDEBUG a.cc<br>generates:

10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>18<br>19<br>20<br>push rbp # callee-saved spills + a stack realignment,<br>push rbx # all on the fast path<br>push rax<br>mov eax, [rdi + 8] # size<br>cmp eax, [rdi + 12] # vs capacity<br>jae .Lgrow<br>.Lstore: # reached from the fast path AND from .Lgrow<br>mov rcx, [rdi]<br>mov [rcx + rax*4], esi<br>inc dword ptr [rdi + 8]<br>add rsp, 8<br>pop rbx<br>pop rbp<br>ret<br>.Lgrow:<br>mov rbx, rdi # keep `this`/`x` alive across the call<br>mov ebp, esi<br>call SmallVectorBase::grow_pod<br>...<br>jmp .Lstore

push_back reserves capacity and then stores, so<br>the store at .Lstore is shared between the no-grow and<br>post-grow paths. On the grow path this and x<br>must survive the grow_pod call, which means they are saved<br>in callee-saved registers, leading to<br>push rbx/push rbp in the prologue.<br>push rbp is needed to maintain the 16-byte alignment of the<br>stack frame.

GCC's output is also inefficient:

push rbp ; mov ebp, esi # x -> rbp, in the entry block<br>push rbx ; mov rbx, rdi # this -> rbx<br>... ; cmp ; jnb .Lslow<br>.Lmerge: # reached by both paths, reads rbx/rbp<br>mov rdx, [rbx] ; mov [rdx+rax*4], ebp ; ...

Shrink wrapping can't remove<br>it

Shrink wrapping relocates the save/restore of callee-saved registers;<br>it never duplicates a block. To carry this/x<br>across the conditional grow_pod call into a store the fast<br>path also reaches, a callee-saved register must be live from entry.<br>clang -mllvm -debug-only=shrink-wrap reports<br>No Shrink wrap candidate found. GCC's<br>-fshrink-wrap-separate (on at -O2) does not<br>optimize this as well.

The transformation that would help is tail duplication —<br>give the slow path its own copy of the store so the fast path keeps<br>this/x in their argument registers. Neither<br>compiler does it here, and it is not shrink-wrapping's job.

Optimization: tail<br>calling the slow path

https://github.com/llvm/llvm-project/pull/206213 moves<br>the grow-and-store out of line and tail calls it:

10<br>11<br>12<br>13<br>LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) {<br>T Tmp = Elt; // in case Elt aliases storage that grow() invalidates<br>this->grow(this->size() + 1);<br>std::memcpy(reinterpret_castvoid *>(this->end()), &Tmp, sizeof(T));<br>this->set_size(this->size() + 1);

void push_back(ValueParamT Elt) {<br>if (LLVM_UNLIKELY(this->size() >= this->capacity()))<br>return growAndPushBack(Elt);<br>std::memcpy(reinterpret_castvoid *>(this->end()), &Elt, sizeof(T));<br>this->set_size(this->size() + 1);

The generated assembly is now optimal for the fast path:

mov eax, [rdi + 8]<br>cmp eax, [rdi + 12]<br>jae growAndPushBack # TAILCALL<br>mov rcx, [rdi]<br>mov [rcx + rax*4], esi<br>inc dword ptr [rdi + 8]<br>ret

7 instructions instead of 14, no callee-saved registers, nothing to<br>shrink-wrap.

The slow path, now in an out-of-line function (in a separate section<br>using COMDAT), becomes even slower.

noinline is load-bearing, otherwise Clang and GCC may<br>inline the helper back and the prologue returns.

#include<br>// noinline growAndPushBack is load-bearing for both Clang and GCC.<br>void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImplint> &v) {<br>for (unsigned l = 0; l 2)<br>for (unsigned i = 0; i 2; ++i)<br>v.push_back(i);

T Tmp = Elt handles Elt referencing the<br>vector's own storage. It is elided for small by-value types. Passing the<br>element by reference to the out-of-line growAndPushBack<br>makes it address-taken / memory-materialized (it must be readable at a<br>fixed address across another non-inlined call), which defeats<br>construct-in-place for large element types. However, this is<br>insignificant given that grow() has to copy size()<br>elements.

Results

lld .text shrinks 40,512 bytes;<br>by-const& element types win most, e.g.<br>GotSection::addConstant goes 167 → 45 bytes. On the LLVM compile-time<br>tracker the clang build is 0.41–0.51% fewer<br>instructions:u across every configuration, for +0.13%<br>binary size.

Sorted by relative size, a few outliers grow ~13.8% — the constexpr<br>ByteCode interpreter (Interp.cpp,<br>EvalEmitter.cpp). A smaller push_back likely<br>perturbs the bottom-up inliner's near-threshold decisions.

std::vector::push_back<br>is slow in both libc++ and libstdc++

Both libraries need a stack frame for their<br>vector::push_back fast path. https://godbolt.org/z/5h85M9Gr9

10<br>11<br>12<br>#include<br>#include

void pb_int(std::vectorint> &v, int x) { v.push_back(x); }<br>void pb_int(llvm::SmallVectorImplint> &v, int x) { v.push_back(x); }

struct T {int x[32];};<br>void pb_Tcreate(std::vector &v, int x){ v.push_back(T{{x, 1}}); }<br>void pb_Tcopy(std::vector &v, const T &t){ v.push_back(t); }

void pb_Tcreate(llvm::SmallVectorImpl &v, int x){ v.push_back(T{{x, 1}}); }<br>void...

push_back path void llvm fast push

Related Articles