A deep dive into SmallVector:push_back

A deep dive into SmallVector::push_back | MaskRay

2026-06-27

tl;dr This blog post describes a recent SmallVector::push_back optimization for approximately trivially copyable element types.

SmallVector is LLVM's most-used container, and push_back its hot operation. For the trivially-copyable specialization the fast path should be fast.

#include

void f(llvm::SmallVectorImplint> &v, int x) { v.push_back(x); }

clang -S --target=x86_64 -O2 -DNDEBUG a.cc generates:

10 11 12 13 14 15 16 17 18 19 20 push rbp # callee-saved spills + a stack realignment, push rbx # all on the fast path push rax mov eax, [rdi + 8] # size cmp eax, [rdi + 12] # vs capacity jae .Lgrow .Lstore: # reached from the fast path AND from .Lgrow mov rcx, [rdi] mov [rcx + rax*4], esi inc dword ptr [rdi + 8] add rsp, 8 pop rbx pop rbp ret .Lgrow: mov rbx, rdi # keep `this`/`x` alive across the call mov ebp, esi call SmallVectorBase::grow_pod ... jmp .Lstore

push_back reserves capacity and then stores, so the store at .Lstore is shared between the no-grow and post-grow paths. On the grow path this and x must survive the grow_pod call, which means they are saved in callee-saved registers, leading to push rbx/push rbp in the prologue. push rbp is needed to maintain the 16-byte alignment of the stack frame.

GCC's output is also inefficient:

push rbp ; mov ebp, esi # x -> rbp, in the entry block push rbx ; mov rbx, rdi # this -> rbx ... ; cmp ; jnb .Lslow .Lmerge: # reached by both paths, reads rbx/rbp mov rdx, [rbx] ; mov [rdx+rax*4], ebp ; ...

Shrink wrapping can't remove it

Shrink wrapping relocates the save/restore of callee-saved registers; it never duplicates a block. To carry this/x across the conditional grow_pod call into a store the fast path also reaches, a callee-saved register must be live from entry. clang -mllvm -debug-only=shrink-wrap reports No Shrink wrap candidate found. GCC's -fshrink-wrap-separate (on at -O2) does not optimize this as well.

The transformation that would help is tail duplication — give the slow path its own copy of the store so the fast path keeps this/x in their argument registers. Neither compiler does it here, and it is not shrink-wrapping's job.

Optimization: tail calling the slow path

https://github.com/llvm/llvm-project/pull/206213 moves the grow-and-store out of line and tail calls it:

10 11 12 13 LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) { T Tmp = Elt; // in case Elt aliases storage that grow() invalidates this->grow(this->size() + 1); std::memcpy(reinterpret_castvoid *>(this->end()), &Tmp, sizeof(T)); this->set_size(this->size() + 1);

void push_back(ValueParamT Elt) { if (LLVM_UNLIKELY(this->size() >= this->capacity())) return growAndPushBack(Elt); std::memcpy(reinterpret_castvoid *>(this->end()), &Elt, sizeof(T)); this->set_size(this->size() + 1);

The generated assembly is now optimal for the fast path:

mov eax, [rdi + 8] cmp eax, [rdi + 12] jae growAndPushBack # TAILCALL mov rcx, [rdi] mov [rcx + rax*4], esi inc dword ptr [rdi + 8] ret

7 instructions instead of 14, no callee-saved registers, nothing to shrink-wrap.

The slow path, now in an out-of-line function (in a separate section using COMDAT), becomes even slower.

noinline is load-bearing, otherwise Clang and GCC may inline the helper back and the prologue returns.

#include // noinline growAndPushBack is load-bearing for both Clang and GCC. void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImplint> &v) { for (unsigned l = 0; l 2) for (unsigned i = 0; i 2; ++i) v.push_back(i);

T Tmp = Elt handles Elt referencing the vector's own storage. It is elided for small by-value types. Passing the element by reference to the out-of-line growAndPushBack makes it address-taken / memory-materialized (it must be readable at a fixed address across another non-inlined call), which defeats construct-in-place for large element types. However, this is insignificant given that grow() has to copy size() elements.

Results

lld .text shrinks 40,512 bytes; by-const& element types win most, e.g. GotSection::addConstant goes 167 → 45 bytes. On the LLVM compile-time tracker the clang build is 0.41–0.51% fewer instructions:u across every configuration, for +0.13% binary size.

Sorted by relative size, a few outliers grow ~13.8% — the constexpr ByteCode interpreter (Interp.cpp, EvalEmitter.cpp). A smaller push_back likely perturbs the bottom-up inliner's near-threshold decisions.

std::vector::push_back is slow in both libc++ and libstdc++

Both libraries need a stack frame for their vector::push_back fast path. https://godbolt.org/z/5h85M9Gr9

10 11 12 #include #include

void pb_int(std::vectorint> &v, int x) { v.push_back(x); } void pb_int(llvm::SmallVectorImplint> &v, int x) { v.push_back(x); }

struct T {int x[32];}; void pb_Tcreate(std::vector &v, int x){ v.push_back(T{{x, 1}}); } void pb_Tcopy(std::vector &v, const T &t){ v.push_back(t); }

void pb_Tcreate(llvm::SmallVectorImpl &v, int x){ v.push_back(T{{x, 1}}); } void...

A deep dive into SmallVector:push_back

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7