Optimizing LLVM's Bump Allocator

jandeboevrie1 pts0 comments

Optimizing LLVM's bump allocator | MaskRay

2026-06-28

BumpPtrAllocator is LLVM's bump allocator (arena<br>allocator): each allocation bumps a pointer within a slab, and<br>everything is freed at once when the allocator dies. It backs Clang's<br>ASTContext, lld's make object pools,<br>TableGen records, and many other arenas.

Here is the fast path before three recent changes:

10<br>11<br>12<br>13<br>14<br>15<br>16<br>__attribute__((returns_nonnull)) void *Allocate(size_t Size, Align Alignment) {<br>BytesAllocated += Size; // (3) accounting RMW<br>uintptr_t AlignedPtr = alignAddr(CurPtr, Alignment); // (1) always realign<br>size_t SizeToAllocate = Size;<br>#if LLVM_ADDRESS_SANITIZER_BUILD<br>SizeToAllocate += RedZoneSize;<br>#endif<br>uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate;<br>if (LLVM_LIKELY(AllocEndPtr uintptr_t(End)<br>&& CurPtr != nullptr)) { // (2) bound + null check<br>CurPtr = reinterpret_castchar *>(AllocEndPtr);<br>...<br>return reinterpret_castchar *>(AlignedPtr);<br>return AllocateSlow(Size, SizeToAllocate, Alignment);

Three changes streamline the three marked lines.

A minimum<br>alignment skips the realign (#205240)

alignAddr(CurPtr, Alignment) is wasteful: a<br>freshly-bumped pointer is usually aligned enough already. #205240<br>rounds each size up to MinAlign (default 8), so the fast<br>path realigns only for over-aligned requests. I've learned the trick<br>from Bump<br>Allocation: Up or Down?:

// Optimized to a constant<br>SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign);

uintptr_t AlignedPtr = uintptr_t(CurPtr);<br>// For the common `alignof(T)<br>if (Alignment.value() > MinAlign)<br>AlignedPtr = alignAddr(CurPtr, Alignment);

SpecificBumpPtrAllocator uses<br>MinAlign = 1 instead — DestroyAll strides at<br>sizeof(T), so it needs tight packing, not rounding.

I made a mistake in the first attempt: nullptr plus a<br>non-zero offset triggered a UBSan diagnostic. Fixed by keeping the math<br>in the uintptr_t domain.

A sentinel End drops<br>the null check (#205485)

__attribute__((returns_nonnull)) specifies the return<br>value is non-null. In a fresh allocator whose CurPtr and<br>End are both null, Allocate(0) used to return<br>null. In 2022, https://reviews.llvm.org/D125040 added the<br>&& CurPtr != nullptr check to the fast path<br>condition, which was not ideal.

I tried 1<br>// Fast path check. The condition also fails for a fresh allocator (End ==<br>// nullptr) to avoid a separate null check.<br>if (LLVM_LIKELY(AlignedPtr + SizeToAllocate - 1 uintptr_t(End))) { ... }

but then adopted aengelke's suggestion. Storing the end as a sentinel<br>one past the real end (EndSentinel = realEnd + 1, and<br>0 when there is no slab) folds both conditions into one<br>unsigned compare:

if (LLVM_LIKELY(AllocEndPtr

An empty allocator has EndSentinel == 0, so<br>AllocEndPtr is always false and the null case falls<br>through to the slow path with no separate branch.

Dropping the<br>per-allocation accounting (#205711)

BytesAllocated += Size was a read-modify-write to a<br>member on every allocation, backing a getBytesAllocated()<br>that reported requested bytes — distinct from<br>getTotalMemory()'s slab capacity. It had only<br>stats/diagnostic consumers: lldb's ConstString memory report, a clangd<br>debug log, TableGen's dumpAllocationStats, and one clang<br>regression test. Dropping the member and migrating those consumers<br>(mostly to getTotalMemory()) removes the hot-path<br>store.

A detail: the red zone and ABI. The ASan red-zone<br>size is also a member. Gating it on<br>#if LLVM_ADDRESS_SANITIZER_BUILD to drop it in release<br>builds would be an ABI footgun: that macro is per translation<br>unit, so an ASan-instrumented TU and a non-ASan<br>libLLVM would silently disagree on the struct layout. The<br>member is instead gated on LLVM_ENABLE_ABI_BREAKING_CHECKS,<br>which is fixed per library build and link-time-enforced (via the<br>EnableABIBreakingChecks symbol); the red-zone arithmetic is<br>then gated on both macros.

Combined, the fast path becomes:

10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>void *Allocate(size_t Size, Align Alignment) {<br>size_t SizeToAllocate = Size;<br>#if LLVM_ADDRESS_SANITIZER_BUILD && LLVM_ENABLE_ABI_BREAKING_CHECKS<br>SizeToAllocate += RedZoneSize;<br>#endif<br>SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign);<br>uintptr_t AlignedPtr = uintptr_t(CurPtr);<br>if (Alignment.value() > MinAlign)<br>AlignedPtr = alignAddr(CurPtr, Alignment);<br>uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate;<br>if (LLVM_LIKELY(AllocEndPtr<br>CurPtr = reinterpret_castchar *>(AllocEndPtr);<br>...<br>return reinterpret_castchar *>(AlignedPtr);<br>return AllocateSlow(Size, SizeToAllocate, Alignment);

Generated assembly

Allocating a typical arena object — a 24-byte, 8-aligned node via<br>Allocate() — compiles to a six-instruction fast<br>path (clang -O2, release):

mov rax, [rdi] # CurPtr (also the return value)<br>lea rcx, [rax + 0x18] # new = CurPtr + 24<br>cmp rcx, [rdi + 0x8] # vs EndSentinel<br>jae .slow<br>mov [rdi], rcx # CurPtr = new<br>ret

That matches the canonical bump fast path. A<br>downward-bumping allocator would not need the<br>rax/rcx distinction — one fewer live value,<br>but the instruction count stays the same. LLVM bumps...

curptr sizetoallocate alignment size uintptr_t alignedptr

Related Articles