Optimizing LLVM's Bump Allocator

Optimizing LLVM's bump allocator | MaskRay

2026-06-28

BumpPtrAllocator is LLVM's bump allocator (arena allocator): each allocation bumps a pointer within a slab, and everything is freed at once when the allocator dies. It backs Clang's ASTContext, lld's make object pools, TableGen records, and many other arenas.

Here is the fast path before three recent changes:

10 11 12 13 14 15 16 __attribute__((returns_nonnull)) void *Allocate(size_t Size, Align Alignment) { BytesAllocated += Size; // (3) accounting RMW uintptr_t AlignedPtr = alignAddr(CurPtr, Alignment); // (1) always realign size_t SizeToAllocate = Size; #if LLVM_ADDRESS_SANITIZER_BUILD SizeToAllocate += RedZoneSize; #endif uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr uintptr_t(End) && CurPtr != nullptr)) { // (2) bound + null check CurPtr = reinterpret_castchar *>(AllocEndPtr); ... return reinterpret_castchar *>(AlignedPtr); return AllocateSlow(Size, SizeToAllocate, Alignment);

Three changes streamline the three marked lines.

A minimum alignment skips the realign (#205240)

alignAddr(CurPtr, Alignment) is wasteful: a freshly-bumped pointer is usually aligned enough already. #205240 rounds each size up to MinAlign (default 8), so the fast path realigns only for over-aligned requests. I've learned the trick from Bump Allocation: Up or Down?:

// Optimized to a constant SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign);

uintptr_t AlignedPtr = uintptr_t(CurPtr); // For the common `alignof(T) if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment);

SpecificBumpPtrAllocator uses MinAlign = 1 instead — DestroyAll strides at sizeof(T), so it needs tight packing, not rounding.

I made a mistake in the first attempt: nullptr plus a non-zero offset triggered a UBSan diagnostic. Fixed by keeping the math in the uintptr_t domain.

A sentinel End drops the null check (#205485)

__attribute__((returns_nonnull)) specifies the return value is non-null. In a fresh allocator whose CurPtr and End are both null, Allocate(0) used to return null. In 2022, https://reviews.llvm.org/D125040 added the && CurPtr != nullptr check to the fast path condition, which was not ideal.

I tried 1 // Fast path check. The condition also fails for a fresh allocator (End == // nullptr) to avoid a separate null check. if (LLVM_LIKELY(AlignedPtr + SizeToAllocate - 1 uintptr_t(End))) { ... }

but then adopted aengelke's suggestion. Storing the end as a sentinel one past the real end (EndSentinel = realEnd + 1, and 0 when there is no slab) folds both conditions into one unsigned compare:

if (LLVM_LIKELY(AllocEndPtr

An empty allocator has EndSentinel == 0, so AllocEndPtr is always false and the null case falls through to the slow path with no separate branch.

Dropping the per-allocation accounting (#205711)

BytesAllocated += Size was a read-modify-write to a member on every allocation, backing a getBytesAllocated() that reported requested bytes — distinct from getTotalMemory()'s slab capacity. It had only stats/diagnostic consumers: lldb's ConstString memory report, a clangd debug log, TableGen's dumpAllocationStats, and one clang regression test. Dropping the member and migrating those consumers (mostly to getTotalMemory()) removes the hot-path store.

A detail: the red zone and ABI. The ASan red-zone size is also a member. Gating it on #if LLVM_ADDRESS_SANITIZER_BUILD to drop it in release builds would be an ABI footgun: that macro is per translation unit, so an ASan-instrumented TU and a non-ASan libLLVM would silently disagree on the struct layout. The member is instead gated on LLVM_ENABLE_ABI_BREAKING_CHECKS, which is fixed per library build and link-time-enforced (via the EnableABIBreakingChecks symbol); the red-zone arithmetic is then gated on both macros.

Combined, the fast path becomes:

10 11 12 13 14 15 16 17 void *Allocate(size_t Size, Align Alignment) { size_t SizeToAllocate = Size; #if LLVM_ADDRESS_SANITIZER_BUILD && LLVM_ENABLE_ABI_BREAKING_CHECKS SizeToAllocate += RedZoneSize; #endif SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign); uintptr_t AlignedPtr = uintptr_t(CurPtr); if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment); uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr CurPtr = reinterpret_castchar *>(AllocEndPtr); ... return reinterpret_castchar *>(AlignedPtr); return AllocateSlow(Size, SizeToAllocate, Alignment);

Generated assembly

Allocating a typical arena object — a 24-byte, 8-aligned node via Allocate() — compiles to a six-instruction fast path (clang -O2, release):

mov rax, [rdi] # CurPtr (also the return value) lea rcx, [rax + 0x18] # new = CurPtr + 24 cmp rcx, [rdi + 0x8] # vs EndSentinel jae .slow mov [rdi], rcx # CurPtr = new ret

That matches the canonical bump fast path. A downward-bumping allocator would not need the rax/rcx distinction — one fewer live value, but the instruction count stays the same. LLVM bumps...

Optimizing LLVM's Bump Allocator

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7