A faster bump allocator for rust

2026-06-02 23:02

Say hello to stumpalo.

Stumpalo is a bump allocator.

Stumpalo has scoped stack support.

Stumpalo is extremely fast.

Stumpalo has a logo, created very hastily.

Stumpalo’s logo is stumpy:

# Speed

You’re probably using a bump allocator because you want raw allocation throughput.

Let’s see how fast stumpalo is, compared to other libraries.

operation stumpalo blink bumpalo

alloc_u8 ✅ 1.00x 🔴 2.14x 🟠 1.54x

alloc_u16 ✅ 1.00x 🔴 2.46x 🟥 2.54x

alloc_u32 ✅ 1.00x 🟥 3.36x 🟥 3.34x

alloc_u64 ✅ 1.00x 🟥 3.35x 🟥 3.34x

alloc_u128 ✅ 1.00x 🟡 1.19x 🟡 1.18x

alloc_multiple_u8 ✅ 1.00x 🔴 1.82x 🔴 1.85x

alloc_multiple_u16 ✅ 1.00x 🔴 2.30x 🔴 2.34x

alloc_multiple_u32 ✅ 1.00x 🟥 3.12x 🟥 3.14x

alloc_multiple_u64 ✅ 1.00x 🟥 3.23x 🟥 3.25x

alloc_multiple_u128 ✅ 1.00x 🟥 2.70x 🟥 2.61x

alloc_array_u8_8 ✅ 1.00x 🔴 1.99x 🔴 2.11x

alloc_array_u8_32 ✅ 1.00x 🟢 1.15x 🟡 1.20x

alloc_array_u8_64 ✅ 1.00x 🟠 1.55x 🟠 1.59x

alloc_array_u8_128 ✅ 1.00x 🟡 1.30x 🟠 1.50x

alloc_slice_u8_8 🟢 1.11x 🟡 1.27x ✅ 1.00x

alloc_slice_u8_32 🟢 1.06x ✅ 1.00x 🟢 1.08x

alloc_slice_u8_64 ✅ 1.05x ✅ 1.00x 🟢 1.09x

alloc_slice_u8_128 ✅ 1.00x 🟢 1.06x ✅ 1.04x

alloc_slice_u16_8 ✅ 1.00x 🟡 1.33x 🟡 1.16x

alloc_slice_u16_32 ✅ 1.00x 🟢 1.14x 🟢 1.11x

alloc_slice_u16_64 ✅ 1.00x 🟢 1.14x 🟢 1.10x

alloc_slice_u16_128 ✅ 1.04x ✅ 1.00x ✅ 1.02x

alloc_slice_u32_8 ✅ 1.00x 🟢 1.14x 🟢 1.09x

alloc_slice_u32_32 ✅ 1.00x 🟢 1.14x 🟢 1.10x

alloc_slice_u32_64 ✅ 1.05x ✅ 1.00x 🟢 1.06x

alloc_slice_u32_128 🟢 1.09x ✅ 1.00x 🟢 1.13x

alloc_slice_u64_8 ✅ 1.00x 🟡 1.25x 🟢 1.11x

alloc_slice_u64_32 ✅ 1.04x ✅ 1.00x ✅ 1.02x

alloc_slice_u64_64 🟢 1.08x ✅ 1.00x 🟢 1.10x

alloc_slice_u64_128 🟢 1.07x ✅ 1.00x 🟢 1.08x

alloc_slice_u128_8 ✅ 1.00x 🟢 1.12x 🟢 1.11x

alloc_slice_u128_32 🟢 1.08x ✅ 1.00x 🟢 1.12x

alloc_slice_u128_64 🟢 1.07x ✅ 1.00x 🟢 1.08x

alloc_slice_u128_128 ✅ 1.03x ✅ 1.00x ✅ 1.04x

alloc_struct_13 ✅ 1.00x 🟠 1.55x 🟠 1.39x

alloc_struct_24 ✅ 1.00x 🔴 1.94x 🔴 1.97x

alloc_struct_26 ✅ 1.00x 🟠 1.56x 🟠 1.52x

alloc_struct_30 ✅ 1.00x 🟠 1.54x 🟠 1.45x

alloc_struct_32 ✅ 1.00x 🟠 1.35x 🟠 1.40x

alloc_struct_64 ✅ 1.00x 🟠 1.44x 🟠 1.48x

alloc_struct_96 ✅ 1.00x 🟢 1.13x 🟡 1.18x

alloc_struct_128 ✅ 1.00x 🟡 1.33x 🟡 1.17x

alloc_struct_192 ✅ 1.02x ✅ 1.00x 🟢 1.09x

alloc_struct_256 ✅ 1.00x 🟡 1.16x ✅ 1.01x

alloc_struct_512 🟢 1.06x ✅ 1.00x ✅ 1.02x

alloc_struct_1k ✅ 1.00x 🟢 1.05x ✅ 1.01x

alloc_str_8 🟢 1.11x ✅ 1.05x ✅ 1.00x

alloc_str_16 🟢 1.07x ✅ 1.02x ✅ 1.00x

alloc_str_32 ✅ 1.04x ✅ 1.00x 🟢 1.07x

alloc_str_40 ✅ 1.00x 🟢 1.08x 🟢 1.06x

alloc_str_48 ✅ 1.00x ✅ 1.03x 🟢 1.06x

alloc_str_64 ✅ 1.00x ✅ 1.04x 🟢 1.06x

alloc_str_72 ✅ 1.04x ✅ 1.00x 🟢 1.07x

alloc_str_80 ✅ 1.03x ✅ 1.00x 🟢 1.07x

alloc_str_128 ✅ 1.00x 🟢 1.11x 🟢 1.08x

alloc_slice_lit_u8_8 ✅ 1.00x 🔴 2.47x 🔴 2.23x

alloc_slice_lit_u8_32 ✅ 1.00x 🔴 1.83x 🟠 1.71x

alloc_slice_lit_u8_64 ✅ 1.00x 🟡 1.34x 🟠 1.42x

alloc_slice_lit_u8_128 ✅ 1.00x 🟡 1.31x 🟡 1.31x

alloc_str_lit_8 ✅ 1.00x 🔴 2.02x 🔴 1.82x

alloc_str_lit_16 ✅ 1.00x 🔴 1.78x 🟠 1.60x

alloc_str_lit_32 ✅ 1.00x 🟠 1.51x 🟠 1.42x

alloc_str_lit_40 ✅ 1.00x 🔴 1.76x 🔴 1.93x

alloc_str_lit_48 ✅ 1.00x 🟠 1.74x 🔴 1.82x

alloc_str_lit_64 ✅ 1.00x 🔴 1.75x 🟠 1.69x

alloc_str_lit_72 ✅ 1.00x 🟠 1.53x 🟠 1.61x

alloc_str_lit_80 ✅ 1.00x 🟠 1.54x 🟠 1.63x

alloc_str_lit_128 ✅ 1.00x 🟠 1.36x 🟠 1.35x

clear ✅ 1.00x ✅ 1.04x ✅ 1.04x

clear_and_reuse ✅ 1.00x 🟥 3.35x 🟥 3.35x

Benchmark machine: AMD Ryzen 3900x, Arch Linux, kernel 7.0.3

# Where does the speed come from

In an arena allocator, the fast path is everything. The fast path has to check whether there’s room in the current chunk, if so, allocate the value in the current chunk, and if not, jump to the slow path.

# Using more information

Rustc / LLVM is able to erase if/else statements whose conditions are expressions known at compile-time.

Different types have different information available at compile-time. Think alignment and size. When this information is available, stumpalo uses it, as well as information about the hardware you’re running on, to avoid overflow/underflow checks, when overflow/underflow couldn’t possibly occur anyway.

Generally, stumpalo’s fast-paths contain a single conditional branch, and as few as six instructions.

# Less indirection

A stumpalo arena contains pointers to the top and bottom of the chunk. Other libraries contain a pointer to a chunk, whose header contains pointers to their top. Stumpalo goes through one less layer of indirection to read the top.

# Example

The following function:

fn alloc_u32(a: &mut Arena, n: u32) -> &mut u32 { a.alloc(n)

Compiles down to this fast path:

alloc_u32: mov rcx, qword ptr [rdi] and rcx, -4 lea rax, [rcx - 4] cmp rax, qword ptr [rdi + 8] jb example::ArenaRef::alloc_slow_with::h903e68372b5b408b mov dword ptr [rcx - 4], esi mov...

A faster bump allocator for rust

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy