A faster bump allocator for rust
A faster bump allocator for rust
A faster bump allocator for rust
2026-06-02 23:02
Say hello to stumpalo.
Stumpalo is a bump allocator.
Stumpalo has scoped stack support.
Stumpalo is extremely fast.
Stumpalo has a logo, created very hastily.
Stumpalo’s logo is stumpy:
# Speed
You’re probably using a bump allocator because you want raw<br>allocation throughput.
Let’s see how fast stumpalo is, compared to other libraries.
operation<br>stumpalo<br>blink<br>bumpalo
alloc_u8<br>โ 1.00x<br>๐ด 2.14x<br>๐ 1.54x
alloc_u16<br>โ 1.00x<br>๐ด 2.46x<br>๐ฅ 2.54x
alloc_u32<br>โ 1.00x<br>๐ฅ 3.36x<br>๐ฅ 3.34x
alloc_u64<br>โ 1.00x<br>๐ฅ 3.35x<br>๐ฅ 3.34x
alloc_u128<br>โ 1.00x<br>๐ก 1.19x<br>๐ก 1.18x
alloc_multiple_u8<br>โ 1.00x<br>๐ด 1.82x<br>๐ด 1.85x
alloc_multiple_u16<br>โ 1.00x<br>๐ด 2.30x<br>๐ด 2.34x
alloc_multiple_u32<br>โ 1.00x<br>๐ฅ 3.12x<br>๐ฅ 3.14x
alloc_multiple_u64<br>โ 1.00x<br>๐ฅ 3.23x<br>๐ฅ 3.25x
alloc_multiple_u128<br>โ 1.00x<br>๐ฅ 2.70x<br>๐ฅ 2.61x
alloc_array_u8_8<br>โ 1.00x<br>๐ด 1.99x<br>๐ด 2.11x
alloc_array_u8_32<br>โ 1.00x<br>๐ข 1.15x<br>๐ก 1.20x
alloc_array_u8_64<br>โ 1.00x<br>๐ 1.55x<br>๐ 1.59x
alloc_array_u8_128<br>โ 1.00x<br>๐ก 1.30x<br>๐ 1.50x
alloc_slice_u8_8<br>๐ข 1.11x<br>๐ก 1.27x<br>โ 1.00x
alloc_slice_u8_32<br>๐ข 1.06x<br>โ 1.00x<br>๐ข 1.08x
alloc_slice_u8_64<br>โ 1.05x<br>โ 1.00x<br>๐ข 1.09x
alloc_slice_u8_128<br>โ 1.00x<br>๐ข 1.06x<br>โ 1.04x
alloc_slice_u16_8<br>โ 1.00x<br>๐ก 1.33x<br>๐ก 1.16x
alloc_slice_u16_32<br>โ 1.00x<br>๐ข 1.14x<br>๐ข 1.11x
alloc_slice_u16_64<br>โ 1.00x<br>๐ข 1.14x<br>๐ข 1.10x
alloc_slice_u16_128<br>โ 1.04x<br>โ 1.00x<br>โ 1.02x
alloc_slice_u32_8<br>โ 1.00x<br>๐ข 1.14x<br>๐ข 1.09x
alloc_slice_u32_32<br>โ 1.00x<br>๐ข 1.14x<br>๐ข 1.10x
alloc_slice_u32_64<br>โ 1.05x<br>โ 1.00x<br>๐ข 1.06x
alloc_slice_u32_128<br>๐ข 1.09x<br>โ 1.00x<br>๐ข 1.13x
alloc_slice_u64_8<br>โ 1.00x<br>๐ก 1.25x<br>๐ข 1.11x
alloc_slice_u64_32<br>โ 1.04x<br>โ 1.00x<br>โ 1.02x
alloc_slice_u64_64<br>๐ข 1.08x<br>โ 1.00x<br>๐ข 1.10x
alloc_slice_u64_128<br>๐ข 1.07x<br>โ 1.00x<br>๐ข 1.08x
alloc_slice_u128_8<br>โ 1.00x<br>๐ข 1.12x<br>๐ข 1.11x
alloc_slice_u128_32<br>๐ข 1.08x<br>โ 1.00x<br>๐ข 1.12x
alloc_slice_u128_64<br>๐ข 1.07x<br>โ 1.00x<br>๐ข 1.08x
alloc_slice_u128_128<br>โ 1.03x<br>โ 1.00x<br>โ 1.04x
alloc_struct_13<br>โ 1.00x<br>๐ 1.55x<br>๐ 1.39x
alloc_struct_24<br>โ 1.00x<br>๐ด 1.94x<br>๐ด 1.97x
alloc_struct_26<br>โ 1.00x<br>๐ 1.56x<br>๐ 1.52x
alloc_struct_30<br>โ 1.00x<br>๐ 1.54x<br>๐ 1.45x
alloc_struct_32<br>โ 1.00x<br>๐ 1.35x<br>๐ 1.40x
alloc_struct_64<br>โ 1.00x<br>๐ 1.44x<br>๐ 1.48x
alloc_struct_96<br>โ 1.00x<br>๐ข 1.13x<br>๐ก 1.18x
alloc_struct_128<br>โ 1.00x<br>๐ก 1.33x<br>๐ก 1.17x
alloc_struct_192<br>โ 1.02x<br>โ 1.00x<br>๐ข 1.09x
alloc_struct_256<br>โ 1.00x<br>๐ก 1.16x<br>โ 1.01x
alloc_struct_512<br>๐ข 1.06x<br>โ 1.00x<br>โ 1.02x
alloc_struct_1k<br>โ 1.00x<br>๐ข 1.05x<br>โ 1.01x
alloc_str_8<br>๐ข 1.11x<br>โ 1.05x<br>โ 1.00x
alloc_str_16<br>๐ข 1.07x<br>โ 1.02x<br>โ 1.00x
alloc_str_32<br>โ 1.04x<br>โ 1.00x<br>๐ข 1.07x
alloc_str_40<br>โ 1.00x<br>๐ข 1.08x<br>๐ข 1.06x
alloc_str_48<br>โ 1.00x<br>โ 1.03x<br>๐ข 1.06x
alloc_str_64<br>โ 1.00x<br>โ 1.04x<br>๐ข 1.06x
alloc_str_72<br>โ 1.04x<br>โ 1.00x<br>๐ข 1.07x
alloc_str_80<br>โ 1.03x<br>โ 1.00x<br>๐ข 1.07x
alloc_str_128<br>โ 1.00x<br>๐ข 1.11x<br>๐ข 1.08x
alloc_slice_lit_u8_8<br>โ 1.00x<br>๐ด 2.47x<br>๐ด 2.23x
alloc_slice_lit_u8_32<br>โ 1.00x<br>๐ด 1.83x<br>๐ 1.71x
alloc_slice_lit_u8_64<br>โ 1.00x<br>๐ก 1.34x<br>๐ 1.42x
alloc_slice_lit_u8_128<br>โ 1.00x<br>๐ก 1.31x<br>๐ก 1.31x
alloc_str_lit_8<br>โ 1.00x<br>๐ด 2.02x<br>๐ด 1.82x
alloc_str_lit_16<br>โ 1.00x<br>๐ด 1.78x<br>๐ 1.60x
alloc_str_lit_32<br>โ 1.00x<br>๐ 1.51x<br>๐ 1.42x
alloc_str_lit_40<br>โ 1.00x<br>๐ด 1.76x<br>๐ด 1.93x
alloc_str_lit_48<br>โ 1.00x<br>๐ 1.74x<br>๐ด 1.82x
alloc_str_lit_64<br>โ 1.00x<br>๐ด 1.75x<br>๐ 1.69x
alloc_str_lit_72<br>โ 1.00x<br>๐ 1.53x<br>๐ 1.61x
alloc_str_lit_80<br>โ 1.00x<br>๐ 1.54x<br>๐ 1.63x
alloc_str_lit_128<br>โ 1.00x<br>๐ 1.36x<br>๐ 1.35x
clear<br>โ 1.00x<br>โ 1.04x<br>โ 1.04x
clear_and_reuse<br>โ 1.00x<br>๐ฅ 3.35x<br>๐ฅ 3.35x
Benchmark machine: AMD Ryzen 3900x, Arch Linux, kernel 7.0.3
# Where does the speed come from
In an arena allocator, the fast path is everything.<br>The fast path has to check whether there’s room in the current chunk, if so,<br>allocate the value in the current chunk, and if not, jump to the slow path.
# Using more information
Rustc / LLVM is able to erase if/else statements whose conditions are expressions known<br>at compile-time.
Different types have different information available at compile-time. Think alignment and size.<br>When this information is available, stumpalo uses it, as well as information about the hardware<br>you’re running on, to avoid overflow/underflow checks, when overflow/underflow couldn’t<br>possibly occur anyway.
Generally, stumpalo’s fast-paths contain a single conditional branch, and as few as six<br>instructions.
# Less indirection
A stumpalo arena contains pointers to the top and bottom of the chunk.<br>Other libraries contain a pointer to a chunk, whose header contains pointers to their top.<br>Stumpalo goes through one less layer of indirection to read the top.
# Example
The following function:
fn alloc_u32(a: &mut Arena, n: u32) -> &mut u32 {<br>a.alloc(n)
Compiles down to this fast path:
alloc_u32:<br>mov rcx, qword ptr [rdi]<br>and rcx, -4<br>lea rax, [rcx - 4]<br>cmp rax, qword ptr [rdi + 8]<br>jb example::ArenaRef::alloc_slow_with::h903e68372b5b408b<br>mov dword ptr [rcx - 4], esi<br>mov...