-->
Eight million pixels and counting<br>– Custom allocators in Rust
Nical
Custom allocators in Rust
Lately I have been looking into custom allocators in the Rust. Recent internet chatter about the Zig programming language has brought allocators back to my attention, but custom allocators are an old topic. I remember playing with that a long while back in my C++ days after watching a talk by Andrei Alexandrescu. I found it a satisfying thing to hack on because you can start off with simple things that work quickly, build them out of composable parts and of course the rabbit hole is deep enough that after spending a lot of time you get to feel great about pulling advanced stunts. That said, I'm already digressing. I am not going to write about building memory allocators today, just about writing data structures that can accept different allocators. This isn't about replacing the global allocator either, mind you, but making it possible for different parts of a program to chose their memory allocation strategies. I'll first superficially ramble a bit about the the motivation and state of affairs in the Rust ecosystem, then I'll go over an experiment I made to integrate custom allocators in lyon's tessellator in a few different ways.
The ubiquitous global allocator
I've long had a mild dislike for the "one and only global allocator" strategy. It's a pretty simple and decent default, to a point that a lot of programmers forgot that there are alternative ways to work with memory. I look at performance profiles a lot at work and allocation/deallocation is at the same time always prominent and also kind of elusive. It is hard to distill down simple benchmarks for allocation overhead, because of much of that overhead comes from cache misses and contention between multiple threads using the same allocator.
To mitigate allocation performance issues, it is very common to build algorithms in ways that allow keeping the data structures that hold intermediate state around. This way subsequent runs can reuse the allocations from the previous ones. lyon's tessellator is written like that for example. In WebRender we also try to keep temporary allocations around. I'll refer to this approach as "recycling allocations" and it is basically about accepting that interacting with allocators is slow and making contraptions to avoid it.
Another option is to take a step back and think a bit about whether memory allocations really need to be slow. What are the different allocation patterns? Are they all best served by a general purpose allocator? Bump allocators are very good at providing very fast short-lived allocations. They also can work very well with groups of allocations with a common and well defined scope, for example a frame in a video game or a graphics rendering engine. Most allocations stay in the thread they are created with. Using a simpler thread-local allocator wherever it makes sense can provides some wins.<br>Of course there are also long-lived allocations with mostly unpredictable lifetimes backing resources that may or may not move to different threads. These are well served by a global allocator, are they the majority of allocations in majority of programs? I don't think so and I'm not the only one. Some languages (Zig, Jai and others) have made that it a pillar of their library design to make it possible and convenient to pick the right allocation strategy everywhere allocation needs to happen.
Using custom allocators to avoid paying the cost of memory allocations for temporary data sounds like more work than recycling for about the same benefits, but the benefits can go beyond merely skipping the allocator overhead. When keeping around an allocation for later use, we take memory that is nice and warm in the cache hierarchy after its use and enforce that only future uses of this specific code will be able to use it again. Something else that runs immediately after could have benefited from that memory for its own temporary allocations, but it will have to reach for other pieces of memory that may not be as readily available.<br>When running on a tight memory budgets, hoarding temporary data structures can also occupy precious resources.
Of course all of these advantages require understanding allocation lifetime patterns and picking the right strategy for the each part of the code. It's not like slapping a bump allocator under an algorithm will always make it faster. That algorithm could be made slower if the underlying allocator is fast but fails to reuse memory that is warm in CPU caches. It's an extra mental load for the programmer for sure and it may not be the right amount of effort for everyone. Whether or not this option is good for everyone, I think that we will be in a good place when that option is easily reachable.
Custom allocators and Rust
Rust, like most languages chose to start with the stance that most people should not have to think about how memory allocation is happening. All allocations...