Restartable Sequences
May 31st, 2026 @ justine's web page<br>Restartable Sequences
The best kept secret at the frontier of system programming right now is<br>the Linux 4.18+ (c. 2018) concept of restartable sequences or rseq for<br>short. They allow you to create thread-safe data structures without<br>locks or atomics which scale to microprocessors with many cores.
It's currently only possible to use rseq on Linux using handwritten<br>assembly code. However I believe in the future, all operating systems<br>will be updated to support rseq(), all system programming<br>languages will be redesigned to be able to express restartable<br>sequences, and all data structure libraries will be rewritten to use<br>them.
So far the only software I've seen using rseq is tcmalloc, jemalloc,<br>glibc, and cosmopolitan. That's destined to change now that<br>microprocessors with 128 or even 192 cores are becoming inexpensive. For<br>example,
On my $160 Raspberry Pi 5 (which<br>has 4 cores), rseq makes my malloc()<br>implementation 3x faster versus having a dlmalloc<br>mspace assigned to each thread. For most developers, that's a take it or<br>leave it kind of improvement. However,
On my $4,834 System76 Thelio Astra<br>with Ampere's 128 core 3GHz Altra CPU, rseq makes cosmopolitan<br>malloc() go 34x faster (compared to<br>sharding ops over an array of mspaces<br>using sched_getcpu()%32)
On my $17,628.55 AMD Threadripper Pro<br>7995WX with 96 cores, rseq makes<br>my malloc() 43x faster (versus using that<br>same sched_getcpu() mutex sharding technique)
System programmers who don't have a workstation like the ones above are<br>going to be left behind like a dinosaur, with no opportunity to pluck<br>the low hanging fruit of 10x performance optimizations. For example, I<br>wouldn't have been able to pull off the<br>speedups I made to matrix multiplication last<br>year if I hadn't splurged on a 96 core CPU. It put me in the poor house<br>for a few months (since the cheaper Ampere workstations weren't<br>available it the time) but was so worth it, since my work received<br>press<br>coverage,<br>it made<br>me famous in the AI community, it helped my project<br>get adopted<br>by 32% of organizations, and even earned me a job offer from Google<br>to work in their Gradient Canopy improving TPU performance for Gemini.
If you do have one of these microprocessors, then restartable sequences<br>are going to be one of the most important tricks you'll use to exploit<br>its capabilities. This tutorial will show you how they work, and provide<br>you with a concrete example for pushing and popping which can be<br>immediately useful.
What Problems Do Restartable Sequences Solve?
Whenever the Cosmopolitan C runtime creates a thread on a Linux system,<br>it issues an rseq() system call which gives the kernel 32<br>bytes of TLS memory. Then, for the remainder of that thread's life, the<br>kernel will update the TLS memory with the CPU number whenever the<br>thread is rescheduled. I found that to be immediately helpful for<br>improving my sched_getcpu() implementation. Since now it<br>just needs a 1 nanosecond relaxed mov instruction to get<br>the CPU number, whereas before I needed to wait an entire microsecond<br>for the getcpu() system call.
However it gets better. There's a second field in the rseq TLS memory<br>that allows the thread to send information back to the kernel. Normally<br>the rseq_cs field is NULL, but it can be<br>updated with a pointer specifying a sequence of assembly instructions in<br>your program. Now, whenever the kernel preempts your thread and tries to<br>move it to a different CPU, it'll notice your rseq_cs is<br>non-null, and will check the program counter (a.k.a. %rip on x86) to see<br>if it's currently within the specified interval. If that's the case,<br>then the kernel will force the thread to jump to an abort handler you<br>also specify, which can do things like jump back to the beginning of the<br>function to retry the operation.
Here's why we need that. Let's say you have a GIL like this:
static pthread_mutex_t lock;<br>static struct List *list;
If you're using that to protect your data structures, then it's going to<br>go slow on systems with dozens of cores, since only a single thread can<br>hold the lock at any given moment. So you might think, let's create a<br>lockless list using atomics. That's pretty simple if we're only pushing,<br>but if we want to also be able to pop, then we'd need to account for the<br>ABA<br>problem with something like the following:
#define MASQUE 0x00fffffffffffff0 // supports pml5t w/ malloc'd memory<br>#define PTR(x) ((uintptr_t)(x) & MASQUE)<br>#define TAG(x) ROL((uintptr_t)(x) & ~MASQUE, 8)<br>#define ABA(p, t) ((uintptr_t)(p) | (ROR((uintptr_t)(t), 8) & ~MASQUE))<br>#define ROL(x, n) (((x) > (64 - (n))))<br>#define ROR(x, n) (((x) >> (n)) | ((x) struct List {<br>struct List *next;<br>// ...<br>};
_Atomic(struct List *) list;
void push(struct List *elem) {<br>struct List *tip;<br>for (tip = atomic_load_explicit(&list, memory_order_relaxed);;) {<br>elem->next = (struct List *)PTR(tip);<br>if (atomic_compare_exchange_weak_explicit(<br>&list, &tip, (struct List *)ABA(elem, TAG(tip) +...