Restartable Sequences

May 31st, 2026 @ justine's web page Restartable Sequences

The best kept secret at the frontier of system programming right now is the Linux 4.18+ (c. 2018) concept of restartable sequences or rseq for short. They allow you to create thread-safe data structures without locks or atomics which scale to microprocessors with many cores.

It's currently only possible to use rseq on Linux using handwritten assembly code. However I believe in the future, all operating systems will be updated to support rseq(), all system programming languages will be redesigned to be able to express restartable sequences, and all data structure libraries will be rewritten to use them.

So far the only software I've seen using rseq is tcmalloc, jemalloc, glibc, and cosmopolitan. That's destined to change now that microprocessors with 128 or even 192 cores are becoming inexpensive. For example,

On my $160 Raspberry Pi 5 (which has 4 cores), rseq makes my malloc() implementation 3x faster versus having a dlmalloc mspace assigned to each thread. For most developers, that's a take it or leave it kind of improvement. However,

On my $4,834 System76 Thelio Astra with Ampere's 128 core 3GHz Altra CPU, rseq makes cosmopolitan malloc() go 34x faster (compared to sharding ops over an array of mspaces using sched_getcpu()%32)

On my $17,628.55 AMD Threadripper Pro 7995WX with 96 cores, rseq makes my malloc() 43x faster (versus using that same sched_getcpu() mutex sharding technique)

System programmers who don't have a workstation like the ones above are going to be left behind like a dinosaur, with no opportunity to pluck the low hanging fruit of 10x performance optimizations. For example, I wouldn't have been able to pull off the speedups I made to matrix multiplication last year if I hadn't splurged on a 96 core CPU. It put me in the poor house for a few months (since the cheaper Ampere workstations weren't available it the time) but was so worth it, since my work received press coverage, it made me famous in the AI community, it helped my project get adopted by 32% of organizations, and even earned me a job offer from Google to work in their Gradient Canopy improving TPU performance for Gemini.

If you do have one of these microprocessors, then restartable sequences are going to be one of the most important tricks you'll use to exploit its capabilities. This tutorial will show you how they work, and provide you with a concrete example for pushing and popping which can be immediately useful.

What Problems Do Restartable Sequences Solve?

Whenever the Cosmopolitan C runtime creates a thread on a Linux system, it issues an rseq() system call which gives the kernel 32 bytes of TLS memory. Then, for the remainder of that thread's life, the kernel will update the TLS memory with the CPU number whenever the thread is rescheduled. I found that to be immediately helpful for improving my sched_getcpu() implementation. Since now it just needs a 1 nanosecond relaxed mov instruction to get the CPU number, whereas before I needed to wait an entire microsecond for the getcpu() system call.

However it gets better. There's a second field in the rseq TLS memory that allows the thread to send information back to the kernel. Normally the rseq_cs field is NULL, but it can be updated with a pointer specifying a sequence of assembly instructions in your program. Now, whenever the kernel preempts your thread and tries to move it to a different CPU, it'll notice your rseq_cs is non-null, and will check the program counter (a.k.a. %rip on x86) to see if it's currently within the specified interval. If that's the case, then the kernel will force the thread to jump to an abort handler you also specify, which can do things like jump back to the beginning of the function to retry the operation.

Here's why we need that. Let's say you have a GIL like this:

static pthread_mutex_t lock; static struct List *list;

If you're using that to protect your data structures, then it's going to go slow on systems with dozens of cores, since only a single thread can hold the lock at any given moment. So you might think, let's create a lockless list using atomics. That's pretty simple if we're only pushing, but if we want to also be able to pop, then we'd need to account for the ABA problem with something like the following:

#define MASQUE 0x00fffffffffffff0 // supports pml5t w/ malloc'd memory #define PTR(x) ((uintptr_t)(x) & MASQUE) #define TAG(x) ROL((uintptr_t)(x) & ~MASQUE, 8) #define ABA(p, t) ((uintptr_t)(p) | (ROR((uintptr_t)(t), 8) & ~MASQUE)) #define ROL(x, n) (((x) > (64 - (n)))) #define ROR(x, n) (((x) >> (n)) | ((x) struct List { struct List *next; // ... };

_Atomic(struct List *) list;

void push(struct List *elem) { struct List *tip; for (tip = atomic_load_explicit(&list, memory_order_relaxed);;) { elem->next = (struct List *)PTR(tip); if (atomic_compare_exchange_weak_explicit( &list, &tip, (struct List *)ABA(elem, TAG(tip) +...

Restartable Sequences

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan