Concurrent, atomic MSI hash tables

signa111 pts0 comments

Concurrent, atomic MSI hash tables

Concurrent, atomic MSI hash tables

May 06, 2026

nullprogram.com/blog/2026/05/06/

Readers will be familiar with Mask-Step-Index (MSI) hash tables, a<br>technique for building fast, open-addressed hash tables in a dozen lines<br>of code. If multiple threads or processes access an MSI table with<br>at least one still inserting elements, care must be taken to avoid data<br>races. This article will show how to add atomic operations to MSI tables<br>in order to support different concurrency constraints.

Let’s begin with the simplest case: An integer hash set, no deletions,<br>only one insert thread (single producer), and consumers do not care about<br>insert order. That is, the producer inserts A then B, but consumers may<br>observe B in the table before A. Suppose this is the hash table in the<br>single-threaded case:

int32_t *lookup(int32_t key, int32_t *table, int exp)<br>uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;<br>uint32_t mask = ((uint32_t)1 exp) - 1;<br>uint32_t step = (hash >> (32 - exp)) | 1;<br>for (uint32_t index = hash;;) {<br>index = (index + step) & mask;<br>if (!table[index] || table[index]==key) {<br>return table + index;

Keys must be non-zero, and tables are zero-initialized. Usage example:

// Initialization<br>enum { exp = 8 };<br>int32_t table[18] = {};

// Producer<br>for (int i = 0; i nkeys; i++) {<br>*lookup(keys[i], table, exp) = keys[i];

// Consumer<br>int32_t key = 1234;<br>bool present = *lookup(key, table, exp);

The only problem is the data race on table slots. Since consumers can<br>tolerate out-of-order insertions, ordering does not matter and relaxed<br>atomics eliminate the data race. Insert and query now have different<br>requirements, so it makes sense to distinguish them. Starting with the<br>latter:

bool contains(int32_t key, int32_t *table, int exp)<br>uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;<br>uint32_t mask = ((uint32_t)1 exp) - 1;<br>uint32_t step = (hash >> (32 - exp)) | 1;<br>for (uint32_t index = hash;;) {<br>index = (index + step) & mask;<br>int32_t k = __atomic_load_n(table+index, __ATOMIC_RELAXED);<br>if (!k) {<br>return false;<br>} else if (k == key) {<br>return true;

Note how all elements are accessed by atomic loads, as a producer may<br>store to any slot at any time. Now producers:

bool insert(int32_t key, int32_t *table, int exp)<br>uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;<br>uint32_t mask = ((uint32_t)1 exp) - 1;<br>uint32_t step = (hash >> (32 - exp)) | 1;<br>for (uint32_t index = hash;;) {<br>index = (index + step) & mask;<br>if (!table[index]) {<br>__atomic_store_n(table+index, key, __ATOMIC_RELAXED);<br>return true;<br>} else if (table[index] == key) {<br>return false;

This function may load elements non-atomically because there’s only one<br>producer: the current thread. This idea could not be expressed were the<br>type system involved, e.g. _Atomic, but GCC atomics do not<br>involve require such special qualifiers. Stores on the other hand are<br>concurrent with consumers, requiring an atomic store. Single-producer,<br>multiple-consumer (SPMC) usage is nearly identical to the single-threaded<br>case:

// Producer<br>for (int i = 0; i nkeys; i++) {<br>insert(keys[i], table, exp);

// Consumer<br>int32_t key = 1234;<br>bool present = contains(key, table, exp);

A concurrent integer hash table is contrived and unrealistic. In a real<br>program a key likely carries some broader semantic meaning. For example,<br>if that “integer” is actually a memory offset known as a pointer, then it<br>points at some object, and it is important that stores to that object<br>happen before consumers observe the pointer in the table:

bool insert(Thing *thing, Thing **table, int exp)<br>Thing *lookup(Key key, Thing **table, int exp)

Where usage might look like:

// Producer<br>for (int i = 0; i nthings; i++) {<br>things[i].key = ...; // update/init object<br>insert(things+i, table, exp);

// Consumer<br>bool present = !!find((Key){...}, table, exp);

In this case relaxed atomics are insufficient. Updates to the inserted<br>object may be reordered after the insertion, and consumers will race on<br>those updates. In this case we upgrade to acquire-release:

Thing *lookup(Key key, Thing **table, int exp)<br>// ...<br>for (...) {<br>// ...<br>Thing *thing = __atomic_load_n(table+index, __ATOMIC_ACQUIRE);<br>if (!thing || thing->key==key) {<br>return thing;

bool insert(Thing *thing, Thing **table, int exp)<br>// ...<br>for (...) {<br>// ...<br>if (!table[index]) {<br>__atomic_store_n(table+index, thing, __ATOMIC_RELEASE);<br>return true;<br>} else if (table[index]->key == thing->key) {<br>return false;

In this case producer and consumer synchronize on the atomics. Producer<br>stores are ordered before the release, and consumer loads are ordered<br>after the acquire. Objects are not modified once in the table, so atomics<br>are not required for their fields. On some architectures, including x86,<br>there will be no indication at the ISA level that atomics are in use —<br>i.e. this likely generates the same code as the single-threaded version —<br>and these atomics merely constrain the compiler’s instruction scheduling.

As a side...

table index hash thing uint32_t int32_t

Related Articles