Learner's Notes: Linux membarrier() System Call - Details of Asymmetric Fences | Ryan Chung<br>Post<br>Cancel<br>Asymmetric Fences?<br>While browsing through Folly’s synchronization primitives to study up on concurrency concepts, one familiar-looking name in the codebase caught my attention: Asymmetric Thread Fence. I already knew what std::atomic_thread_fence is meant to achieve, but what does this construct do? And what exactly is asymmetric about it?<br>That question led me down an interesting rabbit hole: understanding the details of asymmetric thread fences and what they are actually doing underneath the hood (at least on Linux). We’ll start with the C++ proposal for this construct, then work our way down into the implementation details.<br>C++ P1202R0 Introduction<br>We’ll walk through the mechanics in a later section, but for now it suffices to understand what these fences are trying to achieve. Referring to the proposal, let’s look at the overview:<br>Some types of concurrent algorithms can be split into a common path and an uncommon path , both of which require fences (or other operations with non-relaxed memory orders) for correctness. On many platforms, it’s possible to speed up the common path by adding an even stronger fence type (stronger than memory_order_seq_cst) down the uncommon path. These facilities are being used in an increasing number of concurrency libraries. We propose standardizing these asymmetric fences, and incorporating them into the memory model.
Essentially, a concurrent algorithm may originally require a fence on both sides. However, if one path is significantly more common than the other, we may optimize for the common path by using a lighter fence there, while shifting the heavier synchronization cost onto the uncommon path. This can result in an overall performance improvement if we ensure the performance gain from the common path is much more than the overhead from the heavier fence of the uncommon path.<br>This pattern is more common than it may first appear. If you browse through Folly’s codebase, you can find several uses of it:<br>folly/synchronization/HazptrDomain.h: Hazard Pointersfolly/synchronization/detail/ThreadCachedReaders.h: RCUfolly/executors/ThreadPoolExecutor.cpp: Thread Pool ExecutorLet’s begin with a modified example from the paper, known as Dekker’s example:
10<br>11<br>12<br>13<br>14<br>15<br>atomic_int x{0}, y{0};<br>int r1, r2;
// Path A:<br>x.store(1, memory_order_relaxed);<br>atomic_thread_fence(memory_order_seq_cst);<br>r1 = y.load(memory_order_relaxed);
// Path B:<br>y.store(1, memory_order_relaxed);<br>atomic_thread_fence(memory_order_seq_cst);<br>r2 = x.load(memory_order_relaxed);
// Happens after both paths are completed.<br>assert(!(r1 == 0 && r2 == 0)); // Never fails.
We know the assertion cannot fail. Let’s examine why under the C++ memory model:<br>Constraint $(1) \rightarrow (2)$ and $(1) \rightarrow (5)$ arise from the synchronizes-with relationship between thread construction and the start of the thread function [thread.thread.constr]/6. This also establishes the modification order of both X and Y via the write-write coherence guarantees described in [intro.races]/15.Constraint $(4) \rightarrow (5)$ and $(7) \rightarrow (2)$ are coherence-orderings that arise from the requirements in [atomics.order]/3.3. For example, $(4)$ and $(5)$ are not the same atomic read-modify-write operation, and $(4)$ reads the value stored by $(1)$, and $(1)$ precedes $(5)$ in the modification order of Y.From [atomics.order]/4.4, memory_order::seq_cst fence $A$ happens-before $(4)$, and $(5)$ happens-before memory_order::seq_cst fence $B$, therefore $(3)$ must precede $(6)$ in the single total order $S$ on all memory_order::seq_cst operations. By symmetric reasoning, $(6)$ must also precede $(3)$ in the same total order $S$.This is impossible, hence a contradiction. Therefore the execution where r1 == 0 && r2 == 0 is forbidden.<br>Now, returning to asymmetric thread fences. The key idea is that in some concurrent algorithms, we care about optimizing a common path at the expense of an uncommon path . If we are willing to let the slow path absorb stronger synchronization costs, we can restructure the code using asymmetric fences. The paper illustrates this with the following transformation:
10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>// Globals:<br>atomic_int x{0}, y{0};<br>int r1, r2;
// Fast path:<br>x.store(1, memory_order_relaxed);<br>asymmetric_thread_fence_light(); // Like atomic_signal_fence;<br>// only inhibits compiler reordering<br>r1 = y.load(memory_order_relaxed);
// Slow path:<br>y.store(1, memory_order_relaxed);<br>asymmetric_thread_fence_heavy(); // Stronger than atomic_thread_fence(memory_order_seq_cst);<br>r2 = x.load(memory_order_relaxed);
// Happens after both fast path and slow path are complete.<br>assert(!(r1 == 0 && r2 == 0)); // Never fails.
That’s all we need to know about them for now. Let’s now move on to how these asymmetric fences are implemented.<br>Asymmetric Light Fence - Compiler Barrier<br>Let’s start with the simpler primitive:...