The Slowdown That Doesn't Show Up in Profiles
← back
The Slowdown That Doesn’t Show Up in Profiles
006 · 2026-05-17 · false sharing, cache lines, struct layout
I had a channel state struct with three atomic fields — a status flag and two counters. Each one was written by a different thread, and they didn’t share any data through mutexes or references. Every field was independently owned.
#[repr(C)]<br>struct ChannelState {<br>status: AtomicU8, // control thread<br>rx_count: AtomicU64, // reader thread<br>tx_count: AtomicU64, // writer thread
It was fast single-threaded. When I added a second thread it got slower, and a third made it worse. The more cores I threw at it, the less work each one actually got done.
I ran perf stat and IPC looked fine. Flamegraph showed nothing unexpected — the hot function was a tight fetch_add loop, exactly where it should be. CPU utilization was high but work wasn’t getting done.
I spent an afternoon on it before realizing the answer had nothing to do with my code.
Cache lines
CPUs don’t read individual bytes from memory. They pull in 64-byte contiguous blocks called cache lines . When any core writes to any byte in a line, every other core’s cached copy of that entire 64-byte block gets invalidated — not just the byte that changed, the whole block.
That’s the cache coherency protocol doing its job. A round-trip to re-fetch a line from another core’s cache costs tens of nanoseconds, which is fast in isolation but adds up quickly in a tight loop.
My struct fit in a single cache line:
cache line 0 — 64 bytes
status (1B)<br>padding (7B)<br>rx_count (8B)<br>tx_count (8B)<br>unused (40B)
Three threads writing to three separate fields, with no shared data as far as the source code is concerned. But they all sit in the same 64-byte block, so every time core 0 writes status, cores 1 and 2 lose their cached copies of rx_count and tx_count.
That’s false sharing — the threads aren’t sharing any data, they’re sharing a cache line.
At the hardware level, two cores passing the same line back and forth, each write invalidating the other:
STEP 1
core 0
Modified
x y
writes x
core 1
x y
Core 0 owns the line. Writes x — no stall, the data is in L1.
STEP 2
core 0
Invalid
x y
invalidated
← RFO · data →
core 1
Modified
x y
writes y
Core 1 writes y. Sends Request For Ownership. Core 0 flushes the line. ~40 cycle stall.
STEP 3
core 0
Modified
x y
writes x
RFO → · ← data
core 1
Invalid
x y
invalidated
Core 0 writes x. Needs the line back. Another RFO, another ~40 cycles. Repeat forever.
total bus stalls: 80 cycles (and counting)
Proving it
I stripped it down to the smallest possible repro: two versions of the same struct, one that packs both fields onto the same cache line and one that pads them apart.
// Version A: both fields on one cache line<br>#[repr(C)]<br>struct Contended {<br>x: AtomicU64, // thread 1 writes here<br>y: AtomicU64, // thread 2 writes here
// Version B: each field on its own line<br>#[repr(C)]<br>struct Padded {<br>x: AtomicU64,<br>_pad: [u8; 56],<br>y: AtomicU64,
contended — same cache line
line 0
x (8B)<br>y (8B)
padded — separate lines
line 0
line 1
x (8B)<br>pad (56B)<br>y (8B)
Two threads, each doing 50M fetch_add calls on its own field. Warmup, then measure:
use std::sync::{Arc, atomic::{AtomicU64, Ordering::Relaxed}};<br>use std::time::Instant;
#[repr(C)]<br>struct Contended { x: AtomicU64, y: AtomicU64 }
#[repr(C)]<br>struct Padded { x: AtomicU64, _pad: [u8; 56], y: AtomicU64 }
const N: u64 = 50_000_000;
fn benchT: Send + Sync + 'static>(<br>label: &str,<br>data: ArcT>,<br>f0: fn(&T), f1: fn(&T),<br>) {<br>// warmup<br>let (d0, d1) = (data.clone(), data.clone());<br>std::thread::scope(|s| { s.spawn(|| f0(&d0)); s.spawn(|| f1(&d1)); });
let t = Instant::now();<br>std::thread::scope(|s| { s.spawn(|| f0(&data)); s.spawn(|| f1(&data)); });<br>println!("{label}: {:?}", t.elapsed());
fn main() {<br>bench("contended", Arc::new(Contended {<br>x: AtomicU64::new(0), y: AtomicU64::new(0),<br>}), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }},<br>|d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});
bench("padded", Arc::new(Padded {<br>x: AtomicU64::new(0), _pad: [0; 56], y: AtomicU64::new(0),<br>}), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }},<br>|d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});
50M fetch_add(Relaxed) per thread, 2 threads, Zen 4 single CCD
contended
924 ms
padded
184 ms
5.0x
Same work, same atomic operations, just 56 bytes of padding between the fields. 5x difference.
The fix
The fix is to put each contended field on its own cache line. crossbeam has CachePadded for exactly this:
use crossbeam_utils::CachePadded;
struct ChannelState {<br>status: CachePaddedAtomicU8>,<br>rx_count: CachePaddedAtomicU64>,<br>tx_count: CachePaddedAtomicU64>,
Or without the dependency, manual padding:
#[repr(C)]<br>struct ChannelState {<br>status: AtomicU8,<br>_pad0: [u8; 63],<br>rx_count: AtomicU64,<br>_pad1: [u8;...