Spent an afternoon on a perf issue that 56 bytes of padding fixed

The Slowdown That Doesn't Show Up in Profiles

← back

The Slowdown That Doesn’t Show Up in Profiles

006 · 2026-05-17 · false sharing, cache lines, struct layout

I had a channel state struct with three atomic fields — a status flag and two counters. Each one was written by a different thread, and they didn’t share any data through mutexes or references. Every field was independently owned.

#[repr(C)] struct ChannelState { status: AtomicU8, // control thread rx_count: AtomicU64, // reader thread tx_count: AtomicU64, // writer thread

It was fast single-threaded. When I added a second thread it got slower, and a third made it worse. The more cores I threw at it, the less work each one actually got done.

I ran perf stat and IPC looked fine. Flamegraph showed nothing unexpected — the hot function was a tight fetch_add loop, exactly where it should be. CPU utilization was high but work wasn’t getting done.

I spent an afternoon on it before realizing the answer had nothing to do with my code.

Cache lines

CPUs don’t read individual bytes from memory. They pull in 64-byte contiguous blocks called cache lines . When any core writes to any byte in a line, every other core’s cached copy of that entire 64-byte block gets invalidated — not just the byte that changed, the whole block.

That’s the cache coherency protocol doing its job. A round-trip to re-fetch a line from another core’s cache costs tens of nanoseconds, which is fast in isolation but adds up quickly in a tight loop.

My struct fit in a single cache line:

cache line 0 — 64 bytes

status (1B) padding (7B) rx_count (8B) tx_count (8B) unused (40B)

Three threads writing to three separate fields, with no shared data as far as the source code is concerned. But they all sit in the same 64-byte block, so every time core 0 writes status, cores 1 and 2 lose their cached copies of rx_count and tx_count.

That’s false sharing — the threads aren’t sharing any data, they’re sharing a cache line.

At the hardware level, two cores passing the same line back and forth, each write invalidating the other:

STEP 1

core 0

Modified

x y

writes x

core 1

x y

Core 0 owns the line. Writes x — no stall, the data is in L1.

STEP 2

core 0

Invalid

x y

invalidated

← RFO · data →

core 1

Modified

x y

writes y

Core 1 writes y. Sends Request For Ownership. Core 0 flushes the line. ~40 cycle stall.

STEP 3

core 0

Modified

x y

writes x

RFO → · ← data

core 1

Invalid

x y

invalidated

Core 0 writes x. Needs the line back. Another RFO, another ~40 cycles. Repeat forever.

total bus stalls: 80 cycles (and counting)

Proving it

I stripped it down to the smallest possible repro: two versions of the same struct, one that packs both fields onto the same cache line and one that pads them apart.

// Version A: both fields on one cache line #[repr(C)] struct Contended { x: AtomicU64, // thread 1 writes here y: AtomicU64, // thread 2 writes here

// Version B: each field on its own line #[repr(C)] struct Padded { x: AtomicU64, _pad: [u8; 56], y: AtomicU64,

contended — same cache line

line 0

x (8B) y (8B)

padded — separate lines

line 0

line 1

x (8B) pad (56B) y (8B)

Two threads, each doing 50M fetch_add calls on its own field. Warmup, then measure:

use std::sync::{Arc, atomic::{AtomicU64, Ordering::Relaxed}}; use std::time::Instant;

#[repr(C)] struct Contended { x: AtomicU64, y: AtomicU64 }

#[repr(C)] struct Padded { x: AtomicU64, _pad: [u8; 56], y: AtomicU64 }

const N: u64 = 50_000_000;

fn benchT: Send + Sync + 'static>( label: &str, data: ArcT>, f0: fn(&T), f1: fn(&T), ) { // warmup let (d0, d1) = (data.clone(), data.clone()); std::thread::scope(|s| { s.spawn(|| f0(&d0)); s.spawn(|| f1(&d1)); });

let t = Instant::now(); std::thread::scope(|s| { s.spawn(|| f0(&data)); s.spawn(|| f1(&data)); }); println!("{label}: {:?}", t.elapsed());

fn main() { bench("contended", Arc::new(Contended { x: AtomicU64::new(0), y: AtomicU64::new(0), }), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }}, |d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});

bench("padded", Arc::new(Padded { x: AtomicU64::new(0), _pad: [0; 56], y: AtomicU64::new(0), }), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }}, |d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});

50M fetch_add(Relaxed) per thread, 2 threads, Zen 4 single CCD

contended

924 ms

padded

184 ms

5.0x

Same work, same atomic operations, just 56 bytes of padding between the fields. 5x difference.

The fix

The fix is to put each contended field on its own cache line. crossbeam has CachePadded for exactly this:

use crossbeam_utils::CachePadded;

struct ChannelState { status: CachePaddedAtomicU8>, rx_count: CachePaddedAtomicU64>, tx_count: CachePaddedAtomicU64>,

Or without the dependency, manual padding:

#[repr(C)] struct ChannelState { status: AtomicU8, _pad0: [u8; 63], rx_count: AtomicU64, _pad1: [u8;...

Spent an afternoon on a perf issue that 56 bytes of padding fixed

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits