Shard your locks: benchmarking 6 Go cache designs

Shard your locks: benchmarking 6 Go cache designs | Beyond the Happy PathShard your locks: benchmarking 6 Go cache designs June 20, 2026 · 7 min · 1464 words · Misha Strebkov Table of ContentsThe contenders How I measured (the short version) The resultsThe obvious fix scales backwards Skew isn’t simply “worse” The numbers (8 cores, ns/op, lower is better)

The winner, in a few linesWhy 256 shards?

What to actually use Three things that surprised me

I built the same in-memory string → string cache six ways, using nothing but the Go standard library, and benchmarked them under read-heavy, balanced, and write-heavy load across 1 to 8 cores. The rankings flip depending on the workload — and one of the “obvious” answers gets slower the more cores you give it. TL;DR: Shard your locks. A 256-way striped map (sharded) was the all-around winner — up to 8× faster than a single sync.Mutex at 8 cores — and it’s about 15 lines of code. sync.RWMutex, the reflexive fix for “reads are contended,” is a trap: it barely helps reads past two cores and is slower than a plain mutex for writes. The contenders# CacheIdeaOne-linernaivePlain map, no lockingNot thread-safe — concurrent writes crash the process. Baseline only.mutexOne sync.MutexSimple, correct, doesn’t scale.rwmutexOne sync.RWMutexParallel reads, exclusive writes.syncmapsync.MapThe stdlib’s own concurrent map.sharded256 shards, one mutex eachLock striping. Keys routed by hash.cowCopy-on-write via atomic.PointerLock-free reads; every write copies the whole map.All six satisfy one interface, so a single harness drives them identically. Code: github.com/kluyg/in-memory-cache. How I measured (the short version)# testing.B + b.RunParallel, 1,000,000 keys, GOMAXPROCS swept 1→8, on a 20-core i7-14700K. Each data point is the median of 10 runs summarized with benchstat; variation was mostly ±0–3%. Throughput below is 1000 / (ns/op) in millions of ops/sec — higher is better. I measured the cache in-process, not behind HTTP: net/http + JSON cost microseconds, which would bury the nanosecond-scale differences I’m chasing. The 14700K is a hybrid chip — 8 performance cores (with hyperthreading) plus 12 efficiency cores — so an unpinned sweep is a trap: as GOMAXPROCS rises, the OS can spill goroutines onto E-cores or hyperthread siblings and migrate them mid-run, which confounds the scaling curves. So the process is pinned to one thread per physical P-core (affinity 0x5555); each GOMAXPROCS step adds a real P-core. Pinning shifted the absolute numbers by 10–25% in places but left every ranking and curve shape unchanged. One deliberate non-axis: value size doesn’t matter here. Go strings are immutable, so Set stores a 16-byte header and never touches the value bytes — 64 B and 16 KB benchmark identically (0 B/op). Value size affects memory and GC, not op throughput. The results#

Read the slopes, not just the heights: sharded and cow climb; mutex is flat. More cores, more throughput — unless you picked a single lock. cow owns read-only (87 Mops/s at 8 cores, fully lock-free reads) and vanishes the moment writes appear — it’s pinned to ≈0 on the three write panels because every Set copies the entire million-entry map. sharded is the only design that’s near the top in every panel. The obvious fix scales backwards# Normalize each design to its own single-core throughput and the story gets sharper:

mutex is below 1× — at 8 cores it’s 0.66× its single-core speed. Reads can’t run in parallel, and the cache line holding the lock ping-pongs between cores. You added hardware and lost performance. rwmutex plateaus around 2×. The shared reader counter becomes the new contention point; it stops improving after ~4 cores. sharded reaches 6.9×, while cow and syncmap track — even slightly exceed — the ideal 8× line (lock-free reads get a bonus from the larger aggregate cache). Caveat: syncmap’s great slope flatters a poor baseline — it’s still slower in absolute terms than sharded. Skew isn’t simply “worse”# Real caches see Zipfian access — a few hot keys take most of the traffic. The common assumption is that skew hurts. It’s more interesting than that:

Above 1× means faster under skew. Reads get faster almost everywhere — the hot keys stay in CPU cache (mutex reads speed up 1.6×, syncmap 1.9×). The striking exception is sharded’s balanced mix at 0.82× — skew makes it slower : hot keys collide on a few shards, so those locks contend while the rest sit idle. cow is the control case: its balanced-mix bar sits at 1.03×, essentially flat. That’s the tell-tale of a design whose write cost is distribution-independent — it copies the whole map on every Set regardless of which key changed, so the key distribution can’t touch it. Skew moves a number only where the distribution changes where work lands (cache lines, shards); it leaves cow’s uniform...

Shard your locks: benchmarking 6 Go cache designs

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews