Shard your locks: benchmarking 6 Go cache designs

kluyg1 pts0 comments

Shard your locks: benchmarking 6 Go cache designs | Beyond the Happy PathShard your locks: benchmarking 6 Go cache designs<br>June 20, 2026 · 7 min · 1464 words · Misha Strebkov<br>Table of ContentsThe contenders<br>How I measured (the short version)<br>The resultsThe obvious fix scales backwards<br>Skew isn&rsquo;t simply &ldquo;worse&rdquo;<br>The numbers (8 cores, ns/op, lower is better)

The winner, in a few linesWhy 256 shards?

What to actually use<br>Three things that surprised me

I built the same in-memory string → string cache six ways, using nothing but<br>the Go standard library, and benchmarked them under read-heavy, balanced, and<br>write-heavy load across 1 to 8 cores. The rankings flip depending on the<br>workload — and one of the &ldquo;obvious&rdquo; answers gets slower the more cores you<br>give it.<br>TL;DR: Shard your locks. A 256-way striped map (sharded) was the<br>all-around winner — up to 8× faster than a single sync.Mutex at 8 cores —<br>and it&rsquo;s about 15 lines of code. sync.RWMutex, the reflexive fix for &ldquo;reads<br>are contended,&rdquo; is a trap: it barely helps reads past two cores and is slower<br>than a plain mutex for writes.<br>The contenders#<br>CacheIdeaOne-linernaivePlain map, no lockingNot thread-safe — concurrent writes crash the process. Baseline only.mutexOne sync.MutexSimple, correct, doesn&rsquo;t scale.rwmutexOne sync.RWMutexParallel reads, exclusive writes.syncmapsync.MapThe stdlib&rsquo;s own concurrent map.sharded256 shards, one mutex eachLock striping. Keys routed by hash.cowCopy-on-write via atomic.PointerLock-free reads; every write copies the whole map.All six satisfy one interface, so a single harness drives them identically.<br>Code: github.com/kluyg/in-memory-cache.<br>How I measured (the short version)#<br>testing.B + b.RunParallel, 1,000,000 keys, GOMAXPROCS swept 1→8, on a<br>20-core i7-14700K. Each data point is the median of 10 runs summarized with<br>benchstat; variation was mostly ±0–3%. Throughput below is 1000 / (ns/op)<br>in millions of ops/sec — higher is better. I measured the cache in-process,<br>not behind HTTP: net/http + JSON cost microseconds, which would bury the<br>nanosecond-scale differences I&rsquo;m chasing.<br>The 14700K is a hybrid chip — 8 performance cores (with hyperthreading) plus<br>12 efficiency cores — so an unpinned sweep is a trap: as GOMAXPROCS rises, the<br>OS can spill goroutines onto E-cores or hyperthread siblings and migrate them<br>mid-run, which confounds the scaling curves. So the process is pinned to one<br>thread per physical P-core (affinity 0x5555); each GOMAXPROCS step adds a real<br>P-core. Pinning shifted the absolute numbers by 10–25% in places but left every<br>ranking and curve shape unchanged.<br>One deliberate non-axis: value size doesn&rsquo;t matter here. Go strings are<br>immutable, so Set stores a 16-byte header and never touches the value bytes —<br>64 B and 16 KB benchmark identically (0 B/op). Value size affects memory and GC,<br>not op throughput.<br>The results#

Read the slopes, not just the heights:<br>sharded and cow climb; mutex is flat. More cores, more throughput —<br>unless you picked a single lock.<br>cow owns read-only (87 Mops/s at 8 cores, fully lock-free reads) and<br>vanishes the moment writes appear — it&rsquo;s pinned to ≈0 on the three write<br>panels because every Set copies the entire million-entry map.<br>sharded is the only design that&rsquo;s near the top in every panel.<br>The obvious fix scales backwards#<br>Normalize each design to its own single-core throughput and the story gets<br>sharper:

mutex is below 1× — at 8 cores it&rsquo;s 0.66× its single-core speed.<br>Reads can&rsquo;t run in parallel, and the cache line holding the lock ping-pongs<br>between cores. You added hardware and lost performance.<br>rwmutex plateaus around 2×. The shared reader counter becomes the new<br>contention point; it stops improving after ~4 cores.<br>sharded reaches 6.9×, while cow and syncmap track — even slightly<br>exceed — the ideal 8× line (lock-free reads get a bonus from the larger<br>aggregate cache). Caveat: syncmap&rsquo;s great slope flatters a poor baseline —<br>it&rsquo;s still slower in absolute terms than sharded.<br>Skew isn&rsquo;t simply &ldquo;worse&rdquo;#<br>Real caches see Zipfian access — a few hot keys take most of the traffic. The<br>common assumption is that skew hurts. It&rsquo;s more interesting than that:

Above 1× means faster under skew. Reads get faster almost everywhere — the<br>hot keys stay in CPU cache (mutex reads speed up 1.6×, syncmap 1.9×). The<br>striking exception is sharded&rsquo;s balanced mix at 0.82× — skew makes it<br>slower : hot keys collide on a few shards, so those locks contend while the<br>rest sit idle.<br>cow is the control case: its balanced-mix bar sits at 1.03×, essentially flat.<br>That&rsquo;s the tell-tale of a design whose write cost is distribution-independent —<br>it copies the whole map on every Set regardless of which key changed, so the<br>key distribution can&rsquo;t touch it. Skew moves a number only where the distribution<br>changes where work lands (cache lines, shards); it leaves cow&rsquo;s uniform...

rsquo cores cache reads skew sharded

Related Articles