A regression in code I didn't touch
A regression in code I didn't touch
A deep dive into L1 instruction cache set conflicts, associativity, and code alignment in Go.
May 19, 2026
In my previous<br>post I briefly<br>touched on the topic of how merely shifting the code by a couple of bytes may<br>significantly affect hot path performance.
CPUs are weird. They don't just take instructions and run them in order. There<br>are caches, branch predictors, prefetchers, yada yada, and all of it is<br>sensitive to where exactly your code sits in memory. The same hot loop at<br>one address can be a few percent slower at another, just because it crossed<br>some invisible boundary somewhere.
Every cache you can find around cpu is a potential subject of unexpected<br>performance regressions (or gains) inflicted by code alignment changes. The hero<br>of this post is L1 icache - the fastest cpu cache that stores cpu instructions.<br>On my machine (Intel i5-12500) it's 32KB, 8-way set associative: 64 sets ×<br>8 ways × 64-byte cachelines. Those numbers matter for the story.
In this post I want to tell you an interesting anecdote about the case where I<br>spent a couple hours investigating why a change in one piece of code caused a<br>performance regression in a completely unrelated part of the codebase and the<br>root cause was, surprisingly, L1i conflict misses from limited cache<br>associativity.
The Phantom Regression
I was working on improving compression speed of quality level 2 in my Brotli Go port go-brrr.
go version go1.26.2-X:nodwarf5 linux/amd64<br>goos: linux<br>goarch: amd64<br>pkg: github.com/molecule-man/go-brrr<br>cpu: 12th Gen Intel(R) Core(TM) i5-12500
│ /tmp/before.txt │ /tmp/after.txt │<br>│ B/s │ B/s vs base │<br>830kb.so.css 297.8Mi ± 0% 304.0Mi ± 0% +2.08% (p=0.000 n=21)<br>005kb.webp.js 126.8Mi ± 1% 122.7Mi ± 0% -3.24% (p=0.000 n=21)<br>011kb.quer.json 348.5Mi ± 0% 344.8Mi ± 0% -1.08% (p=0.000 n=21)
The speed of large files compression is improved (expected). However,<br>performance on small files regressed by 3% - completely unexpected as my change<br>touched hash2.go (doesn't matter what it is) but small files under 64kb are<br>always compressed by hash2u16.go.
At this point I was pretty accustomed to such regressions and would otherwise<br>happily skip the investigation but this time I decided to dig deeper as a 3%<br>regression was larger than the usual 2% alignment-shift-induced regression<br>typical for this machine.
416 Bytes of Trouble
As it was pretty clear that regression is caused by alignment change the first<br>thing I did was to calculate how much things have shifted in the assembly. The<br>only function I changed was createBackwardReferences in hash2.go therefore I<br>dumped assembly for this function before and after:
go tool objdump -s '\(\*h2\)\.createBackwardReferences' /tmp/bench.before<br>go tool objdump -s '\(\*h2\)\.createBackwardReferences' /tmp/bench.after
Checking the instruction addresses in assembly showed that my change shrank<br>the function by 402 bytes. The Go compiler aligns all functions to 32 bytes,<br>meaning the first instruction of any function always starts at an address that<br>is divisible by 32. So the actual downstream shift must be a multiple of 32 -<br>and objdump showed it was 416B. The h2u16 code is AFTER the changed function in<br>the assembly so this is exactly what shifted the machine code of the regressing<br>path.
The fact that hotpath of h2u16 as a whole has shifted by something that is 32B<br>divisible already hinted that the root cause was 64B aligned instruction cache<br>(and not e.g. 32B aligned intel's DSB cache - but I didn't realize it back then<br>and continued investigation).
Perf-action
Note to myself: thank god there is perf, use it more often.
The next thing I did, I took perf from the shelf and started interrogating it<br>until the picture became crystal clear:
Expand to see the perf command<br>EVENTS='cycles,instructions,branches,branch-misses,<br>baclears.any,dsb2mite_switches.penalty_cycles,<br>frontend_retired.dsb_miss,frontend_retired.any_dsb_miss,<br>frontend_retired.l1i_miss,frontend_retired.itlb_miss,<br>frontend_retired.unknown_branch,<br>idq.dsb_uops,idq.mite_uops,idq.ms_uops,<br>icache_data.stalls,icache_tag.stalls,<br>br_misp_retired.cond,br_misp_retired.indirect,br_misp_retired.near_taken'
for bin in before after; do<br>BENCH_CORPUS_FILE=../testcorpus/005kb.webp.js \<br>perf stat -e $EVENTS -- \<br>/tmp/bench.$bin -test.run '^$' \<br>-test.bench 'CompressCorpusFile/q=2' \<br>-test.benchtime 1000000x -test.cpu 1 -test.count 1<br>done
eventbeforeafter<br>time (1M iters)22.65 s23.10 s<br>...<br>🚩 frontend_retired.l1i_miss 9.96 M 28.14 M<br>🚩 icache_data.stalls (cycles)54.7 M 135.7 M
L1 instruction cache misses nearly tripled (2.8×). The next step was to localize<br>the source of icache misses. Of course perf can do it:
Expand to see the perf command<br>for bin in before after; do<br>BENCH_CORPUS_FILE=... perf record -F 999 -e cpu_core/frontend_retired.l1i_miss/upp \<br>-o /tmp/perf.$bin.data -- /tmp/bench.$bin ...<br>done
perf report -i /tmp/perf.after.data --stdio...