Regression in untouched code: debugging an L1 I-cache associativity issue

A regression in code I didn't touch

A deep dive into L1 instruction cache set conflicts, associativity, and code alignment in Go.

May 19, 2026

In my previous post I briefly touched on the topic of how merely shifting the code by a couple of bytes may significantly affect hot path performance.

CPUs are weird. They don't just take instructions and run them in order. There are caches, branch predictors, prefetchers, yada yada, and all of it is sensitive to where exactly your code sits in memory. The same hot loop at one address can be a few percent slower at another, just because it crossed some invisible boundary somewhere.

Every cache you can find around cpu is a potential subject of unexpected performance regressions (or gains) inflicted by code alignment changes. The hero of this post is L1 icache - the fastest cpu cache that stores cpu instructions. On my machine (Intel i5-12500) it's 32KB, 8-way set associative: 64 sets × 8 ways × 64-byte cachelines. Those numbers matter for the story.

In this post I want to tell you an interesting anecdote about the case where I spent a couple hours investigating why a change in one piece of code caused a performance regression in a completely unrelated part of the codebase and the root cause was, surprisingly, L1i conflict misses from limited cache associativity.

The Phantom Regression

I was working on improving compression speed of quality level 2 in my Brotli Go port go-brrr.

go version go1.26.2-X:nodwarf5 linux/amd64 goos: linux goarch: amd64 pkg: github.com/molecule-man/go-brrr cpu: 12th Gen Intel(R) Core(TM) i5-12500

│ /tmp/before.txt │ /tmp/after.txt │ │ B/s │ B/s vs base │ 830kb.so.css 297.8Mi ± 0% 304.0Mi ± 0% +2.08% (p=0.000 n=21) 005kb.webp.js 126.8Mi ± 1% 122.7Mi ± 0% -3.24% (p=0.000 n=21) 011kb.quer.json 348.5Mi ± 0% 344.8Mi ± 0% -1.08% (p=0.000 n=21)

The speed of large files compression is improved (expected). However, performance on small files regressed by 3% - completely unexpected as my change touched hash2.go (doesn't matter what it is) but small files under 64kb are always compressed by hash2u16.go.

At this point I was pretty accustomed to such regressions and would otherwise happily skip the investigation but this time I decided to dig deeper as a 3% regression was larger than the usual 2% alignment-shift-induced regression typical for this machine.

416 Bytes of Trouble

As it was pretty clear that regression is caused by alignment change the first thing I did was to calculate how much things have shifted in the assembly. The only function I changed was createBackwardReferences in hash2.go therefore I dumped assembly for this function before and after:

go tool objdump -s '$\*h2$\.createBackwardReferences' /tmp/bench.before go tool objdump -s '$\*h2$\.createBackwardReferences' /tmp/bench.after

Checking the instruction addresses in assembly showed that my change shrank the function by 402 bytes. The Go compiler aligns all functions to 32 bytes, meaning the first instruction of any function always starts at an address that is divisible by 32. So the actual downstream shift must be a multiple of 32 - and objdump showed it was 416B. The h2u16 code is AFTER the changed function in the assembly so this is exactly what shifted the machine code of the regressing path.

The fact that hotpath of h2u16 as a whole has shifted by something that is 32B divisible already hinted that the root cause was 64B aligned instruction cache (and not e.g. 32B aligned intel's DSB cache - but I didn't realize it back then and continued investigation).

Perf-action

Note to myself: thank god there is perf, use it more often.

The next thing I did, I took perf from the shelf and started interrogating it until the picture became crystal clear:

Expand to see the perf command EVENTS='cycles,instructions,branches,branch-misses, baclears.any,dsb2mite_switches.penalty_cycles, frontend_retired.dsb_miss,frontend_retired.any_dsb_miss, frontend_retired.l1i_miss,frontend_retired.itlb_miss, frontend_retired.unknown_branch, idq.dsb_uops,idq.mite_uops,idq.ms_uops, icache_data.stalls,icache_tag.stalls, br_misp_retired.cond,br_misp_retired.indirect,br_misp_retired.near_taken'

for bin in before after; do BENCH_CORPUS_FILE=../testcorpus/005kb.webp.js \ perf stat -e $EVENTS -- \ /tmp/bench.$bin -test.run '^$' \ -test.bench 'CompressCorpusFile/q=2' \ -test.benchtime 1000000x -test.cpu 1 -test.count 1 done

eventbeforeafter time (1M iters)22.65 s23.10 s ... 🚩 frontend_retired.l1i_miss 9.96 M 28.14 M 🚩 icache_data.stalls (cycles)54.7 M 135.7 M

L1 instruction cache misses nearly tripled (2.8×). The next step was to localize the source of icache misses. Of course perf can do it:

Expand to see the perf command for bin in before after; do BENCH_CORPUS_FILE=... perf record -F 999 -e cpu_core/frontend_retired.l1i_miss/upp \ -o /tmp/perf.$bin.data -- /tmp/bench.$bin ... done

perf report -i /tmp/perf.after.data --stdio...

Regression in untouched code: debugging an L1 I-cache associativity issue

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast