Optimization Catalog: How 4 bytes of padding make array clearing 49% faster

theanonymousone1 pts0 comments

Optimization catalog. How 4 bytes of padding make array clearing 49% faster

Optimization catalog. How 4 bytes of padding make array clearing 49% faster

A surprising alignment quirk I learned the hard way: adding 4 bytes of struct padding makes Go's array clearing 49% faster on Intel, all thanks to REP STOSQ.

June 22, 2026

Part of the Optimization catalog series:

When float division beats integer division

How 4 bytes of padding make array clearing 49% faster (this post)

"Psst, do you want your array clearing function to be 49% faster? Just shift the<br>array by 4 bytes."

Let's take a look at this Go snippet:

const words = 1 19

type block struct {<br>n uint32<br>data [words]uint32

func (b *block) reset() {<br>b.data = [words]uint32{}

What if I were to tell you that if you just add dummy 4 bytes between n and<br>data you'll speed up the reset function by 49% (on intel machines at<br>least)? Sounds crazy?

Let's benchmark it. Let's actually try different<br>paddings - with a 4-byte step:

type (<br>blk00 struct {<br>_ [0]byte<br>data [words]uint32<br>blk04 struct {<br>_ [4]byte<br>data [words]uint32<br>// ...<br>blk28 struct {<br>_ [28]byte<br>data [words]uint32

func clear00(b *blk00) { b.data = [words]uint32{} }<br>// ...<br>func clear28(b *blk28) { b.data = [words]uint32{} }

So what do we have here?<br>(on benchmark methodology. I denoised my machine before running the benchmarks: pinned core, turbo off, fixed uncore<br>frequency, SMT sibling offline, THP disabled. Here's exactly<br>how.)

goos: linux<br>goarch: amd64<br>pkg: clearalign<br>cpu: 12th Gen Intel(R) Core(TM) i5-12500<br>│ bench.intel.txt │<br>│ B/s │<br>Clear/off=00/mod8=0 28.22Gi ± 0%<br>Clear/off=04/mod8=4 18.94Gi ± 0%<br>Clear/off=08/mod8=0 28.22Gi ± 0%<br>Clear/off=12/mod8=4 18.94Gi ± 0%<br>Clear/off=16/mod8=0 28.22Gi ± 0%<br>Clear/off=20/mod8=4 18.94Gi ± 0%<br>Clear/off=24/mod8=0 28.22Gi ± 0%<br>Clear/off=28/mod8=4 18.93Gi ± 0%

Do you see the zebra pattern? Every padding divisible by 8 gives us much higher<br>throughput.

Do we see the same behavior on the AMD chips? Yes! But the effect is milder, only ~9% here (I didn't investigate why, maybe AMD is just better 🤪).

goos: linux<br>goarch: amd64<br>pkg: clearalign<br>cpu: AMD Ryzen 5 7535HS with Radeon Graphics<br>│ bench.amd.txt │<br>│ B/s │<br>Clear/off=00/mod8=0 62.78Gi ± 0%<br>Clear/off=04/mod8=4 57.67Gi ± 0%<br>Clear/off=08/mod8=0 62.74Gi ± 0%<br>Clear/off=12/mod8=4 57.61Gi ± 0%<br>Clear/off=16/mod8=0 62.74Gi ± 0%<br>Clear/off=20/mod8=4 57.65Gi ± 0%<br>Clear/off=24/mod8=0 62.73Gi ± 0%<br>Clear/off=28/mod8=4 57.57Gi ± 0%

I even rented aws c6g.large instance to test if this reproduces on ARM chips.<br>It's not.

Expand to see the ARM results<br>goos: linux<br>goarch: arm64<br>pkg: clearalign<br>│ bench.arm.txt │<br>│ B/s │<br>Clear/off=00/mod8=0 34.25Gi ± 0%<br>Clear/off=04/mod8=4 34.18Gi ± 0%<br>Clear/off=08/mod8=0 34.15Gi ± 0%<br>Clear/off=12/mod8=4 34.19Gi ± 0%<br>Clear/off=16/mod8=0 34.25Gi ± 0%<br>Clear/off=20/mod8=4 34.17Gi ± 0%<br>Clear/off=24/mod8=0 34.19Gi ± 0%<br>Clear/off=28/mod8=4 34.16Gi ± 0%<br>geomean 34.19Gi

What on earth is going on, Intel? Let's check the assembly. The most important<br>line there is this:

REP; STOSQ AX, ES:0(DI)

The whole clearing is governed by a single instruction: REP (repeat) STOSQ on the<br>provided address range. What's so special about the REP STOSQ and why it sucks<br>when it's not 8-byte aligned? The Intel's Optimization Reference says:

3.7.6.4 Memset Considerations

When the destination buffer is misaligned, memset() performance using Enhanced<br>REP MOVSB and STOSB can degrade about 20% relative to aligned case, for<br>processors based on Ivy Bridge microarchitecture.

Nice! We've got confirmation that the misalignment causes performance<br>degradation. But what is "Enhanced REP MOVSB and STOSB" (aka ERMSB)? As far as I<br>understand it ERMSB is the special handling Intel added to make REP STOSB<br>fast in the first place. (Intel's doc says STOSB, but as we'll see it applies to<br>STOSQ too)

So how exactly does Intel enhance REP STOSB? Well, Intel's Optimization<br>Reference doesn't spell it out (God, that thing is so hard to read). But I think<br>I have a good guess on what happens behind the scenes. This whole topic<br>originated when I was investigating a regression in my brotli library. Back then<br>I dug into the perf counters and saw that the 8-misaligned clear showed +139%<br>L2 read-for-ownership and +65% cycles vs the aligned one. So the guess is that<br>enhancement mostly consists of writing whole cache lines without reading them<br>first. This, I think, explains why breaking 8-byte alignment hurts: the no-RFO<br>path works on whole cache lines, but with a 4-byte offset one of the 8-byte<br>stores crosses every 64-byte boundary, so no line is ever cleanly covered and<br>the fast path can't kick in.

One more thing about the alignment. The padding controlls offset within the<br>struct but what about the address of the struct itself? What if it lands on an<br>address that is not 8-byte-aligned? Turns out that can't happen, Go allocator<br>makes sure it's always page-aligned. I verified it by this targeted test.

STOSB vs STOSQ

STOSB (B=Byte, clears one byte at a...

clear mod8 byte intel words uint32

Related Articles