Optimization Catalog: How 4 bytes of padding make array clearing 49% faster

Optimization catalog. How 4 bytes of padding make array clearing 49% faster

A surprising alignment quirk I learned the hard way: adding 4 bytes of struct padding makes Go's array clearing 49% faster on Intel, all thanks to REP STOSQ.

June 22, 2026

Part of the Optimization catalog series:

When float division beats integer division

How 4 bytes of padding make array clearing 49% faster (this post)

"Psst, do you want your array clearing function to be 49% faster? Just shift the array by 4 bytes."

Let's take a look at this Go snippet:

const words = 1 19

type block struct { n uint32 data [words]uint32

func (b *block) reset() { b.data = [words]uint32{}

What if I were to tell you that if you just add dummy 4 bytes between n and data you'll speed up the reset function by 49% (on intel machines at least)? Sounds crazy?

Let's benchmark it. Let's actually try different paddings - with a 4-byte step:

type ( blk00 struct { _ [0]byte data [words]uint32 blk04 struct { _ [4]byte data [words]uint32 // ... blk28 struct { _ [28]byte data [words]uint32

func clear00(b *blk00) { b.data = [words]uint32{} } // ... func clear28(b *blk28) { b.data = [words]uint32{} }

So what do we have here? (on benchmark methodology. I denoised my machine before running the benchmarks: pinned core, turbo off, fixed uncore frequency, SMT sibling offline, THP disabled. Here's exactly how.)

goos: linux goarch: amd64 pkg: clearalign cpu: 12th Gen Intel(R) Core(TM) i5-12500 │ bench.intel.txt │ │ B/s │ Clear/off=00/mod8=0 28.22Gi ± 0% Clear/off=04/mod8=4 18.94Gi ± 0% Clear/off=08/mod8=0 28.22Gi ± 0% Clear/off=12/mod8=4 18.94Gi ± 0% Clear/off=16/mod8=0 28.22Gi ± 0% Clear/off=20/mod8=4 18.94Gi ± 0% Clear/off=24/mod8=0 28.22Gi ± 0% Clear/off=28/mod8=4 18.93Gi ± 0%

Do you see the zebra pattern? Every padding divisible by 8 gives us much higher throughput.

Do we see the same behavior on the AMD chips? Yes! But the effect is milder, only ~9% here (I didn't investigate why, maybe AMD is just better 🤪).

goos: linux goarch: amd64 pkg: clearalign cpu: AMD Ryzen 5 7535HS with Radeon Graphics │ bench.amd.txt │ │ B/s │ Clear/off=00/mod8=0 62.78Gi ± 0% Clear/off=04/mod8=4 57.67Gi ± 0% Clear/off=08/mod8=0 62.74Gi ± 0% Clear/off=12/mod8=4 57.61Gi ± 0% Clear/off=16/mod8=0 62.74Gi ± 0% Clear/off=20/mod8=4 57.65Gi ± 0% Clear/off=24/mod8=0 62.73Gi ± 0% Clear/off=28/mod8=4 57.57Gi ± 0%

I even rented aws c6g.large instance to test if this reproduces on ARM chips. It's not.

Expand to see the ARM results goos: linux goarch: arm64 pkg: clearalign │ bench.arm.txt │ │ B/s │ Clear/off=00/mod8=0 34.25Gi ± 0% Clear/off=04/mod8=4 34.18Gi ± 0% Clear/off=08/mod8=0 34.15Gi ± 0% Clear/off=12/mod8=4 34.19Gi ± 0% Clear/off=16/mod8=0 34.25Gi ± 0% Clear/off=20/mod8=4 34.17Gi ± 0% Clear/off=24/mod8=0 34.19Gi ± 0% Clear/off=28/mod8=4 34.16Gi ± 0% geomean 34.19Gi

What on earth is going on, Intel? Let's check the assembly. The most important line there is this:

REP; STOSQ AX, ES:0(DI)

The whole clearing is governed by a single instruction: REP (repeat) STOSQ on the provided address range. What's so special about the REP STOSQ and why it sucks when it's not 8-byte aligned? The Intel's Optimization Reference says:

3.7.6.4 Memset Considerations

When the destination buffer is misaligned, memset() performance using Enhanced REP MOVSB and STOSB can degrade about 20% relative to aligned case, for processors based on Ivy Bridge microarchitecture.

Nice! We've got confirmation that the misalignment causes performance degradation. But what is "Enhanced REP MOVSB and STOSB" (aka ERMSB)? As far as I understand it ERMSB is the special handling Intel added to make REP STOSB fast in the first place. (Intel's doc says STOSB, but as we'll see it applies to STOSQ too)

So how exactly does Intel enhance REP STOSB? Well, Intel's Optimization Reference doesn't spell it out (God, that thing is so hard to read). But I think I have a good guess on what happens behind the scenes. This whole topic originated when I was investigating a regression in my brotli library. Back then I dug into the perf counters and saw that the 8-misaligned clear showed +139% L2 read-for-ownership and +65% cycles vs the aligned one. So the guess is that enhancement mostly consists of writing whole cache lines without reading them first. This, I think, explains why breaking 8-byte alignment hurts: the no-RFO path works on whole cache lines, but with a 4-byte offset one of the 8-byte stores crosses every 64-byte boundary, so no line is ever cleanly covered and the fast path can't kick in.

One more thing about the alignment. The padding controlls offset within the struct but what about the address of the struct itself? What if it lands on an address that is not 8-byte-aligned? Turns out that can't happen, Go allocator makes sure it's always page-aligned. I verified it by this targeted test.

STOSB vs STOSQ

STOSB (B=Byte, clears one byte at a...

Optimization Catalog: How 4 bytes of padding make array clearing 49% faster

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org