Every Byte Matters

setheron1 pts0 comments

Every byte matters | Farid Zakaria’s Blog

Every byte matters

Published 2026-06-01 on<br>Farid Zakaria's Blog

I have spent a large portion of my career working in Java. In that time, you get used to huge classes. New functionality? Just add a new method and field to the class.<br>The cost of each new field is rarely considered. Performance is often considered from a classic computer science perspective by considering asymptotic analysis of the algorithms and data structures in-use.

Turns out that even within a growth scale for your algorithm, such a simple for-loop O(N), time can vary dramatically if we have a little deeper understanding of the underlying hardware.

First, let’s understand our current machine. Let’s take a peek at our cache line and page sizes.

$ lscpu | grep -i cache<br>L1d cache: 352 KiB (10 instances)<br>L1i cache: 640 KiB (10 instances)<br>L2 cache: 10 MiB (5 instances)<br>L3 cache: 12 MiB (1 instance)

$ getconf LEVEL1_DCACHE_LINESIZE<br>64

The instances number is a reflection of how the caches are shared amongst CPUs. If I had 10 CPUs, each one has their own L1d cache, whereas two of them would share an L2 cache.

Our cache line size is 64 bytes .

┌─────────────────────────────────────────────┐<br>│ 64 bytes │<br>│ byte 0 byte 1 byte 2 ... byte 63 │<br>└─────────────────────────────────────────────┘

When you read a single byte from memory, the hardware will fill the surrounding 64 bytes into the cache line.<br>The idea being that data is often temporal and spatially located, meaning data is often accessed near each other and close in time to each other.

We can reference Jeff Dean’s famous “Latency numbers every programmer should know”, however a quick recap with the values from our particular machine is the following:

┌──────────────────────────────────────────────────────────────┐<br>│ CPU Core │<br>│ ┌───────────┐ │<br>│ │ Registers │

The sizes for each cache, is the number returned by lscpu divided by the number of cores or instances; i.e. 352 KiB ÷ 10 instances = ~35 KiB.<br>We then determine the number of cache lines by dividing this number by 64; i.e. 35 KiB ÷ 64 bytes = 560 cache lines.

How does this all matter ? 🤔

Let’s consider an example where we want to iterate over a single struct Monster and pull out the boolean is_alive to filter them.<br>We create our struct, and in this particular example we need 64 bytes to represent a single Monster.

struct Monster {<br>uint32_t id; // 4 bytes<br>float x, y, z; // 12 bytes<br>float vx, vy, vz; // 12 bytes<br>int32_t hp; // 4 bytes<br>int32_t attack; // 4 bytes<br>int32_t defense; // 4 bytes<br>uint8_t is_alive; // 1 byte<br>uint8_t team; // 1 byte<br>char name[22]; // 22 bytes<br>}; // total: 64 bytes

If we had an array of Monsters and we iterate over them, the cache line would fill up like so.<br>Each cache line would fill with a single monster, and we would fetch only the is_alive byte.

This is often referred to as “Array of Structs”.

cache line 0 cache line 1<br>┌──────────────────────────────┐ ┌──────────────────────────────┐<br>│ id0 x0 y0 z0 vx0 vy0 vz0 hp0 │ │ id1 x1 y1 z1 vx1 vy1 vz1 hp1 │<br>│ atk0 def0 alive0 team0 name0 │ │ atk1 def1 alive1 team1 name1 │<br>│ ▲ │ │ ▲ │<br>└─────────────┼────────────────┘ └────────────┼─────────────────┘<br>│ │<br>need this need this

If we instead normalize the data such that each field is in it’s own list, we can pack the cache lines much tighter.

cache line 0<br>┌───────────────────────────────────────────────────────────────┐<br>│alive0 alive1 alive2 alive3 alive4 alive5 ... alive62 alive63 │<br>│ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ │<br>└──┼──────┼──────┼──────┼──────┼──────┼──────────┼───────┼──────┘<br>│ │ │ │ │ │ │ │<br>└──────┴──────┴──────┴──────┴──────┴──────────┴───────┘<br>all 64 in one fetch

// SoA layout<br>struct Monsters {<br>uint32_t *ids;<br>float *xs, *ys, *zs;<br>float *vxs, *vys, *vzs;<br>int32_t *hps;<br>int32_t *attacks;<br>int32_t *defenses;<br>uint8_t *is_alives; // packed contiguously<br>uint8_t *teams;<br>char (*names)[22];<br>};

This type of layout is referred to as “Struct of Arrays”.

How much of an impact can this have?

We can observe up to 30x improvements when the Monster struct is 1KiB 🤯

The delta is less observable when the struct is small because multiple Monster structs can still be fetched within a single cache-line.

This data access is incredibly hot though. Your CPU pre-fetcher knows it’s going sequentially and fetches the next cache line before you need it. You never actually have to wait for the memory to be fetched.

What about random access patterns?

Not all access patterns are sequential. Hash maps, trees, graph traversal, and pointer-heavy data structures jump to unpredictable locations. The CPU can’t prefetch what it can’t predict. With random access, the CPU needs the entire array to be present in the cache in order to avoid stalls due to memory lookup.

This means the total size of your collection determines your performance tier.

Monsters<br>Working Set (64B)<br>Latency (64B)<br>Working Set (128B)<br>Latency (128B)

512<br>32 KiB<br>~3 ns<br>64 KiB<br>~11 ns

4,096<br>256 KiB<br>~11 ns<br>512 KiB<br>~13 ns

32,768<br>2...

cache bytes byte line struct data

Related Articles