Memory Location Matters for Performance

tosh1 pts0 comments

Memory location matters for performance

Memory location matters for performance

by Itamar Turner-Trauring<br>Last updated 22 Nov 2023, originally created 13 Jan 2022

If you’re writing high-performance code, what matters is how many CPU instructions you run.<br>Right?

That’s what I learned long ago in school, anyway, when I took algorithms and data structures classes.<br>The basic assumption was that reading or writing to memory took a small, constant amount of time, and so you could focus on how many operations were being done.<br>This was so fundamental to how we were taught that we didn’t even have to write real code: pseudo-code was sufficient.

Unfortunately, this is not how a CPU works in the real world.

In the real world, access to memory is vastly slower than CPU instructions, and all those O(N) measurements that just focus on CPU operations are insufficient to tell you what real-world performance will be.<br>And that means you need to think about not just your computation, but also how you access memory.

An updated version of this article is included in a book I’m working on that will help teach you how to optimize low-level code, the kind of code you’d write with Cython, C, or Rust.<br>The goal is to help data scientists and scientists who normally write Python to understand how to make their compiled code faster.

Same number of operations, different performance

Let’s see the impact of memory access on performance.<br>I’ll be using Rust for my example code because a low-level language maps more closely to what the CPU is doing.<br>Writing x = y + 2 in Rust will often translate directly to the relevant CPU instructions of adding 2 to a value and then shoving the result somewhere.<br>That same code in Python will involve vastly more work.

Consider the following program; all it does is increment values in a contiguous chunk of memory:

use std::env::args;

fn main() {<br>// Get the first command-line argument:<br>let scan_type = args().nth(1).expect(<br>"Must provide scan type argument.");

// Get the second argument, convert to an integer:<br>let vector_size = args().nth(2).expect(<br>"Must provide array size argument.");<br>let vector_size = vector_size.parse::usize>().expect(<br>"Not an integer.");

// Figure out an appropriate multiplier:<br>let multiplier = match &*scan_type {<br>"linear" => 1,<br>"random" => 48271,<br>_ => panic!("Unknown scan type."),<br>};

// Create 4 vectors (for our purposes, an array in<br>// memory) of size vector_size, filled with zeros.<br>let mut data = vec![0; vector_size];<br>let mut data2 = vec![0; vector_size];<br>let mut data3 = vec![0; vector_size];<br>let mut data4 = vec![0; vector_size];

// Update values in the vectors. A multiplier of 1<br>// results in a linear memory scan, whereas 48271<br>// will result in jumping around memory "randomly":<br>let mut index = 0;<br>for _ in 0..100_000_000 {<br>data[index] += 1;<br>data2[index] += 1;<br>data3[index] += 1;<br>data4[index] += 1;<br>index = (multiplier * index + 1) % vector_size;

Now, there are two ways we scan through memory with this program:

Linearly, from start to finish.

Randomly bouncing around.

In either case, we’ll be doing the same number of operations.<br>So per the model I learned in college, both variations should take the same amount of time because they’re doing the same amount of work.<br>Let’s try it out:

$ time ./memory-cache linear 1000000

real 0m0.936s<br>user 0m0.909s<br>sys 0m0.022s<br>$ time ./memory-cache random 1000000

real 0m4.943s<br>user 0m4.902s<br>sys 0m0.014s

Bouncing around memory at random is a lot slower than a linear scan of memory—but why?

Slow memory and the CPU cache hierarchy

It turns out that from a CPU’s perspective, reading from RAM is quite slow.<br>To speed things up, CPU designers provide a series of caches to reduce the need to access RAM.

To simplify dramatically, here’s how the CPU talks to RAM:

l1i

L1 cache (instructions)

l2

L2 cache

l2 -->

l1i->l2

l1d

L1 cache (data)

l2 -->

l1d->l2

l3

L3 cache

l3 -->

l2->l3

RAM

RAM

RAM -->

l3->RAM

CPU

CPU

l1i -->

CPU->l1i

l1d -->

CPU->l1d

From fastest to slowest:

The L1 caches are very fastest and very small, and might be per-CPU or per-core.<br>There’s one for instructions (i.e. the code you’re running) and one for the data your program is using.

The L2 cache is bigger but slower, and can store both instructions and data.

The L3 cache is even bigger and even slower, and can store both instructions and data.

Finally, there’s RAM, which is even slower to access.

My CPU, for example, has 4 cores with:

32KB L1 data cache and 32KB L1 instruction per core.

256KB L2 cache per core.

8MB L3 cache shared by all the cores.

Caches and memory access performance

Given we’re only accessing values infrequently in both cases, linear scan and random access, why is linear scan so much faster?

Cache lines

Memory is loaded into the caches in chunks known as “cache lines”, typically 64 bytes at a time.<br>So when we access the first value of the array, it gets loaded into the L1 cache—but so do the 2nd, 3rd, 4th, etc. values...

memory cache code data vector_size performance

Related Articles