Data Access Patterns That Makes Your CPU Really Angry | weineng30% worse than a randomized access pattern.">30% worse than a randomized access pattern.">30% worse than a randomized access pattern.">Given an array of data, what is the slowest way to sum up the integers? Is it adding the numbers from left to right, adding them randomly, or doing something else? In this post, we are going to build a data access pattern from the ground up that sums numbers as slowly as possible by exploiting memory pitfalls.<br>uint32_t* data = ...;
// sequential<br>data[0] + data[1] + data[2] + ...<br>// random<br>data[67] + data[69420] + data[42] + ...<br>// the slowest<br>data[A] + data[B] + data[C] + ...<br>Spoiler: You can do >30% worse than a randomized access pattern.
Some rules:<br>Only consider the time taken to run the accumulation function. The time taken to create positions isn’t included.<br>The accumulation function is fixed as follows, and data is filled randomly and we are only allowed to change the contents of positions:<br>constexpr int ELEMENT_COUNT = (1 16) * (PAGE_SIZE / sizeof(uint32_t)); // 2^26
/* data contains integers that we want to sum up<br>* positions is the access pattern that we use to sum up the integers<br>* Overflow is expected, but we don’t really care about the actual sum, do we?<br>*/<br>uint32_t accumulator(uint32_t const* data, uint32_t const* positions) {<br>uint32_t total = 0;<br>for (uint32_t i = 0; i ELEMENT_COUNT; ++i) {<br>uint32_t pos = positions[i];<br>total += data[pos];<br>return total;<br>Measurements of accumulator durations are based on using rdtsc cycle count.<br>Some additional notes:<br>There are 2^26 integers: using 65536 pages, and each page contains 1024 integers. These numbers are chosen simply so it doesn’t take too long on my machine to run.<br>Huge pages are disabled.<br>All measurements are based on my machine:<br>Reveal machine specs<br>Hide code❯ lscpu<br>Architecture: x86_64<br>CPU op-mode(s): 32-bit, 64-bit<br>Address sizes: 42 bits physical, 48 bits virtual<br>Byte Order: Little Endian<br>CPU(s): 8<br>On-line CPU(s) list: 0-7<br>Vendor ID: GenuineIntel<br>Model name: Intel® Core™ Ultra 7 268V<br>CPU family: 6<br>Model: 189<br>Thread(s) per core: 1<br>Core(s) per socket: 8<br>Socket(s): 1<br>Stepping: 1<br>CPU(s) scaling MHz: 50%<br>CPU max MHz: 2200.0000<br>CPU min MHz: 400.0000<br>BogoMIPS: 6604.80<br>Flags: …<br>Virtualization features:<br>Virtualization: VT-x<br>Caches (sum of all):<br>L1d: 320 KiB (8 instances)<br>L1i: 512 KiB (8 instances)<br>L2: 14 MiB (5 instances)<br>L3: 12 MiB (1 instance)<br>NUMA:<br>Vulnerabilities:
The full code can be found here, run with g++ -std=c++2a -O3 slowest.cc && taskset -c 3 sudo ./a.out. I highly recommend opening up slowest.cc and running the code.<br>Our job is to find the permutation of positions that yields the slowest possible timing. Let’s begin with the simplest access pattern:<br>void linear(uint32_t const* data, uint32_t* positions) {<br>for (uint32_t i = 0; i ELEMENT_COUNT; ++i) {<br>positions[i] = i;<br>This is likely the fastest permutation, taking 133M (132752394) cycles. This is expected, since CPUs are heavily optimised for sequential accesses.<br>On the other hand, we could randomise the permutation of positions.<br>void fisher_yates_shuffle(uint32_t const* data, uint32_t* positions) {<br>linear(data, positions);<br>uint32_t remaining = ELEMENT_COUNT;<br>for (uint32_t i = 0; i ELEMENT_COUNT; ++i) {<br>uint32_t random = rand() % remaining;<br>uint32_t tmp = positions[i];<br>positions[i] = positions[i + random];<br>positions[i + random] = tmp;<br>--remaining;<br>Now, the CPU cannot predict which data it will access next, so randomised access takes 1.57B (1572108618) cycles, which is over 10x worse than with linear access. Could we do worse? Of course. Let’s build up the worst permutation, starting with a simple regression.<br>Start by setting positions such that every consecutive element accessed is always separated by a cache line, which is the unit of data that is stored in a cache:<br>void separated_by_a_cacheline(uint32_t const* data, uint32_t* positions) {<br>constexpr int element_count_per_cacheline =<br>CACHELINE_SIZE / sizeof(uint32_t);<br>constexpr int cacheline_count = ELEMENT_COUNT / element_count_per_cacheline;<br>static_assert(ELEMENT_COUNT % element_count_per_cacheline == 0);<br>int current = 0;<br>for (int element_index = 0; element_index element_count_per_cacheline;<br>++element_index) {<br>for (int cacheline_index = 0; cacheline_index cacheline_count;<br>++cacheline_index) {<br>positions[current] =<br>cacheline_index * element_count_per_cacheline + element_index;<br>++current;<br>This pattern is terrible because each access uses one 4-byte integer from a 64-byte cache line before moving on. By the time we come back to the same cache line, the useful reuse cache would have been evicted. This culminates in a horrible performance with a cycle count of 719M (718804156), already taking 4x longer than a linear scan.<br>When accessing elements separated by a cache line, the hardware prefetchers can still recognise a simple streaming pattern in data and start fetching future cache lines before the load requests them. However, many Intel...