Beyond the Memory Wall: The CPU Was Helping You All Along

random__duck1 pts1 comments

Beyond the Memory Wall: The CPU Was Helping You All Along

Beyond the Memory Wall: The CPU Was Helping You All Along

Published on 2026-06-05

systems

performance

memory

cache

architecture

low-level

cpu

The False Confidence

Blog 1 ended with what felt like a satisfying explanation. DRAM is slow, the cache hierarchy exists to hide that fact, and if your working set grows large enough to spill out of cache, you start paying the full cost of main memory access. I even had the experiments to back it up: the 64-byte cache line showed up right where the hardware said it would, and the latency curves bent exactly the way the mental model predicted. It felt like I'd caught the memory subsystem with its pants down.

So naturally, I assumed the next set of experiments would just be more of the same. Bigger working sets, slower access times, the memory wall doing its thing on a larger stage. I set up a working set sweep in Aletheia ranging from 1KB all the way upto 64MB, expecting to watch latency climb steadily as the data outgrew each level of cache. What I did not expect was to find myself 20 minutes later questioning whether my benchmark was broken.

The sequential scan results looked wrong in the most confusing possible way, not wrong like garbage values or obvious bugs, but wrong like "this should be slower and it isn't and I don't know why." A 64MB sequential scan, well into DRAM territory, was not behaving like something that had to pay 100 nanoseconds per cache miss. Same machine, same DRAM, same cache hierarchy I'd spent an entire post describing as the fundamental bottleneck of modern computing.

So either I had misunderstood the memory wall, or the CPU was quietly doing a lot more work than I had given it credit for.

The Weird Result

The experiment was straightforward. A working set sweep in Aletheia, ranging from 1KB to 64MB, measuring average latency per access at each size. The goal was to cross the cache hierarchy intentionally and watch what happens. Small buffers should feel fast, larger ones should slow down gradually, and once the working set grows large enough, DRAM should start making itself obvious.

At first, the numbers behaved exactly as expected. Small buffers sat around a few nanoseconds per access, larger working sets got progressively more expensive, and the general shape felt reassuringly familiar. Memory was getting slower as data grew, exactly as we saw in Blog 1.

[!Spoiler Alert]

Then things got weird :)

Sequential working-set sweep results. At this point, the numbers looked suspiciously reasonable for something supposedly paying DRAM latency.

The larger working sets were clearly touching DRAM, but the results still looked strangely reasonable. A 16MB sequential scan landed at around 99ns per access, while a 64MB scan came back at around 94ns. Not only was the slowdown less dramatic than expected, the numbers were not even increasing cleanly anymore.

This felt suspicious so I reran it, stared at the measurement code for longer than I would like to admit, and convinced myself I had probably done something stupid. Same result every time.

If DRAM access really costs around 100ns, why did scanning through tens of megabytes of it sequentially still feel surprisingly cheap?

The Wrong Hypothesis

My first instinct was that the answer had to be hiding somewhere in the access pattern itself. Sequential scans felt unusually well-behaved, and pointer chasing, at least from the outside, felt far more chaotic. So naturally I assumed the real explanation probably lived somewhere between those two extremes.

[!Thought Of The Day]<br>What if the distinction was not sequential versus random, but predictable versus unpredictable?

This led me to stride access. Instead of touching every element sequentially, memory gets accessed at fixed intervals, skipping a few elements, then a few more, then much larger jumps. Intuitively this felt like a reasonable middle ground, less orderly than a sequential scan but not completely chaotic either. If sequential was too easy and pointer chasing was too painful, stride felt like the obvious next experiment to run.

So I tried it. The results were interesting, but strangely unsatisfying. Some strides clearly hurt more than others, and larger jumps made things noticeably slower. But the experiment never really answered the question that had been bothering me. Memory still seemed strangely resistant to becoming as painful as I expected it to be.

At some point I realized I was probably asking the wrong question entirely. Maybe randomness itself was not the real problem. Maybe the better question was this: what exactly stops the CPU from helping?

The Better Question

Up until this point I had been treating randomness and pointer chasing as roughly the same thing. They both felt chaotic, they both seemed hostile to cache locality, and they both looked nothing like the neat orderly world of sequential scans. But somewhere in the middle of all this...

memory cache felt like sequential working

Related Articles