Cache Hierarchy Matters for Pragmatism

bcapchickadee1 pts0 comments

The Pragmatic Programmer After the Memory Wall

Khola.Blog

SubscribeSign in

The Pragmatic Programmer After the Memory Wall<br>What still holds when agents write code quickly, hardware punishes indirection, and cloud control planes fail at global scale.

Nitin Khola<br>May 28, 2026

Share

I do not think The Pragmatic Programmer has aged out. I think the environment around it has become less forgiving.<br>The book is not really about tools. It is about engineering posture: take responsibility, keep systems easier to change, make feedback loops tight, test what can fail, and refuse to live with broken windows. That posture still holds. What changed is the failure surface.<br>Thanks for reading Khola.Blog! Subscribe for free to receive new posts and support my work.

Subscribe

In 2026, a code agent can produce a week of mediocre abstraction before lunch. A laptop-class chip can execute billions of instructions per second and still spend the hot path waiting on scattered memory. A global control plane can replicate bad state across regions faster than a human can open the incident channel.<br>That does not make pragmatism obsolete. It makes the old advice more literal.<br>ETC Has a Physical Layer

The book’s central design rule is ETC: easier to change. That rule is still the right one. The mistake is treating “easier to change” as a synonym for “more abstract.”<br>That is how codebases end up with a hierarchy for every noun, an interface for every class, and a dependency-injected trail of breadcrumbs between one integer and the next. It feels civilized in review. It often benchmarks like a pile of receipts.<br>Modern CPUs do not run class diagrams. They fetch cache lines.<br>Intel’s Lunar Lake design puts more attention on power, memory proximity, and a memory-side cache. Apple’s M5 pushes unified memory bandwidth to 153 GB/s. Those are not licenses to ignore locality. They are evidence that silicon vendors are spending real die area and packaging complexity to hide the cost of moving data.<br>Software can still defeat all of it with pointer chasing.<br>The common agent failure is not exotic. Ask for a particle update loop, a pricing pass, a simulation tick, or a ranking transform, and the default output often looks like this:<br>struct Particle {<br>float x;<br>float y;<br>float z;<br>float vx;<br>float vy;<br>float vz;

void step(float dt) {<br>x += vx * dt;<br>y += vy * dt;<br>z += vz * dt;<br>};

std::vector particles;That shape is fine until the loop matters. Then every iteration drags fields through memory as an object bundle, whether the CPU needs all of them or not.<br>For a hot loop, I want the data shaped around the access pattern:<br>struct ParticleBlock {<br>std::vector x;<br>std::vector y;<br>std::vector z;<br>std::vector vx;<br>std::vector vy;<br>std::vector vz;

void step(float dt) {<br>for (size_t i = 0; i This is not an argument for flattening the whole application into arrays. It is an argument for performance tiers.<br>Business policy can afford indirection. Hot loops, storage engines, serialization paths, rendering, compression, matching, ranking, and simulation usually cannot. In those places, ETC means the future maintainer can find the data flow, predict the memory access, and benchmark the change without spelunking through ceremony.<br>The pragmatic move is not “object-oriented” or “data-oriented.” The pragmatic move is knowing which part of the system is paying rent to the cache hierarchy.<br>Big-O Is Missing the Invoice

Big-O is not wrong. It is incomplete.<br>It throws away constants because that is what makes asymptotic reasoning useful. The machine puts those constants back with interest. A branch mispredict, a cold cache line, a TLB miss, a failed prefetch, and a recursive call frame all live outside the clean little expression.<br>That is why production sorting implementations use hybrids. The high-level algorithm carries the asymptotic guarantee. The small-partition fallback respects the machine.<br>For tiny contiguous ranges, insertion sort can beat a theoretically superior algorithm because it walks memory in a boring pattern. Boring is a feature. The hardware can prefetch it. The branch predictor can learn it. The compiler can see it.<br>The lesson is not “always use insertion sort.” That would be a cargo cult with a better haircut. The lesson is that the crossover point belongs to the benchmark, not the blog post.<br>This is the update I would write into the margin of the book: estimate first, then measure at the physical boundary. If the path is CPU-bound, measure cache misses. If it is I/O-bound, measure queueing and tail latency. If it is distributed, measure retries, coordination, and blast radius.<br>An asymptotic proof is the start of the conversation. A profile is where the machine gets a vote.<br>Agents Make Broken Windows Cheap

The book’s broken-window rule gets more important when code is cheap.<br>A human usually leaves a broken window one commit at a time: a vague name, a duplicated branch, a swallowed exception, a test skipped because the release is late. An agent can stamp out...

memory float vector cache pragmatic still

Related Articles