Why More Cores Stopped Saving Us

ingve1 pts0 comments

Why More Cores Stopped Saving Us | jonathanbeard.io Skip to content Scaling doesn’t stall on the resource you keep adding. It stalls on the one dependency you can’t parallelize away.

In previous posts we walked through the ways you can actually parallelize an application: pipelined, task parallel, or both (there are lots of variations, all the way down to the instruction-level and memory parallelism the microarchitecture squeezes out under you). This is the post about the ceiling all of them share.

For a good long stretch, the fix for “this program is too slow” was just “wait.”1 Performance showed up for free, year after year, because the clock kept climbing. You wrote your code, a faster chip shipped, and your code got faster while you slept. Then sometime in the mid-2000s the free lunch ended.2 (It always does.) The clocks quit climbing because the physics quit cooperating: push a single core much faster and it runs too hot to keep going that way.3 So the industry pivoted hard, and the new promise was parallelism. Not one fast core, but lots of them, and the marching orders couldn’t have been simpler. More cores, more speed. Throw cores at it.4 For a while that genuinely worked, and a whole culture of engineering grew up around the assumption that the next slice of performance was just a matter of adding more hands to the job. Except the people who knew the physics knew better, even then. The limits were visible from the start: the memory bandwidth each core gets to itself shrinks as you add more of them all reaching for the same pins, and the time to cross the chip climbs as the chip grows, because more cores means more area and physics still bills you for every millimeter a signal has to travel. More hands only helps if every hand can reach the work and talk to its neighbors cheaply, and at scale neither stays free.

And it’s worth being precise about what those extra cores are actually plugged into, because adding a core to a modern chip isn’t just adding a core. It’s adding a core together with its own slice of the cache hierarchy, its own prefetchers, its own fast private memory, all of it reuse-optimized, all of it one big bet that you’ll touch the same data again soon and that a hot working set will sit close and stay close. When that bet holds, parallelism scales just fine. Give each core a private working set with good locality, or hand many cores the same immutable data to read, and the caches do exactly what they were built to do; you really do get close to the speedup the core count promises. The trouble is narrower and more specific than “too many cores.” It shows up when cores share mutable cache lines and have to keep ping-ponging ownership back and forth, when two unrelated variables happen to land on the same line and the hardware can’t tell your false sharing from the real thing, when locality gets poor enough that the caches stop earning their keep, or when enough cores pull on memory at once that you hit a bandwidth wall the hierarchy was supposed to hide. Drive cores into those patterns and the reuse the architecture was counting on does start to break down. Keep the work clean and independent and it holds up beautifully.

There’s a second sacrifice buried in there, and this one we made on purpose. To keep all those cores manageable, to let a programmer go on pretending memory is one flat thing that reads the same everywhere even while a dozen cores are scribbling on it at once, we bolted coherence onto the hardware.5 That’s the machinery that quietly hunts down every copy of a cache line scattered across the chip and keeps them all in agreement, so your code never has to ask which core touched what last. It’s a gorgeous abstraction, and it’s nowhere close to free. The up-front cost is fixed: real silicon area, real power, and a mountain of verification effort, spent whether or not a given program ever touches shared memory at all. The runtime bill is more selective, and it lands hardest exactly where you’d guess, on programs that genuinely share mutable data, where the protocol spends its life chasing ownership of contended lines back and forth, a cost that climbs with the very core count it exists to tame. We laid that one on the altar of programmability, and we’re still paying for it in transistors spent on bookkeeping instead of math, and in watts spent keeping caches honest.

Virtual memory is the same story one more time. That private, uniform address space every program thinks it owns isn’t a fact about the machine; it’s an abstraction the hardware and the operating system maintain together, with a pile of dedicated widgets: page tables (specialized cache entries), TLBs (simply more caches, specialized for those page tables), page-table walkers (really just simple controllers specialized for graph traversal), and an MMU (really just a combination of the above plus the translation machinery), all grinding away to turn the addresses your code uses into the real ones the DRAM...

cores core memory keep adding chip

Related Articles