NUMA: Cores, memory, and the distance between them

sys_call1 pts0 comments

NUMA Explained: Why Memory Distance Slows Your VMs

Use Cases

Resources

Company

Contact

Products<br>For ContainersFor GPUs<br>Use Cases<br>Multi-Tenant IsolationUntrusted Code ExecutionAI Agent Sandboxing<br>Resources<br>The VineEventsDocs<br>WHO WE ARE<br>Why EderaSocial LovePress & Events<br>BEHIND THE SCENES<br>CareersLegal<br>FOLLOW<br>LinkedInYouTubeBluesky<br>Support<br>Contact EderaEdera GitHub

NUMA - Part 1: Cores, memory, and the distance between them<br>June 23, 2026 - Steven Noonan

Two virtual machines on the same host, configured identically, running the same workload. One of them is 20% slower than the other, consistently. Nothing is wrong with the workload, nothing is wrong with the host, no contention from other tenants. The slow one's memory just happens to be on the wrong side of an interconnect from the CPUs running it, and there is no in-guest knob the operator can turn to fix it.<br>That is the story this series is about. Edera shipped a stack of changes that make Xen-based virtualisation NUMA-aware end to end - inside the guest, through the paravirtual I/O drivers, into dom0, and back to the hypervisor's view of host hardware. Some of those pieces are, as far as we can tell, the first implementations anywhere. To explain why any of it matters, we have to start with what NUMA actually is.<br>Where NUMA came from: From UMA to Multi-Socket Servers<br>NUMA stands for Non-Uniform Memory Access. The defining property is in the name: memory access cost on a NUMA machine is not uniform. Which CPU is doing the access matters, and which physical memory bank the data lives in matters, and the cost depends on the relationship between those two things.<br>The historical opposite of NUMA is UMA, Uniform Memory Access, where every CPU reaches every byte of memory through the same memory controller at the same cost. UMA was the world of single-socket commodity servers and many embedded systems. It is conceptually simple and it scales up to a limit - that limit being roughly "as many cores as you can fit on one chunk of silicon with one set of memory channels".<br>What happens when you need to push past that limit? UMA scales until it stops, and "stops" has a few intertwined reasons. Electrical trace lengths get long. Signals take longer to propagate end to end, and at the data rates a modern memory bus runs at, longer traces also force you to slow the bus down to keep signal integrity - which piles latency on top of latency. (This is the same physics behind the L1 / L2 / L3 cache hierarchy: L1 is small partly because keeping a cache small and physically close to the core is what keeps it fast.) A single memory controller cannot feed an arbitrary number of cores at full bandwidth; pin counts on the package put a hard ceiling on how many memory channels you can wire to one socket. Bus contention between many CPUs trying to read memory simultaneously stops being negligible. The industry's answer was to give each socket (or die) its own memory controller and stop pretending memory is one symmetric pool. The machine becomes NUMA. Each node has its own CPUs and its own slice of memory; CPUs can still read from any node, but reads to a remote node have to cross an interconnect, and the interconnect is fast in absolute terms and slow relative to a local DRAM access.<br>UMACPU 0CPU 1CPU 2CPU 3memory controllermemoryone pool, shared by all CPUsevery access has the same costNUMAnode 0CPU 0CPU 1memory controllermemorynode 1CPU 2CPU 3memory controllermemoryinterconnectlocal access is cheap; remote access crosses the interconnect<br>It helps to think of the interconnect as a bridge between two adjacent towns. On an empty road, the crossing is only marginally slower than local traffic; at rush hour, when everyone is trying to use the same bridge at once, the cost is whatever the queue happens to be. We will come back to that distinction when we talk about how NUMA looks on a busy production workload versus a quiet microbenchmark.<br>A short chronology to anchor where this came from. Commercial NUMA showed up in the 1990s in big-iron systems aimed at a small set of buyers who needed more cores than one bus could feed: SGI's Origin line, Sequent NUMA-Q, Compaq's AlphaServer GS series, all with bespoke node-to-node fabrics. NUMA arrived in commodity x86 in 2003 with AMD's Opteron, which gave each socket its own memory controller and connected sockets via HyperTransport. Intel caught up in 2008 with Nehalem and the QuickPath Interconnect (QPI), after a long run of front-side-bus chipsets that were genuinely UMA (all CPUs sharing one bus to a northbridge, no per-socket memory at all). Today's Intel parts use the Ultra Path Interconnect (UPI), the descendant of QPI; today's AMD parts use Infinity Fabric, the descendant of HyperTransport. Each generation lifted the absolute bandwidth substantially, but the ratio of remote-to-local DRAM cost has stayed stubbornly in roughly the same range.<br>You might be tempted to think, looking at all of this so far, that each socket is one NUMA node...

memory numa access interconnect socket from

Related Articles