OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet
Jump to main content
Search
NEXTPLATFORM AD
OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet
Timothy Prickett Morgan
Timothy Prickett<br>Morgan
Co-Editor, Co-Founder, The Next Platform
Published<br>tue 12 May 2026 // 18:52 UTC
Sometimes,<br>to solve a particular system architecture problem, you have to invent a new<br>technology. And sometimes, you just need to squint at the problem a little and look<br>at what you already have and use the parts in a different way.<br>The<br>latter approach is what has happened as researchers at OpenAI, Microsoft, Broadcom,<br>AMD, and Nvidia took a hard look at how ever-embiggening bandwidth on network<br>ports is not necessarily a valuable thing compared to have scale out networks<br>that have higher radix switches – meaning a lot more network links between<br>devices – and also flatter networks with fewer switches. Lowering the switch<br>count means the scale out network lashing together AI system nodes has lower<br>latency (fewer hops across the network between any two endpoints), lower cost<br>(which lowers the total cost of acquisition), and lower power consumption (which<br>further lowers the total cost of ownership).
NEXTPLATFORM AD
With<br>most great engineering ideas, when you look at it, the new approach is<br>intuitively obvious and you have to wonder why it wasn’t always done that way.<br>Such is the case with Multipath Reliable Connection, a new network protocol that<br>lays down atop Ethernet switch ASICs and that borrows many of the ideas of the<br>Ultra Ethernet specification put forth by the Ultra Ethernet Consortium, which<br>was founded back in July 2023 for the express purpose of scaling Ethernet<br>to more than 1 million endpoints as well as making it as good as the InfiniBand<br>low latency network for AI clusters.<br>The<br>MRC effort was started two years ago. OpenAI did a lot of the talking for the<br>new MRC protocol as it was unveiled last week, but we strongly suspect that<br>Microsoft did a lot of the work based on its extensive experience with both<br>RoCE Ethernet and InfiniBand networks. You can read the OpenAI blog<br>about MRC here, download the<br>paper the five companies release there, and see the Open Compute<br>Project spec for the effort at this link.<br>In<br>essence, what MRC does is stop chasing ports with higher and higher bandwidth<br>and start using the same aggregate bandwidth of a given switch ASIC to increase<br>the number of links between devices. I know what you’re thinking: Won’t<br>increasing the number of ports and the number of links mean increasing the<br>number of potential failures in those links, thereby making it more the<br>absolutely synchronous work like an AI training run comes to a crashing halt<br>more often? No, it won’t, if you radically increase the number of links between<br>endpoints. If you have enough links, as it turns out, and the right protocol,<br>you an heal around link failures and while the AI training job slows down,<br>there are enough ways to reroute traffic that the network can heal around the<br>link failure. And at your convenience, without having to stop the AI training<br>job, you can repair the link.<br>Endpoint<br>failures – meaning GPUs and XPUs – will still crash the training run, of course.<br>To which we say: Why not locally snapshot checkpoints on each server node,<br>stream them out to network storage or a shared memory appliance, keep a few spare<br>GPUs or XPUs in the network, and restore that one failed compute engine and<br>then resume the calculation? Perhaps this is harder than it sounds. . . . It<br>might be better to have an out of band compute engine monitor that predicts a<br>failure for a compute engine, freezes the training run before it crashes, takes<br>the failing compute engine offline, loads up data on the spare compute engine,<br>and resumes processing. Why let it crash at all?
NEXTPLATFORM AD
Anyway,<br>back to MRC. While the Ultra Ethernet protocol is a brand new protocol that<br>starts from a blank sheet of paper to make Ethernet more like InfiniBand in<br>terms of low latency, traffic shaping, and adaptive load balancing, the MRC<br>protocol is much less drastic of a change and is, in fact, a superset extension<br>of the current RDMA over Converged Ethernet (RoCE) protocol that hyperscalers, cloud<br>builders, and supercomputing centers have been complaining about for more than<br>a decade.<br>The<br>adaptive load balancing is based on Explicit Congestion Notification, and like<br>Ultra Ethernet, MRC supports out of order delivery of packets, packet spraying<br>across multiple links, selective retransmission, and packet trimming to help<br>deal with congestion.<br>Packet<br>trimming is neat in that it only retransmits packets that have been dropped due<br>to switch ASIC buffer overflows, and it does so without invoking the global ECN<br>mechanism. (Nvidia has a good explanation of packet trimming, which was<br>implemented in the Cumulus Linux network operating system it acquired shortly<br>after buying Mellanox, here.)<br>While ECN tells packet senders to slow down when...