OpenAI, Microsoft and Friends Build a Better, More Scalable Ethernet

OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet

Jump to main content

NEXTPLATFORM AD

OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet

Timothy Prickett Morgan

Timothy Prickett Morgan

Co-Editor, Co-Founder, The Next Platform

Published tue 12 May 2026 // 18:52 UTC

Sometimes, to solve a particular system architecture problem, you have to invent a new technology. And sometimes, you just need to squint at the problem a little and look at what you already have and use the parts in a different way. The latter approach is what has happened as researchers at OpenAI, Microsoft, Broadcom, AMD, and Nvidia took a hard look at how ever-embiggening bandwidth on network ports is not necessarily a valuable thing compared to have scale out networks that have higher radix switches – meaning a lot more network links between devices – and also flatter networks with fewer switches. Lowering the switch count means the scale out network lashing together AI system nodes has lower latency (fewer hops across the network between any two endpoints), lower cost (which lowers the total cost of acquisition), and lower power consumption (which further lowers the total cost of ownership).

NEXTPLATFORM AD

With most great engineering ideas, when you look at it, the new approach is intuitively obvious and you have to wonder why it wasn’t always done that way. Such is the case with Multipath Reliable Connection, a new network protocol that lays down atop Ethernet switch ASICs and that borrows many of the ideas of the Ultra Ethernet specification put forth by the Ultra Ethernet Consortium, which was founded back in July 2023 for the express purpose of scaling Ethernet to more than 1 million endpoints as well as making it as good as the InfiniBand low latency network for AI clusters. The MRC effort was started two years ago. OpenAI did a lot of the talking for the new MRC protocol as it was unveiled last week, but we strongly suspect that Microsoft did a lot of the work based on its extensive experience with both RoCE Ethernet and InfiniBand networks. You can read the OpenAI blog about MRC here, download the paper the five companies release there, and see the Open Compute Project spec for the effort at this link. In essence, what MRC does is stop chasing ports with higher and higher bandwidth and start using the same aggregate bandwidth of a given switch ASIC to increase the number of links between devices. I know what you’re thinking: Won’t increasing the number of ports and the number of links mean increasing the number of potential failures in those links, thereby making it more the absolutely synchronous work like an AI training run comes to a crashing halt more often? No, it won’t, if you radically increase the number of links between endpoints. If you have enough links, as it turns out, and the right protocol, you an heal around link failures and while the AI training job slows down, there are enough ways to reroute traffic that the network can heal around the link failure. And at your convenience, without having to stop the AI training job, you can repair the link. Endpoint failures – meaning GPUs and XPUs – will still crash the training run, of course. To which we say: Why not locally snapshot checkpoints on each server node, stream them out to network storage or a shared memory appliance, keep a few spare GPUs or XPUs in the network, and restore that one failed compute engine and then resume the calculation? Perhaps this is harder than it sounds. . . . It might be better to have an out of band compute engine monitor that predicts a failure for a compute engine, freezes the training run before it crashes, takes the failing compute engine offline, loads up data on the spare compute engine, and resumes processing. Why let it crash at all?

NEXTPLATFORM AD

Anyway, back to MRC. While the Ultra Ethernet protocol is a brand new protocol that starts from a blank sheet of paper to make Ethernet more like InfiniBand in terms of low latency, traffic shaping, and adaptive load balancing, the MRC protocol is much less drastic of a change and is, in fact, a superset extension of the current RDMA over Converged Ethernet (RoCE) protocol that hyperscalers, cloud builders, and supercomputing centers have been complaining about for more than a decade. The adaptive load balancing is based on Explicit Congestion Notification, and like Ultra Ethernet, MRC supports out of order delivery of packets, packet spraying across multiple links, selective retransmission, and packet trimming to help deal with congestion. Packet trimming is neat in that it only retransmits packets that have been dropped due to switch ASIC buffer overflows, and it does so without invoking the global ECN mechanism. (Nvidia has a good explanation of packet trimming, which was implemented in the Cumulus Linux network operating system it acquired shortly after buying Mellanox, here.) While ECN tells packet senders to slow down when...

OpenAI, Microsoft and Friends Build a Better, More Scalable Ethernet

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast