InfiniBand, RoCE, and All That

kkm1 pts0 comments

InfiniBand, RoCE, and all that

InfiniBand, RoCE, and all that<br>19 Jun 2026 · 15 min read ·<br>Cover: Opening of the Pneumatic Despatch Mail Service, Illustrated London News, 28 Feb. 1863.

The standard path for sending data over a network works like this:

The standard path. On each host the data is copied through a kernel<br>buffer (the orange hops), so the CPU is in the critical path on both ends.

SENDERRECEIVER1 App hands data to the kernel2 Kernel copies into socket buffer3 Kernel schedules the NIC4 NIC DMAs from buffer onto wire1 NIC DMAs into a kernel buffer2 Kernel copies it into app buffer3 Kernel notifies the waiting appapp bufkernelNICNICkernelapp bufcopynetworkcopy

SENDER1 App hands data to the kernel2 Kernel copies into socket buffer3 Kernel schedules the NIC4 NIC DMAs from buffer onto wireapp bufkernelNICnetworkRECEIVERNICkernelapp buf1 NIC DMAs into a kernel buffer2 Kernel copies it into app buffer3 Kernel notifies the waiting app

There are variations and optimizations, but the basic shape has the kernel<br>involved on both sides, and the data being copied into a kernel buffer. For a<br>web server handling HTTP requests this is fine, since the per-message overhead<br>is negligible relative to all the stuff the application wants to do with the<br>data.

There are lots of places where this is not so fine. The ones that matter most,<br>nowadays, are in AI training & inference. In training, a gradient all-reduce —<br>the step where hundreds of GPUs each combine their gradients with everyone<br>else’s before the next step can start — is a barrier that every GPU has to wait<br>on. In distributed inference, an expert parallel<br>kernel<br>synchronizes the waiting GPUs in much the same way EP is just an example, all the other parallelisms (except data-parallel)<br>have the same property..

What these workloads need is for data to move directly from an application<br>buffer on one machine to an application buffer on another, without either CPU<br>touching it in the critical path, and without a memory copy. This is Remote<br>Direct Memory Access (RDMA):

RDMA. The NIC reads and writes the application buffers directly, so the<br>kernel is bypassed and nothing is copied.

SENDERRECEIVERapp bufkernelNICNICkernelapp bufnetworkbypassedbypassed

SENDERapp bufkernelNICnetworkRECEIVERNICkernelapp buf

Building hardware that provides it reliably and efficiently is the problem<br>InfiniBand is designed to solve.

InfiniBand as the answer§

In 1999, the industry agreed that RDMA was something that needed doing. Two<br>competing proposals — Future I/OA switched-fabric interconnect for host-to-host and host-to-I/O<br>communication. See Future I/O<br>(IEEE)., backed by Compaq, HP, and IBM, and Next<br>Generation I/OIntel announced NGIO in November 1998: Intel Introduces Next Generation<br>I/O for Computing<br>Servers., backed by Intel, Microsoft, and<br>Sun — merged into a single effort and produced the InfiniBand Trade<br>AssociationVersion 1.0 of the spec followed in 2000. The vision went further than<br>even the I/O-stack framing suggests: devices would attach to the fabric as<br>endpoints rather than as slots on a local bus. Wikipedia has a good summary of<br>the early history.. The ambition was extraordinary: InfiniBand was not designed as<br>a networking technology but as a replacement for the entire server I/O stack—the<br>PCI bus for device I/O, Ethernet for networking, Fibre Channel for storage.

The result was designed to be technically coherent from the ground up. The big<br>idea is credit-based flow control at the link layer: a sender cannot transmit<br>unless the receiver has signaled it has buffer space. This makes the fabric<br>inherently lossless. Losslessness isn’t strictly required for RDMA, but it<br>makes the transport much simpler: nothing in the fast path has to recover from<br>a dropped packet. The programming<br>model that grew up around this, “the verbs API”“Verbs” is not a formal API specification. The IBTA spec defines a set of<br>abstract operations — ibv_post_send, ibv_open_device, and so on — that must<br>exist and behave in certain ways, without prescribing an exact interface. The<br>de facto implementation is<br>libibverbs, developed by the<br>OpenFabrics Alliance and merged into the Linux kernel in 2005., is a<br>coherent stack built on the guarantees the fabric provides.

InfiniBand then lost almost every battle it entered. PCIe won the device I/O<br>bus. Ethernet held general networking. Fibre Channel held storage. The main<br>place InfiniBand survived and thrived was high-performance computing: fluid<br>dynamics, molecular dynamics, climate models. At that kind of scale and<br>coupling, interconnect latency is a direct ceiling on how fast the simulation<br>runs, and people would pay for a dedicated fabric to push that ceiling up.

The founding consortium members mostly lost interest as the empire shrank to a<br>single niche they were not primarily in the business of serving. The company<br>that remained was<br>Mellanox, founded in<br>1999 specifically to build InfiniBand silicon, which ended up dominating the InfiniBand NIC<br>and...

kernel infiniband data buffer path from

Related Articles