UCCL-EP: An expert parallel communications kernel without owning the NIC
UCCL-EP: An expert parallel communications kernel without owning the NIC<br>12 Jun 2026 · 12 min read ·<br>Cover: "Office of the American Telegraph Company, Corner of Broadway and Liberty Street, New York - Telegrams for Europe," from Harper's Weekly (1866), via the Library of Congress.
In the last post we looked at<br>how expert parallel communications kernels work. This was a story of how the<br>original<br>DeepEP<br>library from DeepSeek was organized. That library relies on<br>GPU-initiated communication: the GPU has to be able to tell the NIC directly<br>what to transfer and when.
The primitives that library introduces are sufficiently general and powerful<br>that others have built on them to expand support across NICs and across GPU<br>types. This is the story of UCCL,<br>specifically UCCL-EP, which takes<br>DeepEP-style communication patterns and makes them work for arbitrary<br>NIC-accelerator pairs.
We’re interested in heterogeneous hardware here at Doubleword. We want the most<br>tokens for the lowest price, regardless of what makes themSee our earlier posts on bringing up Deepseek-v4 Flash on MI300x. Doubleword was<br>recently named as one of six companies in the first wave of UK Sovereign AI<br>investmentsOur first investments, UK Sovereign AI, which has given us access to an allocation on the AI<br>Research Resource, including time on<br>Isambard-AI, the UK’s national AI<br>supercomputing facility.
Isambard-AI is a great facility. But its chips are connected with the<br>HPE Slingshot<br>interconnect. No GPU-initiated communication, no DeepEPAlso no UCCL, yet (we’re working on it).. The inference<br>shapes that high-performance point-to-point EP kernels like DeepEP permit<br>(Two-Batch Overlap, WideEP) are crucial for us to maximise the intelligence<br>we can provide per pound spent.
This post is about how UCCL-EP gets expert parallelism to work across arbitrary<br>interconnects.
The DeepEP contract§
The fast parallel structure that DeepEP builds on relies on the existence of a<br>few simple remote communication primitives:
A one-sided write : put these bytes at that address on that rank. The<br>receiver doesn’t post a matching receive or run any code to accept the<br>dataContrast two-sided send/recv, where both ends participate in every<br>transfer.: it’s a GPU in the middle of its own kernel, and its half of the<br>protocol is to poll local memory until the bytes arrive.
An ordered signal : an atomic add into a known slot, telling the<br>receiver the data has arrived. We need to be able to do this and know that<br>it will land after the data has landed, so there’s a strict ordering<br>requirement.
A quiet : confirmation on the sender-side that all of its writes have<br>completed, needed before it can reuse a source buffer or signal completion<br>to anyone else.
This contract is NVSHMEM’s device API: put_nbi for the write,<br>amo_nonfetch_add for the signal, quiet for the fence. DeepEP calls these<br>functions from NVSHMEM. The problem is that:
On the accelerator side, NVSHMEM is NVIDIA only.
On the NIC side, IBGDAInfiniBand GPUDirect Async. Mellanox/NVIDIA’s name for GPU-initiated<br>networking., the mechanism that lets the GPU satisfy this<br>contract itself, only works on NVIDIA NICs.
UCCL bridges the gaps by implementing the exact same contractUCCL’s device-side shim exports nvshmemi_ibgda_put_nbi_warp,<br>nvshmemi_ibgda_amo_nonfetch_add, and nvshmemi_ibgda_quiet, so DeepEP’s<br>kernels compile against it largely unchanged., providing<br>NVSHMEM’s IBGDA primitives on arbitrary NICs and GPUs.
How DeepEP does it§
An RDMA NIC is driven through queues in memory. To send something, a process<br>writes a work queue entry, a small descriptor carrying the opcode, source<br>address, destination address, and length, into a queue pair on the NICA queue pair is RDMA’s connection object: a send queue and a receive<br>queue through which a process hands descriptors to the NIC, executed in<br>order, with completions reported to a companion completion queue.,<br>then writes to a doorbell register on the NIC to say there is work to do. The<br>NIC reads the descriptor, moves the bytes, and posts a completion into a<br>completion queue. Ordinarily the process driving the NIC runs on the CPU.
IBGDA moves the whole arrangement onto the GPU. The queue pair and completion<br>queue are allocated in GPU memory, and the NIC’s doorbell register is mapped<br>into the GPU’s address space. A warp inside the dispatch kernel builds the<br>work queue entry itself, issues a memory fence, and writes the doorbell over<br>PCIe. The NIC then pulls the payload directly out of HBMVia GPUDirect RDMA: the NIC DMAs to and from GPU memory without staging<br>through host RAM. This needs the nvidia_peermem kernel module (or dmabuf)<br>so the NIC can get at GPU pages. and sends it.
This satisfies the contract above trivially. The one-sided write is a write<br>descriptor. The signal is an atomic-add descriptor posted to the same queue<br>pair, and because a queue pair executes its descriptors in order,...