UCCL-EP: DeepEP-style expert parallelism on any NIC, no GPU-initiated comms

UCCL-EP: An expert parallel communications kernel without owning the NIC

UCCL-EP: An expert parallel communications kernel without owning the NIC 12 Jun 2026 · 12 min read · Cover: "Office of the American Telegraph Company, Corner of Broadway and Liberty Street, New York - Telegrams for Europe," from Harper's Weekly (1866), via the Library of Congress.

In the last post we looked at how expert parallel communications kernels work. This was a story of how the original DeepEP library from DeepSeek was organized. That library relies on GPU-initiated communication: the GPU has to be able to tell the NIC directly what to transfer and when.

The primitives that library introduces are sufficiently general and powerful that others have built on them to expand support across NICs and across GPU types. This is the story of UCCL, specifically UCCL-EP, which takes DeepEP-style communication patterns and makes them work for arbitrary NIC-accelerator pairs.

We’re interested in heterogeneous hardware here at Doubleword. We want the most tokens for the lowest price, regardless of what makes themSee our earlier posts on bringing up Deepseek-v4 Flash on MI300x. Doubleword was recently named as one of six companies in the first wave of UK Sovereign AI investmentsOur first investments, UK Sovereign AI, which has given us access to an allocation on the AI Research Resource, including time on Isambard-AI, the UK’s national AI supercomputing facility.

Isambard-AI is a great facility. But its chips are connected with the HPE Slingshot interconnect. No GPU-initiated communication, no DeepEPAlso no UCCL, yet (we’re working on it).. The inference shapes that high-performance point-to-point EP kernels like DeepEP permit (Two-Batch Overlap, WideEP) are crucial for us to maximise the intelligence we can provide per pound spent.

This post is about how UCCL-EP gets expert parallelism to work across arbitrary interconnects.

The DeepEP contract§

The fast parallel structure that DeepEP builds on relies on the existence of a few simple remote communication primitives:

A one-sided write : put these bytes at that address on that rank. The receiver doesn’t post a matching receive or run any code to accept the dataContrast two-sided send/recv, where both ends participate in every transfer.: it’s a GPU in the middle of its own kernel, and its half of the protocol is to poll local memory until the bytes arrive.

An ordered signal : an atomic add into a known slot, telling the receiver the data has arrived. We need to be able to do this and know that it will land after the data has landed, so there’s a strict ordering requirement.

A quiet : confirmation on the sender-side that all of its writes have completed, needed before it can reuse a source buffer or signal completion to anyone else.

This contract is NVSHMEM’s device API: put_nbi for the write, amo_nonfetch_add for the signal, quiet for the fence. DeepEP calls these functions from NVSHMEM. The problem is that:

On the accelerator side, NVSHMEM is NVIDIA only.

On the NIC side, IBGDAInfiniBand GPUDirect Async. Mellanox/NVIDIA’s name for GPU-initiated networking., the mechanism that lets the GPU satisfy this contract itself, only works on NVIDIA NICs.

UCCL bridges the gaps by implementing the exact same contractUCCL’s device-side shim exports nvshmemi_ibgda_put_nbi_warp, nvshmemi_ibgda_amo_nonfetch_add, and nvshmemi_ibgda_quiet, so DeepEP’s kernels compile against it largely unchanged., providing NVSHMEM’s IBGDA primitives on arbitrary NICs and GPUs.

How DeepEP does it§

An RDMA NIC is driven through queues in memory. To send something, a process writes a work queue entry, a small descriptor carrying the opcode, source address, destination address, and length, into a queue pair on the NICA queue pair is RDMA’s connection object: a send queue and a receive queue through which a process hands descriptors to the NIC, executed in order, with completions reported to a companion completion queue., then writes to a doorbell register on the NIC to say there is work to do. The NIC reads the descriptor, moves the bytes, and posts a completion into a completion queue. Ordinarily the process driving the NIC runs on the CPU.

IBGDA moves the whole arrangement onto the GPU. The queue pair and completion queue are allocated in GPU memory, and the NIC’s doorbell register is mapped into the GPU’s address space. A warp inside the dispatch kernel builds the work queue entry itself, issues a memory fence, and writes the doorbell over PCIe. The NIC then pulls the payload directly out of HBMVia GPUDirect RDMA: the NIC DMAs to and from GPU memory without staging through host RAM. This needs the nvidia_peermem kernel module (or dmabuf) so the NIC can get at GPU pages. and sends it.

This satisfies the contract above trivially. The one-sided write is a write descriptor. The signal is an atomic-add descriptor posted to the same queue pair, and because a queue pair executes its descriptors in order,...

UCCL-EP: DeepEP-style expert parallelism on any NIC, no GPU-initiated comms

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y