Anatomy of a high-performance EP kernel

Anatomy of a high-performance EP kernel 10 Jun 2026 · 19 min read · Cover: "Estación telefónica central en Paris," from El mundo físico (1882), via Wikimedia Commons.

Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs. But alas, to use lots of GPUs on LLM inference, we need to get those GPUs talking to one another.

There are lots of different ways to get different GPUs working together: Tensor Parallelism, Pipeline Parallelism, Context Parallelism, Expert Parallelism, etc. All have their place. But for MoE models, in the MoE layers, when you want to serve at large scale, ‘wide Expert Parallelism’ (wideEP) is kingSee vLLM’s original DeepSeek large-scale serving post for a demonstration at production scale: DeepSeek at 2.2k tokens/s per GPU on an H200 cluster, served with wideEP and data parallel attention..

The other kinds of parallelism all require communication between GPUs, but their patterns are fixed by the architecture: who sends, who receives, and how much, are all known before the forward pass begins, and are the same on every step. The comms can run as standard collectives.

Expert parallelism is different. Which tokens need to reach which GPUs is decided by the router, from the data, at runtime, fresh in every MoE layer. And the tokens have somewhere to be reached from: we’ll assume the ‘data parallel attention’ arrangement DeepSeek serves with, where each token lives on exactly one rank (a rank being one GPU somewhere in our cluster). The experts are spread across those same ranks, so a token and the experts it’s routed to will generally not be in the same place. Here’s an example, with 8 GPUs split across 2 nodes, two experts per GPU, 1 token per rank, and 2 routed experts per token:

Hover a rank chip for its token’s round trip, or an expert for everything routed to it. Four of the sixteen experts drew no tokens at all this step: the routing is lumpy.

DISPATCHEXPERTSCOMBINENODE 0 · NVLINKNODE 1 · NVLINKcrossing = RDMA · within a node = NVLinkGPU 0GPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7Expert 0Expert 1Expert 2Expert 3Expert 4Expert 5Expert 6Expert 7Expert 8Expert 9Expert 10Expert 11Expert 12Expert 13Expert 14Expert 15r0r0r1r1r2r2r3r3r4r4r5r5r6r6r7r7 When it comes time to run our MoE layers, our tokens have to go and meet their experts, wherever they might be in the network fabric. It’s the job of the EP communication kernel to make that happen.

The modern shape of these kernels was set by DeepSeek’s DeepEP library. In this post we’ll build up the anatomy of a DeepEP-style dispatch and combine kernel: the high-throughput shape first, then the low-latency one.

The job we have to do§

Let’s make the setup concrete. We have 8 GPUs, split across 2 nodes, connected with RDMA, and each data parallel rank owns a single GPU. Attention runs on each GPU over a batch of BiB_iBi tokens, where BiB_iBi can vary between GPUs. We’re doing expert parallel with E=16E=16E=16 experts, two per GPU, of which K=2K=2K=2 are routed for each token.

At each rank rir_iri, at the entrance to the EP layer, we have a tensor of shape (Bi,H)(B_i, H)(Bi,H)HHH is the hidden size.. The routing layer will run locally, and give us expert assignments for each token. We’re routing 2-out-of-16: for each token, the router gives us a set of logits of length 161616 (i.e. a tensor of shape (Bi,16)(B_i, 16)(Bi,16)), from which we’ll take the indices of the top 2, to get a tensor of shape (Bi,2)(B_i, 2)(Bi,2). For example, if token kkk is routed to experts 333 and 131313, then row kkk will be [3,13][3, 13][3,13].

So at the entrance to the EP layer each rank holds two things: the activation rows it produced, and, after the local routing pass, the top-2 expert assignment for each of those rows.

activations(Bᵢ, H)routerexpert logits(Bᵢ, E=16)assignment(Bᵢ, K=2)0123456789101112131415E:Wᵣ xH→Et0−1.20.3−0.42.10.1−0.80.5−0.20.9−0.60.2−1.00.41.7−0.30.6313t12.40.2−0.50.9−1.10.31.9−0.20.7−0.90.10.5−0.70.2−0.1−0.406t20.12.2−0.30.4−0.90.2−0.60.80.31.8−0.20.6−1.10.10.5−0.819t3−0.60.32.00.1−0.20.7−1.00.4−0.50.20.8−0.30.61.6−0.70.1213 Not all of the experts live locally. Some live next door, on neighbouring NVLink peers, and some live far away, on nodes reachable only over RDMA. The goal of the expert parallelism kernels is to get the activations where they need to go, run the expert GEMMs when they get there, and then bring them back home.

We’re doing communications here, and with communications it’s handy to specialise on what we care about most: throughput, or latency. The split maps onto the two phases of inference: prefill brings big, compute-bound batches with plenty of other work to hide communication behind, while at decode there is hardly anything else to do, so the transfer itself is what we wait on. We’ll start with the...

Anatomy of a high-performance EP kernel

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs