Anatomy of a high-performance EP kernel
Anatomy of a high-performance EP kernel<br>10 Jun 2026 · 19 min read ·<br>Cover: "Estación telefónica central en Paris," from El mundo físico (1882), via Wikimedia Commons.
Large language models are large. Because they’re large, we need lots of GPUs to<br>run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and<br>we could just always compute independent things on different GPUs. But alas, to<br>use lots of GPUs on LLM inference, we need to get those GPUs talking to one<br>another.
There are lots of different ways to get different GPUs working together: Tensor<br>Parallelism, Pipeline Parallelism, Context Parallelism, Expert Parallelism,<br>etc. All have their place. But for MoE models, in the MoE layers, when you want<br>to serve at large scale, ‘wide Expert Parallelism’ (wideEP) is kingSee vLLM’s original DeepSeek large-scale serving<br>post for a demonstration<br>at production scale: DeepSeek at 2.2k tokens/s per GPU on an H200 cluster,<br>served with wideEP and data parallel attention..
The other kinds of parallelism all require communication between GPUs, but<br>their patterns are fixed by the architecture: who sends, who receives, and how<br>much, are all known before the forward pass begins, and are the same on every<br>step. The comms can run as standard collectives.
Expert parallelism is different. Which tokens need to reach which GPUs is<br>decided by the router, from the data, at runtime, fresh in every MoE layer.<br>And the tokens have somewhere to be reached from: we’ll assume the ‘data<br>parallel attention’ arrangement DeepSeek serves with, where each token lives<br>on exactly one rank (a rank being one GPU somewhere in our cluster). The<br>experts are spread across those same ranks, so a token and the experts it’s<br>routed to will generally not be in the same place. Here’s an example, with 8<br>GPUs split across 2 nodes, two experts per GPU, 1 token per rank, and 2 routed<br>experts per token:
Hover a rank chip for its token’s round trip, or an expert for<br>everything routed to it. Four of the sixteen experts drew no tokens at all<br>this step: the routing is lumpy.
DISPATCHEXPERTSCOMBINENODE 0 · NVLINKNODE 1 · NVLINKcrossing = RDMA · within a node = NVLinkGPU 0GPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7Expert 0Expert 1Expert 2Expert 3Expert 4Expert 5Expert 6Expert 7Expert 8Expert 9Expert 10Expert 11Expert 12Expert 13Expert 14Expert 15r0r0r1r1r2r2r3r3r4r4r5r5r6r6r7r7<br>When it comes time to run our MoE layers, our tokens have to go and meet their<br>experts, wherever they might be in the network fabric. It’s the job of the EP<br>communication kernel to make that happen.
The modern shape of these kernels was set by DeepSeek’s<br>DeepEP library. In this post we’ll<br>build up the anatomy of a DeepEP-style dispatch and combine kernel: the<br>high-throughput shape first, then the low-latency one.
The job we have to do§
Let’s make the setup concrete. We have 8 GPUs, split across 2 nodes,<br>connected with RDMA, and each data parallel rank owns a single GPU. Attention<br>runs on each GPU over a batch of BiB_iBi tokens, where BiB_iBi can vary between<br>GPUs. We’re doing expert parallel with E=16E=16E=16 experts, two per GPU, of which<br>K=2K=2K=2 are routed for each token.
At each rank rir_iri, at the entrance to the EP layer, we have a tensor of<br>shape (Bi,H)(B_i, H)(Bi,H)HHH is the hidden size.. The routing layer will run locally, and give us expert<br>assignments for each token. We’re routing 2-out-of-16: for each token,<br>the router gives us a set of logits of length 161616 (i.e. a tensor of shape<br>(Bi,16)(B_i, 16)(Bi,16)), from which we’ll take the indices of the top 2, to get a tensor<br>of shape (Bi,2)(B_i, 2)(Bi,2). For example, if token kkk is routed to experts 333 and<br>131313, then row kkk will be [3,13][3, 13][3,13].
So at the entrance to the EP layer each rank holds two things: the activation rows it produced, and, after the local routing pass, the top-2 expert assignment for each of those rows.
activations(Bᵢ, H)routerexpert logits(Bᵢ, E=16)assignment(Bᵢ, K=2)0123456789101112131415E:Wᵣ xH→Et0−1.20.3−0.42.10.1−0.80.5−0.20.9−0.60.2−1.00.41.7−0.30.6313t12.40.2−0.50.9−1.10.31.9−0.20.7−0.90.10.5−0.70.2−0.1−0.406t20.12.2−0.30.4−0.90.2−0.60.80.31.8−0.20.6−1.10.10.5−0.819t3−0.60.32.00.1−0.20.7−1.00.4−0.50.20.8−0.30.61.6−0.70.1213<br>Not all of the experts live locally. Some live next door, on neighbouring<br>NVLink peers, and some live far away, on nodes reachable only over RDMA. The<br>goal of the expert parallelism kernels is to get the<br>activations where they need to go, run the expert GEMMs when they get there,<br>and then bring them back home.
We’re doing communications here, and with communications it’s handy to<br>specialise on what we care about most: throughput, or latency. The split maps<br>onto the two phases of inference: prefill brings big, compute-bound batches<br>with plenty of other work to hide communication behind, while at decode there<br>is hardly anything else to do, so the transfer itself is what we wait on.<br>We’ll start with the...