CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
By: Shuang Ma, Yuyi Li, Yihan Zhang, Danyang Chen, Shuyang Ji, Ziming Mao, Cheng Ji, Ansha Prashanth, Wenting Yang, Yiran Wang, Chihan Cui, Peiyu Lin, Amanda Raybuck, Ion Stoica, Yang Zhou.
Date: June 9, 2026
GPU communication is a critical component of large-scale LLM training and inference, yet its complexity makes it challenging for code-generation models. We present CommBench, a benchmark with 100+ GPU communication problems + reference solutions (collectively called examples) that cover industry-level multi-device communication use cases based on UCCL's development experience. CommBench spans point-to-point , collective , expert-parallel , compute and communication fusion , and utility functions .<br>These examples are either hand-written by GPU communication experts or distilled from production codebases such as Mscclpp, NCCL, NVSHMEM, DeepEP, ThunderKittens, vLLM, and SGLang. We then evaluate leading closed and open models under a cheat-resistant harness on real hardware spanning intra-node NVLink and inter-node RDMA, and present case studies of where and why they succeed or break down. As future work, we plan to post-train LLMs on these datasets to close this gap.
CommBench open-source: uccl-project/CommBench (MIT license).
Why Writing GPU Communication Code Matters—and Why It Remains Challenging for LLMs?
Communication and compute-communication fusion are essential for scaling modern LLM training and inference. In production training, communication can consume 43.6% of the forward pass 1; in MoE inference with wide expert parallelism, inter-device communication accounts for up to 47% of total execution time 2. Getting this code right and fast is not a nice-to-have.
The demand for customized GPU communication and compute-communication fusion is rapidly growing. Established libraries like NCCL offer comprehensive interfaces, but optimize for generality over frontier performance. As a result, companies often maintain in-house GPU communication and computation stacks for tighter control and optimization. GPU communication also remains a rapidly evolving area: new hardware and new LLM architectures continuously introduce new requirements for higher performance and specialized workloads, while communication abstractions are still evolving:
Modern GPUs are extremely powerful and expensive , motivating highly customized kernels and tighter compute–communication fusion to maximize hardware utilization across architectures such as Hopper, Blackwell, and AMD GPUs. As GPUs become faster, communication increasingly needs to be initiated directly from GPUs instead of relying on CPU-mediated execution paths used in traditional libraries such as NCCL.
New LLM architectures, such as MoE expert parallelism , introduce increasingly irregular and fine-grained communication patterns that are not well supported by existing collective libraries.
Multi-device GPU programming is inherently harder than single-device coding, for three reasons:
It demands niche expertise , requiring deep knowledge of both GPU kernels and networking.
It requires coordinating many devices over fail-prone interconnects , which is intrinsically difficult.
It lacks data , as practical, faithful datasets for GPU communication are largely missing.
Despite all this, multi-device GPU programming has been largely overlooked in LLM coding benchmarks. HumanEval, MBPP, LiveCodeBench — these measure single-device reasoning. No existing benchmark evaluates whether a model can generate correct GPU communication code, including communication primitives (e.g., Mscclpp channels and collective interfaces) and compute–communication fusion kernels (e.g., fused AllGather+GEMM across NVLink and InfiniBand).
Benchmark and Framework Structure
Benchmark Structure
The dataset is organized as a list of independently runnable examples, currently with 100+ such examples. Some implement complete, ready-to-use functionality (e.g., P2P/collective interfaces, or MoE expert-parallel dispatch and combine); others are reusable communication building blocks (e.g., Mscclpp channels). Drawing on our hands-on experience from the UCCL project, we manually assign each example one of three difficulty levels: Easy / Medium / Hard .
Some examples we hand-wrote on top of base libraries; others are extracted from production-grade communication and LLM-serving frameworks. By function , they fall into:
P2P — point-to-point transfer between a pair of devices.
Collective — group operations across all ranks (AllReduce, AllGather, All-to-All, …).
EP — dynamic, non-uniform dispatch/combine traffic for MoE models.
Fusion — kernels that interleave communication with compute (e.g., AllGather+GEMM).
Utilities — supporting components such as connection setup, buffer registration, and topology queries.
By source , they span cuda-runtime,...