Deploying inference endpoints with PD disaggregation on AMD GPUs

Deploying inference endpoints with PD disaggregation on AMD GPUs - dstack

Infrastructure orchestration is an agent skill

Initializing search

dstackai/dstack

-->

GitHub

dstack Sky

Concepts

Guides

Examples

Clusters

Inference

Models

Accelerators

Reference

server/config.yml

CLI

HTTP API

Environment variables

llms-full.txt

skill.md

Case studies

Benchmarks

Blog

Discord

Deploying inference endpoints with PD disaggregation on AMD GPUs¶

dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.

dstack recently added native support for Prefill–Decode (PD) disaggregation. It works with Shepherd Model Gateway (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with NVIDIA Dynamo on NVIDIA. This post walks through deploying it on AMD GPUs with SMG.

Why PD disaggregation¶

PD disaggregation is useful when a single LLM deployment has two different bottlenecks:

Prefill processes the prompt. It is compute-bound, parallelizable, and has a direct impact on Time to First Token (TTFT).

Decode generates tokens one by one. It is memory-bound, sequential, and has a direct impact on inter-token latency.

When the same worker handles both phases, every replica has to serve both bottlenecks. With PD disaggregation, prefill and decode run as separate pools, and each pool can be sized and scaled independently.

The tradeoff is operational: for every request, the KV cache produced by the prefill worker must be transferred to the decode worker before generation can continue. That transfer sits on the TTFT path, so the cluster needs a high-bandwidth, low-latency interconnect such as RDMA over InfiniBand or RoCE, rather than TCP over a conventional NIC.

In this walkthrough, SMG routes requests between SGLang workers. On AMD, the workers use the Mooncake Transfer Engine to transfer KV cache over RDMA/RoCE. In the configuration we tested, the RDMA fabric is exposed by Broadcom bnxt_re Ethernet devices.

Prerequisites Running PD disaggregation on dstack requires first creating a fleet with placement: cluster, so that prefill and decode workers share a high-bandwidth interconnect. This can be a backend fleet provisioned by dstack on a cloud or Kubernetes cluster, or an SSH fleet registered against bare-metal or VM hosts you already manage.

Validating the interconnect¶

To measure end-to-end bandwidth across nodes, run the NCCL/RCCL tests example.

For a quick check that the RDMA devices are visible on a particular host, run:

$ ibv_devices

All eight bnxt_re* interfaces should be listed. Use ibv_devinfo to inspect port state and link details. If devices are missing or in an unexpected state, install or update the NIC driver and userspace RDMA library before proceeding.

Deploying the service¶

To deploy an inference endpoint with PD disaggregation using dstack, define a service with three replica groups: an SMG router, a pool of prefill workers, and a pool of decode workers.

The example below deploys Qwen/Qwen2.5-72B-Instruct on a multi-node cluster with AMD MI300X GPUs:

type: service name: amd-sglang-pd-service

image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427 privileged: true

env: - MODEL_ID=Qwen/Qwen2.5-72B-Instruct - HF_TOKEN - SGLANG_USE_AITER=0 - SGLANG_ROCM_FUSED_DECODE_MLA=0 - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 - SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600 - RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 - NCCL_IB_DISABLE=1

replicas: - count: 1 commands: - pip install smg - | smg launch \ --pd-disaggregation \ --host 0.0.0.0 \ --port 30000 resources: cpu: 4.. router: type: sglang

- count: 1..2 scaling: metric: rps target: 300 commands: - | python3 -m sglang.launch_server \ --model $MODEL_ID \ --disaggregation-mode prefill \ --disaggregation-transfer-backend mooncake \ --host 0.0.0.0 \ --port 30000 \ --tp $DSTACK_GPUS_NUM \ --trust-remote-code \ --disaggregation-ib-device $RDMA_DEVICES \ --disaggregation-bootstrap-port 8998 \ --disable-radix-cache \ --disable-cuda-graph \ --disable-overlap-schedule \ --mem-fraction-static 0.8 \ --max-running-requests 1024 resources: gpu: MI300X:8 cpu: 96.. memory: 512GB..

- count: 1..4 scaling: metric: rps target: 300 commands: - | python3 -m sglang.launch_server \ --model $MODEL_ID \ --disaggregation-mode decode \ --disaggregation-transfer-backend mooncake \ --host 0.0.0.0 \ --port 30000 \ --tp $DSTACK_GPUS_NUM \ --trust-remote-code \ --disaggregation-ib-device $RDMA_DEVICES \ --disable-radix-cache \ --disable-cuda-graph \ --disable-overlap-schedule \ --decode-attention-backend triton \ --mem-fraction-static 0.8 \ --max-running-requests 1024 resources: gpu: MI300X:8 cpu: 96.. memory: 512GB..

port: 30000 model:...

Deploying inference endpoints with PD disaggregation on AMD GPUs

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast