Deploying inference endpoints with PD disaggregation on AMD GPUs - dstack
Skip to content
Infrastructure orchestration is an agent skill
Initializing search
dstackai/dstack
-->
GitHub
dstack Sky
Concepts
Guides
Examples
Clusters
Inference
Models
Accelerators
Reference
server/config.yml
CLI
HTTP API
Environment variables
More
llms-full.txt
skill.md
Case studies
Benchmarks
Blog
Discord
Deploying inference endpoints with PD disaggregation on AMD GPUs¶
dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.
dstack recently added native support for Prefill–Decode (PD) disaggregation. It works with Shepherd Model Gateway (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with NVIDIA Dynamo on NVIDIA. This post walks through deploying it on AMD GPUs with SMG.
Why PD disaggregation¶
PD disaggregation is useful when a single LLM deployment has two different bottlenecks:
Prefill processes the prompt. It is compute-bound, parallelizable, and has a direct impact on Time to First Token (TTFT).
Decode generates tokens one by one. It is memory-bound, sequential, and has a direct impact on inter-token latency.
When the same worker handles both phases, every replica has to serve both bottlenecks. With PD disaggregation, prefill and decode run as separate pools, and each pool can be sized and scaled independently.
The tradeoff is operational: for every request, the KV cache produced by the prefill worker must be transferred to the decode worker before generation can continue. That transfer sits on the TTFT path, so the cluster needs a high-bandwidth, low-latency interconnect such as RDMA over InfiniBand or RoCE, rather than TCP over a conventional NIC.
In this walkthrough, SMG routes requests between SGLang workers. On AMD, the workers use the Mooncake Transfer Engine to transfer KV cache over RDMA/RoCE. In the configuration we tested, the RDMA fabric is exposed by Broadcom bnxt_re Ethernet devices.
Prerequisites<br>Running PD disaggregation on dstack requires first creating a fleet with placement: cluster, so that prefill and decode workers share a high-bandwidth interconnect. This can be a backend fleet provisioned by dstack on a cloud or Kubernetes cluster, or an SSH fleet registered against bare-metal or VM hosts you already manage.
Validating the interconnect¶
To measure end-to-end bandwidth across nodes, run the NCCL/RCCL tests example.
For a quick check that the RDMA devices are visible on a particular host, run:
$ ibv_devices
All eight bnxt_re* interfaces should be listed. Use ibv_devinfo to inspect port state and link details. If devices are missing or in an unexpected state, install or update the NIC driver and userspace RDMA library before proceeding.
Deploying the service¶
To deploy an inference endpoint with PD disaggregation using dstack, define a service with three replica groups: an SMG router, a pool of prefill workers, and a pool of decode workers.
The example below deploys Qwen/Qwen2.5-72B-Instruct on a multi-node cluster with AMD MI300X GPUs:
type: service<br>name: amd-sglang-pd-service
image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427<br>privileged: true
env:<br>- MODEL_ID=Qwen/Qwen2.5-72B-Instruct<br>- HF_TOKEN<br>- SGLANG_USE_AITER=0<br>- SGLANG_ROCM_FUSED_DECODE_MLA=0<br>- SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600<br>- SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600<br>- RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7<br>- NCCL_IB_DISABLE=1
replicas:<br>- count: 1<br>commands:<br>- pip install smg<br>- |<br>smg launch \<br>--pd-disaggregation \<br>--host 0.0.0.0 \<br>--port 30000<br>resources:<br>cpu: 4..<br>router:<br>type: sglang
- count: 1..2<br>scaling:<br>metric: rps<br>target: 300<br>commands:<br>- |<br>python3 -m sglang.launch_server \<br>--model $MODEL_ID \<br>--disaggregation-mode prefill \<br>--disaggregation-transfer-backend mooncake \<br>--host 0.0.0.0 \<br>--port 30000 \<br>--tp $DSTACK_GPUS_NUM \<br>--trust-remote-code \<br>--disaggregation-ib-device $RDMA_DEVICES \<br>--disaggregation-bootstrap-port 8998 \<br>--disable-radix-cache \<br>--disable-cuda-graph \<br>--disable-overlap-schedule \<br>--mem-fraction-static 0.8 \<br>--max-running-requests 1024<br>resources:<br>gpu: MI300X:8<br>cpu: 96..<br>memory: 512GB..
- count: 1..4<br>scaling:<br>metric: rps<br>target: 300<br>commands:<br>- |<br>python3 -m sglang.launch_server \<br>--model $MODEL_ID \<br>--disaggregation-mode decode \<br>--disaggregation-transfer-backend mooncake \<br>--host 0.0.0.0 \<br>--port 30000 \<br>--tp $DSTACK_GPUS_NUM \<br>--trust-remote-code \<br>--disaggregation-ib-device $RDMA_DEVICES \<br>--disable-radix-cache \<br>--disable-cuda-graph \<br>--disable-overlap-schedule \<br>--decode-attention-backend triton \<br>--mem-fraction-static 0.8 \<br>--max-running-requests 1024<br>resources:<br>gpu: MI300X:8<br>cpu: 96..<br>memory: 512GB..
port: 30000<br>model:...