A guide on how to run Nemotron 3 Super 120B Thinking on 2 Nvidia DGX Spark

TechPreacher1 pts0 comments

Serving Nemotron-Super-120B with a 1M token context on a 2-node DGX Spark cluster

Sign in<br>Subscribe

This is a build log. We had two NVIDIA DGX Spark workstations (GB10 / SM121, 128 GB unified memory each), 200 GbE ConnectX-7 NICs, and the goal of serving NVIDIA's Nemotron-3-Super-120B-A12B-NVFP4 with the model's full 1 million token context. The path there crossed several traps that aren't documented in any one place: a missing Ray binary in the latest NGC vLLM image, environment-variable propagation quirks across nodes, host-memory starvation that survives only with cgroup-style discipline, and a handful of vllm serve flags that move between releases.<br>The full repository is at https://github.com/TechPreacher/dgx-spark-vllm-cluster. Below is the narrative.<br>Hardware and topology<br>Each Spark has a Grace-Blackwell SoC with native FP4 tensor cores (SM121) and 128 GB of host-GPU unified memory. The two boxes are linked by two ConnectX-7 dual-port NICs on each side: four 200 GbE ports per node, ~800 GbE aggregate data plane. The interfaces show up under predictable RoCE names (rocep1s0f0, rocep1s0f1, roceP2p1s0f0, roceP2p1s0f1) on top of standard netdev names (enp1s0f0np0 etc).<br>We split traffic deliberately:<br>Control plane — one of the four interfaces (enp1s0f1np1 on this hardware) carries Ray's GCS, PyTorch tensor-parallel rendezvous, and anything that needs a single IP per node.<br>Data plane — all four interfaces are exposed to NCCL via NCCL_IB_HCA and to UCX via UCX_NET_DEVICES. NCCL talks RoCE directly to the four HCAs; the netdev names are also exported to NCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME, and OMPI_MCA_btl_tcp_if_include as a TCP fallback path.<br>A subtle correctness detail that bit us early: the per-node bring-up script must pass --device=/dev/infiniband --cap-add=IPC_LOCK --ulimit memlock=-1:-1 to docker run. The head-side copy of run_cluster.sh had these flags; the worker-side copy did not. NCCL fell back to TCP on the worker without complaining, and we lost the 800 GbE data plane until we synced the two copies (now kept byte-identical and verified at every bring-up).<br>Ray topology<br>We run two Ray nodes (head, worker), with vLLM scheduling tensor-parallel shards across them. The bring-up scripts both call into a shared cluster/{head,worker}/run_cluster.sh that wraps docker run with the right flags and starts ray start --block inside the container.<br># Node 1 (head)<br>cd cluster/head && bash run_headnode_2.sh

# Node 2 (worker)<br>cd cluster/worker && bash run_workernode_2.sh<br>Each script blocks on the container's foreground ray start. Closing the terminal tears the cluster down. The launcher script for the model is a separate process that uses docker exec to step inside the head container and run vllm serve there — Ray picks up the request and dispatches TP rank 1 to the worker over the data plane.<br>Picking the model<br>We had Qwen3 FP8 variants (30B-A3B-Thinking, 122B-A10B) serving cleanly via the same Ray topology, but neither uses the SM121 FP4 tensor cores. For Nemotron-3-Super, NVIDIA ships an NVFP4-quantized checkpoint specifically targeted at this generation of hardware. The model itself is a LatentMoE hybrid: Mamba-2 state-space layers interleaved with full attention layers and a sparse MoE on top. 120 B total parameters, 12 B active per token. The Mamba layers carry no KV cache (just a fixed-size SSM state), which is the reason a 1 M token context is physically tractable on consumer-scale memory — only the attention layers' KV grows linearly with sequence length.<br>The first obstacle: NGC dropped Ray from vLLM<br>NVIDIA's NGC publishes a vLLM container roughly monthly. We were running nvcr.io/nvidia/vllm:25.11-py3 for the Qwen path because that's what we had pinned when the cluster was first built. The HF model card for Nemotron-3-Super recommends vllm/vllm-openai:v0.20.0 (or newer), and 25.11 was too old to recognize super_v3 as a reasoning parser, didn't have --async-scheduling, didn't accept --mamba-ssm-cache-dtype float16, and didn't expose the --reasoning-parser-plugin flag we'd have needed to side-load the parser.<br>So we moved to the latest NGC vLLM build at the time, nvcr.io/nvidia/vllm:26.05.post1-py3. The head container started and then immediately printed:<br>/bin/bash: line 1: ray: command not found<br>The image had no ray in $PATH. We checked further:<br>docker run --rm --entrypoint /bin/bash nvcr.io/nvidia/vllm:26.05.post1-py3 -c \<br>'which ray; python -c "import ray"; pip show ray'<br># ray binary not in PATH<br># ModuleNotFoundError: No module named 'ray'<br># WARNING: Package(s) not found: ray<br>NGC had removed Ray entirely from this build. Not a PATH issue — ray is genuinely absent. vLLM upstream installs Ray as a transitive dependency, so this looks intentional on NGC's part (smaller image, fewer CVEs to scan).<br>The fastest fix that keeps everything else about the cluster unchanged: layer Ray back on top of the NGC base in a thin local image. We added cluster/Dockerfile:<br>ARG...

vllm cluster nvidia head worker node

Related Articles