DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark

pilooch1 pts0 comments

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - DGX Spark / GB10 - NVIDIA Developer Forums

= 40rem)" rel="stylesheet" data-target="desktop" />

= 40rem)" rel="stylesheet" data-target="discourse-ai_desktop" /><br>= 40rem)" rel="stylesheet" data-target="poll_desktop" />

= 40rem)" rel="stylesheet" data-target="desktop_theme" data-theme-id="13" data-theme-name="discourse-nvidia-theme"/>

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Accelerated Computing

DGX Spark / GB10 User Forum

DGX Spark / GB10

deepseek

tonyd615

May 16, 2026, 2:31am

I didn’t create this recipe you guys did but I was finally able to find it and get Deepseek v4 Flash working with 200k Context on 2 Nodes.

Sharing this since I couldn’t find a confirmed end-to-end recipe for the official DeepSeek-V4-Flash on a 2-node Spark setup, and there was a lot of “nobody has it on 2 nodes yet” floating around. It works. Here’s exactly what I ran.

Setup:

2x DGX Spark (GB10), 128GB unified each

Direct QSFP56 200G cable between them (RoCE/NCCL over the CX-7), link-local addressing

No Ray. TP=2 with --distributed-executor-backend mp, --nnodes 2

This is built on @eugr @eugr_nv eugr/spark-vllm-docker PR #219 (DeepSeek V4 Flash recipe) + the @jasl9187 jasl/vllm fork. Full credit to them — I just got it stood up and verified on real hardware. Note PR #219 is still open/unmerged.

Build (the one thing to get right: pin the vLLM commit, don’t use a branch alias — only the pinned commit has the GB10 validation behind it):

./build-and-copy.sh \

–vllm-repo GitHub - jasl/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub \

–vllm-ref dda4668b59567416f86956cfe7bbc1eab371a61e \

–rebuild-vllm -t vllm-node-dsv4 -c

Launch (from the head node):

DOTENV_CONTAINER_NAME=vllm_ds4 nohup ./run-recipe.sh \

deepseek-v4-flash --no-ray --tp 2 --name vllm_ds4 > ds4.log 2>&1 &

Key flags the recipe sets: official deepseek-ai/DeepSeek-V4-Flash (native FP8, E4M3 128x128 block, ~149GB/46 shards), --kv-cache-dtype fp8, --enable-expert-parallel, speculative deepseek_mtp num_speculative_tokens=2, --max-model-len 200000, --max-num-seqs 2, block-size 256, cudagraph FULL_AND_PIECEWISE.

Numbers I’m seeing (warm, single stream): ~44 tok/s decode. Concurrency=2 aggregate ~45 tok/s. TTFT on short prompts ~2s warm. Cold start container-to-serving was ~6 min. These line up with the jasl GB10 validation baseline (conversational c=1 ~35 t/s, scaling to ~96 t/s aggregate at c=8, MTP spec-accept ~68% on conversational).

Gotchas that cost me time:

The “Pin NCCL” commit in PR #219 matters — it symlinks the system libnccl; without the current PR head the cross-node init isn’t right.

build-and-copy’s image copy mangled the worker user for me (double user@). Worked around it with a plain docker save | ssh worker docker load over the link.

max_num_seqs=2 is intentional at 200K ctx (KV budget). If you want more concurrency, drop max-model-len (the validated profiles do 65K@16seqs, 32K@36seqs).

Long-context cold prefill is the weak spot: ~53s TTFT at 32K, ~250s at 128K. Fine for normal prompts, rough for huge contexts.

One of my CX-7 links wedged during teardown churn (mlx5 ACCESS_REG timeout); a clean cold boot cleared it, nothing else did.

Hope it saves someone the night I just spent. Curious if anyone’s pushed concurrency or long-ctx prefill further on GB10.

jasl9187

May 16, 2026, 1:27pm

There are awesome friends helping me improve the performance of long-context prefill.

Last night we had a ~20% improvement

image1142×620 68.3 KB

testing on 2 * RTX Pro 6000

And I just applied new optimizations and am running benchmarks.

Keyper-AI

May 16, 2026, 1:49pm

Very cool.

Can you post your benchmarks on spark-arena.com?

I am currently running Qwen 3.6 but would like something that runs faster on larger context.

co-le

May 16, 2026, 4:05pm

Well done, I tried two times and gave up haha. I’ll reproduce it as soon as I can.

tonyd615

May 16, 2026, 7:02pm

Yes I’ll get that done asap

tonyd615

May 16, 2026, 7:02pm

I’ll try to get the spark arena recipe posted.

dbsci

May 16, 2026, 7:11pm

I’ll give you 4,921 bonus points if you upload benchmark via sparkrun arena benchmark and post a “v2” recipe!

co-le

May 16, 2026, 7:27pm

Oh sorry I meant I tried 2 times before your post and gave up, but now I set an agent to reproduce what worked for you and we’ll see how it goes.

tonyd615

May 16, 2026, 11:18pm

I got approved I am about to post right now give me a few mins

tonyd615

May 17, 2026, 12:33am

10

GitHub - tonyd2wild/deepseek-v4-flash-dual-spark-recipe: Reproducible recipe: official DeepSeek-V4-Flash on a dual NVIDIA DGX Spark cluster (TP=2, jasl/vllm, MTP, fp8 KV, 200K ctx). · GitHub Still working to get it up on SparkRun it is a little issue im having with the model loading in the 5 minutes timeout, but I am trying to get it up....

spark deepseek recipe flash vllm gb10

Related Articles