Two Qwen3 models on one DGX Spark: the residency math

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Setup

devashish.me

SubscribeSign in

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Setup The residency math, the gpu_memory_utilization trap, and what to verify first. Notes from my experiments with local LLMs.

Devashish Jun 16, 2026

My agent stack with Hermes runs on a workstation. The models run on a DGX Spark on the same LAN. The split is deliberate: the workstation stays responsive, the Spark does the GPU work, and they talk over an HTTP proxy.

Since I started managing the agent fleet through Clawrium, the Hermes count has climbed. More agents on more hosts, more concurrent traffic, all hitting the same Spark. What was a one-laptop, one-model setup is now a small fleet against a single backend — and the shape of the load is exactly what a single-model server can’t serve. Thanks for reading devashish.me! Subscribe for free to receive new posts and support my work.

Fleet snapshot with different providers (orchestration using Clawrium) The Spark served models through ollama for months. It worked. One model up, single config, easy to bring down. But ollama owns the card. There’s no per-process memory budget, no gpu_memory_utilization knob, no straightforward way to coresident a heavy model for reasoning and a fast model for quick turns. KV cache management is whatever the underlying llama.cpp backend gives you. PagedAttention isn’t there. vLLM fixes all of that. PagedAttention reclaims KV blocks instead of contiguous-pinning them.

gpu_memory_utilization gives you a per-container budget.

One Spark (GB10, 119.67 GiB unified memory) can run multiple vLLM containers behind a LiteLLM proxy on :4000, and Hermes hits one URL to route to either model. The promise: serve Qwen3-Next-80B-Instruct-FP8 for the heavy work and Qwen3-4B-Instruct-2507 for fast turns, coresident, both reachable from a single endpoint.

That’s the why. What follows is what it took to make the promise hold. Spark hardware will happily hold two Qwen3 models if the numbers line up. They didn’t, for several days. That’s where my last weekend went. Attempt one: trust the target

First 80B config: gpu_memory_utilization: 0.75, max_model_len: 65536, max_num_seqs: 4. vLLM’s KV cache init crashed with “No available memory for the cache blocks.” Qwen3-Next is mostly Mamba; the per-block page alignment pushes KV pool demand higher than the ~14 GiB residue after weights. Bumped to 0.85. Now the free-memory check crashed: “Free memory on device (98.51/119.67 GiB) is less than desired GPU memory utilization (0.85, 101.72 GiB).” The 4B was already resident at ~16 GiB. The 80B’s 0.85 target was reading the whole card, not what was free. That’s the first lesson. gpu_memory_utilization is a fraction of total GPU memory, not free memory . Two co-resident vLLM processes need their fractions to sum below ~0.95 to leave room for CUDA framework overhead. If your math assumes free, you’ll oscillate between OOMs and silent KV starvation. Settled at 0.80 / 32k / 2 for the 80B. Loaded clean. KV pool ~20.8 GiB after weights. Attempt two: point Hermes at it

Then Hermes came online and tool calls came back as plain text. JSON sitting inside content. tool_calls: []. finish_reason: stop. Hermes never executed it. A day of parser triage produced nothing actionable. Both hermes_tool_parser.py and qwen3xml_tool_parser.py look for (singular). The plural tag is the system-prompt definition, not the output. The parser wasn’t wrong. The model wasn’t emitting. tool_choice: "required" worked. tool_choice: "auto" came back empty: tool_calls: [], content: "", 619 characters of reasoning inside concluding “Alright, that’s it” without emitting the call. Qwen’s own model card states it plainly: Qwen3-Next-80B-Thinking supports only thinking mode. enable_thinking: false is a structural no-op on this checkpoint. /no_think in the prompt is ignored. The model reasons inside , decides, and never emits. That’s an unrecoverable failure for any agent SDK that defaults to tool_choice: "auto". The fix wasn’t a parser flag. It was swapping the whole 80B backbone from Thinking to Instruct. 77 GiB pre-pull. Drain GPU. Bring up with --enable-auto-tool-choice --tool-call-parser hermes, no --reasoning-parser. Three LiteLLM aliases (writer / reviewer / sources) all passed tool_choice: "auto" cleanly with finish_reason: tool_calls. Trade accepted: reviewer loses native traces. Reasoning moved into the prompt. Attempt three: the bump that broke coresidency

Reviewer agent (running on Hermes) needed 64k context. Bumped the 80B to 0.85 / 65536 / 2. 80B loaded healthy. The 4B’s restart loop kicked in 19 times: “Free memory on device (12.58/119.67 GiB) is less than desired GPU memory utilization (0.12, 14.36 GiB).” 80B’s actual residency at 0.85 was 101.5 GiB. Plus ~5 GiB CUDA framework overhead. That left ~12.5 GiB free. The 4B needed 14.36 GiB. No room. Toned the 80B back to 0.80, dropped...

Two Qwen3 models on one DGX Spark: the residency math

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi