North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss - DGX Spark / GB10 - NVIDIA Developer Forums
= 40rem)" rel="stylesheet" data-target="desktop" />
= 40rem)" rel="stylesheet" data-target="discourse-ai_desktop" /><br>= 40rem)" rel="stylesheet" data-target="discourse-calendar_desktop" /><br>= 40rem)" rel="stylesheet" data-target="poll_desktop" />
= 40rem)" rel="stylesheet" data-target="desktop_theme" data-theme-id="13" data-theme-name="discourse-nvidia-theme"/>
North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss
Accelerated Computing
DGX Spark / GB10 User Forum
DGX Spark / GB10
agentic-ai
jeremyk
June 17, 2026, 4:26pm
Hey all,
I just put up two Spark Arena runs of North Mini Code 1.0 — an FP8 reference and an NVFP4 quant we made — to see what the GB10’s native FP4 support buys us. It’s Cohere’s first open agentic coding model: a 30B MoE (3B active), Apache 2.0, built for exactly the kind of run-it-yourself, sovereign setup the Spark is great for. Blog here: North Mini Code: Agentic Coding Model for Developers | Cohere
The results, same model / same recipe / same Spark, only the quant changed:
Single user @ 16K context (realistic): ~52 tok/s on NVFP4 vs ~32 on FP8 → ~1.65x faster
Two concurrent users: scales to ~84 tok/s aggregate (the Spark Arena figure)
Memory: 17 GB weights vs 28 GB → ~40% smaller footprint
Quality: identical HumanEval across NVFP4 and FP8 — no measurable loss
Benchmarks & Recipe:
FP8: CohereLabs/North-Mini-Code-1.0-fp8 - Spark Arena Benchmark
NVFP4: XanuNetworks/North-Mini-Code-1.0-NVFP4 - Spark Arena Benchmark
Both run on a single Spark (tensor parallel 1) under vLLM with FP8 KV cache, tool calling + reasoning via the cohere_command4 parsers. Recipes and full PP/TG-vs-concurrency logs are on both pages if you want to reproduce.
Fun side note: looks like this is the only Cohere model on the board so far, so a shout out to the Cohere folks for putting out such a solid little agentic coding model. Getting ~1.65x and a 40% smaller footprint for no quality hit makes it a really nice fit for the Spark.
Would love to hear how it runs on other people’s setups, and if anyone wants to stress the quant on heavier coding workloads than HumanEval, I’m all ears. Feedback welcome!
Cheers!
coder543
June 17, 2026, 5:23pm
I think any 4-bit quant can get those output tok/s benefits, since it is just memory bandwidth bound, and 4-bit models are about the same size.
I could be wrong, but I think real potential benefit of NVFP4 is more efficient use of the tensor cores for prefill (prompt processing). It would be interesting to see how many tokens/sec you’re getting for that.
Unfortunately, in my testing, North Mini Code just doesn’t seem to be good enough for me to have any great use for it yet, but I look forward to a future version 2.
jeremyk
June 17, 2026, 5:29pm
NVFP4 PP:
Screenshot 2026-06-17 132829956×408 37.2 KB
FP8 PP:
image824×412 33.2 KB
Related topics
Topic
Replies<br>Views<br>Activity
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect
DGX Spark / GB10
213
6488
March 13, 2026
MiniMax M2.7 NFVP4 Recipe & Benchmarks
DGX Spark / GB10
llama
123
11498
May 19, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark
DGX Spark / GB10
jetson<br>llama<br>nemotron
2479
February 23, 2026
NVFP4 quantization of a 100B-class Llama on 2× DGX Spark — lessons + open questions
DGX Spark / GB10
llama
383
May 15, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM
DGX Spark / GB10
234
12833
May 15, 2026
NVIDIA folks -- where is this promised nvfp4 speedup?
DGX Spark / GB10
27
2813
March 26, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ!
DGX Spark / GB10
144
8641
March 14, 2026
Best Q4 / NVFP4 model for quality Qwen3.5-27B or alternatives?
DGX Spark / GB10
llama<br>deepseek<br>nemotron
16
3732
April 26, 2026
Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant?
DGX Spark / GB10
24
1520
January 15, 2026
MiniMax 2.5 REAP - NVFP4 on single DGX Spark
DGX Spark / GB10
25
3167
April 1, 2026
Powered by Discourse, best viewed with JavaScript enabled