Cohere's open agentic North Mini Code – accelerated with NVFP4 on spark-arena

North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss - DGX Spark / GB10 - NVIDIA Developer Forums

= 40rem)" rel="stylesheet" data-target="desktop" />

= 40rem)" rel="stylesheet" data-target="discourse-ai_desktop" /><br>= 40rem)" rel="stylesheet" data-target="discourse-calendar_desktop" /><br>= 40rem)" rel="stylesheet" data-target="poll_desktop" />

= 40rem)" rel="stylesheet" data-target="desktop_theme" data-theme-id="13" data-theme-name="discourse-nvidia-theme"/>

North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss

Accelerated Computing

DGX Spark / GB10 User Forum

DGX Spark / GB10

agentic-ai

jeremyk

June 17, 2026, 4:26pm

Hey all,

I just put up two Spark Arena runs of North Mini Code 1.0 — an FP8 reference and an NVFP4 quant we made — to see what the GB10’s native FP4 support buys us. It’s Cohere’s first open agentic coding model: a 30B MoE (3B active), Apache 2.0, built for exactly the kind of run-it-yourself, sovereign setup the Spark is great for. Blog here: North Mini Code: Agentic Coding Model for Developers | Cohere

The results, same model / same recipe / same Spark, only the quant changed:

Single user @ 16K context (realistic): ~52 tok/s on NVFP4 vs ~32 on FP8 → ~1.65x faster

Two concurrent users: scales to ~84 tok/s aggregate (the Spark Arena figure)

Memory: 17 GB weights vs 28 GB → ~40% smaller footprint

Quality: identical HumanEval across NVFP4 and FP8 — no measurable loss

Benchmarks & Recipe:

FP8: CohereLabs/North-Mini-Code-1.0-fp8 - Spark Arena Benchmark

NVFP4: XanuNetworks/North-Mini-Code-1.0-NVFP4 - Spark Arena Benchmark

Both run on a single Spark (tensor parallel 1) under vLLM with FP8 KV cache, tool calling + reasoning via the cohere_command4 parsers. Recipes and full PP/TG-vs-concurrency logs are on both pages if you want to reproduce.

Fun side note: looks like this is the only Cohere model on the board so far, so a shout out to the Cohere folks for putting out such a solid little agentic coding model. Getting ~1.65x and a 40% smaller footprint for no quality hit makes it a really nice fit for the Spark.

Would love to hear how it runs on other people’s setups, and if anyone wants to stress the quant on heavier coding workloads than HumanEval, I’m all ears. Feedback welcome!

Cheers!

coder543

June 17, 2026, 5:23pm

I think any 4-bit quant can get those output tok/s benefits, since it is just memory bandwidth bound, and 4-bit models are about the same size.

I could be wrong, but I think real potential benefit of NVFP4 is more efficient use of the tensor cores for prefill (prompt processing). It would be interesting to see how many tokens/sec you’re getting for that.

Unfortunately, in my testing, North Mini Code just doesn’t seem to be good enough for me to have any great use for it yet, but I look forward to a future version 2.

jeremyk

June 17, 2026, 5:29pm

NVFP4 PP:

Screenshot 2026-06-17 132829956×408 37.2 KB

FP8 PP:

image824×412 33.2 KB

Cohere's open agentic North Mini Code – accelerated with NVFP4 on spark-arena

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews