Introducing North Mini Code: Cohere’s First Model For Developers
Log In<br>Sign Up
Back to Articles
Introducing North Mini Code: Cohere’s First Model For Developers
Enterprise Article Published<br>June 9, 2026
Upvote 50
+44
Cohere Code Agents Team coherecode Follow
CohereLabs
All co-authors listed below
Today, we are releasing North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license.
North Mini Code is the first model in Cohere’s new family of models, and is specifically designed and trained for agentic software engineering tasks.
Figure 1: North Mini Code’s performance in agentic coding tasks and complex code generation benchmarks, compared to leading open-source models of similar size. See here for the details of our benchmarking methodology.
North Mini Code is optimized for complex software engineering workflows, terminal-based agentic tasks, and high-quality code generation. On Artificial Analysis’ Coding Index, North Mini Code achieves a score of 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and even substantially larger models such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B).1 It ranks among the strongest open-source coding models in its size class.
Try North Mini Code in OpenCode
Real-world code agents depend on model quality and robustness across agent harnesses. We trained North Mini Code using multiple scaffolds rather than optimizing for a single one. This approach enables North Mini Code to serve as a reliable foundation for coding agents such as OpenCode.
Architecture
Figure 2: North Mini Code is a Mixture-of-Experts Transformer decoder with interleaved sliding-window self-attention and full self-attention.
North Mini Code is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses our efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio [1]. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers.
Post-Training for Coding Excellence
Figure 3: The post-training pipeline is made up of two phases of supervised fine-tuning (SFT) and a phase of agentic reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks.
We post-train North Mini Code using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. Our first stage SFT data focuses on coding capabilities that are integrated within a wider mix for robustness and usability. The datamix includes programming, reasoning, and instruction following across a large variety of domains where the code datasets correspond to 70% of trainable tokens, 43% agentic tool-use data, and 27% single-turn competitive or scientific programming data. In the second stage SFT, we use a 4.5 billion token data mixture from only agentic and reasoning-driven samples, where code data forms 61% of trainable tokens. This mixture comprises our highest-quality data across coding and wider agentic tasks where tool calls and completions are verified as executable and correct.
Our internal data pipeline heavily relies on containerised agentic coding environments. We maintain a disjoint subset of these environments for use in synthetic SFT data generation and RLVR. The majority are based on software engineering tasks from real-world repositories, while the rest are terminal-based agentic tasks sourced from open-source and internal datasets. In total, we used over 70k verifiable tasks across ~5k unique repositories. We deduplicate our environments against the repository sources from SWE-Bench [2] and SWE-Bench-Pro [3] to avoid source leakage during evaluation [4].
We used 64K and 128K context lengths for the first and second stages of SFT, respectively. This “long-to-longer” cascade approach (similar to [5, 6]) enables bipartite training on valuable shorter data, establishing a robust performance baseline, followed by targeted long-context training only on high-quality verified samples. Without multi-stage training, the 20B non-code tokens during the initial training stage often dominated the 1.5B tokens of high-quality code data in later training, producing poorer performance and higher behavioral conflicts from data trends differing between stages. Anecdotally, training on a near-complete length distribution of samples produced shorter final trajectories during evaluation than training on a truncated distribution up to 64K only.
Instead of optimising North Mini Code...