Cohere North Mini Code is the company’s first open-source coding model — a 30-billion-parameter mixture-of-experts released on June 9, 2026 that activates only about 3 billion parameters per token, fits on a single H100, and carries a fully permissive Apache 2.0 license. It is the first model in a new “North” family aimed squarely at developers rather than the enterprise buyers Cohere has historically served.

The headline is that a model this small posts a 33.4 on the Artificial Analysis Coding Index — an independent, third-party score — ahead of open models several times its size, including a 120B and a 123B competitor. The catch, and the part most coverage mentions in a single sentence before moving on, is that the same independent lab measured it generating roughly three times the output tokens of comparable models. That verbosity is a real, compounding cost in any high-volume pipeline, and it does not show up in a leaderboard ranking.

This guide covers what actually launched, how the architecture earns its single-H100 economics, where the model genuinely leads and where it does not, why the two Artificial Analysis leaderboards are not the same thing, and a recomputed cost model for the self-hosted versus managed decision that the verbosity number quietly changes.

Key takeaways

01
Cohere's first open-source developer model.North Mini Code 1.0 released June 9, 2026 — a 30B mixture-of-experts with ~3B active parameters per token, 256K context, 64K max generation, Apache 2.0 license, runnable on a single H100 at FP8.
02
Third-party score beats far larger rivals.Artificial Analysis independently scored it 33.4 on the Coding Index — above Qwen3.5, Gemma 4 and Devstral Small 2, and also above the much larger Nemotron 3 Super (120B) and Mistral Small 4 (119B).
03
The verbosity trap is the hidden cost.Artificial Analysis measured ~3× the output tokens of comparable models (75M vs a 25M class median across its eval). Benchmark rankings skip this; high-volume pipelines cannot.
04
Cohere-stated throughput claims need a hedge.Cohere reports 2.8× output throughput and ~30% lower inter-token latency versus Devstral Small 2 under identical hardware. Those are vendor internal tests, not independently confirmed as of June 12, 2026.
05
Designed as a coding sub-agent, not a generalist.It is strong and narrow — an Agentic Index of 21.7 shows it underperforms on non-coding agentic tasks. The production pattern is a frontier orchestrator with North Mini Code handling code execution on cheap fixed hardware.

01 — What ShippedCohere’s first model for developers.

On June 9, 2026, Cohere released North Mini Code 1.0 — described in its own materials as the first model in a new “North” family and built specifically for agentic software engineering. It is a meaningful departure: Cohere’s prior models, Command R and Aya, were not coding-focused, and the company has not previously shipped an open-weight coding model. With this release, Cohere becomes the first major enterprise AI vendor to put an open-source coding model on the table.

The model ships under an Apache 2.0 license — full commercial permissibility, including fine-tuning, redistribution, and commercial deployment. Weights are on Hugging Face in both BF16 and an FP8-quantized variant, and the model is also reachable through the Cohere API, Cohere Model Vault, OpenRouter, and OpenCode for teams that want to test without standing up their own hardware.

Open weights

North Mini Code 1.0

30B total · ~3B active · Apache 2.0

A decoder-only MoE Transformer with 128 experts, 8 activated per token via a sigmoid router. Inference compute is comparable to a ~3B dense model. BF16 and FP8 weights are both published.

huggingface.co/CohereLabs/North-Mini-Code-1.0

Managed option

Cohere API & Model Vault

256K context · 64K max generation

Cohere's hosted inference for teams that want to evaluate before self-hosting. North Mini Code is also available on OpenRouter and inside OpenCode. No API pricing was published at launch — check Cohere's pricing page for current rates.

cohere.com/blog/north-mini-code

Release snapshot

North Mini Code 1.0 launched June 9, 2026 under Apache 2.0. It is a 30B mixture-of-experts with ~3B active parameters, a 256K-token context window and a 64K max generation length, runnable on a single H100 at FP8. Cohere co-founder Nick Frosst demoed it on a Mac Studio via MLX at roughly 20 GB of RAM. No hosted API pricing was disclosed at launch — the model is open-weight, so do not assume a per-token cloud rate without checking Cohere’s own pricing page.

Cohere’s direct comparison model is Mistral’s Devstral Small 2, a 24B dense model. The framing matters: North Mini Code activates only ~3B parameters per inference step against Devstral’s full 24B of dense compute, which is the whole basis of Cohere’s efficiency pitch. It also lands in the same week as a wave of open-source coding releases — including Kimi K2.7-Code from Moonshot AI and GLM-5.2 from Z.ai — so readers sizing up the open field have several alternatives to weigh at once.

02 — Architecture30B total, ~3B active — single-H100 economics.

North Mini Code is a decoder-only Transformer with a sparse mixture-of-experts feed-forward design: 128 experts total, 8 activated per token through a sigmoid router. The result is 30 billion parameters of capacity but only about 3 billion active per token — so the inference compute is comparable to a ~3B dense model, which is what lets it run on a single H100 at FP8 precision (and at roughly 20 GB of RAM on Apple silicon via MLX). The context window is 256K tokens, large enough to hold a mid-sized multi-file codebase in a single pass, with a 64K maximum generation length.

The attention stack is a hybrid: an interleaved 3:1 ratio of sliding-window attention (with RoPE) and global attention (with no positional embeddings), following the “RoPE to NoPE” hybrid-attention approach, with a single dense layer ahead of the sparse layers and SwiGLU activations in the feed-forward blocks. Post-training is two-stage supervised fine-tuning followed by reinforcement learning with verifiable rewards (RLVR), using a CISPO algorithm against binary rewards from unit-test verifiers across more than 70,000 verifiable tasks in roughly 5,000 repositories, deduplicated against SWE-Bench.

Sparse capacity

Experts, 8 active

128

A sigmoid router activates 8 of 128 experts per token, giving 30B total capacity at ~3B active compute. The efficiency case rests entirely on this sparsity versus a 24B dense competitor.

MoE · ~3B active

Context

Tokens in a single pass

256K

Large enough to hold a mid-sized multi-file codebase without chunking, with a 64K maximum generation length. The interleaved sliding-window plus global attention stack keeps long context tractable.

64K max generation

RL training

Verifiable tasks

70K+

A single multi-environment RLVR run across ~5,000 repositories used binary rewards from unit-test verifiers, with 512 rollouts per batch and a group size of 8, deduplicated against SWE-Bench.

CISPO · unit-test rewards

"North Mini Code is the first model in Cohere's new family of models, and is specifically designed and trained for agentic software engineering tasks."— Team Cohere, Hugging Face blog, June 9, 2026

One detail in the training story is worth pulling out because it explains the model’s headline strength. Cohere trained against three distinct agent scaffolds — SWE-Agent (a rich CLI), mini-SWE-Agent (a single bash tool), and OpenCode (individually typed tools returning structured JSON). Adding only about 6% cross-harness data to the SFT mix produced a ~10 percentage-point gain on the OpenCode evaluation while holding SWE-Agent performance steady. The design goal, in Cohere’s words from its documentation, is that performance should generalize across agent scaffolds rather than be tuned to a single one — which is exactly what you want from a model meant to slot into whatever harness your team already runs.

03 — BenchmarksWhere it leads and where it trails.

The agentic-coding benchmark profile is genuinely strong for the size class. On SWE-Bench Verified, the released model resolves 67.6% of tasks at pass@1 using the SWE-Agent harness. It posts 40.2% on SWE-Bench Pro, 36% on Terminal-Bench v2 at pass@1 with the Terminus-2 harness, and 61.0% on the mini-SWE-Agent benchmark — the last of which Cohere notes “emerged for free” from multi-harness training rather than being directly optimized. The chart below maps those scores; the SWE-Bench Verified bar is the one to anchor on.

North Mini Code · agentic coding benchmarks (pass@1)

Source: Hugging Face model card & launch blog, June 9, 2026

SWE-Bench Verifiedpass@1 · SWE-Agent harness v1.1.0

67.6%

mini-SWE-Agentpass@1 · emerged from multi-harness training

61.0%

SWE-Bench Propass@1

40.2%

Terminal-Bench v2pass@1 · Terminus-2 harness

36%

Read the metric, not just the number

A widely repeated figure is 80.2% — but that is the SFT-only checkpoint’s pass@10 on SWE-Bench Verified, before the final RLVR stage. The released model’s SWE-Bench Verified score is 67.6% at pass@1. Pass@10 (ten attempts, best result kept) and pass@1 (one attempt) are different evaluations. RLVR added +3.0 percentage points to SWE-Bench and +7.9 points to Terminal-Bench v2 â real gains, but the 80.2% is not the model’s one-shot number.

The reason the size-to-score ratio looks so good is the independent verdict from Artificial Analysis: a Coding Index of 33.4, ahead of Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), and the direct competitor Devstral Small 2 (24B dense) — and, more surprisingly, ahead of substantially larger open models including Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B). On output speed, Artificial Analysis measured roughly 210 tokens per second (8th of 127 open-weight models it tracks) with a 0.25-second time to first token against a class median of 1.95 seconds. The matrix below puts the comparison in one place.

Open-weight coding-model comparison for June 2026 — total parameters, active parameters, context window, license, minimum hardware, Artificial Analysis Coding Index, and a verbosity flag — with North Mini Code 1.0 measured against Devstral Small 2, Qwen3.5, Gemma 4, Nemotron 3 Super, Mistral Small 4, and Devstral 2. Coding Index scores are from Artificial Analysis; specs are from Hugging Face model cards and vendor blogs, retrieved June 13, 2026.
Model	Total	Active	Context	License	Min hardware	AA Coding Index	Verbosity
North Mini Code 1.0	30B	~3B	256K	Apache 2.0	1× H100 (FP8)	33.4	~3× median (flagged)
Devstral Small 2	24B	24B (dense)	—	Open	Single GPU	Below NMC	Baseline class
Qwen3.5 35B-A3B	35B	~3B	—	Open	Single GPU	Below NMC	—
Gemma 4 26B-A4B	26B	~4B	—	Open	Single GPU	Below NMC	—
Nemotron 3 Super 120B-A12B	120B	~12B	—	Open	Multi-GPU	Below NMC	—
Mistral Small 4 119B-A6B	119B	~6B	256K	Apache 2.0	Multi-GPU	Below NMC	—
Devstral 2 (123B)	123B	123B (dense)	—	Open	Multi-GPU	Below NMC	—

No existing source we found combines hardware, license, and verbosity in one table alongside the Coding Index — most comparisons stop at two or three benchmark columns and quietly drop the cost dimensions that decide a self-hosting question. The Coding Index scores for the comparison models are reported by Artificial Analysis as below North Mini Code’s 33.4; where we did not have an exact published figure for a cell, we have marked it rather than guess. That honesty matters more here than table density.

04 — Two LeaderboardsThe two Artificial Analysis indexes are not the same.

This is the single most common source of confusion in the coverage, and getting it wrong will mislead a buyer. Artificial Analysis publishes more than one score, and they measure different things. The 33.4 figure is the Coding Index — a weighted average of Terminal-Bench Hard (agentic terminal tasks) and SciCode (scientific code generation). That is distinct from the broader Intelligence Index, where North Mini Code scores 27.6 (above gpt-oss-20B at 24.5, just below Mistral Small 4 at 27.8, ranking 18th of 127 comparable open-weight models). And both are distinct again from the Agentic Index, where it scores only 21.7.

The Agentic Index number is the honest counterweight to the headline. North Mini Code underperforms on non-coding agentic tasks — the τ²-Bench Telecom component, weighted at 37% of that index, drags the score down. The takeaway is precise: this is a strong, narrow coding model, not a general-purpose agent. The Coding Agent Index (built on DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA) is yet another separate leaderboard. If you see a single “score” quoted without naming which index it is, treat it as unverified.

Three different Artificial Analysis scores · same model

Source: Artificial Analysis, retrieved June 13, 2026

Coding IndexTerminal-Bench Hard + SciCode · the headline strength

33.4

Intelligence Indexgeneral capability · 18th of 127 open-weight models

27.6

Agentic Indexnon-coding agentic tasks · the honest weak spot

21.7

05 — The Verbosity TrapThe hidden cost benchmark rankings skip.

Here is the part that almost every write-up mentions once and then drops. To complete the Intelligence Index evaluation, Artificial Analysis measured North Mini Code generating 75 million output tokens against a class median of 25 million — roughly three times the output volume of comparable models. This is independently reported, not vendor-stated. A leaderboard ranking treats those extra tokens as free; a production pipeline does not.

Verbosity compounds in two directions at once. Every extra output token is more inference cost and more latency, and in a high-volume agentic pipeline those two effects multiply. On a managed API where you pay per token, 3× the output is 3× the bill for the same task. On self-hosted hardware the dollar cost is fixed, but the verbosity shows up instead as GPU utilization and queue depth — you process fewer tasks per hour on the same H100. Either way, the 33.4 Coding Index does not tell you about it.

"Verbosity is a hidden pipeline cost that benchmarks do not surface. Artificial Analysis measured North Mini Code generating three times the output tokens of comparable models. That verbosity compounds across inference cost and latency in high-volume pipelines."— VentureBeat editorial analysis, June 9, 2026

None of this makes North Mini Code a bad model — its speed (~210 tokens per second) partly offsets the latency penalty, and on fixed self-hosted hardware the marginal cost of extra tokens is zero in dollar terms. But it does mean the right way to evaluate this model is on your own workload, measuring total tokens emitted per completed task, not on its leaderboard position. The next section turns the verbosity number into the cost decision it actually drives.

06 — Cost ModelSelf-hosted vs managed, with the verbosity tax priced in.

VentureBeat framed the launch around a real architectural decision: a frontier managed model such as Claude Fable 5 is priced at $50 per million output tokens, while North Mini Code runs on a single H100 you control. To make that tradeoff concrete, the table below models a team running 1,000 agentic coding tasks per day. Two inputs are sourced — the $50/M managed rate and the ~3× verbosity ratio — and two are illustrative or estimated and flagged as such: a 4,000-token class-median output per task, and an H100 lease at roughly $2.50/hour. Recompute every cell from the formula in its note before relying on it for your own numbers.

Modeled monthly cost for 1,000 agentic coding tasks per day, comparing a class-median model on a managed API, North Mini Code's token volume billed at the same managed rate, and North Mini Code self-hosted on a single H100. The $50 per million output tokens managed rate and the 3x verbosity ratio are sourced; the 4,000-token median per-task output and the $2.50/hour H100 lease are illustrative or estimated and flagged. Derived cells are recomputed from the formula shown in each row's note.
Scenario	Output tokens / task	Monthly output (30 days)	Modeled monthly cost	How it is derived
Class-median model · managed API	4,000 (illustrative)	120M	~$6,000 / mo	120M ÷ 1M × $50 = $6,000. Scales linearly with volume.
North Mini Code token-volume · managed-rate equivalent	12,000 (3× verbosity)	360M	~$18,000 / mo	360M ÷ 1M × $50 = $18,000. The verbosity tax, if billed per token.
North Mini Code · self-hosted single H100	12,000 (3× verbosity)	360M	~$1,800 / mo	$2.50/hr (est.) × 24 × 30 = $1,800. Fixed cost — token count does not move it.

The math is deliberately simple so you can audit it. At 1,000 tasks per day and a modeled 4,000 median output tokens each, a class-median model emits 120 million output tokens a month, which at $50/M is about $6,000. North Mini Code’s ~3Ã verbosity pushes the same workload to 360 million tokens — about $18,000 a month if it were billed at that managed rate. Self-hosted on a single H100 at an estimated $2.50/hour, the cost is fixed at roughly $1,800 a month no matter how many tokens the model emits. That is the real shape of the decision: the verbosity that would triple a per-token bill is irrelevant on fixed hardware, which is precisely why the open-weight economics can win for high-volume teams — provided you can keep the H100 busy.

Two honest caveats keep this from being a slam dunk. The per-task token figure is a placeholder — your real number depends on your tasks, and you should measure it before trusting the totals. And a single H100 has finite throughput, so the self-hosted column assumes the box can clear 1,000 tasks a day given the verbosity; if it cannot, you are buying more hardware and the fixed cost rises. This is exactly the kind of comparative, workload-grounded eval our AI and digital transformation engagements start with.

07 — The Production PatternThe sub-agent case nobody wrote.

Cohere explicitly trained North Mini Code to work as a sub-agent under an orchestrator, and that points at a production architecture most coverage skipped. In a multi-agent coding pipeline, a frontier managed model handles orchestration — planning, decomposition, judgment — while North Mini Code handles the high-volume code execution underneath it on a single H100 you own. You get frontier-grade orchestration on the few expensive calls that need it, and fixed-cost code execution on the many cheap ones that do not. That hybrid is the practical answer for teams that cannot run a $50/M output model at the volume their agents generate.

The matrix below maps where North Mini Code fits — and, just as importantly, where it does not. Its Agentic Index of 21.7 is the guardrail: do not reach for it as a general-purpose agent for non-coding workflows.

Sub-agent execution

Code execution under an orchestrator

Cohere trained it to slot under a frontier orchestrator. Pair a managed model for planning with North Mini Code on a single H100 for the high-volume code edits, terminal tasks, and review passes. Frontier judgment, fixed-cost execution.

Strong fit

Sovereign self-hosting

Data-residency-bound coding

Apache 2.0 plus single-H100 hardware makes on-prem agentic coding viable for banks, governments, and healthcare — Cohere's existing enterprise base. No per-token cloud bill, weights and data stay in your perimeter.

Pick open weights

Local developer use

Run it on a workstation

FP8 weights quantize to roughly 20 GB of RAM — Cohere demoed it on a Mac Studio via MLX. Strong for local code assistants where you want no API dependency and no token meter running.

Tractable locally

General agentic work

Non-coding multi-step tasks

The Agentic Index of 21.7 says it plainly: this is a narrow coding specialist, not a generalist. For non-coding agentic workflows, stay with a broader model and route only the coding subtasks here.

Use something broader

If your team is building this kind of pipeline, the orchestrator is usually an open-source coding assistant, and North Mini Code as a local model would plug into a tool like Continue.dev. For the broader build-versus-buy framing on running models inside your own perimeter, our guide to self-hosting open-source AI models covers the deployment economics in full.

08 — Where To Run ItWeights, API, and inside your IDE.

Four surfaces are live. Open weights on Hugging Face (BF16 and an FP8-quantized variant) for self-hosting and fine-tuning; the Cohere API and Cohere Model Vault for managed inference; OpenRouter for routed access; and OpenCode for using it directly inside an agentic coding workflow. The Cohere-stated efficiency claims below are the vendor’s own internal tests â useful as a directional signal, but treat them as such until independently verified.

Cohere-stated efficiency claims (vendor)

Cohere reports North Mini Code at 2.8× higher output throughput and ~30% lower inter-token latency than Devstral Small 2, measured under identical hardware configurations. These are Cohere internal tests, not independently confirmed as of June 12, 2026 — Artificial Analysis measured ~210 tokens/second for North Mini Code on its own benchmark but did not publish a head-to-head Devstral comparison in the same environment. Use the vendor numbers as direction, not as settled fact.

Self-host

Hugging Face weights

BF16 + FP8 quantized · Apache 2.0

Download the weights, run on a single H100 at FP8 (or ~20 GB of RAM on Apple silicon via MLX), fine-tune on proprietary code, quantize for deployment. The path for sovereignty-bound and high-volume teams.

huggingface.co/CohereLabs/North-Mini-Code-1.0

Managed & routed

Cohere API · OpenRouter · OpenCode

256K context · no launch pricing disclosed

Cohere API and Model Vault for hosted inference, OpenRouter for routed access, OpenCode for in-IDE agentic use. No per-token API pricing was published at launch — check Cohere's pricing page for current rates.

cohere.com/blog/north-mini-code

09 — Strategy SignalCohere’s pivot toward developers.

Step back from the spec sheet and the release reads as a strategic signal. Cohere built its business on enterprise sovereign AI — banks, governments, healthcare — anchored on data-residency guarantees, with Command R and Aya as its flagship models. North Mini Code is the first model aimed at developers rather than enterprise procurement, and the new “North” family name marks a deliberate new product line. The sovereign-AI proposition now extends to developer teams: run a capable coding model entirely inside your own perimeter, under a permissive license, on hardware you control.

Read forward, this looks like Cohere trying to build developer ecosystem gravity the way Mistral did with Devstral — before the open-source coding race gets away from it. The same week brought Kimi K2.7-Code and GLM-5.2, and the field is crowding fast. An Apache 2.0 model that wins its weight class on an independent coding index, runs on one H100, and is explicitly built to act as a sub-agent is a credible opening move. The open question is whether Cohere sustains a release cadence and tooling story strong enough to keep developers once the novelty of the first “North” model fades.

"Its small, cost effective, apache 2.0, and locally deployable. This is the way LLMs should go. small, open source, transparent and sovereign, vs large, expensive, proprietary and hegemonic."— Nick Frosst, co-founder, Cohere (post on X, June 9, 2026)

10 — ConclusionA strong, narrow model with an honest catch.

The shape of open coding, June 2026

The leaderboard win is real; the cost picture lives in the output-token column.

North Mini Code is a genuinely impressive release for its size: a 30B mixture-of-experts that activates ~3B parameters, runs on a single H100, ships under Apache 2.0, and posts an independent Coding Index of 33.4 ahead of open models several times larger. For teams that need a self-hostable, sovereign coding model — or a cheap fixed-cost sub-agent under a frontier orchestrator — it is one of the strongest options available today.

The discipline this release demands is reading the whole picture. The 33.4 is the Coding Index, not the Intelligence Index (27.6) or the Agentic Index (21.7) — and the gap to 21.7 is the model telling you it is a coding specialist, not a generalist. The throughput and latency advantages over Devstral are Cohere-stated, not yet independently confirmed. And the verbosity — roughly 3× the output tokens of comparable models, the one figure benchmarks omit — is the cost that decides whether self-hosting actually pays off for your volume.

The broader signal is that the open-weight coding race is now a cost-and-control argument as much as a capability one. When a 30B model wins its class on an independent index and runs on a single GPU you own, the question shifts from which model is smartest to which model is cheap enough to run the workload you actually have. The right move is the same one it always is: benchmark on your own tasks, measure tokens per completed task, and decide per-workload — not per-headline.

Cohere North Mini Code: An Open 30B Agentic Coding Model