The best open-weight coding models you can self-host in 2026 now land roughly 60–72% on SWE-bench Verified — close enough to be genuinely useful, far enough from the 80–95% frontier cloud coders that you should choose your hardware and model with eyes open. The decision that matters is not “which open model is best” but “which open model fits the box I can actually buy.”

That box question got harder this year. The DRAM and GDDR7 shortage pushed workstation GPU prices up and erased the big-memory Mac configurations: the Mac Studio M3 Ultra now maxes out at 96GB, and the “large-memory Mac for local AI” is the M5 Max 128GB laptop rather than a 256GB or 512GB Studio. Meanwhile the models themselves ballooned — GLM-5.2 ships 744 billion parameters under an MIT license, and no single desk-side machine comes close to running it.

This guide does the arithmetic so you do not have to: how much VRAM each model needs at 4-bit, why memory bandwidth — not capacity — sets your token speed, and a single matrix that pairs every 2026 open coder with the hardware tier that runs it. We will be precise about what fits a single 96GB card, honest about what only runs in the cloud, and clear about the policy reason almost every leading open coder now comes from a Chinese lab. Every number traces to a primary source; verify the volatile ones before you spend.

Key takeaways

01
Open coders are useful, but the gap to frontier is real.Self-hostable open-weight models top out around 71–72% SWE-bench Verified (Qwen3-Coder-Next, Devstral 2), while closed frontier coders sit at 80–95%. That is roughly a 17–27 point gap — meaningful, not unbridgeable.
02
A single 96GB card is the real self-hosting ceiling.An RTX PRO 6000 Blackwell (96GB) runs Qwen3-Coder-Next (80B MoE) comfortably or Devstral 2 (123B dense) tightly at 4-bit. A 744B MoE like GLM-5.2 does not fit one box — all weights must stay resident in GPU memory regardless of active-parameter count.
03
Bandwidth, not memory, decides how fast it feels.Decode is memory-bound. A 128GB DGX Spark (273 GB/s) generates tokens far slower than a 128GB M5 Max (614 GB/s), which is slower again than a 96GB RTX PRO 6000 (1.79 TB/s). Buy bandwidth for interactive coding; buy capacity only when a model will not otherwise load.
04
SWE-bench Verified and SWE-bench Pro are not the same test.GLM-5.2 reports a vendor SWE-bench Pro score (62.1%); most others report SWE-bench Verified. Pro is a harder problem set — never line the two up as if they were comparable. We split them into separate columns for exactly this reason.
05
The open coding leaderboard is overwhelmingly Chinese.US export controls gate the most capable closed weights but explicitly exempt published open weights. The structural result: the leading open coders — Qwen, DeepSeek, GLM, Kimi — come from Chinese labs, several trained on non-NVIDIA silicon to stay export-proof.

01 — The GapOpen coders are good. Frontier cloud is still ahead.

Start with the honest framing, because most roundups bury it. The best coding model you can download and run yourself is no longer a toy — but it is not the best coding model, full stop. On SWE-bench Verified, the standard agentic-coding benchmark, the strongest self-hostable open-weight models land in the low 70s, while the closed frontier sits in the high 80s to mid 90s. The chart below puts the two groups side by side.

SWE-bench Verified · self-hostable open weights vs closed frontier

Source: BenchLM / morphllm SWE-bench Verified, vendor reports — June 2026. Leaderboard positions shift; verify on swebench.com.

Closed frontier (best)Cloud coders, leaderboard range

~95%

Cloud

GPT-5.5 / Opus 4.8Closed frontier · SWE-bench Verified

~89%

Cloud

DeepSeek V4-ProOpen weights · 1.6T total · cloud-scale only

80.6%

Open

Devstral 2 (123B)Open · fits one 96GB card, tight

72.2%

Open · 1 card

Qwen3-Coder-Next (80B/3B)Open · fits one 96GB card, comfortable

~71.3%

Open · 1 card

Nemotron 3 Super (120B/12B)Open · fits one 96GB card

60.5%

Open · 1 card

Self-hostable open weightsClosed frontier (cloud only)

Two things are true at once. First, the trend line is steep: Devstral Small 2, a 24B model that runs on a single consumer GPU, scores 68.0% SWE-bench Verified (vendor-reported) — higher than much larger models managed eighteen months ago. Second, the absolute frontier has not stood still. Closed coders verify in the high 80s and above, so the gap between “best thing on my desk” and “best thing behind an API” is roughly 17 to 27 percentage points depending on which models you line up. For a developer, that gap is the difference between a model that closes most well-scoped tickets and one that closes the gnarly ones too.

The interpretation we draw: open-weight self-hosting in 2026 is a cost, privacy, and control decision far more than a raw-capability one. If your workload is bounded — internal tooling, code review, test generation, refactors on a known codebase — a single-card open model is now a credible primary. If you are pushing the hardest autonomous agentic coding, the honest move is to keep the frontier API in the loop and route only the cheap, high-volume, or sensitive-data work to a local model. For the largest open release of the quarter, see our DeepSeek V4 migration guide; if you are still deciding whether to self-host at all, our self-hosting deployment decision guide walks the full build-versus-rent tree.

02 — BenchmarksRead the benchmark before you trust the number.

The single most common mistake in 2026 model comparisons is treating SWE-bench Verified and SWE-bench Pro as the same scale. They are not. SWE-bench Verified is the widely-cited human-validated subset; SWE-bench Pro, from Scale AI, uses a harder, scaled problem set, so a 62% on Pro is not “worse” than a 71% on Verified — they are different exams. A handful of models report only one or the other, which makes naive cross-comparison actively misleading. The table below keeps them in separate columns on purpose.

2026 open-weight coding models split by which SWE-bench variant they report — Verified versus the harder Scale AI Pro set — plus whether each fits a single self-hosting box. Scores are vendor-reported unless noted; benchmark variants differ and must not be cross-compared. Current as of June 2026.
Model (total / active)	SWE-bench Verified	SWE-bench Pro	Single-box self-host?
Report SWE-bench Verified
DeepSeek V4-Pro (1.6T / 49B)	80.6%	—	No — cloud / cluster only
Devstral 2 (123B dense)	72.2%*	—	Yes — 96GB card, tight
Qwen3-Coder-Next (80B / 3B)	~71.3%‡	—	Yes — 96GB card, comfortable
Nemotron 3 Ultra (550B / 55B)	65–71.9%	—	No — 550B total
Devstral Small 2 (24B dense)	68.0%*	—	Yes — 24GB consumer GPU
Nemotron 3 Super (120B / 12B)	60.47%	—	Yes — 96GB card
Nemotron 3 Nano (30B / 3B)	38.8%†	—	Yes — 16GB GPU
Report SWE-bench Pro (harder — do not cross-compare)
GLM-5.2 (744B / 40B)	—	62.1%*	No — 4× H200 minimum
Qwen3-Coder-480B (480B / 35B)	—	38.7%	No — multi-GPU

* vendor-reported · † OpenHands scaffold · ‡ ~71.3% via OpenHands, technical report

Benchmark hygiene

GLM-5.2 currently has only a vendor-stated SWE-bench Pro figure (62.1%) and no independent SWE-bench Verified result — so it cannot be ranked against the Verified column above without an asterisk. Treat all single-source vendor scores (GLM-5.2, Devstral 2, Devstral Small 2) as provisional until third-party audits land. When a model only reports a benchmark its own lab created, discount accordingly. Our companion GLM-5.2 coding benchmarks breakdown digs into what is and is not independently confirmed.

03 — VRAM MathWhat actually fits on one card.

The rule that governs self-hosting is blunt: every parameter has to live in memory, all the time. At 4-bit, model weights cost roughly half a byte per parameter plus overhead — so a 30B model needs about 15GB, an 80B model about 40–46GB, and a 123B dense model about 62GB. On top of the weights you need room for the KV cache, which grows with context length. For a 70B-class dense model, a 128K-token context at FP16 can demand around 40GB of KV cache on its own. That second number is the one most buyers forget.

This is also where the popular “32B is the ceiling for consumer self-hosting” claim needs splitting in two. For a dense model carrying a full 128K FP16 context, the comfortable single-96GB-card ceiling really does land near a 32B-class coder, because the dense KV cache eats tens of gigabytes on top of the weights. A 123B dense model like Devstral 2 loads its weights in ~62GB and leaves ~34GB — enough for a constrained context with a quantized KV cache, but not a relaxed full-length session. That headroom math is exactly why Mistral recommends a minimum of four H100-class GPUs for production serving, even though the model technically fits one 96GB card for single-user use.

The exception that breaks the 32B rule is architecture. Qwen3-Coder-Next is an 80B Mixture-of-Experts model with only 3B active parameters and a linear-attention design, so its KV cache is dramatically smaller than a dense transformer of the same context length. At Q4_K_M it needs only ~46GB for weights and leaves ~50GB of headroom — enough for a very long context on a single 96GB card. Never apply dense-model KV-cache math to an efficient-attention MoE, and never confuse active parameters with VRAM: a 744B/40B MoE still has to keep all 744B weights resident. For the full quantization, KV-cache, and context arithmetic, see our companion VRAM, quantization, and KV-cache guide, and for how that scales with context, our look at Qwen’s long-context approach.

Runs anywhere

30B-class coder at Q4

~15GB

Qwen3-Coder-30B-A3B (~64% SWE-bench Verified, approximate) or Nemotron 3 Nano fit in roughly 15GB — comfortable on a 24–32GB consumer GPU like an RTX 5090, with room for context.

RTX 5090 / RTX 3090

Single 96GB card

Qwen3-Coder-Next (80B/3B)

~46GB

An 80B MoE with linear attention: ~46GB of weights at Q4_K_M, ~50GB headroom for a long KV cache. The comfortable single-card pick at ~71.3% SWE-bench Verified.

RTX PRO 6000 · Mac Studio M3 Ultra

Single 96GB card, tight

Devstral 2 (123B dense)

~62GB

123B dense parameters fit in ~62GB at Q4, leaving ~34GB for KV cache. Loads on one 96GB card for single-user work; Mistral recommends 4× H100 for multi-user production serving.

RTX PRO 6000 (single user)

Cloud / cluster only

GLM-5.2 (744B/40B)

~372GB

At INT4 the 744B weights need ~372GB resident — a minimum of 4× H200 (141GB each). The MIT-licensed open crown, but not a desk-side model regardless of its 40B active count.

4× H200 minimum

04 — BandwidthMemory bandwidth, not capacity, sets your speed.

Once a model fits, the next question is how fast it generates tokens — and that is governed almost entirely by memory bandwidth, not raw compute. Token generation (the decode phase) is memory-bound: for each new token the accelerator must read the active weights out of memory, so a useful first-principles ceiling is tokens per second ≈ memory bandwidth ÷ weights read per token. For a dense model, the weights read per token equal the full quantized model size. Real-world decode lands a bit under that ceiling once attention, the router, and framework overhead are included.

Hold the model fixed and vary only the hardware, and the effect is stark. The table below runs Devstral 2 (123B dense, ~62GB at Q4) across four memory tiers. The capacity is similar; the bandwidth is not — and the token speed tracks the bandwidth, almost linearly.

Bandwidth-bound decode estimate for Devstral 2 (123B dense, ~62GB at 4-bit) across four hardware tiers. Ceiling equals memory bandwidth divided by 62GB of weights read per token; realistic decode is roughly 70% of that ceiling. First-principles estimates, stack-dependent, not interactive guarantees.
Hardware (same 62GB model)	Bandwidth (GB/s)	Ceiling (BW ÷ 62)	Realistic decode (~70%)
DGX Spark (GB10), 128GB	273	~4.4 tok/s	~3 tok/s
MacBook Pro M5 Max, 128GB	614	~9.9 tok/s	~7 tok/s
Mac Studio M3 Ultra, 96GB	819	~13.2 tok/s	~9 tok/s
RTX PRO 6000 Blackwell, 96GB	1,792	~28.9 tok/s	~20 tok/s

The counter-intuitive result this exposes: a 128GB DGX Spark, which has more memory than a 96GB Mac Studio or RTX PRO 6000, generates tokens the slowest of the group, because its 273 GB/s bandwidth is the lowest. At equal capacity, the 614 GB/s M5 Max is roughly 2.25× faster than the same-size DGX Spark. The DGX Spark’s value is capacity — it lets a 123B-class model load at all on a small, ~140W-typical box — not interactive speed. For dense-70B-at-Q4 decode specifically, bandwidth math puts the DGX Spark in the low single digits (one independent report measured roughly 2.7 tok/s); the higher tens-of-tokens figures you may see elsewhere are batch throughput or smaller/MoE/NVFP4 configurations, not single-user dense decode. Treat every token-per-second number here as a stack-dependent estimate.

A model you can download is not a model you can run. The weights ship free; the bandwidth that runs them does not.— Digital Applied, on the 2026 self-hosting reality

05 — Fit MatrixThe hardware-to-model fit matrix.

Here is the core deliverable: every realistic 2026 hardware tier paired with the open coder that best fits it at 4-bit, with the VRAM used, the benchmark score, and a bandwidth-derived speed estimate. Prices reflect the mid-2026 shortage — note the RTX PRO 6000’s gap between its ~$8,565 launch MSRP and its current street pricing. For a broader walk through the consumer-to-workstation price brackets, pair this with our local-AI hardware price-brackets guide.

Hardware-to-model fit matrix for self-hosting open-weight coding models in 2026: each hardware tier paired with its best-fit model at 4-bit, VRAM used, SWE-bench Verified score, and a bandwidth-derived single-user token-speed estimate. Prices are mid-2026 and volatile; token speeds are stack-dependent estimates, not interactive guarantees.
Hardware	Memory / BW	Price (mid-2026)	Best-fit coder @ Q4	Weights / SWE-V	Tok/s (est.)
Consumer GPU (≤32GB)
RTX 5090	32GB · ~1.79 TB/s	~$2,000–4,000	Qwen3-Coder-30B-A3B (30B/3B)	~15GB · ~64%≈	~60–90
RTX 5090	32GB · ~1.79 TB/s	~$2,000–4,000	Devstral Small 2 (24B dense)	~12GB · 68.0%*	~25–35
Workstation GPU (96GB)
RTX PRO 6000 Blackwell	96GB · ~1.79 TB/s	~$8.5k MSRP · ~$11–14.5k street	Qwen3-Coder-Next (80B/3B)	~46GB · ~71.3%‡	~40–59
RTX PRO 6000 Blackwell	96GB · ~1.79 TB/s	~$8.5k MSRP · ~$11–14.5k street	Devstral 2 (123B dense)	~62GB · 72.2%*	~20–29
Unified-memory boxes
DGX Spark (GB10)	128GB · 273 GB/s	$3,999–4,699	Devstral 2 (123B dense)	~62GB · 72.2%*	~3–5
MacBook Pro M5 Max	128GB · 614 GB/s	~$5,000–6,000	Devstral 2 (123B dense)	~62GB · 72.2%*	~7–10
Mac Studio M3 Ultra	96GB · 819 GB/s	~$5,000	Qwen3-Coder-Next (80B/3B)	~46GB · ~71.3%‡	~25–35
Multi-GPU / cloud only
4× H200 (cluster)	~564GB · —	$100k+ / rental	GLM-5.2 (744B/40B)	~372GB · 62.1%* (Pro)	—
Cloud API	— · —	Per-token	DeepSeek V4-Pro (1.6T/49B)	n/a · 80.6%	—

* vendor-reported · ‡ ~71.3% via OpenHands · ≈ ~64% approximate, variant unconfirmed · all SWE scores are Verified unless marked “(Pro)” · token speeds are bandwidth-derived single-user estimates.

Pricing reality check

The RTX PRO 6000 Blackwell launched at roughly $8,565 MSRP, but the 2026 GDDR7 and DRAM shortage has made street pricing high and volatile — mid-2026 listings have commonly run $11,000–$14,500 (NVIDIA Marketplace around $13,250), while some retail still held near $8,500–$9,200. The same shortage is why the big-memory Mac Studio configs disappeared and 96GB is now the M3 Ultra ceiling. Price this hardware the week you buy it, not from a launch-day MSRP.

06 — The ModelsThe open coders worth running yourself.

Strip out the cloud-only giants and four models do the real work for self-hosters. Each is the best answer for a specific hardware budget — from a single consumer GPU up to a 96GB workstation card.

96GB · comfortable

Qwen3-Coder-Next

80B total · 3B active · linear attention

The single-card sweet spot. ~46GB at Q4_K_M leaves ~50GB for a long KV cache thanks to linear attention, and it verifies around 71.3% on SWE-bench (OpenHands). Open-weight under the Qwen family license (confirm terms before commercial use), ~40–59 tok/s on a 96GB card.

huggingface.co/Qwen/Qwen3-Coder-Next

96GB · tight / single-user

Devstral 2

123B dense · Modified MIT · 256K context

The top dense open coder at 72.2% SWE-bench Verified (vendor-reported). Loads in ~62GB at Q4 on one 96GB card for single-user work; Mistral recommends 4× H100 for multi-user production serving.

huggingface.co/mistralai/Devstral-2

96GB · alternative

Nemotron 3 Super

120B total · 12B active · 1M context

NVIDIA’s hybrid Mamba-Transformer MoE fits a 96GB card at ~60GB (Q4) and verifies 60.47% on SWE-bench. A strong long-context option on the NVIDIA Open Model License, on Bedrock and OpenRouter.

build.nvidia.com/nvidia/nemotron-3-super

24–32GB consumer GPU

Devstral Small 2

24B dense · Apache 2.0

The best coder that runs on affordable hardware: 68.0% SWE-bench Verified (vendor-reported), 25–35 tok/s on an RTX 4090, ~55 on an RTX PRO 6000. Mistral built it for single-GPU operation across RTX and DGX Spark.

huggingface.co/mistralai/Devstral-Small-2

For a deeper hands-on walkthrough of the Mistral family and its agent tooling, see our Devstral 2 and Mistral Vibe CLI guide. Above this tier, the cloud-scale open models — DeepSeek V4-Pro (1.6T/49B, 80.6% Verified), Nemotron 3 Ultra (550B/55B), and Kimi K2.7-Code (1T/32B) — are open-weight in name but require GPU clusters in practice. Download them to fine-tune or to host on rented infrastructure; do not expect them on a single box.

07 — Export ControlsWhy almost every open coder is now Chinese.

Look back at the matrix and a pattern jumps out: Qwen, DeepSeek, GLM, Kimi — the open coding leaderboard is overwhelmingly Chinese. That is not an accident of talent alone; it is a structural consequence of US policy. The Bureau of Industry and Security controls closed model weights trained above a 10^26-compute threshold (ECCN 4E091) and requires export licenses for them — but published, open-weight model weights are explicitly not controlled. The most capable closed models are gateable; increasingly capable open ones are not.

The export-control paradox · USCC analysis

A US-China Economic and Security Review Commission analysis put the structural asymmetry plainly: “The models most dangerous to US interests because they are most capable are closed and gateable, while the models most consequential for global adoption are open and already in production. You can restrict the first category; you cannot restrict the second.” The same report cites Andreessen Horowitz leadership noting that roughly 80% of the start-ups it sees leveraging open-source AI stacks are using Chinese models — attribution is second-hand, so treat the exact figure as directional.

There is a hardware twist that makes these models even harder to constrain. GLM-5.2 was trained entirely on Huawei Ascend NPUs with no NVIDIA dependency, and DeepSeek V4 was likewise trained on non-NVIDIA silicon. That architectural independence is a deliberate hedge: a model that needs no controlled US chips to train, and ships under an MIT or Apache license, is structurally resilient to the export regime. For teams, the practical takeaway is less geopolitical than operational — your open-weight failover stack will likely lean on Chinese models, so plan governance, licensing review, and a second source accordingly. Our open-weight second-source playbook covers how to structure that resilience without single-vendor risk.

08 — DecisionHow to choose by hardware budget.

The decision collapses to one question: what box can you buy, and how fast does it need to feel? Match your budget to the tier below, then benchmark the recommended model on your own repositories before you commit — a 70% benchmark score on someone else’s task set is not the same as a model that closes your tickets.

≤$4,000 · consumer GPU

One RTX 5090 (32GB)

Run a 30B-class coder — Qwen3-Coder-30B-A3B or Devstral Small 2 — at 60–90 or 25–35 tok/s. A 70B model will not fit 32GB at Q4 (~38GB) without crippling CPU offload. This is the best value-per-dollar self-hosting tier, capped near 30B.

Pick a 30B coder

$5–6k · unified memory

M5 Max or Mac Studio M3 Ultra

128GB of memory loads a 123B model, but 614–819 GB/s bandwidth keeps decode in the single-to-low-double digits. Great for capacity, batch jobs, and quiet always-on local inference; not the fastest interactive coder per dollar.

Pick capacity over speed

$11–14.5k · workstation card

RTX PRO 6000 Blackwell (96GB)

The genuine single-card ceiling: Qwen3-Coder-Next (80B MoE) comfortably or Devstral 2 (123B dense) tightly, at 1.79 TB/s for ~20–59 tok/s. The fastest interactive self-hosted coding you can buy in one box — at shortage pricing.

Pick the 96GB workstation

Frontier or 744B GLM-5.2

Cloud API or rented cluster

For the hardest agentic coding, keep a frontier cloud coder (80–95% SWE-bench) in the loop. For GLM-5.2 or DeepSeek V4-Pro, rent multi-GPU rather than buy — single-box self-hosting is not realistic at 744B-plus.

Stay in the cloud

The pragmatic architecture for most teams is a hybrid: a single-card open model for the high-volume, privacy-sensitive, or cost-capped work, with a frontier API reserved for the hardest tickets and the agentic runs where the 17–27 point gap actually bites. Getting that routing right — which workloads stay local, which escalate, and how to measure the trade — is exactly what our AI transformation engagements scope and benchmark, and where our custom development team wires the local model into your stack.

09 — ConclusionOpen weights, run honestly.

The shape of self-hosted coding, June 2026

Match the model to the box, and be honest about the gap.

Open-weight coding models crossed a real threshold in the first half of 2026. A single 96GB workstation card now runs a model that verifies in the low 70s on SWE-bench, a consumer GPU runs a credible 30B coder, and the whole field is improving faster than the hardware shortage is making it expensive. For bounded, sensitive, or high-volume work, self-hosting an open coder is no longer a compromise — it is often the right default.

But the headline numbers hide two hard constraints, and ignoring either is how self-hosting projects disappoint. The first is memory: a 744B model does not fit one box no matter how few parameters activate per token, and a dense 123B model that loads its weights in 62GB still has to find room for the KV cache. The second is bandwidth: capacity lets a model load, but bandwidth is what makes it feel fast, which is why a 128GB DGX Spark can be slower than a 96GB card. Buy for the bottleneck that actually binds your workload.

And keep the frontier in view. The best self-hostable open coder still trails the best cloud coder by roughly 17 to 27 SWE-bench points. That gap is closing, but it is not closed — so the durable move is a hybrid stack: local open models for most of the work, a frontier API for the hardest of it, and a clear-eyed read of which benchmark, which hardware, and which price you are actually buying.

Best Open-Weight Coding Models to Self-Host in 2026

01 — The GapOpen coders are good. Frontier cloud is still ahead.

SWE-bench Verified · self-hostable open weights vs closed frontier

02 — BenchmarksRead the benchmark before you trust the number.

03 — VRAM MathWhat actually fits on one card.

30B-class coder at Q4

Qwen3-Coder-Next (80B/3B)

Devstral 2 (123B dense)

GLM-5.2 (744B/40B)

04 — BandwidthMemory bandwidth, not capacity, sets your speed.

05 — Fit MatrixThe hardware-to-model fit matrix.

06 — The ModelsThe open coders worth running yourself.

Qwen3-Coder-Next

Devstral 2

Nemotron 3 Super

Devstral Small 2

07 — Export ControlsWhy almost every open coder is now Chinese.

08 — DecisionHow to choose by hardware budget.

One RTX 5090 (32GB)

M5 Max or Mac Studio M3 Ultra

RTX PRO 6000 Blackwell (96GB)

Cloud API or rented cluster

09 — ConclusionOpen weights, run honestly.

Match the model to the box, and be honest about the gap.

Match the open coder to the box, and run it in production.

Open-weight self-hosting engagements

The questions teams ask before they self-host.

Continue exploring open-weight AI.

Devstral 2 & Mistral Vibe CLI: Complete Coding Guide

MiniMax M2 & Agent: Complete Guide to Chinese AI Platform

AI PCs and NPUs in 2026: Can They Really Run Local AI?

Small Language Models for On-Device Agents in 2026

Do Not Single-Source Your AI: A Second-Source Playbook

Google AI Plans: Free vs Plus vs Pro vs Ultra 2026