The best open-weight coding models you can self-host in 2026 now land roughly 60–72% on SWE-bench Verified — close enough to be genuinely useful, far enough from the 80–95% frontier cloud coders that you should choose your hardware and model with eyes open. The decision that matters is not “which open model is best” but “which open model fits the box I can actually buy.”
That box question got harder this year. The DRAM and GDDR7 shortage pushed workstation GPU prices up and erased the big-memory Mac configurations: the Mac Studio M3 Ultra now maxes out at 96GB, and the “large-memory Mac for local AI” is the M5 Max 128GB laptop rather than a 256GB or 512GB Studio. Meanwhile the models themselves ballooned — GLM-5.2 ships 744 billion parameters under an MIT license, and no single desk-side machine comes close to running it.
This guide does the arithmetic so you do not have to: how much VRAM each model needs at 4-bit, why memory bandwidth — not capacity — sets your token speed, and a single matrix that pairs every 2026 open coder with the hardware tier that runs it. We will be precise about what fits a single 96GB card, honest about what only runs in the cloud, and clear about the policy reason almost every leading open coder now comes from a Chinese lab. Every number traces to a primary source; verify the volatile ones before you spend.
- 01Open coders are useful, but the gap to frontier is real.Self-hostable open-weight models top out around 71–72% SWE-bench Verified (Qwen3-Coder-Next, Devstral 2), while closed frontier coders sit at 80–95%. That is roughly a 17–27 point gap — meaningful, not unbridgeable.
- 02A single 96GB card is the real self-hosting ceiling.An RTX PRO 6000 Blackwell (96GB) runs Qwen3-Coder-Next (80B MoE) comfortably or Devstral 2 (123B dense) tightly at 4-bit. A 744B MoE like GLM-5.2 does not fit one box — all weights must stay resident in GPU memory regardless of active-parameter count.
- 03Bandwidth, not memory, decides how fast it feels.Decode is memory-bound. A 128GB DGX Spark (273 GB/s) generates tokens far slower than a 128GB M5 Max (614 GB/s), which is slower again than a 96GB RTX PRO 6000 (1.79 TB/s). Buy bandwidth for interactive coding; buy capacity only when a model will not otherwise load.
- 04SWE-bench Verified and SWE-bench Pro are not the same test.GLM-5.2 reports a vendor SWE-bench Pro score (62.1%); most others report SWE-bench Verified. Pro is a harder problem set — never line the two up as if they were comparable. We split them into separate columns for exactly this reason.
- 05The open coding leaderboard is overwhelmingly Chinese.US export controls gate the most capable closed weights but explicitly exempt published open weights. The structural result: the leading open coders — Qwen, DeepSeek, GLM, Kimi — come from Chinese labs, several trained on non-NVIDIA silicon to stay export-proof.
01 — The GapOpen coders are good. Frontier cloud is still ahead.
Start with the honest framing, because most roundups bury it. The best coding model you can download and run yourself is no longer a toy — but it is not the best coding model, full stop. On SWE-bench Verified, the standard agentic-coding benchmark, the strongest self-hostable open-weight models land in the low 70s, while the closed frontier sits in the high 80s to mid 90s. The chart below puts the two groups side by side.
SWE-bench Verified · self-hostable open weights vs closed frontier
Source: BenchLM / morphllm SWE-bench Verified, vendor reports — June 2026. Leaderboard positions shift; verify on swebench.com.Two things are true at once. First, the trend line is steep: Devstral Small 2, a 24B model that runs on a single consumer GPU, scores 68.0% SWE-bench Verified (vendor-reported) — higher than much larger models managed eighteen months ago. Second, the absolute frontier has not stood still. Closed coders verify in the high 80s and above, so the gap between “best thing on my desk” and “best thing behind an API” is roughly 17 to 27 percentage points depending on which models you line up. For a developer, that gap is the difference between a model that closes most well-scoped tickets and one that closes the gnarly ones too.
The interpretation we draw: open-weight self-hosting in 2026 is a cost, privacy, and control decision far more than a raw-capability one. If your workload is bounded — internal tooling, code review, test generation, refactors on a known codebase — a single-card open model is now a credible primary. If you are pushing the hardest autonomous agentic coding, the honest move is to keep the frontier API in the loop and route only the cheap, high-volume, or sensitive-data work to a local model. For the largest open release of the quarter, see our DeepSeek V4 migration guide; if you are still deciding whether to self-host at all, our self-hosting deployment decision guide walks the full build-versus-rent tree.
02 — BenchmarksRead the benchmark before you trust the number.
The single most common mistake in 2026 model comparisons is treating SWE-bench Verified and SWE-bench Pro as the same scale. They are not. SWE-bench Verified is the widely-cited human-validated subset; SWE-bench Pro, from Scale AI, uses a harder, scaled problem set, so a 62% on Pro is not “worse” than a 71% on Verified — they are different exams. A handful of models report only one or the other, which makes naive cross-comparison actively misleading. The table below keeps them in separate columns on purpose.
| Model (total / active) | SWE-bench Verified | SWE-bench Pro | Single-box self-host? |
|---|---|---|---|
| Report SWE-bench Verified | |||
| DeepSeek V4-Pro (1.6T / 49B) | 80.6% | — | No — cloud / cluster only |
| Devstral 2 (123B dense) | 72.2%* | — | Yes — 96GB card, tight |
| Qwen3-Coder-Next (80B / 3B) | ~71.3%‡ | — | Yes — 96GB card, comfortable |
| Nemotron 3 Ultra (550B / 55B) | 65–71.9% | — | No — 550B total |
| Devstral Small 2 (24B dense) | 68.0%* | — | Yes — 24GB consumer GPU |
| Nemotron 3 Super (120B / 12B) | 60.47% | — | Yes — 96GB card |
| Nemotron 3 Nano (30B / 3B) | 38.8%† | — | Yes — 16GB GPU |
| Report SWE-bench Pro (harder — do not cross-compare) | |||
| GLM-5.2 (744B / 40B) | — | 62.1%* | No — 4× H200 minimum |
| Qwen3-Coder-480B (480B / 35B) | — | 38.7% | No — multi-GPU |
* vendor-reported · † OpenHands scaffold · ‡ ~71.3% via OpenHands, technical report
03 — VRAM MathWhat actually fits on one card.
The rule that governs self-hosting is blunt: every parameter has to live in memory, all the time. At 4-bit, model weights cost roughly half a byte per parameter plus overhead — so a 30B model needs about 15GB, an 80B model about 40–46GB, and a 123B dense model about 62GB. On top of the weights you need room for the KV cache, which grows with context length. For a 70B-class dense model, a 128K-token context at FP16 can demand around 40GB of KV cache on its own. That second number is the one most buyers forget.
This is also where the popular “32B is the ceiling for consumer self-hosting” claim needs splitting in two. For a dense model carrying a full 128K FP16 context, the comfortable single-96GB-card ceiling really does land near a 32B-class coder, because the dense KV cache eats tens of gigabytes on top of the weights. A 123B dense model like Devstral 2 loads its weights in ~62GB and leaves ~34GB — enough for a constrained context with a quantized KV cache, but not a relaxed full-length session. That headroom math is exactly why Mistral recommends a minimum of four H100-class GPUs for production serving, even though the model technically fits one 96GB card for single-user use.
The exception that breaks the 32B rule is architecture. Qwen3-Coder-Next is an 80B Mixture-of-Experts model with only 3B active parameters and a linear-attention design, so its KV cache is dramatically smaller than a dense transformer of the same context length. At Q4_K_M it needs only ~46GB for weights and leaves ~50GB of headroom — enough for a very long context on a single 96GB card. Never apply dense-model KV-cache math to an efficient-attention MoE, and never confuse active parameters with VRAM: a 744B/40B MoE still has to keep all 744B weights resident. For the full quantization, KV-cache, and context arithmetic, see our companion VRAM, quantization, and KV-cache guide, and for how that scales with context, our look at Qwen’s long-context approach.
30B-class coder at Q4
Qwen3-Coder-30B-A3B (~64% SWE-bench Verified, approximate) or Nemotron 3 Nano fit in roughly 15GB — comfortable on a 24–32GB consumer GPU like an RTX 5090, with room for context.
Qwen3-Coder-Next (80B/3B)
An 80B MoE with linear attention: ~46GB of weights at Q4_K_M, ~50GB headroom for a long KV cache. The comfortable single-card pick at ~71.3% SWE-bench Verified.
Devstral 2 (123B dense)
123B dense parameters fit in ~62GB at Q4, leaving ~34GB for KV cache. Loads on one 96GB card for single-user work; Mistral recommends 4× H100 for multi-user production serving.
GLM-5.2 (744B/40B)
At INT4 the 744B weights need ~372GB resident — a minimum of 4× H200 (141GB each). The MIT-licensed open crown, but not a desk-side model regardless of its 40B active count.
04 — BandwidthMemory bandwidth, not capacity, sets your speed.
Once a model fits, the next question is how fast it generates tokens — and that is governed almost entirely by memory bandwidth, not raw compute. Token generation (the decode phase) is memory-bound: for each new token the accelerator must read the active weights out of memory, so a useful first-principles ceiling is tokens per second ≈ memory bandwidth ÷ weights read per token. For a dense model, the weights read per token equal the full quantized model size. Real-world decode lands a bit under that ceiling once attention, the router, and framework overhead are included.
Hold the model fixed and vary only the hardware, and the effect is stark. The table below runs Devstral 2 (123B dense, ~62GB at Q4) across four memory tiers. The capacity is similar; the bandwidth is not — and the token speed tracks the bandwidth, almost linearly.
| Hardware (same 62GB model) | Bandwidth (GB/s) | Ceiling (BW ÷ 62) | Realistic decode (~70%) |
|---|---|---|---|
| DGX Spark (GB10), 128GB | 273 | ~4.4 tok/s | ~3 tok/s |
| MacBook Pro M5 Max, 128GB | 614 | ~9.9 tok/s | ~7 tok/s |
| Mac Studio M3 Ultra, 96GB | 819 | ~13.2 tok/s | ~9 tok/s |
| RTX PRO 6000 Blackwell, 96GB | 1,792 | ~28.9 tok/s | ~20 tok/s |
The counter-intuitive result this exposes: a 128GB DGX Spark, which has more memory than a 96GB Mac Studio or RTX PRO 6000, generates tokens the slowest of the group, because its 273 GB/s bandwidth is the lowest. At equal capacity, the 614 GB/s M5 Max is roughly 2.25× faster than the same-size DGX Spark. The DGX Spark’s value is capacity — it lets a 123B-class model load at all on a small, ~140W-typical box — not interactive speed. For dense-70B-at-Q4 decode specifically, bandwidth math puts the DGX Spark in the low single digits (one independent report measured roughly 2.7 tok/s); the higher tens-of-tokens figures you may see elsewhere are batch throughput or smaller/MoE/NVFP4 configurations, not single-user dense decode. Treat every token-per-second number here as a stack-dependent estimate.
A model you can download is not a model you can run. The weights ship free; the bandwidth that runs them does not.— Digital Applied, on the 2026 self-hosting reality
05 — Fit MatrixThe hardware-to-model fit matrix.
Here is the core deliverable: every realistic 2026 hardware tier paired with the open coder that best fits it at 4-bit, with the VRAM used, the benchmark score, and a bandwidth-derived speed estimate. Prices reflect the mid-2026 shortage — note the RTX PRO 6000’s gap between its ~$8,565 launch MSRP and its current street pricing. For a broader walk through the consumer-to-workstation price brackets, pair this with our local-AI hardware price-brackets guide.
| Hardware | Memory / BW | Price (mid-2026) | Best-fit coder @ Q4 | Weights / SWE-V | Tok/s (est.) |
|---|---|---|---|---|---|
| Consumer GPU (≤32GB) | |||||
| RTX 5090 | 32GB · ~1.79 TB/s | ~$2,000–4,000 | Qwen3-Coder-30B-A3B (30B/3B) | ~15GB · ~64%≈ | ~60–90 |
| RTX 5090 | 32GB · ~1.79 TB/s | ~$2,000–4,000 | Devstral Small 2 (24B dense) | ~12GB · 68.0%* | ~25–35 |
| Workstation GPU (96GB) | |||||
| RTX PRO 6000 Blackwell | 96GB · ~1.79 TB/s | ~$8.5k MSRP · ~$11–14.5k street | Qwen3-Coder-Next (80B/3B) | ~46GB · ~71.3%‡ | ~40–59 |
| RTX PRO 6000 Blackwell | 96GB · ~1.79 TB/s | ~$8.5k MSRP · ~$11–14.5k street | Devstral 2 (123B dense) | ~62GB · 72.2%* | ~20–29 |
| Unified-memory boxes | |||||
| DGX Spark (GB10) | 128GB · 273 GB/s | $3,999–4,699 | Devstral 2 (123B dense) | ~62GB · 72.2%* | ~3–5 |
| MacBook Pro M5 Max | 128GB · 614 GB/s | ~$5,000–6,000 | Devstral 2 (123B dense) | ~62GB · 72.2%* | ~7–10 |
| Mac Studio M3 Ultra | 96GB · 819 GB/s | ~$5,000 | Qwen3-Coder-Next (80B/3B) | ~46GB · ~71.3%‡ | ~25–35 |
| Multi-GPU / cloud only | |||||
| 4× H200 (cluster) | ~564GB · — | $100k+ / rental | GLM-5.2 (744B/40B) | ~372GB · 62.1%* (Pro) | — |
| Cloud API | — · — | Per-token | DeepSeek V4-Pro (1.6T/49B) | n/a · 80.6% | — |
* vendor-reported · ‡ ~71.3% via OpenHands · ≈ ~64% approximate, variant unconfirmed · all SWE scores are Verified unless marked “(Pro)” · token speeds are bandwidth-derived single-user estimates.
06 — The ModelsThe open coders worth running yourself.
Strip out the cloud-only giants and four models do the real work for self-hosters. Each is the best answer for a specific hardware budget — from a single consumer GPU up to a 96GB workstation card.
Qwen3-Coder-Next
The single-card sweet spot. ~46GB at Q4_K_M leaves ~50GB for a long KV cache thanks to linear attention, and it verifies around 71.3% on SWE-bench (OpenHands). Open-weight under the Qwen family license (confirm terms before commercial use), ~40–59 tok/s on a 96GB card.
Devstral 2
The top dense open coder at 72.2% SWE-bench Verified (vendor-reported). Loads in ~62GB at Q4 on one 96GB card for single-user work; Mistral recommends 4× H100 for multi-user production serving.
Nemotron 3 Super
NVIDIA’s hybrid Mamba-Transformer MoE fits a 96GB card at ~60GB (Q4) and verifies 60.47% on SWE-bench. A strong long-context option on the NVIDIA Open Model License, on Bedrock and OpenRouter.
Devstral Small 2
The best coder that runs on affordable hardware: 68.0% SWE-bench Verified (vendor-reported), 25–35 tok/s on an RTX 4090, ~55 on an RTX PRO 6000. Mistral built it for single-GPU operation across RTX and DGX Spark.
For a deeper hands-on walkthrough of the Mistral family and its agent tooling, see our Devstral 2 and Mistral Vibe CLI guide. Above this tier, the cloud-scale open models — DeepSeek V4-Pro (1.6T/49B, 80.6% Verified), Nemotron 3 Ultra (550B/55B), and Kimi K2.7-Code (1T/32B) — are open-weight in name but require GPU clusters in practice. Download them to fine-tune or to host on rented infrastructure; do not expect them on a single box.
07 — Export ControlsWhy almost every open coder is now Chinese.
Look back at the matrix and a pattern jumps out: Qwen, DeepSeek, GLM, Kimi — the open coding leaderboard is overwhelmingly Chinese. That is not an accident of talent alone; it is a structural consequence of US policy. The Bureau of Industry and Security controls closed model weights trained above a 10^26-compute threshold (ECCN 4E091) and requires export licenses for them — but published, open-weight model weights are explicitly not controlled. The most capable closed models are gateable; increasingly capable open ones are not.
There is a hardware twist that makes these models even harder to constrain. GLM-5.2 was trained entirely on Huawei Ascend NPUs with no NVIDIA dependency, and DeepSeek V4 was likewise trained on non-NVIDIA silicon. That architectural independence is a deliberate hedge: a model that needs no controlled US chips to train, and ships under an MIT or Apache license, is structurally resilient to the export regime. For teams, the practical takeaway is less geopolitical than operational — your open-weight failover stack will likely lean on Chinese models, so plan governance, licensing review, and a second source accordingly. Our open-weight second-source playbook covers how to structure that resilience without single-vendor risk.
08 — DecisionHow to choose by hardware budget.
The decision collapses to one question: what box can you buy, and how fast does it need to feel? Match your budget to the tier below, then benchmark the recommended model on your own repositories before you commit — a 70% benchmark score on someone else’s task set is not the same as a model that closes your tickets.
One RTX 5090 (32GB)
Run a 30B-class coder — Qwen3-Coder-30B-A3B or Devstral Small 2 — at 60–90 or 25–35 tok/s. A 70B model will not fit 32GB at Q4 (~38GB) without crippling CPU offload. This is the best value-per-dollar self-hosting tier, capped near 30B.
M5 Max or Mac Studio M3 Ultra
128GB of memory loads a 123B model, but 614–819 GB/s bandwidth keeps decode in the single-to-low-double digits. Great for capacity, batch jobs, and quiet always-on local inference; not the fastest interactive coder per dollar.
RTX PRO 6000 Blackwell (96GB)
The genuine single-card ceiling: Qwen3-Coder-Next (80B MoE) comfortably or Devstral 2 (123B dense) tightly, at 1.79 TB/s for ~20–59 tok/s. The fastest interactive self-hosted coding you can buy in one box — at shortage pricing.
Cloud API or rented cluster
For the hardest agentic coding, keep a frontier cloud coder (80–95% SWE-bench) in the loop. For GLM-5.2 or DeepSeek V4-Pro, rent multi-GPU rather than buy — single-box self-hosting is not realistic at 744B-plus.
The pragmatic architecture for most teams is a hybrid: a single-card open model for the high-volume, privacy-sensitive, or cost-capped work, with a frontier API reserved for the hardest tickets and the agentic runs where the 17–27 point gap actually bites. Getting that routing right — which workloads stay local, which escalate, and how to measure the trade — is exactly what our AI transformation engagements scope and benchmark, and where our custom development team wires the local model into your stack.
09 — ConclusionOpen weights, run honestly.
Match the model to the box, and be honest about the gap.
Open-weight coding models crossed a real threshold in the first half of 2026. A single 96GB workstation card now runs a model that verifies in the low 70s on SWE-bench, a consumer GPU runs a credible 30B coder, and the whole field is improving faster than the hardware shortage is making it expensive. For bounded, sensitive, or high-volume work, self-hosting an open coder is no longer a compromise — it is often the right default.
But the headline numbers hide two hard constraints, and ignoring either is how self-hosting projects disappoint. The first is memory: a 744B model does not fit one box no matter how few parameters activate per token, and a dense 123B model that loads its weights in 62GB still has to find room for the KV cache. The second is bandwidth: capacity lets a model load, but bandwidth is what makes it feel fast, which is why a 128GB DGX Spark can be slower than a 96GB card. Buy for the bottleneck that actually binds your workload.
And keep the frontier in view. The best self-hostable open coder still trails the best cloud coder by roughly 17 to 27 SWE-bench points. That gap is closing, but it is not closed — so the durable move is a hybrid stack: local open models for most of the work, a frontier API for the hardest of it, and a clear-eyed read of which benchmark, which hardware, and which price you are actually buying.