MiniMax M3 vs Opus 4.8 vs GPT-5.5 is the first three-way agentic coding comparison where the right answer is genuinely "it depends on the workload." M3 launched June 1, 2026 with frontier-class claims at a fraction of the API cost; Claude Opus 4.8 shipped May 28 and leads real-world repository work; GPT-5.5, live since late April, owns shell-driven DevOps. None of them wins everything.
What makes this comparison hard isn't the models — it's the numbers. M3's headline scores are vendor-stated, run on MiniMax's own infrastructure, and as of June 3 had no independent corroboration. Its open weights had not yet shipped. And the most-quoted Terminal-Bench comparison silently mixes two different benchmark versions. Strip those problems out and a clean routing decision emerges.
This guide builds that decision. We cover what each model actually costs, where each one genuinely leads, the benchmark caveats most coverage skips, a cost-per-successful-task breakeven you can reproduce, and a six-row routing matrix so you can match each workload to the right model — not the loudest headline.
- 01No single model wins agentic coding.Opus 4.8 leads SWE-Bench Pro (vendor-stated 69.2%); GPT-5.5 leads Terminal-Bench 2.0 (82.7%); M3 wins on raw API cost. The right call is per-workload routing, not a single default.
- 02M3's benchmarks are vendor-run and unverified.Every M3 score was produced on MiniMax's own infrastructure with baselines it chose. As of June 3, 2026, independent assessments from Artificial Analysis and LMArena had not yet published.
- 03MiniMax compared M3 against the wrong Opus.Launch materials benchmarked M3 against Opus 4.7 (64.3% SWE-Bench Pro), not the Opus 4.8 released three days earlier. At the correct baseline, M3 (59.0%) trails by roughly 10 points.
- 04The cost gap is real but smaller than the headline.At standard pricing, M3 is around 8x cheaper on input than Opus 4.8. The 17x figure used a 7-day launch promotion. VentureBeat's 5-10%-of-cost line compared M3 promo rates to GPT-5.5 standard rates.
- 05Open-weight, but not open today.M3 weights and the technical report were not shipped at launch (expected within roughly 10 days, around June 11). The license was unpublished; the prior M2.7 used a non-commercial license. Verify terms before commercial use.
01 — The LineupThree models, three distinct bets.
Each model in this comparison is making a different wager about how teams will buy and run AI coding in 2026. Claude Opus 4.8 bets on premium hosted capability — highest real-world repo accuracy and the strongest long-context fidelity. GPT-5.5 bets on the agentic shell: long-running, command-line, DevOps-flavored tasks. MiniMax M3 bets on economics — frontier-class claims at the lowest API cost, with eventual self-hosting for high-volume teams.
Before the benchmarks, it helps to anchor on what each one is. Our companion deep-dive on MiniMax M3's open-weight release covers the MSA architecture in detail; the Opus 4.8 vs GPT-5.5 head-to-head and the GPT-5.5 complete guide go deeper on those two.
Claude Opus 4.8
Highest vendor-stated SWE-Bench Pro (69.2%) and the strongest long-context retrieval in the set. Hosted only — also on AWS Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
GPT-5.5
Terminal-Bench 2.0 leader (82.7%) for shell-heavy DevOps and command-line agents. Closed, hosted-API-only; a Pro tier exists at higher cost. No 'Codex' variant on the 5.5 line.
MiniMax M3
Lowest API cost, 1M context, native multimodal input. Weights expected around June 11. All scores vendor-stated; parameter count undisclosed (community estimates 200-400B).
02 — Read The Fine PrintWhy every headline number needs a caveat.
Most coverage of M3's launch printed its benchmark table as if it were settled. It isn't. Three structural problems sit under the numbers, and ignoring them produces a misleading routing decision. We surface all three before we use any figure.
First, the scores are vendor-run.Every M3 benchmark was produced on MiniMax's own infrastructure and scaffolding, with baselines MiniMax selected. As of June 3, 2026, independent evaluators such as Artificial Analysis and LMArena had not yet published their own assessments. Treat M3's numbers as claims awaiting confirmation, not as established facts.
Second, the Terminal-Bench versions differ. M3's 66.0% Terminal-Bench score is on version 2.1; the GPT-5.5 (82.7%) and Opus 4.8 (74.6%) figures are on version 2.0. Most published comparisons drop these into one column as if they were interchangeable. They are not the same benchmark, so any direct numeric comparison across that row carries an explicit caveat.
Third, MiniMax compared against the wrong Opus. M3's launch materials benchmarked it against Claude Opus 4.7 (64.3% on SWE-Bench Pro), not the Opus 4.8 that Anthropic had shipped three days earlier. Whether deliberate or convenient, the effect is the same: at the correct, current baseline, M3's SWE-Bench "lead" disappears.
Every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked.— Anonymous evaluator cited in TechTimes, June 1, 2026
03 — Repo-Level WorkSWE-Bench Pro: where Opus 4.8 leads.
SWE-Bench Pro tests real-world GitHub issue resolution — multi-file diffs against codebases, with no public ground-truth leakage. It is the closest single benchmark to the work most engineering teams actually hand an AI: "fix this issue in this repo." On this test, Opus 4.8 leads the field at a vendor-stated 69.2%, a roughly 10.6-point margin over GPT-5.5 (58.6%) and a similar gap over M3 (59.0%).
The chart below uses the corrected Opus 4.8 baseline, not the Opus 4.7 figure MiniMax cited. At 59.0%, M3 trails not only Opus 4.8 but also the older Opus 4.7 (64.3%). The single benchmark where M3 exceeds an Opus model is BrowseComp autonomous web search, where its vendor-stated 83.5% tops Opus 4.7's 79.3% — a different task class from repo resolution.
SWE-Bench Pro · real-world GitHub issue resolution
Source: vendor blogs + The Decoder, DataCamp (June 2026). All figures vendor-stated.Opus 4.8's edge isn't only the headline accuracy. Early tester reports describe it as markedly less likely to let a code flaw pass without flagging it than Opus 4.7, and Artificial Analysis data cited by The Decoder put it at roughly 15% fewer passes per task and about 35% fewer output tokens than 4.7 on comparable agentic work. For repo-level engineering, those efficiency gains compound: fewer retries and fewer tokens partially offset the premium per-token price.
On the long-context retrieval that real repositories demand, the gap widens. Opus 4.8 leads GPT-5.5 by the largest single margin in the set on GraphWalks BFS at 1M tokens (68.1% vs 45.4%), and leads on OSWorld-Verified (83.4% vs 78.7%) and MCP-Atlas (82.2% vs 75.3%). If your agent has to hold a large codebase in context and reason across it, that retrieval fidelity is the deciding factor.
04 — Shell & DevOpsTerminal-Bench: GPT-5.5 owns the shell.
Terminal-Bench tests long-running, shell-based agentic tasks — the command-line, DevOps-flavored work where an agent installs dependencies, runs builds, debugs failing pipelines, and operates a real terminal. Here GPT-5.5 is the leader at 82.7% on Terminal-Bench 2.0, roughly 8 points ahead of Opus 4.8's 74.6% on the same version.
M3's 66.0% sits below both — but on Terminal-Bench 2.1, a different version of the benchmark. We show it on a separate track below precisely because comparing it directly against the 2.0 figures would be apples-to-oranges. The honest read: on the corroborated, same-version 2.0 numbers, GPT-5.5 is the shell specialist; M3's shell standing is genuinely unknown until it is re-run on a common version.
Terminal-Bench · long-running shell agentic tasks
Source: vendor blogs (June 2026). Note: M3 on Terminal-Bench 2.1; others on 2.0.05 — The Cost StoryThe cost gap is real — and smaller than the headlines.
M3's pitch is economics, and on raw token price the gap is substantial. At standard pricing — $0.60 per 1M input, $2.40 per 1M output — M3 is roughly 8x cheaper on input than Opus 4.8 ($5 / $25) and a similar multiple cheaper than GPT-5.5 ($5 / $30). For high-volume, cost-sensitive batch work, that is a genuine structural advantage.
But the widely-quoted multiples deserve scrutiny. The ~17x figure uses M3's 7-day launch promotion ($0.30 / $1.20), not its permanent rate. And one widely-shared headline framed M3 as matching frontier benchmarks "at 5-10% of the cost" by comparing M3's promotional rate to GPT-5.5's standard rate — at standard M3 pricing the input ratio is closer to 12% of GPT-5.5, not 5%. The durable, honest number is roughly 8x, not 17x.
Input price per 1M tokens · relative scale
Source: OpenRouter + vendor pricing pages (June 2026). Bar length = relative input price.06 — Cost Per Successful TaskThe number nobody publishes.
Token price is the wrong production metric. What actually governs your bill is cost per successfultask — the all-in spend to get a correct result, including the retries a less-accurate model needs. Coverage of M3 cites raw token ratios (8x, 17x) but stops there. The interesting question is: how much of a success-rate penalty does M3's cost advantage absorb before it disappears?
The arithmetic is simple. If M3 costs roughly one-eighth of Opus 4.8 per attempt, it can fail far more often and still come out ahead on cost per success. Using the SWE-Bench Pro gap as a rough proxy for repo-level quality (M3 around 10 points behind Opus 4.8), M3 would need to require something on the order of seven to eight times as many attempts before its per-success cost matched Opus 4.8's. For most workloads it never gets close to that — which is why M3 is compelling for high-volume batch even after you adjust for quality.
Standard-rate input gap
At $0.60 vs $5.00 per 1M input, M3's per-attempt cost is roughly an eighth of Opus 4.8's. A model can fail several times for each Opus success and still be cheaper per successful task.
SWE-Bench Pro delta
M3 (59.0%) trails Opus 4.8 (69.2%) on repo resolution, vendor-stated. That gap implies more retries on hard multi-file tasks — the variable that erodes a raw token-price advantage.
Cost-sensitive long-context batch
For bulk long-document understanding, code search, and high-throughput pipelines where occasional retries are cheap, M3's price advantage survives any realistic quality penalty. This is its home turf.
The right production metric is cost per successful task, not cost per token.— Lushbinary M3 vs Opus GPT-5.5 comparison, June 2026
The practical takeaway: route by retry tolerance. Where a failed attempt is cheap to detect and re-run — batch summarization, bulk code search, first-pass drafts — M3's economics dominate. Where a failed attempt is expensive — a wrong multi-file diff merged into production, a botched migration — the success-rate penalty matters more than the token price, and Opus 4.8's accuracy earns its premium. For a deeper treatment of this trade-off, see our cost-per-task analysis for agentic coding.
07 — The Routing MatrixOne decision per workload.
Here is the full three-way routing matrix — six workload classes, each matched to the model whose strengths fit it. No published comparison runs all three models against all six use cases at once; most cover two models or one task. Treat the M3 entries as provisional until independent benchmarks land, and benchmark on your own prompts before changing a production default.
Multi-file diffs against a real codebase
Opus 4.8 leads SWE-Bench Pro (vendor-stated 69.2%) and the long-context retrieval that large repos demand. When a wrong diff is expensive, accuracy beats token price.
Long-running command-line agents
GPT-5.5 leads Terminal-Bench 2.0 (82.7%) for builds, pipelines, and terminal-driven tasks. Its shell standing is the most clearly corroborated of the three on a common benchmark version.
Million-token context, retrieval-heavy
Opus 4.8 leads GraphWalks BFS 1M (68.1% vs GPT-5.5's 45.4%) when fidelity matters most. For high-volume, cost-sensitive long-context batch, M3's 1M window and low price win instead.
Interleaved text + image inputs
M3 ships native multimodal input and strong vendor-stated computer-use scores. Treat the figures as unverified for now, but this is a genuine M3 differentiator at its price point.
Bulk pipelines, retry-tolerant
M3's roughly 8x standard-rate cost advantage survives any realistic quality penalty where failed attempts are cheap to re-run. This is the clearest M3 win.
On-prem, sovereignty-bound deployment
Only M3 offers a self-host path, but weights had not shipped at launch and the license was unpublished (M2.7 precedent suggests non-commercial). For now, the hosted models are the only deployable option.
The meta-point: in mid-2026 there is no single "best coding model." The teams getting the most out of these tools run a multi-vendor routing layer — Opus 4.8 as the repo-level default, GPT-5.5 for shell pipelines, and M3 (once weights and license are confirmed) for cost-sensitive batch and self-hosting. If you're standing up that routing layer, our AI transformation engagements start with exactly this kind of comparative eval on your own corpus, and our development team wires the routing into your stack.
08 — Deployment RealityOpen-weight — with asterisks.
M3's "open-weight" framing is doing a lot of work, and it deserves precise language. At launch, the weights and the technical report were not shipped. MiniMax said both would land on Hugging Face and GitHub within roughly 10 days of June 1 — an estimated date around June 11 — which means the MSA architecture, safety behavior, and the headline efficiency claims were unverifiable on launch day.
Two further caveats matter for anyone planning to self-host. The license is unpublished. The prior MiniMax M2.7 used a non-commercial license that required prior written authorization for commercial use; M3 is expected to follow a similar pattern, but the terms had not been posted. Open-weight is not the same as open-source, and downloadable weights do not automatically grant commercial rights — verify the license before any production use.
Self-hosting is hardware-intensive.M3's parameter count is undisclosed (community analysis of the MoE architecture estimates 200-400B total, explicitly an estimate). Community guidance for a 4-bit quantization points to roughly 75-150GB of VRAM — multiple high-end accelerators or a high-memory workstation — with broad inference-engine support still maturing at launch. For the full hardware-and-cost decision, see our guide to self-hosting open-weight models and the broader open-weight vs closed-source trade-offs.
09 — ConclusionThe routing call beats the headline.
There is no single best coding model — only the right model per workload.
The cleanest read of this three-way race is also the least dramatic: route by task. Opus 4.8 is the repo-level default, with the highest vendor-stated SWE-Bench Pro score and the strongest long-context retrieval. GPT-5.5 is the shell and DevOps specialist, leading Terminal-Bench on the version everyone can compare. MiniMax M3 is the economics play — frontier-class claims at roughly an eighth of the standard token cost, and the only one with a future self-host path.
What separates a good routing decision from a bad one is taking the caveats seriously. M3's numbers are vendor-run and, as of June 3, unverified. Its weights had not shipped and its license was unpublished. The Terminal-Bench comparison mixes benchmark versions, and the flashiest cost multiple used a one-week promotion. None of that makes M3 a bad bet — it makes it a bet you should pilot through the API before you wire it into anything that matters.
The broader signal is that the coding-model market has fragmented into specialists. A year ago the question was "which model is smartest." In mid-2026 it is "which model fits this workload at the cost and accuracy I need." Teams that build a small multi-vendor routing layer — and re-test it as independent benchmarks land — will spend less and ship more than teams chasing a single default.