MiniMax M3 launched on May 31, 2026, and the pitch is bold: the first open-weight release to hold frontier coding, a one-million-token context window, and native multimodality at the same time. The architecture behind it is MiniMax Sparse Attention, the launch price is a fraction of closed frontier, and the API was live day one. The honest framing matters just as much as the headline.

Two things temper the excitement, and a credible read has to surface both. First, every benchmark MiniMax published was run on its own infrastructure with no independent validation at launch. Second, the open weights were not actually available on launch day; MiniMax committed to releasing them within roughly ten days, and the license was unconfirmed. So the "fully open" story is, for now, a promise rather than a delivered fact.

This guide covers what shipped, the sparse-attention design that makes 1M context affordable, the vendor-stated benchmark picture, the awkward timing against Claude Opus 4.8, the pricing math that makes the token plans genuinely interesting, and a clear decision framework for who should adopt now versus wait.

Key takeaways

01
Three capabilities in one open-weight model.MiniMax bills M3 as the first open-weight release to combine frontier coding, a 1M-token context window, and native multimodal input (images, video, computer use) in a single model.
02
Sparse Attention is the headline engine.MiniMax Sparse Attention (MSA) selects relevant blocks of uncompressed key-values instead of running full quadratic attention, cutting per-token compute to a vendor-stated 1/20th of M2 at 1M-token context.
03
Every benchmark is vendor-run.SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, OSWorld-Verified 70.06%, BrowseComp 83.5, and MCP Atlas 74.2% were all produced on MiniMax infrastructure with no independent validation confirmed at launch.
04
Level with Opus 4.7, behind Opus 4.8.M3's comparisons targeted Claude Opus 4.7. On directly comparable agent benchmarks it trails the three-days-older Opus 4.8 by roughly 10 to 14 points, while undercutting closed-frontier pricing dramatically.
05
Open weights were pending, not shipped.At launch the weights were not yet on Hugging Face and the license was unconfirmed. MiniMax committed to an open-weight release within roughly ten days. Treat enterprise on-prem plans as a near-term promise.

01 — What ShippedAn API release, with open weights committed within days.

What went live on May 31 was the model behind an API, not a weights drop. M3 appeared same-day on OpenRouter under minimax/minimax-m3, exposed a 1M-token context window, and shipped with day-one compatibility for IDE integrations including Claude Code, Cursor, Roo Code, and Cline. The API uses a toggleable thinking mode: on for deep reasoning and long-horizon planning, off for low-latency completion.

MiniMax positions M3 as a clean version break from the M2 line, not a decimal increment. Its direct predecessor was M2.7, a self-evolving model that the company reported was handling a meaningful share of its internal reinforcement-learning workflow autonomously. M3 carries that agentic-first philosophy forward and adds native multimodality. If you followed the lineage, our MiniMax M2.7 and MiniMax M2.5 guides set the context for how aggressively this team has been iterating.

Live day one

M3 API

1M context · thinking mode toggle

Available via the MiniMax platform and on OpenRouter under minimax/minimax-m3 from launch day, with day-one support for Claude Code, Cursor, Roo Code, and Cline. The model card lists a 1M-token context window.

openrouter.ai/minimax/minimax-m3

Committed within days

Open weights

Hugging Face + GitHub · license TBC

MiniMax committed to publishing open weights and a technical report within roughly ten days of launch. At launch the weights were not yet available and the exact license was unconfirmed. Verify before planning any on-prem deployment.

Pending release · ~10-day window

Release snapshot

MiniMax M3 launched May 31, 2026 as an open-weight release committed within days. The API and OpenRouter listing went live same-day; the weights and technical report were promised for a follow-on window on Hugging Face and GitHub. Launch pricing on a 50% promotion is $0.30 / $1.20 per 1M tokens (input / output), with a standard rate of $0.60 / $2.40 after the promotion. Total parameter count was not disclosed at launch.

One detail builders should not over-read: M3's total parameter count is undisclosed. The M2.7 predecessor was a 229B-total / 9.8B-active mixture-of-experts model, but MiniMax has not confirmed M3 inherits those figures, so we treat the size as unknown rather than carrying forward old numbers. Always read the official model card and license text once the weights actually publish.

02 — The Three-Way JamWhy coding, context, and multimodality have been incompatible.

The reason this release reads as ambitious is that those three properties have historically pulled against each other. Quadratic attention scaling makes a genuinely usable million-token window expensive, which is why long-context models have leaned on compression tricks that trade away precision. Multimodality bolted on after the fact has tended to weaken visual reasoning rather than strengthen it. And frontier-level coding has typically demanded the kind of dense compute budgets that push long-context inference costs past what most teams will pay.

MiniMax's framing is that M3 breaks all three constraints at once: a sparse-attention design that keeps long context affordable, a training corpus that was multimodal from step zero, and agentic coding scores it claims rival closed frontier. Whether the model fully delivers on that promise is exactly the question independent benchmarks have not yet answered. The architecture is real and documented; the capability claims are vendor-stated.

Trained natively multimodal

Interleaved pretraining corpus

100T

MiniMax states M3 was pretrained on over 100 trillion tokens of natively interleaved text, image, and video data from step zero, rather than fitting a vision adapter onto a text model after the fact.

Vendor-stated

Document understanding

OmniDocBench (vendor)

Lead

MiniMax reports M3 scoring above Gemini 3.1 Pro on OmniDocBench document understanding, attributing the result to the multimodal training pipeline. No independent confirmation at launch.

vs Gemini 3.1 Pro

Visual-to-code

SVG-Bench (vendor)

Lead

On SVG-Bench, which measures turning a visual into code, MiniMax claims M3 surpasses Claude Opus 4.7. Video benchmarks were run on up to 1,024 frames, but no numeric video scores were published at launch.

vs Claude Opus 4.7

03 — MiniMax Sparse AttentionMSA: block-level selection of real key-values.

The technical centerpiece is MiniMax Sparse Attention (MSA). Rather than computing full quadratic self-attention across the whole sequence, MSA performs block-level selection on the key-value cache. The distinction MiniMax draws against latent-compression approaches matters: where some designs compress key-values into a smaller latent space, MSA selects relevant blocks of uncompressed grouped-query key-values. The claimed benefit is that precision and prefix-caching compatibility are preserved, because the model still attends to the actual stored representations rather than a lossy summary.

The kernel engineering is part of the story. MiniMax describes a "KV outer gather Q" pattern, where key-value blocks form the outer loop and every query that hits a given block is batched together so that block is read from memory exactly once, in contiguous rather than scattered access. MiniMax claims this runs more than four times faster than open-source sparse-attention alternatives such as Flash-Sparse-Attention or flash-moba. As with the rest of the architecture claims, that figure is vendor-stated.

MSA efficiency vs M2 at 1M-token context · vendor-stated

Source: MiniMax M3 blog (vendor-stated; not independently validated)

M2 full attention (baseline)Per-token compute at 1M-token context

100%

M3 with MSAPer-token compute at 1M context · vendor-stated

~5%

Prefill speedupM3 vs full-attention M2 at 1M · vendor-stated

>9×

Decode speedupM3 vs full-attention M2 at 1M · vendor-stated

>15×

Read the bars carefully. The compute figure is the model's claimed reduction to roughly one-twentieth of M2's per-token cost at one million tokens, and the speedups are the vendor's prefill and decode numbers on its own hardware. The earlier MiniMax teaser cited more precise figures of about 9.7 times prefill and 15.6 times decode; the final launch materials rounded these to more than nine and more than fifteen times. Either way, none of it has been reproduced by a third party, so treat it as a design claim worth testing rather than a settled result.

MSA operates on a standard GQA backbone but utilizes block-level selection on real, uncompressed Key-Values.— Elie Bakouch, Prime Intellect (AI training infrastructure)

04 — BenchmarksThe numbers, every one of them vendor-run.

MiniMax published a strong agentic benchmark sheet. Before quoting any of it, set expectations: these scores were produced on MiniMax infrastructure and had no independent validation at launch. If you are unsure what these evaluations actually measure, our SWE-Bench Pro and Terminal-Bench 2.1 guide breaks down the methodology. The headline figures: SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, OSWorld-Verified 70.06% for computer-use task completion, BrowseComp 83.5 for autonomous web search, and MCP Atlas 74.2% for tool use.

M3 benchmark sheet · vendor-stated, not independently validated

Source: MiniMax M3 blog + VentureBeat (all M3 scores vendor-stated)

SWE-Bench ProVendor 59.0% · ahead of DeepSeek V4 Pro 55.4%

59.0%

Open lead

Terminal-Bench 2.1Vendor 66.0% · roughly level with Opus 4.7

66.0%

≈ Opus 4.7

BrowseCompVendor 83.5 · claims to exceed Opus 4.7's 79.3

83.5

Vendor lead

MCP AtlasVendor 74.2% · narrowly over DeepSeek V4 Pro 73.6%

74.2%

Narrow lead

OSWorld-VerifiedVendor 70.06% · trails Opus 4.8 at 83.4%

70.06%

Opus 4.8

SWE-fficiencyVendor 34.8% · harder agentic efficiency eval

34.8%

Lower band

KernelBench HardVendor 28.8% · low-level kernel synthesis

28.8%

Lower band

M3 (vendor-stated)Where a comparison model leads

Two long-horizon autonomy demos give a more textured sense of what M3 can attempt. In one, MiniMax reports the model ran roughly twelve hours without human intervention, produced 18 commits and 23 experimental figures, and reproduced an ICLR 2025 award-winning paper with a vendor-stated reproduction score of 0.650. In another, M3 reportedly improved NVIDIA Hopper FP8 GEMM hardware utilization from 7.6% to 71.3% across 147 submissions over about a day with no reference solution, where comparable models gave up after a few dozen attempts. Both demos are vendor-reported.

Every one of those numbers is vendor-run, on MiniMax's own infrastructure.— Thomas Wiegold, independent researcher

That caveat is the editorial spine of this release. A useful contrast: other recent frontier launches had independent intelligence-index numbers within roughly a day. M3 did not, at launch. The disciplined move is to wait for independent leaderboard and arena results before treating any of these figures as production-grade evidence, then run your own evaluation on the prompts you actually care about.

05 — The Timing GapM3 was benchmarked against the wrong Opus.

Here is the framing most day-of coverage either missed or buried. M3 launched on May 31, three days after Claude Opus 4.8 shipped on May 28. MiniMax's comparisons were set against Claude Opus 4.7, the pre-Opus-4.8 frontier. On the agent benchmarks where a direct comparison exists, the newer Opus 4.8 leads M3 by double-digit margins: SWE-Bench Pro 69.2% versus 59.0%, Terminal-Bench 2.1 74.6% versus 66.0%, and OSWorld-Verified 83.4% versus 70.06%. Our three-way coding routing matrix sets M3 against Opus 4.8 and GPT-5.5 head to head.

That is not a reason to dismiss M3 — it is a reason to frame it correctly. The accurate read is that M3 lands roughly level with Opus 4.7 on agentic work while costing a small fraction of the closed price, and trails the three-days-older Opus 4.8 by ten to fourteen points on directly comparable evals. Stated that way, the story stays compelling without overclaiming. For the closed-frontier reference point, see our Claude Opus 4.8 release coverage.

M3 vs the model it did not compare against · Claude Opus 4.8

Source: VentureBeat (M3 scores vendor-stated; Opus 4.8 from independent benchmarks)

SWE-Bench ProM3 vendor 59.0% · Opus 4.8 69.2%

69.2%

Opus 4.8 +10.2

Terminal-Bench 2.1M3 vendor 66.0% · Opus 4.8 74.6%

74.6%

Opus 4.8 +8.6

OSWorld-VerifiedM3 vendor 70.06% · Opus 4.8 83.4%

83.4%

Opus 4.8 +13.3

M3 (vendor-stated)Claude Opus 4.8 (independent)

The honest read

M3 is not the new agentic frontier; Claude Opus 4.8, released three days earlier, leads it by double digits on every directly comparable agent benchmark. What M3 is: a credible open-weight option that lands near the prior frontier on agentic work at a small fraction of closed-frontier pricing. Both of those statements are true at the same time.

06 — Pricing & PlansWhere M3 is genuinely disruptive.

The cost story is where M3 makes its strongest case. Launch pricing is $0.30 per million input tokens and $1.20 per million output on a limited-time 50% promotion, with a standard rate of $0.60 and $2.40 after. At standard rates that is roughly 8% to 20% of leading closed-frontier per-token pricing. Requests up to 512K input tokens bill at the standard rate; longer contexts cost more, with the exact surcharge not publicly disclosed at launch. For broader context on how this sits against the rest of the market, see our API pricing comparison.

The subscription tiers are the more interesting wrinkle. MiniMax offers shared multimodal quota across text, image, speech, and music: a Plus tier at roughly 1.7 billion tokens a month with 3 to 4 concurrent agents, a Max tier at roughly 5.1 billion tokens with 4 to 5 concurrent agents, and an Ultra tier at roughly 9.8 billion tokens with 6 to 7 concurrent agents. For high-volume agentic builders the breakeven math is striking, which is what the table below works out.

Plus subscription

Tokens per month

1.7B

Around 1.7 billion shared tokens a month with 3 to 4 concurrent agents. At the $0.60 standard input rate, 1.7 billion input tokens alone would run on the order of a thousand dollars on pay-as-you-go.

Entry tier

Max subscription

Tokens per month

5.1B

Around 5.1 billion shared tokens with 4 to 5 concurrent agents, plus a small daily allowance of video clips. The middle tier for teams running several agents in parallel through the day.

Team tier

Ultra subscription

Tokens per month

9.8B

Around 9.8 billion shared tokens with 6 to 7 concurrent agents and a larger daily video allowance. Aimed at heavy multi-agent workloads where pay-as-you-go would be far more expensive.

Heavy tier

Plus · ~$20/mo

Solo developer, multi-agent

~1.7B tokens and 3-4 concurrent agents. The equivalent pay-as-you-go input cost alone is on the order of ~$1,000/month at standard rates, so the subscription is a large discount for steady high-volume use. Confirm billing cadence before committing.

Subscription wins for steady use

Max · ~$50/mo

Small team, parallel agents

~5.1B tokens and 4-5 concurrent agents. The natural fit when several builders or pipelines hit M3 through the day and the workload is predictable enough to favor a flat rate over metered spend.

Subscription for predictable load

Ultra · ~$120/mo

Heavy multi-agent workloads

~9.8B tokens and 6-7 concurrent agents. For sustained automation at scale this dwarfs pay-as-you-go pricing, but only if your usage actually approaches the quota each month.

Subscription for sustained scale

Pay-as-you-go

Bursty or unpredictable usage

$0.30/$1.20 promo or $0.60/$2.40 standard per 1M tokens. Best when volume is low or spiky and you would not get near a subscription quota. Watch the 512K long-context billing threshold for big-document workloads.

PAYG for low or spiky volume

Confirm before you commit

Subscription pricing references reflect launch-day reporting, and at least one tier was reported as annually billed rather than month-to-month. Treat the dollar figures and quotas as indicative and verify the current terms and billing cadence on the MiniMax platform before purchasing.

07 — The Open-Weight CaveatA Day-0 promise, not a delivered reality.

The "open-weight" label is doing a lot of work in the launch messaging, so be precise about it. On May 31 the weights were not on Hugging Face, and the license was unconfirmed — the candidates named in coverage were a permissive open license, but nothing was settled. MiniMax committed to publishing the weights and a technical report within roughly ten days. For anyone planning on-prem deployment, fine-tuning, or sovereignty-bound use, that means the open story is a near-term commitment to verify, not a capability you could act on at launch.

There is also a governance consideration that belongs in any enterprise evaluation, even for API use. As a Chinese company, MiniMax operates under China's 2017 National Intelligence Law, which obligates domestic firms to support and cooperate with state intelligence work. That applies to API-routed prompts regardless of where the user sits. It is not a reason to rule M3 out, but teams handling sensitive or regulated data should account for it alongside the usual model-selection criteria.

Verification checklist

Before treating M3 as production-ready, confirm three things directly from primary sources: that the open weights have actually published and under which license, that an independent evaluation corroborates the vendor benchmark numbers for your workload class, and that the current API pricing and long-context billing threshold match what you budgeted.

08 — Who Should SwitchA decision framework for builders and teams.

M3 is not a universal default, but it is a strong fit for specific profiles today and a wait-and-watch for others. The deciding factors are usage volume, tolerance for unvalidated benchmarks, and whether you need open weights now or can act on the API while they ship.

High-volume agent builder

Cost-sensitive parallel agents

If you run many concurrent agents on a predictable workload, the token plans and per-token pricing make M3 hard to ignore. Benchmark it on your own tasks against your current default, then decide on the subscription tier that matches real usage.

Pilot M3 now

Long-context multimodal work

Documents, images, video at scale

Native multimodality plus a 1M-token window at this price is a genuinely differentiated combination. Validate the multimodal quality on your data, since the supporting scores are vendor-stated and the numeric video benchmarks were not published.

Evaluate on your corpus

Top-of-stack agentic coding

Maximum capability, cost secondary

If you need the strongest agentic coding available and price is a secondary concern, the current independent picture favors closed frontier such as Claude Opus 4.8, which leads M3's vendor numbers on the comparable benchmarks.

Stay with closed frontier

Sovereignty / regulated data

On-prem or compliance-bound

Wait. The open weights and license were not confirmed at launch, and the data-governance considerations need accounting for. Revisit once the weights publish, the license is known, and independent evaluations land.

Wait for weights + audits

For most agencies and engineering teams the right first step is a scoped evaluation: run M3 on the prompts and repositories you actually care about, measure token spend and latency against your current default, and decide per-workload rather than per-headline. If you want help structuring that comparison, our AI digital transformation engagements start with exactly this kind of model-selection eval, and our development team can wire the winning model into your agent stack.

09 — ConclusionA real release with an asterisk.

The shape of open frontier, May 2026

A compelling open-weight option — once the verification arrives.

MiniMax M3 is an ambitious release that claims something genuinely new for open models: frontier coding, a million-token context window, and native multimodality fused into one model, powered by a sparse attention design that makes long context affordable rather than aspirational. The pricing, especially the token-plan breakeven math, is the most immediately actionable part of the story.

The honest framing keeps two facts in view. The benchmarks are vendor-run and unvalidated, and the open weights were committed within days rather than shipped at launch. M3 lands roughly level with the prior Opus 4.7 frontier on agentic work and trails the newer Opus 4.8 by ten to fourteen points on the comparable evals — not the agentic frontier, but a serious option at a small fraction of the closed price.

The broader signal is that open-weight competition is now setting the cost floor for agentic work, and pushing closed frontier on price even when it cannot match it on capability. The practical move is the same one that always wins: wait for independent results, run your own evals on the workloads you care about, and let the numbers you can verify decide.

MiniMax M3: 1M Context, Open-Weight, Agentic Frontier

01 — What ShippedAn API release, with open weights committed within days.

M3 API

Open weights

02 — The Three-Way JamWhy coding, context, and multimodality have been incompatible.

Interleaved pretraining corpus

OmniDocBench (vendor)

SVG-Bench (vendor)

03 — MiniMax Sparse AttentionMSA: block-level selection of real key-values.

MSA efficiency vs M2 at 1M-token context · vendor-stated

04 — BenchmarksThe numbers, every one of them vendor-run.

M3 benchmark sheet · vendor-stated, not independently validated

05 — The Timing GapM3 was benchmarked against the wrong Opus.

M3 vs the model it did not compare against · Claude Opus 4.8

06 — Pricing & PlansWhere M3 is genuinely disruptive.

Tokens per month

Tokens per month

Tokens per month

Solo developer, multi-agent

Small team, parallel agents

Heavy multi-agent workloads

Bursty or unpredictable usage

07 — The Open-Weight CaveatA Day-0 promise, not a delivered reality.

08 — Who Should SwitchA decision framework for builders and teams.

Cost-sensitive parallel agents

Documents, images, video at scale

Maximum capability, cost secondary

On-prem or compliance-bound

09 — ConclusionA real release with an asterisk.

A compelling open-weight option — once the verification arrives.

Pick the model your workload actually needs — evidence first.

Model-selection engagements

The questions we get every week.

Continue exploring frontier releases.

MiniMax M3 vs Opus 4.8 vs GPT-5.5: Coding Showdown

StepFun Step 3.7 Flash: 196B MoE Agentic Vision Model

OpenRouter June 2026: New Models, Pricing and Rankings

Qwen 3.7 Plus: Alibaba's Low-Cost Agent Model GA Release