AI Development11 min read

MiniMax M2.7 Release: Agentic Coding Benchmarks Guide

MiniMax M2.7 release deep dive — agentic coding benchmarks, tool-use performance, multi-step reasoning, and how it compares to M2.5 and Kimi K2.5.

Digital Applied Team
April 12, 2026
11 min read
56.22%

SWE-Pro

10B

Active Params

$0.30/M

Input Price

50x cheaper

vs Opus 4.6

Key Takeaways

Trained Itself: M2.7 ran 100+ rounds of autonomous scaffold optimization during training — analyzing failures, editing its own agent harness, running evals, and keeping what worked. The loop produced a 30% improvement on internal evaluations without human-designed prompt engineering.
Near-Opus Coding at 50x Lower Cost: On SWE-Pro M2.7 scores 56.22%, approaching Claude Opus 4.6's best level, while running at $0.30 input and $1.20 output per million tokens — roughly 50x cheaper on input than Opus 4.6.
Sparse MoE with 10B Active: The architecture is a ~230B parameter Mixture-of-Experts with only 10B active per token, the smallest active-parameter footprint in the Tier-1 coding class. Inference cost and latency track the active count, not the total.
Strong Real-World Adoption: M2.7 is #4 by total tokens on OpenRouter with 1.34T tokens/week and +24% growth, and #3 on the coding leaderboard at 13.0% of all coding tokens routed through the platform.
Open Weights and NVIDIA NIM: Open weights were released alongside the API, and the model is available through NVIDIA NIM microservices for self-hosted inference. Agencies can run it on their own infrastructure or hit the direct API at $0.30/$1.20.
3x Faster Than Opus 4.6: The small active-parameter count lets M2.7 generate roughly 3x faster than Opus 4.6 in end-to-end agentic loops, which compounds the cost gap when tasks require many tool calls.

MiniMax M2.7 didn't just ship stronger coding benchmarks — it trained itself. Over 100 rounds of autonomous scaffold optimization during training produced a 30% improvement on MiniMax's internal evaluations, with no human prompt engineer in the loop. The model analyzed its own failures, proposed changes to its agent harness, ran evals, and decided what to keep. That loop is the single most interesting thing about this release.

The headline coding numbers are strong in their own right: 56.22% on SWE-Pro, 57.0% on Terminal Bench 2, and 55.6% on VIBE-Pro, which MiniMax characterizes as nearly approaching Claude Opus 4.6's best level. But the story agencies actually care about is the price tag and the usage data. M2.7 costs $0.30 per million input tokens and $1.20 per million output tokens — roughly 50x cheaper than Opus 4.6 on a typical coding workload — and has already climbed to #4 on OpenRouter with 1.34T tokens per week and +24% growth. This guide walks through what the self-evolving training actually did, how the sparse MoE architecture delivers the cost advantage, the real benchmark picture against Opus 4.6 and GPT-5.4, and the decision matrix for when M2.7 should replace your current coding model.

Self-Evolution: What Actually Happened

The phrase "self-evolving" gets used loosely in AI marketing, so it helps to be specific about what MiniMax actually did. During training of M2.7, the team set up a closed optimization loop where the model itself improved the agent harness wrapped around it. This was not online self-modification — the deployed weights are fixed — and it was not gradient-level self-training. It was the model doing what a human prompt engineer normally does, on itself, at training time.

The Optimization Loop in Five Steps
  1. Analyze failures from the previous evaluation batch and identify recurring error patterns.
  2. Plan changes to the scaffold — prompts, tool descriptions, retrieval policies, step sequencing.
  3. Modify the scaffold by emitting the edited harness configuration directly.
  4. Run evaluations on the modified scaffold across the benchmark suite.
  5. Compare and decide whether to keep the new scaffold or revert to the previous best.

MiniMax ran more than 100 of these rounds during training. The aggregate effect across the full loop was a 30% improvement on their internal coding evaluations. That is a significant number, but the more interesting implication is methodological: it shows that a sufficiently capable model can do the prompt-engineering and harness-tuning work that human AI engineers usually own. For agencies building production agent systems, this hints at where the labor cost of maintaining agentic pipelines is headed.

What the Loop Discovered

MiniMax has not published the full diff of every scaffold change, but the public materials describe optimizations around tool routing (when to call a tool vs. reason directly), error-recovery patterns (how to retry after a failing shell command), and context management (how much repository state to pull in per step). Many of these optimizations match the scaffolding tricks that independent AI engineers had been sharing on forums and research papers — M2.7 rediscovered them autonomously and built them into its default behavior.

The practical consequence is that M2.7 ships with a strong baseline agentic harness already baked in. Agencies spinning up the model for coding tasks need less custom scaffolding than earlier Chinese coding models required. This is part of why the model performs well on real-world usage despite not topping headline benchmarks — it has better out-of-the-box behavior in long agentic loops.

Architecture: Sparse MoE at Scale

M2.7 is a sparse Mixture-of-Experts model with roughly 230B total parameters and 10B active per token. The 10B active count is the smallest in the Tier-1 coding class — smaller than Xiaomi's MiMo-V2-Pro (42B active), smaller than Z.ai's GLM-5 (roughly 32B active), and far smaller than the dense frontier models. Because inference cost and latency track the active-parameter count, not the total, M2.7's unit economics sit in the budget tier even though its total parameter count puts it among the larger models on the market.

Total Capacity
~230B parameters

The full MoE has roughly 230B parameters distributed across many expert modules. The model picks a small subset of experts to run for each token, keeping compute bounded while retaining the knowledge and representational capacity of a much larger network.

Active Compute
10B per token

Only 10B parameters activate per token. This is what produces the ~3x speed advantage over Opus 4.6 in end-to-end agentic loops, and what lets MiniMax price the model at $0.30/$1.20 per million tokens while keeping margin.

The Efficiency Tradeoff

Sparse MoE architectures earn cost and latency gains, but they have well-known tradeoffs. Specialist routing can struggle on out-of-distribution tasks where the right expert for the input is ambiguous. Fine-tuning a sparse MoE is harder than fine-tuning a dense model of equivalent quality. And long-tail knowledge is more unevenly distributed across experts, so MoE models can show surprising weak spots on rarely-seen topics. For the agentic-coding use cases M2.7 is tuned for, these tradeoffs mostly don't bite — coding has a relatively focused distribution of tool calls, error patterns, and reasoning shapes, which is exactly the scenario where expert routing works well.

Context Window and Throughput

The context window is 205K tokens, comfortably larger than what most agency coding workloads need and large enough to fit repo-scale context for most projects. Reported throughput is roughly 3x Opus 4.6 end-to-end, which matters more than raw tokens-per-second when an agentic task runs 40 or 50 tool calls sequentially. Halving latency per turn compounds across the loop and changes what is feasible in interactive agent sessions.

Agentic Coding Benchmarks

MiniMax published M2.7 against three agentic coding benchmarks: SWE-Pro, VIBE-Pro, and Terminal Bench 2. All three are harder than the widely-reported SWE-bench Verified, so the raw numbers look lower than what Claude Opus 4.6 posts on Verified. The fair comparison requires matching benchmarks, and on that basis M2.7 lands close to Opus 4.6's internal results on the same harder evaluations.

BenchmarkMiniMax M2.7What It Measures
SWE-Pro56.22%Harder successor to SWE-bench Verified with multi-file, multi-step engineering tasks
Terminal Bench 257.0%Multi-step shell and command-line agent tasks across real environments
VIBE-Pro55.6%End-to-end full project delivery — spec to deployable code without step-level supervision

How M2.7 Compares to the Rest of the Field

Direct head-to-head on identical benchmarks is sparse because each lab emphasizes different evaluations. Here's the like-for-like picture where numbers exist, plus the most-cited numbers from adjacent benchmarks to give a rough sense of position.

ModelTop Coding BenchmarkInput / Output ($/M)Notes
MiniMax M2.756.22% (SWE-Pro)$0.30 / $1.20Self-evolving, 10B active, 205K context
Claude Opus 4.680%+ (SWE-bench Verified)$5 / $25Verified uses easier benchmark; Opus scores 53.4% on Pro
MiniMax M2.580.2% (SWE-bench Verified)$0.12 / $0.99Superseded Mar 18, 2026; different benchmark emphasis
GPT-5.457.7% (SWE-bench Pro)$2.50 / ~$10Most comparable number published on SWE-Pro
MiMo-V2-Pro49.2 (Intelligence Index)$1.00 / $3.00#1 on OpenRouter by usage, 1M context

The cleanest comparison in the table is M2.7's 56.22% against GPT-5.4's 57.7% on SWE-bench Pro. The two models are within 1.5 points on the same benchmark, with M2.7 priced at roughly 12% of GPT-5.4's input cost. Opus 4.6 reports 53.4% on SWE-Pro per Anthropic's own comparison table, meaning M2.7 slightly edges Opus 4.6 on the harder agentic benchmark despite trailing on Verified.

For the broader field context, our Q2 2026 Chinese AI market share report covers how M2.7, MiMo-V2-Pro, and Qwen are collectively capturing coding workload share, and our Q2 2026 LLM pricing index places M2.7 against every other model on the market by input, output, and blended cost.

Cost Economics

The pricing gap between M2.7 and Claude Opus 4.6 is the single biggest commercial fact about this release. MiniMax prices M2.7 at $0.30 per million input tokens and $1.20 per million output tokens. Opus 4.6 prices at $5 and $25 for the same volumes. That's roughly 16.7x cheaper on input and 20.8x cheaper on output, before the ~3x speed advantage compounds through an agentic loop.

Per-Task Example: A 40-Step Agentic Coding Loop
Typical shape: 400K input tokens, 80K output tokens across all turns
  • Opus 4.6: 400K × $5/M + 80K × $25/M = $2.00 + $2.00 = $4.00 per task
  • M2.7: 400K × $0.30/M + 80K × $1.20/M = $0.12 + $0.096 = $0.22 per task
  • Ratio: ~18x cheaper per task on this workload shape. At 10,000 tasks/month that's $40,000 on Opus vs $2,200 on M2.7.

The "50x cheaper" headline appears in MiniMax's own marketing and reflects a workload shape where input volume dominates and prompt caching or free-tier discounts don't apply. The honest range is 15-50x depending on input/output ratio and whether either side is using volume discounts. The floor of ~15x is already enough to change where agencies deploy each model.

When Price Actually Matters

For one-off client deliverables where quality is the gating factor and total API spend is a small fraction of engineer time, the price gap rarely matters. Where it does matter is in repeat-run workloads: automated PR review across a client's repositories, batch test generation, nightly agentic refactors, automated migration work across many services. Those workloads scale linearly with client count and repository size, and dropping per-task cost from $4 to $0.22 is the difference between a profitable productized offering and one that eats margin at volume. For a primer on how we help agencies wire these systems into client workflows, see our AI Digital Transformation service.

OpenRouter Adoption

OpenRouter is the closest public signal for what developers are actually routing production traffic through. M2.7 has climbed quickly on three OpenRouter rankings since its March 18 release.

#4 by Tokens
Overall weekly usage

1.34T tokens per week, +24% week-over-week growth. Sits behind MiMo-V2-Pro, free Qwen 3.6 Plus, and free Step 3.5 Flash, meaning #1 on the paid, non-free tier.

#3 on Coding
Coding leaderboard

1.05T coding tokens — 13.0% of all coding tokens routed through OpenRouter. The coding top 3 (MiMo-V2-Pro, Qwen 3.6 Plus, M2.7) collectively handle over 62% of coding traffic.

#3 by Apps
Unique deployments

13.2M apps using the model (7.0% share). Application breadth matters more than raw tokens for predicting staying power — M2.7 is being picked across many use cases, not concentrated in a few heavy routes.

The adoption curve matters because it shapes availability and pricing stability. Models that capture sustained coding share tend to get picked up by second- and third-party inference providers, which drives aggregate capacity up and unit prices down over time. M2.7's position between the free-tier adoption leaders and the high-quality premium incumbents means it is likely to see continued inference-layer investment.

For related context, our guides to MiniMax's earlier M2 agent platform and the M2.1 digital-employee coding release trace how MiniMax evolved from general-purpose agent tooling into the coding-focused, self-evolving model M2.7 represents.

Deployment Options

M2.7 ships with three deployment paths: the MiniMax direct API, NVIDIA NIM microservices for enterprise self-hosting, and open weights for fully self-managed deployment. Each path has a different cost, latency, and data-residency profile.

Direct API and OpenRouter

The simplest path. Hit MiniMax's API at $0.30/$1.20 per million tokens, or go through OpenRouter at the same published rate for provider redundancy. This is the right default for most agency workloads — no infrastructure to run, best-available throughput, and no upfront commitment. Start here and only move off if a specific constraint pushes you elsewhere.

NVIDIA NIM Microservices

NVIDIA packages M2.7 as a NIM microservice, which gives enterprise teams a containerized deployment with optimized inference on H100/H200 GPU clusters. This is the path for clients who need data-sovereignty controls, contractual on-prem commitments, or guaranteed regional compute. Expect meaningfully higher unit cost than the direct API before you hit the break-even volume, which for 10B active MoE inference typically lands in the tens of millions of tokens per day.

Open Weights and Self-Hosting

MiniMax released open weights alongside the API, so teams can run M2.7 on their own hardware with vLLM, TensorRT-LLM, or any MoE-aware inference server. Self-hosting makes sense for privacy-sensitive workloads, highly predictable traffic at very high volumes, or research purposes. For most agencies the operational overhead — GPU procurement, inference-layer tuning, MoE routing performance work — is not worth the marginal cost gap over the direct API.

When M2.7 Wins and When It Doesn't

M2.7 is not a drop-in replacement for Claude Opus 4.6 on every workload. The right framing for agencies is route-and-escalate: send the high-volume, predictable-shape tasks to M2.7 for cost reasons, and escalate the hard cases where quality ceiling matters to Opus. Here's the decision matrix.

Pick M2.7 When
  • Workload is high-volume. Automated PR review, batch test generation, repository-scale refactor pipelines where task cost multiplies across thousands of runs.
  • Task shape is predictable. Agentic loops where the tool-call pattern repeats and the model's self-evolved scaffolding already covers the shape.
  • Latency matters more than ceiling quality. 3x faster generation beats 10% higher benchmark scores in interactive agent work.
  • You need data sovereignty. Open weights plus NIM make on-prem deployment feasible, which Opus simply cannot match.
Stay on Opus When
  • Task requires deep reasoning. Novel architectural decisions, ambiguous specs, debugging subtle distributed-systems bugs — the quality ceiling matters.
  • Deliverable is client-facing premium work. One-shot code quality for leadership demos, pitch deliverables, or flagship client features where the margin on API cost is irrelevant.
  • Multi-modal needs dominate. High-res vision, diagram reading, dense screenshot analysis — Opus 4.7's 2,576px image support has no M2.7 equivalent.
  • Tool ecosystem is the bottleneck. Workloads tied to Claude Code, Anthropic-specific extensions, or the Claude Platform's adaptive thinking and task budgets.

The Route-and-Escalate Pattern

The practical production shape for most agencies is to route incoming coding tasks to M2.7 by default and escalate to Opus 4.6 on specific signals: task complexity scores over a threshold, repeated failures on M2.7, or explicit classification tags. A blended pipeline with 85% M2.7 and 15% Opus handles production quality without paying Opus rates on the easier majority. The same pattern pairs well with our CRM automation builds, where high-volume enrichment and summarization tasks are M2.7-shaped and edge cases escalate to a human or Opus.

For related reading on multi-model routing strategies and the adjacent omnimodal space, see our MiMo V2 Omni release guide, which covers the omni-modal equivalent in the Xiaomi stack.

Conclusion

MiniMax M2.7 is the clearest signal yet that sparse MoE architectures and self-directed training loops are going to squeeze the cost-per-agentic-task curve faster than most teams have planned for. The 10B active parameter design delivers near-Opus coding quality at roughly 1/50th the input cost, and the self-evolution methodology produces an out-of-the-box agent harness that handles real-world coding loops with less custom scaffolding than earlier Chinese coding models required.

The honest limitation: M2.7 is not the quality ceiling. For deep-reasoning work on novel problems, Opus 4.6 and GPT-5.4 still lead. But for the long tail of repeatable agentic coding workloads — PR review, test generation, migration scripts, agentic refactors at repository scale — the economics now strongly favor M2.7 as the default, with premium models reserved for the hard cases. Running both in a route-and- escalate pattern is the pattern most production agency teams should be moving toward over Q2 2026.

Ready to Rewire Your AI Coding Stack?

Whether you're evaluating M2.7 for production workloads, designing a route-and-escalate pipeline across multiple models, or standing up self-hosted inference for data-sensitive clients, we can help you map the model landscape to your actual tasks.

Free consultation
Expert guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring Chinese AI models, agentic coding, and frontier cost economics