AI Development14 min readFeatured Guide

MiniMax M2.5: Coding Benchmarks, Pricing, and Guide

MiniMax M2.5 scores 80.2% SWE-Bench Verified and costs 1/10th of competitors. Complete guide to features, benchmarks, pricing, API access, and model comparison.

Digital Applied Team

February 12, 2026

14 min read

80.2%

SWE-Bench Verified

1/10th

Competitor Cost

37%

Faster Than M2.1

200K+

Training Environments

Key Takeaways

80.2% SWE-Bench Verified: State-of-the-art coding performance with 51.3% Multi-SWE-Bench and 55.4% SWE-Bench Pro scores

1/10th Competitor Cost: $0.30/M input and $2.40/M output for Lightning variant — roughly $1 per hour at 100 tokens per second

37% Faster Than M2.1: Matches Claude Opus 4.6 speed on standard benchmarks, with Lightning variant reaching 100 tokens per second

Full-Stack Agentic AI: Coding, web search, tool calling, and office work including Word documents, PowerPoint, and Excel spreadsheets

Forge RL Framework: Agent-native reinforcement learning with CISPO algorithm and 40x training speedup across 200,000+ environments

MiniMax released M2.5 in February 2026 — a frontier AI model that scores 80.2% on SWE-Bench Verified, placing it within 0.6 percentage points of Claude Opus 4.6 while costing roughly 1/10th to 1/20th the price. The model represents a significant leap from MiniMax's previous M2.1 release, which scored 74% on SWE-Bench with 10 billion active parameters.

M2.5 ships in two variants: a standard model running at 50 tokens per second and a Lightning variant at 100 tokens per second. Both are trained using MiniMax's Forge reinforcement learning framework, which scales agent training across 200,000+ real-world environments including code repositories, web browsers, and office applications.

This guide covers M2.5's benchmark performance, pricing, technical architecture, and how it compares to Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro across coding, search, tool calling, and office work tasks.

What Is MiniMax M2.5?

MiniMax M2.5 is a frontier AI model built for agentic workflows — tasks that require the model to use tools, write code, search the web, and complete multi-step processes autonomously. Unlike models optimized primarily for chat or single-turn generation, M2.5 is designed to function as what MiniMax calls a "digital employee" capable of sustained, independent work.

M2.5-Lightning

Speed-optimized variant

• 100 tokens per second output
• $0.30/M input, $2.40/M output
• ~$1/hour operational cost
• Optimized for high-throughput workflows

M2.5 Standard

Cost-optimized variant

• 50 tokens per second output
• $0.15/M input, $1.20/M output
• ~$0.30/hour operational cost
• Same benchmark performance as Lightning

Both variants support an "architect mode" where M2.5 acts as a planning and coordination layer, breaking complex tasks into subtasks and delegating execution. This is particularly effective for multi-file code refactors and complex project tasks where sequential tool calls are needed.

Key context: MiniMax is a Beijing-based AI company that previously released M2.1 in January 2026. M2.5 represents a generational improvement: 80.2% vs 74% SWE-Bench, 37% faster inference, and expanded capabilities across search, tool calling, and office work.

Coding Performance

M2.5's headline number is 80.2% on SWE-Bench Verified — a benchmark that tests models against real GitHub pull requests requiring bug fixes and feature implementations across production codebases. This places M2.5 within 0.6 percentage points of Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80%) and Gemini 3 Pro (78%).

Coding Benchmark Results

Performance across standard code generation benchmarks

SWE-Bench Verified

80.2%

Real-world GitHub issue resolution across production repositories. Tested with Droid and OpenCode harnesses for agentic tool use.

Multi-SWE-Bench

51.3%

Multi-repository coding tasks requiring cross-project understanding. Leads Opus 4.6 (50.3%) and Gemini 3 Pro (42.7%).

SWE-Bench Pro

55.4%

Harder subset of SWE-Bench with more complex engineering challenges. Tests architectural decision-making alongside implementation.

Agentic Coding Harnesses

M2.5's coding benchmarks were evaluated using Droid and OpenCode — agentic harnesses that give the model access to terminal commands, file editing, and repository navigation. This reflects real-world usage where AI models interact with codebases through tool calls rather than generating isolated code snippets.

The model demonstrates strong multilingual capability across Python, JavaScript, TypeScript, Java, C++, Go, and Rust. SWE-Bench repositories span multiple languages and frameworks, and M2.5's consistent performance across them suggests robust cross-language understanding.

Search and Tool Calling

Beyond coding, M2.5 shows competitive results in web search and tool calling benchmarks — capabilities essential for agentic workflows that require gathering information and interacting with external systems.

Search Performance

BrowseComp (w/ context): 76.3% — complex web browsing tasks requiring multi-page navigation
Wide Search: 70.3% — broad information retrieval across diverse domains

Tool Calling

BFCL multi-turn: 76.8% — significantly ahead of Opus 4.6 (63.3%) and Gemini 3 Pro (61%)
20% fewer rounds: Completes multi-step tasks in fewer tool-calling iterations than competitors

The BFCL (Berkeley Function Calling Leaderboard) result is particularly notable. At 76.8%, M2.5 leads Opus 4.6 by over 13 percentage points in multi-turn tool calling — suggesting that MiniMax's RL training across real-world tool environments translates directly into more efficient function orchestration. The 20% reduction in rounds needed to complete tasks means lower latency and cost in production deployments.

Office Work Capabilities

M2.5 extends beyond coding and search into office productivity tasks — a category most frontier models don't explicitly target. MiniMax trained the model to interact with document editing, spreadsheet manipulation, and presentation creation tools.

GDPval-MM

59%

General document processing and understanding across multiple modalities

MEWC

74.4%

Multi-environment work completion — tasks spanning multiple office applications

Supported Office Tasks

Word documents: Creating, editing, formatting, and restructuring documents including complex table manipulation and style application
PowerPoint presentations: Building slide decks from specifications, adding charts and layouts, and editing existing presentations
Excel spreadsheets: Formula creation, data analysis, pivot table generation, and financial modeling
Financial modeling: Budget projections, scenario analysis, and report generation from raw data inputs

AI-powered office automation is here. Models like M2.5 can handle document creation, data analysis, and reporting at scale. Explore our AI & Digital Transformation Services to integrate AI into your business workflows.

Efficiency and Speed

M2.5 achieves a 37% speed improvement over M2.1 while significantly expanding capability. The Lightning variant runs at 100 tokens per second — matching Claude Opus 4.6's output speed while costing a fraction of the price.

Efficiency Metrics

Speed

• 37% faster than M2.1
• Lightning: 100 TPS output
• Standard: 50 TPS output
• Matches Opus 4.6 throughput

Task Efficiency

• 3.52M tokens average per SWE-Bench task
• 20% fewer tool-calling rounds
• Optimized for sustained agentic work
• Architect mode for complex orchestration

The 3.52 million tokens per SWE-Bench task metric is significant for cost planning. At Lightning pricing ($2.40/M output), a typical complex coding task costs approximately $8.45 in output tokens. Comparable tasks on Opus 4.6 cost roughly $264 at ~$75/M output — a 30x difference for near-identical benchmark results.

Pricing and Cost Analysis

MiniMax positions M2.5 as a cost-disruption play. Both variants are priced 10-20x below comparable frontier models, making sustained agentic use economically viable for tasks that would be prohibitively expensive on alternatives.

Model	Input ($/M)	Output ($/M)	Speed (TPS)	Hourly Cost
M2.5-Lightning	$0.30	$2.40	100	~$1.00
M2.5 Standard	$0.15	$1.20	50	~$0.30

$1 per hour math: At 100 TPS, M2.5-Lightning generates 360,000 output tokens per hour. At $2.40/M, that's $0.86 in output tokens plus input costs — roughly $1 per hour of continuous generation. This makes 24/7 agentic operation economically feasible. For teams looking to integrate cost-effective AI models into development workflows, see our Web Development Services.

How M2.5 Compares

The following comparison puts M2.5 against the three other frontier models most commonly used for coding and agentic tasks. All benchmark numbers are from official reports as of February 2026.

Benchmark	M2.5	Opus 4.6	GPT-5.2	Gemini 3 Pro
SWE-Bench Verified	80.2%	80.8%	80%	78%
Multi-SWE-Bench	51.3%	50.3%	—	42.7%
BrowseComp (w/ ctx)	76.3%	84%	65.8%	59.2%
BFCL Multi-Turn	76.8%	63.3%	—	61%
Output Price ($/M)	$2.40	~$75	~$60	~$20

Comparison Date: February 2026. AI model benchmarks evolve rapidly — verify current specifications before making production decisions. Dashes indicate benchmarks not publicly reported by the provider.

Technical Deep Dive: RL Scaling

M2.5's performance is largely attributed to MiniMax's Forge reinforcement learning framework — a purpose-built system for training AI agents across diverse real-world environments. Forge represents a fundamentally different approach from the RLHF (Reinforcement Learning from Human Feedback) methods used by most competitors.

The Forge Framework

Forge is MiniMax's agent-native RL training infrastructure. Rather than training on static datasets or human preference data, Forge deploys models into live environments — code repositories, web browsers, office applications, and API endpoints — and optimizes based on task completion outcomes.

200,000+ training environments: Real-world codebases, websites, document workflows, and tool APIs used as training grounds
40x training speedup: Forge's distributed architecture parallelizes environment interaction, achieving 40x faster training than standard RL approaches
Outcome-based rewards: Models are rewarded for completing tasks correctly, not for matching human-labeled preferences

CISPO Algorithm

CISPO (Clipped Importance Sampling Policy Optimization) is MiniMax's custom RL algorithm, designed specifically for agentic AI training. It extends standard policy optimization methods with improvements for multi-step decision making:

Importance sampling across long trajectories — handles the credit assignment problem in multi-step tasks
Clipped updates to prevent catastrophic policy changes during training
Environment-specific reward shaping that adapts to coding, search, and office work domains

Why this matters: Most frontier models use RLHF to align outputs with human preferences. Forge trains M2.5 to actually complete tasks in real environments — a distinction that explains M2.5's strong performance on agentic benchmarks like BFCL and SWE-Bench where models must use tools and navigate multi-step workflows.

MiniMax Agent Platform

M2.5 powers MiniMax's Agent platform — a consumer and enterprise product that packages the model's capabilities into ready-to-use workflows. The platform extends beyond API access to provide integrated tools for office work, coding, and information retrieval. For context on the platform's evolution from the original M2 Agent release, see our earlier coverage.

Office Skills

Pre-built workflows for document creation, spreadsheet analysis, presentation building, and cross-application tasks. Users interact through natural language instructions.

10K+ Experts

Domain-specific agent configurations for industries including finance, legal, healthcare, and engineering. Each Expert combines M2.5 with specialized prompting and tool access.

Built by M2.5

MiniMax reports that 80% of the Agent platform's codebase was written by M2.5 itself — a practical demonstration of the model's coding capability at production scale.

The Agent platform positions MiniMax beyond the API-only model provider category. By combining M2.5 with integrated tools, MiniMax offers a product that competes directly with enterprise AI assistants from Microsoft (Copilot), Google (Gemini), and startups building agentic coding tools.

Getting Started with MiniMax M2.5

M2.5 is accessible through the MiniMax API and the Agent platform. The API provides OpenAI-compatible endpoints, making integration straightforward for teams already using standard LLM APIs.

1. API Access

Register at api.minimax.chat to get API credentials. The API supports chat completions, tool calling, and streaming — compatible with OpenAI SDK format. Both M2.5 and M2.5-Lightning are available as separate model endpoints.

2. Choose Your Variant

Select based on your primary use case:

M2.5-Lightning: Best for real-time applications, interactive coding assistants, and latency-sensitive workflows
M2.5 Standard: Best for batch processing, background agentic tasks, and cost-sensitive high-volume operations

3. Agent Platform

For non-developer users or teams wanting ready-to-use AI workflows, the MiniMax Agent platform provides pre-built office skills, domain experts, and a conversational interface. Available at agent.minimax.chat.

Conclusion

MiniMax M2.5 delivers frontier-level coding performance at a fraction of competitor pricing. The 80.2% SWE-Bench Verified score places it alongside Claude Opus 4.6 and GPT-5.2, while the $2.40/M output pricing (Lightning) makes sustained agentic operation economically viable in ways that $60-75/M alternatives do not.

The Forge RL framework and CISPO algorithm represent a meaningful technical differentiation — training AI agents in 200,000+ real-world environments rather than relying solely on human feedback produces measurably better tool-calling performance (76.8% vs 63.3% BFCL) and more efficient multi-step task completion.

For teams evaluating AI coding models, M2.5 warrants serious consideration. The benchmark parity with models costing 10-30x more makes it the clear choice for cost-conscious deployments, while the expanded office work and search capabilities position it as a general-purpose agentic model rather than a coding-only tool.