MiniMax M2.5: China’s Frontier AI Challenger Redefines Efficiency
A 230-billion-parameter model that activates only 10 billion parameters per inference pass, scores within 0.6 percentage points of the world’s best coding model, and costs roughly one dollar per hour to run at full speed. That’s MiniMax M2.5 in a single sentence – and it’s the reason the AI industry’s cost assumptions are being rewritten in real time.
Released on February 12, 2026, by Beijing-based MiniMaxAI, the M2.5 represents a generational leap over its predecessor M2.1 and a direct challenge to frontier models from Anthropic, OpenAI, and Google. Its mixture-of-experts (MoE) architecture activates just 4.3% of total parameters during any given inference pass, delivering what amounts to frontier-tier capability on a fraction of the compute. The result is a model that matches Claude Opus 4.6’s output speed, surpasses GPT-5.2 on SWE-Bench Verified, and does it all at pricing that undercuts every major Western competitor by an order of magnitude.
This isn’t a niche research project or a proof of concept. MiniMax reports that 80% of its own newly committed code is now generated by M2.5, and the model autonomously handles 30% of daily company tasks. Here’s a deep look at what makes it tick, where it excels, and where it falls short.
Architecture: The MoE Efficiency Engine
MiniMax M2.5 employs a mixture-of-experts transformer architecture with 230 billion total parameters distributed across 8 experts. During inference, top-2 routing activates exactly 2 of those 8 experts per token, meaning only about 10 billion parameters are active on any given pass. That 4.3% activation ratio is the foundation of everything that follows – the speed, the cost, and the deployment flexibility.
The model supports a 204,800-token context window, with the underlying architecture capable of scaling to 1 million tokens. It can generate up to 128,000 output tokens in a single response. The entire model is released under an MIT license on Hugging Face, enabling self-hosting, fine-tuning, and commercial deployment without licensing fees.
What sets M2.5’s training apart is the proprietary Forge reinforcement learning framework. Rather than relying primarily on human preference data or supervised fine-tuning, Forge deploys the model across more than 200,000 real-world environments – actual code repositories, web browsers, and office applications. The framework uses an algorithm called CISPO (designed specifically to maintain MoE model stability during large-scale RL training) and achieves a claimed 40x training speedup through asynchronous scheduling and tree-structured sample merging.
Benchmark Performance: Near the Top in Coding
The headline number is 80.2% on SWE-Bench Verified – a benchmark that tests models against real GitHub pull requests requiring bug fixes and feature implementations across production codebases. That score trails Claude Opus 4.6 (80.8%) by just 0.6 percentage points and edges past GPT-5.2 (80.0%) and Gemini 3 Pro (78.0%).
But M2.5’s coding strength extends well beyond a single benchmark.
| Benchmark | MiniMax M2.5 | Claude Opus 4.6 | GPT-5.2 | Gemini 3 Pro |
|---|---|---|---|---|
| SWE-Bench Verified | 80.2% | 80.8% | 80.0% | 78.0% |
| Multi-SWE-Bench | 51.3% | 50.3% | – | 42.7% |
| SWE-Bench Pro | 55.4% | 55.4% | 55.6% | 54.1% |
| BFCL Multi-Turn (Tool Calling) | 76.8% | 63.3% | – | 61.0% |
| Terminal-Bench 2 | 51.7% | 55.1% | 54.0% | 54.0% |
| BrowseComp (w/ context) | 76.3% | 84.0% | 65.8% | 59.2% |
The BFCL multi-turn tool-calling score of 76.8% stands out dramatically – it leads Claude Opus 4.6 by over 13 percentage points. This suggests that MiniMax’s real-environment RL training translates directly into superior function orchestration, which is precisely what matters in agentic coding workflows where models must call APIs, edit files, and navigate repositories autonomously.
On reasoning and knowledge benchmarks, M2.5 posts solid but not category-leading numbers: 86.3% on AIME25, 85.2% on GPQA-Diamond, 44.4% on SciCode, and 70.0% on IFBench. These are respectable scores, but competitors like GLM-5 (92.7% on AIME 2026) and Claude Opus 4.6 pull ahead on pure reasoning tasks. M2.5 was clearly optimized for coding and agentic execution, not abstract mathematics.
Speed and Task Efficiency
M2.5 ships in two variants, and the speed difference matters for different use cases.
| Variant | Output Speed | Input Price (per M tokens) | Output Price (per M tokens) | Hourly Cost |
|---|---|---|---|---|
| M2.5 Standard | 50 tokens/sec | $0.15 | $1.20 | ~$0.30 |
| M2.5 Lightning | 100 tokens/sec | $0.30 | $2.40 | ~$1.00 |
The Lightning variant’s 100 tokens per second matches Claude Opus 4.6’s throughput and runs 52% faster than GLM-5’s approximately 66 tokens per second. Independent testing measured the standard variant at 47.2 tokens per second with a time-to-first-token of 3.03 seconds.
Beyond raw token speed, M2.5 demonstrates meaningful task-level efficiency improvements. It completes a single SWE-Bench task in just 22.8 minutes, down from M2.1’s 31.3 minutes – a 37% speed improvement. Token consumption dropped too: 3.52 million tokens per SWE-Bench task versus M2.1’s 3.72 million, a 5% reduction attributed to better task decomposition. MiniMax calls this the model’s “Spec-writing” coding style – it breaks down architecture and plans features before touching implementation, which reduces inefficient trial-and-error loops.
The Cost Equation That Changes Everything
This is where M2.5 becomes genuinely disruptive. A typical complex SWE-Bench coding task costs approximately $0.15 with M2.5 Standard. The same task on Claude Opus 4.6 runs roughly $3.00 – a 20x difference for near-identical benchmark results.
At the Lightning tier, running the model continuously at 100 tokens per second costs about $1 per hour. Four instances running 24/7 for an entire year would cost less than minimum wage in most countries. M2.5’s output pricing of $1.20 per million tokens is just 37% of GLM-5’s $3.20 per million tokens, despite M2.5 outperforming GLM-5 on coding benchmarks by 2.4 percentage points.
The economic implications extend beyond API pricing. Because M2.5 activates only 10 billion parameters (compared to GLM-5’s 40 billion active parameters), it can be deployed on consumer-grade GPUs. This dramatically lowers the infrastructure barrier for organizations that want to self-host rather than rely on cloud APIs – particularly relevant for enterprises in regulated industries or regions with data sovereignty requirements.
Agentic Capabilities and the Architect Mode
M2.5 isn’t just a code completion engine. It’s designed as what MiniMax calls a “digital employee” – a full-stack agentic AI capable of coding, web search, tool calling, and office productivity tasks including Word documents, PowerPoint presentations, and Excel spreadsheet manipulation.
The model includes an “architect mode” where it functions as a planning and coordination layer, breaking complex tasks into subtasks and delegating execution sequentially. This proves particularly effective for multi-file code refactors and complex project work requiring sequential tool calls. In practice, the model outlines structure and feature design before diving into implementation – a behavior that emerged during RL training rather than being explicitly programmed.
- Web search: 76.3% on BrowseComp with context management, 70.3% on Wide Search
- Tool calling: 76.8% BFCL multi-turn, with 20% fewer search rounds than M2.1
- Office work: 59.0% win rate versus mainstream models in GDPval-MM evaluation
- Languages: Trained on 10+ programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby
- Platforms: Full-stack coverage across Web, Android, iOS, and Windows
The office productivity angle is unusual for a frontier model. MiniMax collaborated with senior professionals in finance, law, and social sciences to train M2.5 on deliverable-quality outputs – financial models, legal documents, research reports, and presentations that follow professional standards rather than generating rough drafts.
Running M2.5 Locally: What You Need
The 10-billion active parameter design makes local deployment feasible in ways that most frontier models simply aren’t. Here’s what the hardware requirements look like across quantization levels.
| Quantization | VRAM / Memory Required | Recommended Hardware | Approximate Speed |
|---|---|---|---|
| FP16 (full precision) | ~460 GB | 8x NVIDIA H100 | 20-50 tokens/sec |
| INT8 | ~230 GB | 4x NVIDIA H100 | 30-60 tokens/sec |
| INT4 / 4-bit | 115-130 GB | 2x H100 or 1x B200 | 50-100 tokens/sec |
| 3-bit (Unsloth UD-Q3_K_XL) | ~101 GB (unified) | Mac with 128 GB RAM | 20+ tokens/sec |
For Mac users, the Unsloth Dynamic 3-bit GGUF quantization compresses the model from 457 GB (unquantized bf16) to 101 GB – a 62% reduction – while maintaining competitive accuracy. Independent benchmarks show Unsloth’s dynamic quantizations significantly outperform non-Unsloth GGUF alternatives at equivalent bit levels, with important layers upcasted to 8 or 16-bit precision.
Quick Setup via llama.cpp on Mac
- Build llama.cpp from the latest GitHub source with
cmake(use-DGGML_CUDA=OFFfor CPU/Metal inference) - Download the quantized model:
hf download unsloth/MiniMax-M2.5-GGUF --include "*UD-Q3_K_XL*" - Launch with recommended settings:
--ctx-size 16384 --flash-attn on --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40
Production Deployment via vLLM on Linux
- Install CUDA 12.1+ and vLLM:
pip install vllm torch torchvision torchaudio - Download the model from Hugging Face with
trust_remote_code=True(required for MoE routing configuration) - Launch the server:
vllm serve minimaxai/MiniMax-M2.5 --quantization awq --dtype bfloat16 --max-model-len 125000 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 - Adjust
--tensor-parallel-sizeto match your GPU count; omitting tensor parallelism can result in 10x slower inference
Critical pitfalls to avoid: never exceed 0.95 GPU memory utilization (drop to 0.90 and use INT4 if you’re hitting limits), always enable trust_remote_code=True or MoE routing will fail silently, and cap context length at 196,600 tokens – exceeding 204,800 produces degraded output.
Where M2.5 Falls Short
No model is without weaknesses, and M2.5 has clear ones.
Terminal-Bench 2 at 51.7% versus Claude Opus 4.6’s 55.1% reveals a gap in terminal-environment coding. General reasoning scores – particularly the 19.4% on Humanity’s Last Exam without tools – indicate this model was not optimized for broad knowledge work or abstract reasoning. If you need a model to reason about advanced mathematics or answer obscure factual questions, Opus and GPT-5.2 still win convincingly.
The model is also notably verbose. During evaluation, it generated 56 million tokens compared to an average of 15 million across comparable models – nearly 4x the typical output volume. This verbosity inflates costs in production if not managed through careful prompt engineering and output length limits.
There are also legitimate concerns about benchmark reliability. Several developers have flagged MiniMax’s history of benchmark reward-hacking with earlier M2 and M2.1 releases. While M2.5’s scores have been independently validated across multiple harnesses (Droid, OpenCode), the concern is worth noting for anyone making procurement decisions based solely on published numbers.
Market Significance and Outlook
M2.5 arrives at a pivotal moment in the global AI landscape. Chinese open-source models grew from roughly 1.2% of global usage in late 2024 to nearly 30% by end of 2025, driven by cost advantages and open-weight accessibility. MiniMax’s IPO-track trajectory – shares rose 15.7% to HK$680 following the M2.5 release – signals investor confidence that efficiency-first architectures can compete with scale-first Western approaches.
The strategic implications are significant. U.S. chip sanctions have constrained Chinese labs’ access to cutting-edge hardware, but MiniMax trained its predecessor M1 on just 512 H800 chips over three weeks at approximately $540,000. Rather than competing on raw compute, Chinese firms are competing on architectural intelligence – hybrid attention mechanisms, aggressive MoE routing, and agent-native RL training that extracts more capability per FLOP.
For enterprises, the calculus is straightforward. M2.5 delivers frontier-level coding performance, sub-$1 hourly operational costs, consumer-GPU deployability, and MIT-licensed open weights. Organizations in regulated industries gain data sovereignty through local deployment. Cost-sensitive operations gain 10-20x savings on AI infrastructure without meaningful performance sacrifices on coding tasks. The model’s limitations in pure reasoning and broad knowledge work mean it won’t replace Claude or GPT-5.2 for every use case – but for the specific, high-value domain of AI-assisted software engineering, it’s now the efficiency benchmark every other model will be measured against.
Sources
- MiniMax-M2.5 vs GLM-5: Full Comparison Analysis
- MiniMax M2.5: Coding Benchmarks and Pricing Guide
- MiniMax M2.5 Review: Real-World Testing
- MiniMax-M2.5 Run Guide – Unsloth Documentation
- MiniMax M2.5 API on Together AI
- MiniMax M2.5 Official Model Page
- MiniMax M2.5 Lightning on Vals AI
- MiniMax M2.5 Specs and Benchmarks – Galaxy.ai
- MiniMax M2.5 Model Card on NVIDIA NIM
- Chinese AI Models Reshaping the Global Landscape