How does MiniMax M2.5 perform on real-world coding tasks?

M2.5 scored 79.7% on SWE-Bench via the Droid harness and 76.1% via OpenCode - both beating Claude Opus 4.6 on the same harnesses. It handles the full software development lifecycle: architecture planning (0-to-1), implementation (1-to-10), feature iteration (10-to-90), and code review and testing (90-to-100). It notably writes specs and architecture plans before coding, similar to a senior engineer's workflow.

What is BrowseComp and why does 76.3% matter?

BrowseComp is a benchmark measuring an AI agent's ability to perform complex web research tasks that require navigating multiple pages, synthesizing information, and reasoning about web content - tasks that mirror what a professional researcher or analyst would do. A score of 76.3% is industry-leading and means M2.5 is exceptionally capable as an autonomous research and search agent, not just a code generator.

MiniMax M2.5 Review 2026: 80.2% SWE-Bench at /Hour

Name: MiniMax M2.5 Review 2026: Frontier Coding at $1/Hour (80.2% SWE-Bench)
Item: MiniMax M2.5
Rating: 8.7
Author: ComputerTech

✓

Why you can trust ComputerTech — We spend hours hands-on testing every AI tool we review, so you get honest assessments, not marketing fluff. How we review · Affiliate disclosure

Written & tested by Sawyer RuhlPublished March 2, 2026 · Updated March 2, 2026

On March 2, 2026, MiniMax dropped a model that should make every AI infrastructure team stop and do the math. MiniMax M2.5 hits 80.2% on SWE-Bench Verified – matching or beating Claude Opus 4.6 in coding – and it runs continuously for $1 per hour at 100 tokens/sec. For comparison, equivalent Claude Opus 4.6 inference on the same benchmark tasks costs roughly 10x more. That’s not a pricing tier. That’s a category reset.

This is the first frontier-class model where the cost of running an agent 24/7 is comparable to a SaaS subscription, not an enterprise compute bill. MiniMax is calling it “intelligence too cheap to meter” – and the numbers back it up.

Rating: 8.7/10 ?????

What Is MiniMax M2.5?

MiniMax M2.5 is a frontier AI model purpose-built for coding agents, agentic tool use, and complex autonomous tasks. It was released publicly on March 2, 2026, by MiniMax – a Chinese AI company founded in early 2022 with a stated mission to “co-create intelligence with everyone.”

MiniMax operates at significant scale: 200 million+ global individual users, 130,000+ enterprise clients and developers, and a product suite spanning text, speech, video, image, and music generation. M2.5 is the company’s flagship reasoning model, trained with reinforcement learning across hundreds of thousands of real-world software development environments.

In one sentence: M2.5 delivers Claude Opus 4.6-class coding performance at one-tenth the cost, with open weights available on Hugging Face.

The $1/Hour Problem No One Was Solving

The bottleneck for agentic AI wasn’t capability – it was cost. Running a capable model as a continuous coding agent, doing background tasks, iterating on a codebase, running tests, fixing bugs – the token costs accumulate fast. A frontier model like Claude Opus 4.6 or GPT-5.2 on a real agentic workload could easily run $50-$200 per complex task. For an autonomous agent operating all day, that’s thousands of dollars monthly before you’ve shipped anything.

MiniMax’s answer is structural. M2.5 and M2.5-Lightning are identical in capability, differing only in throughput. The Lightning version runs at 100 tokens/sec – nearly twice the throughput of most frontier models – and costs $0.30/million input, $2.40/million output. The standard M2.5 runs at 50 tokens/sec and costs half that. Neither requires special pricing tiers, enterprise contracts, or commitments.

The SWE-Bench data is the proof point. MiniMax measured that running SWE-Bench Verified end-to-end with M2.5 costs roughly 10% of what the same evaluation costs with Claude Opus 4.6 – with M2.5 matching Opus 4.6’s task completion time (22.8 vs 22.9 minutes average). Same speed, same quality benchmark, 10x cheaper.

This isn’t marketing math. The HuggingFace model card documents M2.5 consuming an average of 3.52 million tokens per SWE-Bench task vs M2.1’s 3.72 million – the model is actually more token-efficient than its predecessor while achieving higher accuracy. Efficiency gains are compounding, not trading off.

Benchmark Performance

MiniMax M2.5 was evaluated across four major benchmark categories: coding, search and tool calling, office work, and efficiency. The results are strong enough to take seriously – though benchmark skepticism is warranted (more on that in the Controversy section).

Coding Benchmarks

Benchmark	MiniMax M2.5	Claude Opus 4.6	GPT-5.2	Gemini 2.5 Pro	DeepSeek V3
SWE-Bench Verified	80.2%	~79.0%	~72-76%*	~71-75%*	~49%
Multi-SWE-Bench	51.3%	N/A	N/A	N/A	N/A
BrowseComp (w/ context)	76.3%	~65%*	~60-68%*	~58-65%*	N/A
SWE-Bench (Droid harness)	79.7%	78.9%	N/A	N/A	N/A
SWE-Bench (OpenCode harness)	76.1%	75.9%	N/A	N/A	N/A
Office Work Win Rate	59.0%	~41%*	N/A	N/A	N/A

*Estimated from available public benchmarks. Droid/OpenCode harness scores from MiniMax HuggingFace model card. Competitor SWE-Bench scores sourced from respective official publications. Always verify at time of use – benchmarks shift with updates.

Efficiency Benchmarks

Metric	MiniMax M2.5	MiniMax M2.1	Claude Opus 4.6
Avg. SWE-Bench Task Time	22.8 min	31.3 min	22.9 min
Avg. Tokens Per SWE Task	3.52M	3.72M	~35M (estimated)
Native Throughput	50-100 TPS	~40 TPS	~50 TPS
Cost Per SWE Task (est.)	~$0.85	~$0.55	~$8.40
Agentic Rounds Efficiency	~20% fewer rounds vs M2.1	Baseline	N/A

Cost estimates based on published token pricing and average token consumption from MiniMax’s HuggingFace model card. Actual costs vary by task complexity and harness.

Pricing

MiniMax’s pricing structure is unusually simple for a frontier model. Two variants, straightforward per-token rates, and no enterprise-only tier gating.

Model	Speed	Input (per 1M tokens)	Output (per 1M tokens)	Hourly (continuous)
MiniMax M2.5	50 tokens/sec	$0.15	$1.20	$0.30/hr
MiniMax M2.5-Lightning	100 tokens/sec	$0.30	$2.40	$1.00/hr

Both variants support caching, which further reduces costs on repeated context. The model is also available as open weights on Hugging Face – meaning self-hosters pay only for their own compute, with no per-token licensing fees.

Competitor Pricing Comparison

Model	Input (per 1M)	Output (per 1M)	Relative Output Cost vs M2.5
MiniMax M2.5	$0.15	$1.20	1x (baseline)
MiniMax M2.5-Lightning	$0.30	$2.40	2x
DeepSeek V3	~$0.27	~$1.10	~1x
Claude Opus 4.6	~$15.00	~$75.00	~60-65x
GPT-5.2	~$10.00	~$30.00	~25x
Gemini 2.5 Pro	~$3.50	~$14.00	~12x

Competitor prices are approximate based on published API pricing as of March 2026. Verify current rates directly with each provider. Claude Opus pricing via anthropic.com; GPT and Gemini pricing via their respective developer portals.

The DeepSeek V3 comparison is the interesting one. Both are low-cost, open-weight Chinese models. But M2.5 substantially outperforms DeepSeek V3 on SWE-Bench (80.2% vs ~49%) while maintaining comparable pricing. If you’re already using DeepSeek for cost reasons, M2.5 is a direct upgrade.

Key Features

1. Spec-Writing Architecture Tendency

The most operationally interesting behavior that emerged from M2.5’s training is what MiniMax calls a “spec-writing tendency.” Before writing any code, M2.5 automatically decomposes and plans the features, structure, and UI design of a project from the perspective of a senior software architect. This isn’t a prompted behavior – it emerged from reinforcement learning across 200,000+ real-world environments. The implication: M2.5 behaves more like a principal engineer than a code autocomplete tool. The limitation is that this planning phase adds tokens and latency upfront – if you’re doing simple one-shot code generation, you’re paying for planning you don’t need.

2. Full Development Lifecycle Coverage

Most coding models optimize for a slice of the development process. M2.5 was explicitly trained across four distinct lifecycle phases: 0-to-1 system design and environment setup, 1-to-10 system development, 10-to-90 feature iteration, and 90-to-100 code review and testing. This matters for agentic use cases where you want the same model handling architecture decisions and debugging CI failures. The limitation: “trained across all phases” doesn’t mean it’s equally strong at all four – architecture planning for novel domains remains harder than iterating on existing code.

3. Multilingual Coding Across 13+ Languages

M2.5 trained on Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby – and its multilingual performance is noted as a particular strength over predecessors. For polyglot codebases (think a mobile app with Swift/Kotlin frontends, a Node.js API, and a Python ML pipeline), this is significant. The limitation: performance is not uniform across languages. Python, JavaScript, and TypeScript will be strongest given training data distributions; Lua and Dart are likely tails.

4. BrowseComp-Leading Search and Tool Use

With 76.3% on BrowseComp, M2.5 is one of the best available models for autonomous web research tasks. It uses approximately 20% fewer agentic rounds than M2.1 on the same tasks – meaning it’s not just capable, it’s efficient about how it reaches answers. This makes it well-suited for research agents, competitive intelligence pipelines, and any workflow where an AI needs to independently navigate the web to gather context. The limitation: BrowseComp performance is measured with context management – real-world browsing agents face more variability in page structure and blocking.

5. Native 100 TPS Throughput (Lightning)

M2.5-Lightning runs at a native 100 tokens per second – roughly twice the throughput of most frontier models. For agentic tasks, wall-clock time matters more than per-token cost in many cases (you’re paying developer salaries while the agent runs). M2.5-Lightning’s throughput advantage compounds across multi-step pipelines. The limitation: 100 TPS doesn’t help if your bottleneck is tool call latency (API calls, database queries, etc.) – in highly tool-dependent pipelines, the model throughput advantage narrows.

6. Open Weights Self-Hosting

M2.5’s weights are publicly available on Hugging Face. For organizations with data sovereignty requirements, regulated industries, or simply large enough scale that API costs exceed compute costs, self-hosting eliminates per-token fees entirely. The limitation: this is a frontier-scale model requiring significant multi-GPU infrastructure – self-hosting at quality parity with the cloud API is not a small-team project.

Who Is MiniMax M2.5 For?

Use M2.5 if you:

Build agentic coding pipelines where per-task cost is a meaningful budget constraint – M2.5 at 10% the cost of Opus 4.6 enables pipelines that simply weren’t economically viable before
Run background AI agents continuously – $0.30/hour at 50 TPS means a 24/7 coding agent costs less than $220/month at full utilization
Work across polyglot codebases – Android/iOS/Web/backend in multiple languages, where you need a model that actually handles the whole stack
Need research and browsing capabilities alongside coding – M2.5’s BrowseComp leadership means it can gather context from the web and implement, in one loop
Want self-hosting flexibility – open weights let you run it on your own infrastructure, eliminating data residency concerns and per-token fees at scale

Look elsewhere if you:

Handle sensitive enterprise IP or regulated data and can’t vet MiniMax’s data practices thoroughly – Chinese data sovereignty laws are a real concern; Claude or GPT on US infrastructure may be required
Need the richest tool ecosystem and integrations – Claude’s tool use ecosystem and OpenAI’s platform depth are more mature; M2.5 is newer and integration support is still growing
Do simple, low-volume code generation – if you’re running a few hundred API calls/month, the pricing delta doesn’t justify switching from your current setup
Require guaranteed SLAs and enterprise support – MiniMax’s enterprise track record outside China is limited compared to Anthropic or Google

MiniMax M2.5 vs Competitors: Full Comparison

Feature	MiniMax M2.5	Claude Opus 4.6	GPT-5.2	Gemini 2.5 Pro	DeepSeek V3
SWE-Bench Verified	80.2%	~79%	~72-76%*	~71-75%*	~49%
Output Pricing (per 1M)	$1.20-$2.40	~$75	~$30	~$14	~$1.10
Native Throughput	50-100 TPS	~50 TPS	~40-60 TPS	~50 TPS	~40 TPS
Open Weights	? Yes	? No	? No	? No	? Yes
Full-Stack Coverage	Web, Android, iOS, Windows	Web-focused	Web-focused	Web-focused	General
Company Origin	China ????	USA ????	USA ????	USA ????	China ????
BrowseComp	76.3%	~65%*	~60-68%*	~58-65%*	N/A
Best For	Cost-effective coding agents	Enterprise trust + quality	OpenAI ecosystem depth	Long context + multimodal	Budget coding tasks
Self-Hosting	? Open weights	? Cloud only	? Cloud only	? Cloud only	? Open weights
Multimodal Suite	Text + Music + Video + Speech	Text + Vision	Text + Vision + Audio + Video	Text + Vision + Video	Text + Vision

*Estimated from available public data. Verify current benchmark scores at each provider’s official documentation. Internal links to our full reviews of Claude and DeepSeek.

Controversy: What They Don’t Advertise

1. Chinese Data Sovereignty Laws

MiniMax is incorporated and operated in China, which means it operates under Chinese data protection and national security laws – including obligations that could require sharing user data with the Chinese government under certain circumstances. This is not a theoretical risk; it’s a structural feature of operating under Chinese jurisdiction. For individual developers experimenting with the API, this is a low-concern issue. For enterprises running proprietary codebases, sensitive business logic, or regulated data through the API, it warrants serious legal review. The open weights self-hosting option mitigates this – if you run the model on your own servers, your data doesn’t touch MiniMax’s infrastructure.

2. Benchmark Inflation and Self-Reporting

The AI industry’s SWE-Bench results are increasingly contested. Different evaluation harnesses (scaffolding, agent frameworks, tool sets) produce meaningfully different scores – which is why M2.5 published scores under both Droid (79.7%) and OpenCode (76.1%) harnesses. The headline 80.2% is the overall SWE-Bench Verified score, which may use MiniMax’s own scaffolding. Cross-harness variation of 3-4 percentage points is normal and expected – but it’s worth knowing that “80.2%” is not a single objective truth, and comparing it to competitor numbers measured under different conditions has limits. MiniMax’s decision to publish under two external harnesses is a sign of good-faith transparency.

3. Ecosystem Immaturity

MiniMax M2.5 launched on March 2, 2026. The third-party integration ecosystem – VS Code extensions, IDE plugins, CI/CD integrations, workflow templates – is thin compared to Claude or GPT, which have years of community tooling behind them. Early adopters will need to build more plumbing themselves. This is a temporary problem, but it’s real friction in the first 6-12 months.

4. The “Open Weights” Caveat

Open weights is not the same as open source. MiniMax has published the model weights on Hugging Face, but the training data, training code, and full methodology are not public. You can run the model but you can’t audit how it was trained, what biases it may have absorbed, or reproduce it. For most users this doesn’t matter. For safety researchers or organizations with strict AI governance requirements, it’s a meaningful distinction.

5. No Public Track Record on Agentic Stability

M2.5 was trained in “hundreds of thousands of complex real-world environments” – but what those environments look like, how they were constructed, and whether they represent production software diversity isn’t detailed. Claude and GPT have extensive deployment data across millions of real-world codebases; M2.5’s agentic reliability in production is still being established. The benchmark numbers are strong. The real-world track record takes time to accumulate.

Pros and Cons

Pros ?

Industry-leading cost efficiency – 10x to 20x cheaper than Claude Opus 4.6 and GPT-5.2 per output token, making continuous agentic operation genuinely affordable for the first time
SOTA coding performance – 80.2% SWE-Bench Verified puts it at or above every comparable frontier model; cross-harness validation adds credibility
Native 100 TPS throughput – approximately 2x the generation speed of most frontier models; critical for latency-sensitive agentic pipelines
Open weights availability – download and self-host for full data control; eliminates per-token fees at scale and resolves data residency concerns
Spec-writing architect behavior – the emergent tendency to plan before coding mirrors senior engineer workflow; produces more coherent outputs on complex multi-file tasks
Full-stack multilingual coverage – 13+ languages, 200,000+ real-world environments, covering Web/Android/iOS/Windows across the full development lifecycle
BrowseComp-leading web research – 76.3% means it can autonomously navigate and synthesize complex web research; useful for agents that need to gather external context before implementing

Cons ?

Chinese data jurisdiction – API usage routes through MiniMax infrastructure subject to Chinese law; enterprises with sensitive data need to evaluate this carefully or use self-hosted weights
No established enterprise support track record – MiniMax has 130K+ enterprise clients but primarily in Asia; Western enterprise SLA, compliance, and support maturity is unproven
Thin integration ecosystem – limited native IDE plugins, CI/CD integrations, and workflow tools compared to Claude or OpenAI; early adopters build their own plumbing
Self-hosting requires significant GPU resources – open weights are available but a frontier-scale model is not a laptop project; this is a six-figure infrastructure conversation for most teams
Benchmark performance varies by harness – the 80.2% headline vs 76.1% under OpenCode vs 79.7% under Droid shows meaningful variance; real-world performance on your specific codebase will differ
Limited public agentic deployment history – unlike Claude or GPT, M2.5 has no years of production hardening in the Western developer ecosystem; reliability at the edges is still being discovered

Getting Started with MiniMax M2.5

There are two routes: the cloud API or self-hosted open weights. Here’s the practical path for each.

Route 1: Cloud API (fastest)

Sign up at minimax.io – Create a developer account at minimax.io. The API is accessible without an enterprise contract.
Generate an API key – Navigate to the developer console and create your API key. Store it in your environment variables; don’t hard-code it.
Choose your model variant – Use MiniMax-M2.5-Lightning for latency-sensitive pipelines (100 TPS, $2.40/M output). Use MiniMax-M2.5 for background batch workloads (50 TPS, $1.20/M output).
Integrate via OpenAI-compatible endpoint – MiniMax’s API follows the standard chat completions format. Swap your base URL and API key; your existing OpenAI SDK calls should work with minimal changes.
Test with a real coding task – Don’t start with “hello world.” Give it a real bug from your backlog with full file context. The spec-writing behavior only shows up on sufficiently complex tasks.

Route 2: Self-Hosted Open Weights

Download weights from Hugging Face – huggingface.co/MiniMaxAI/MiniMax-M2.5. This is a frontier-scale model; budget significant storage and download time.
Provision adequate GPU infrastructure – This is not a quantized small model. You’ll need multiple high-end GPUs (A100s or H100s) for production-quality inference.
Deploy with vLLM or TGI – Use a production-grade inference server like vLLM or Hugging Face Text Generation Inference for serving; these handle batching, caching, and throughput optimization.
Configure your agent framework – Point your coding agent (OpenCode, Droid, custom) at your self-hosted endpoint. Verify you’re getting comparable performance to the published benchmarks on your harness.

Quick cost sanity check: At 50 TPS / $0.30/hr, running M2.5 8 hours a day for a month costs about $72. At 100 TPS / $1.00/hr, that’s $240/month. Both numbers are within reach for individual developers, not just teams.

Final Verdict

MiniMax M2.5 is the most significant pricing disruption in frontier AI coding agents since DeepSeek V3 made Western AI companies uncomfortable. The difference is that M2.5 actually competes on quality at the top of the leaderboard – 80.2% SWE-Bench Verified is not a “good for the price” score, it’s a top-of-market score, period. Running it continuously costs $0.30-$1.00/hour. That changes the economic calculus of agentic AI fundamentally.

Use M2.5 if you’re building agentic coding pipelines and cost has been the limiting factor. This is the model that makes continuous background coding agents economically viable for individual developers and small teams. If you’ve been waiting for capable AI agents to be affordable, March 2, 2026 is the date you’ve been waiting for.

If you’re an enterprise with data sovereignty requirements, evaluate carefully. The open weights option is the right answer for regulated industries – self-host and the jurisdiction problem disappears. The cloud API requires trusting MiniMax’s data practices, which Western enterprises should scrutinize in the same way they’ve scrutinized DeepSeek.

For individual developers, startups, and cost-conscious teams: this is a 8.7/10. The capability is real, the pricing is genuinely revolutionary, and the open weights give you an exit ramp from vendor lock-in. The only things holding it back from a higher score are ecosystem immaturity and the need to build a real-world reliability track record. Both are time problems, not fundamental ones.

Rating: 8.7/10

Frequently Asked Questions

What is MiniMax M2.5?

MiniMax M2.5 is a frontier AI coding and agentic model released on March 2, 2026, by MiniMax – a Chinese AI company founded in 2022. It achieves 80.2% on SWE-Bench Verified and is designed for full-stack software development across web, Android, iOS, and Windows platforms. It is available as open weights on Hugging Face and via the MiniMax API.

How much does MiniMax M2.5 cost?

MiniMax M2.5 costs $0.15 per million input tokens and $1.20 per million output tokens at 50 tokens/sec. The Lightning version (100 tokens/sec) costs $0.30/M input and $2.40/M output. Running the model continuously for an hour costs $0.30 at 50 TPS or $1.00 at 100 TPS – making it 10x to 20x cheaper than Claude Opus 4.6 or GPT-5.2 per equivalent output.

How does MiniMax M2.5 compare to Claude Opus 4.6?

MiniMax M2.5 matches or slightly exceeds Claude Opus 4.6 on SWE-Bench Verified (80.2% vs ~79%), matches its task completion speed (22.8 vs 22.9 minutes average), and beats it on BrowseComp (76.3%). However, M2.5 costs roughly 10% of Claude Opus 4.6 per task. The trade-offs are ecosystem maturity and data jurisdiction (Chinese vs. US company).

Is MiniMax M2.5 open source?

MiniMax M2.5 is open weights – the model weights are publicly available on Hugging Face at huggingface.co/MiniMaxAI/MiniMax-M2.5. You can download and self-host the model. However, the training code and full architecture details are not fully open source in the traditional sense – it’s weights-open, not code-open.

What programming languages does MiniMax M2.5 support?

MiniMax M2.5 was trained on over 10 programming languages: Python, JavaScript, TypeScript, Java, Go, C, C++, Rust, Kotlin, PHP, Lua, Dart, and Ruby – across more than 200,000 real-world development environments. Its multilingual coding performance is noted as a particular strength over predecessors.

What is MiniMax M2.5-Lightning?

MiniMax M2.5-Lightning is a faster variant of M2.5 with identical capabilities but double the throughput – 100 tokens per second vs 50 TPS for the standard model. It costs twice as much per token ($2.40/M output vs $1.20/M) but completes tasks faster in wall-clock time, making it preferable for latency-sensitive agentic pipelines.

Is MiniMax M2.5 safe to use for enterprise work?

MiniMax is a Chinese company subject to Chinese data laws, which creates data sovereignty concerns for enterprises handling sensitive code or proprietary data via the cloud API. The open weights version can be self-hosted on your own infrastructure to mitigate this entirely. Enterprises in regulated industries (finance, healthcare, defense) should conduct formal due diligence before using the cloud API.

What is the “spec-writing tendency” in MiniMax M2.5?

The spec-writing tendency is an emergent behavior in M2.5 where, before writing any code, the model automatically decomposes and plans the features, structure, and architecture of a project – like a senior software architect would. This wasn’t explicitly programmed; it emerged from reinforcement learning across complex real-world environments. It results in more coherent, architecturally sound outputs on complex multi-file tasks, at the cost of additional planning tokens upfront.

What is BrowseComp and why does M2.5’s 76.3% score matter?

BrowseComp measures an AI agent’s ability to perform complex web research tasks requiring multi-page navigation, information synthesis, and web-content reasoning – tasks that mirror professional research work. M2.5’s 76.3% is industry-leading, meaning it is exceptionally capable as an autonomous research and search agent, not just a code generator. This matters for agentic workflows where the model needs to gather external context before implementing.

How do I get started with MiniMax M2.5?

Visit minimax.io to sign up for API access, or download the open weights from huggingface.co/MiniMaxAI/MiniMax-M2.5 for self-hosting. The cloud API uses an OpenAI-compatible chat completions format – swap your base URL and API key, and existing OpenAI SDK integrations typically work with minimal changes. For self-hosting, deploy with vLLM or Hugging Face TGI on multi-GPU infrastructure (A100/H100 class).

ComputerTech Editorial Team

Our team tests every AI tool hands-on before reviewing it. With 126+ tools evaluated across 8 categories, we focus on real-world performance, honest pricing analysis, and practical recommendations. Learn more about our review process →