Skip to content

MiniMax M3: Open-Weight AI Tops SWE-Bench Pro at 59%

5 min read

MiniMax M3: Open-Weight AI Tops SWE-Bench Pro at 59%
Photo by Ron Lach on Pexels

Why This Open-Weight Release Matters

On June 1, 2026, Shanghai-based MiniMax released M3 — the first open-weight model to simultaneously hit frontier coding scores, offer a 1-million-token context window, and handle image and video inputs natively. That combination has not existed in a single open-weight package before, and the pricing makes it hard to ignore: $0.30 per million input tokens and $1.20 per million output tokens at launch.

For teams that need long-context coding agents without locking into a proprietary API, M3 represents a meaningful shift. The weights are scheduled for release on Hugging Face and GitHub within roughly ten days of launch — making it auditable and self-hostable, not just API-accessible.

Whether the benchmark claims hold up under independent testing is a separate question. But the architecture choices behind M3 are interesting regardless of where the final numbers land.

How MiniMax Sparse Attention Works

Previous MiniMax models used Lightning Attention, a hybrid approach that mixed linear attention with standard softmax attention to manage the quadratic cost of long context. M3 scraps that in favour of a new design called MiniMax Sparse Attention (MSA).

MSA partitions the key-value cache into blocks and selects only the most relevant blocks for each query, rather than attending over every token in a million-token sequence. It operates on a standard Grouped Query Attention (GQA) backbone with uncompressed key-values — a contrast to DeepSeek’s Multi-head Latent Attention (MLA), which compresses keys and values into a lower-dimensional latent space to reduce memory.

The practical result: at 1M-token context, M3 decodes more than 15× faster and prefills more than 9× faster than the standard attention baseline. Per-token compute drops to roughly 1/20th of what MiniMax M2 required at the same context length. MiniMax is promising at least 512K tokens of guaranteed context with the API, with the full 1M window available depending on load.

Benchmark Numbers and What They Mean

MiniMax M3 scores 59.0% on SWE-Bench Pro, the software engineering benchmark that tests agents on real GitHub issues. That is the highest published score for an open-weight model as of this writing, edging out Kimi K2.6 (58.6%) and ahead of GPT-5.5 and Gemini 3.1 Pro on the same benchmark. It trails Claude Opus 4.8’s reported 69.2%.

Other published scores: 66.0% on Terminal-Bench 2.1, 34.8% on SWE-fficiency, and 83.5 on BrowseComp. The BrowseComp result in particular suggests strong agentic web-navigation ability, which is relevant for research and data-gathering workflows.

One caveat worth noting: several of these results were run on MiniMax’s own infrastructure with MiniMax’s own agent scaffolding. Independent third-party replication — on Inspect, EleutherAI’s harness, or other neutral frameworks — has not yet been published at time of writing. The company has a track record of accurate self-reporting with MiniMax-01 and M2, but that track record is short. Treat the numbers as plausible, not settled.

Where M3 Sits Against Open-Weight Competitors

The most direct comparison is Kimi K2.6, covered in the earlier vortx.ch article on Kimi K2.6 vs MiniMax M2.7. M3 beats K2.6 by 0.4 points on SWE-Bench Pro and costs roughly 3.3× less per output token ($1.20 vs $4.00 per million). Kimi K2.6’s counterargument is its Agent Swarm feature — up to 300 parallel sub-agents with 4,000-step horizons — and a leading HLE (Humanity’s Last Exam) score of 54%, which M3 has not publicly matched.

DeepSeek V4 Pro (1.6T parameters, 32T training tokens) sits in the same competitive tier. DeepSeek’s advantage is a more established ecosystem of third-party benchmarks and a longer track record of weight releases. M3 is newer and its weights are not yet live. The practical choice between them will depend on which set of independent benchmarks lands better once the weights are public.

Pricing and Availability

The M3 API is live now via platform.minimax.io and available on OpenRouter. Launch pricing is $0.30 per million input tokens and $1.20 per million output tokens — MiniMax is describing this as a 50% promotional discount, with standard rates at $0.60/$2.40. At $1.20/MTok output, M3 is roughly 3× cheaper than Claude Haiku 4.5 and nearly 4× cheaper than Kimi K2.6 for the same output volume.

Open weights are expected within ten days of the June 1 launch, meaning they should be publicly available by mid-June 2026. MiniMax has published weights for every previous model — M1 and M2 both landed on Hugging Face as promised — so the commitment has precedent.

Inputs supported: text, images, and video. Output: text only. The 1M-token context window accepts mixed modalities across the full context length, which is relevant for long video analysis or large codebases with image-heavy documentation.

What This Means for the Open-Weight Market

The past 90 days have seen a serious compression of the gap between open-weight and frontier proprietary models on coding benchmarks. MiniMax M3 at 59.0%, Kimi K2.6 at 58.6%, and DeepSeek V4 Pro in the same tier — all open-weight, all priced below $2/MTok output — make a compelling case that the coding benchmark race is no longer a proprietary-model story. The full comparison of AI coding assistants in 2026 shows how quickly the gap narrowed over the past year.

The sticking points remain evaluation integrity and ecosystem depth. Proprietary models like Claude Opus 4.8 (69.2% SWE-Bench Pro) still lead the coding benchmark tables, and they come with managed infrastructure, enterprise SLAs, and audited safety evaluations that self-hosted models require you to build yourself. For teams with the capacity to self-host, M3 shifts the calculus. For teams without it, the API pricing is competitive enough to use without needing to run weights locally.

The most useful question to ask over the next 30 days: how does M3 perform on your actual codebase? SWE-Bench Pro tests agents on real GitHub issues, but “real” in that benchmark means Python repositories with well-defined issues. If your work is TypeScript, Go, or Java, in a monorepo with CI dependencies, test the model before committing to it.

Further Reading

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Enjoyed this? Get one AI insight per day.

Join engineers and decision-makers who start their morning with vortx.ch. No fluff, no hype — just what matters in AI.