The Open-Weight Frontier Is Now in Asia
For most of 2025, the frontier AI conversation was a two-party race: OpenAI and Anthropic trading benchmark leads, with Google occasionally crashing the party. That framing is now structurally wrong. In a six-week window between mid-March and late April 2026, two Chinese labs — MiniMax and Moonshot AI — each released open-weight models that match or exceed GPT-5.4 on the coding benchmarks that actually predict real-world utility, at prices 2.5x to 8x lower.
MiniMax M2.7 dropped first, on April 12, 2026. Moonshot AI’s Kimi K2.6 followed on April 20. Both are Mixture-of-Experts (MoE) architectures. Both are open-weight. Both target agentic coding workflows. And both are priced at a level that makes the cost case for proprietary Western models genuinely difficult to sustain.
This comparison cuts through the benchmark theater to answer the question that actually matters: which model should you reach for, and when?
Meet the Contenders
Kimi K2.6 — Moonshot AI’s Agentic Powerhouse
Kimi K2.6 is a 1 trillion parameter MoE model with 32 billion parameters active per forward pass, released by Beijing-based Moonshot AI on April 20, 2026. It ships with a 256K token context window and was built explicitly for long-horizon agentic tasks — the kind where a single agent session might run for 13 hours and execute more than 4,000 tool calls across 300 coordinated sub-agents.
The model’s headline achievement: it became the first open-weight model to beat GPT-5.4 (xhigh) on SWE-Bench Pro, scoring 58.6% against GPT-5.4’s 57.7%. It also leads every model on Humanity’s Last Exam (HLE-Full) with tools, scoring 54.0% against Claude Opus 4.6’s 53.0% and GPT-5.4’s 52.1%. On DeepSearchQA, K2.6 scores 92.5 F1, versus 78.6 for GPT-5.4 — a gap that is hard to attribute to noise.
On pricing, Kimi K2.6 lists at $0.74 per million input tokens and $3.49 per million output tokens. With cache hits, the effective input price drops to $0.16/M — a figure that matters at scale.
MiniMax M2.7 — The Self-Evolving Efficiency Play
MiniMax M2.7, released April 12, 2026, is the more architecturally audacious of the two. Its 230 billion total parameters sound smaller than Kimi K2.6’s 1T, but only 10 billion parameters are active during any single inference pass. That is not a compromise — it is intentional, and it is why M2.7 can run at $0.30 per million input tokens and $1.20 per million output tokens, making it the cheapest frontier-class model currently available.
What makes M2.7 unusual is how it was trained. The model ran more than 100 autonomous rounds of scaffold optimization during training — analyzing its own failure trajectories, modifying scaffold code, running evaluations, and deciding whether to keep or revert changes without human direction. MiniMax reports a 30% performance improvement on internal evaluations from this process. The model can now handle 30% to 50% of a reinforcement learning research workflow autonomously, flagging only critical decisions for human review.
On SWE-Bench Pro, M2.7 scores 56.22% — behind Kimi K2.6’s 58.6%, but within 2.4 points. On Terminal Bench 2 (sustained multi-step coding over hours), it scores 57.0%. On MLE-Bench Lite, which evaluates performance across 22 machine learning competitions, M2.7 achieves a 66.6% medal rate — second only to Claude Opus 4.6 and GPT-5.4.
Benchmark Head-to-Head
The table below compares both models on the benchmarks most predictive of real-world coding and agentic performance, alongside GPT-5.4 for reference.
| Benchmark | Kimi K2.6 | MiniMax M2.7 | GPT-5.4 (ref) |
|---|---|---|---|
| SWE-Bench Pro | 58.6% | 56.22% | 57.7% |
| SWE-Bench Verified | 80.2% | — | — |
| LiveCodeBench v6 | 89.6% | — | — |
| Terminal Bench 2 | 66.7% | 57.0% | — |
| HLE-Full (with tools) | 54.0% | — | 52.1% |
| MLE-Bench Lite (medal rate) | — | 66.6% | — |
| VIBE-Pro | — | 55.6% | — |
| DeepSearchQA (F1) | 92.5 | — | 78.6 |
| Input price ($/M tokens) | $0.74 | $0.30 | ~$2.50 |
| Output price ($/M tokens) | $3.49 | $1.20 | ~$10.00 |
| Active parameters | 32B | 10B | — |
| Context window | 256K | — | — |
A few caveats before reading too much into this table. Benchmark coverage is uneven — MiniMax M2.7 was not evaluated on all Kimi K2.6 benchmarks at time of writing, and vice versa. SWE-Bench Pro is currently the best single predictor of real-world coding agent performance, which is why it anchors this comparison. On that metric, Kimi K2.6 leads by 2.4 percentage points.
Where Kimi K2.6 Has the Edge
Kimi K2.6’s clearest advantage is in long-horizon agentic tasks — scenarios where an agent must coordinate many tools, sub-agents, and multi-step plans over an extended session. The model sustained over 4,000 tool calls in a single 13-hour session during testing, with agent swarms scaling up to 300 coordinated sub-agents. No other open-weight model has published numbers in this range.
The DeepSearchQA gap is also striking. A 92.5 versus 78.6 F1 score against GPT-5.4 is not a marginal improvement — it suggests Kimi K2.6 has a structurally different approach to retrieval-augmented reasoning. If your workload involves agents that must search, synthesize, and act on retrieved information in one pass, this matters more than SWE-Bench Pro.
On Next.js-specific benchmarks, Moonshot reports more than a 50% improvement over K2.5, which matters for teams using these models in frontend code generation pipelines. Kimi K2.6 also leads on LiveCodeBench v6 (89.6% vs. Claude Opus 4.6’s 88.8%), a competitive programming benchmark that tests algorithmic reasoning more than repo-level understanding.
The tradeoff: Kimi K2.6 costs 2.5x more on input and 2.9x more on output than MiniMax M2.7. At scale, that gap compounds quickly.
Where MiniMax M2.7 Has the Edge
MiniMax M2.7’s advantage is efficiency — in both cost and architecture. With only 10B active parameters, it delivers 96% of Kimi K2.6’s SWE-Bench Pro performance at 40% of the input token price. For teams running thousands of daily agent sessions, that arithmetic is not academic.
The self-evolution story is also genuinely novel, not just in marketing. M2.7 underwent 100+ autonomous rounds of scaffold optimization during training — adjusting sampling parameters (temperature, frequency penalty, presence penalty), rewriting workflow guidelines, and adding loop detection to its own agent scaffolding. The 30% internal performance gain from this process is the first credible published result of a model meaningfully improving its own agentic scaffold at training time.
M2.7’s MLE-Bench Lite medal rate (66.6%) places it second only to the top proprietary models across 22 machine learning competition tasks — a benchmark that is notoriously resistant to overfitting because it requires end-to-end problem solving under competition rules. If your use case involves ML engineering automation, research pipeline automation, or self-directed experimentation, M2.7’s architecture appears to generalize better than its SWE-Pro score alone suggests.
The ELO score on GDPval-AA (1495) is the highest among open-source models at time of writing, reinforcing M2.7’s lead in open-source head-to-head evaluation settings.
Architecture: Why MoE Changes the Cost Math
Both models use Mixture-of-Experts architecture, but their parameter-to-activation ratios differ significantly. Kimi K2.6 activates 32B of its 1T parameters per token. MiniMax M2.7 activates 10B of 230B. This is the core reason M2.7 is cheaper to run — fewer active parameters per forward pass means lower compute per token.
This architecture trend is not unique to these two models. Qwen 3.5, GLM-5, and DeepSeek V4 all use MoE. The Chinese AI ecosystem has converged on sparse activation as the dominant efficiency strategy — and it is working. The 15x to 30x price gap between Chinese and Western frontier models in early 2025 has not closed; it has widened, because Western labs have not adopted MoE as aggressively in their API-served models.
One hardware note: GLM-5 (Zhipu AI) runs entirely on Huawei Ascend 910B chips, with no NVIDIA dependency. Kimi K2.6 and MiniMax M2.7 do not publish their training hardware specifics, but the broader Chinese lab ecosystem is actively diversifying away from NVIDIA — a structural shift with long-term supply chain implications that go beyond benchmarks.
The Pricing Reality Check
At $0.30/M input and $1.20/M output, MiniMax M2.7 is the cheapest frontier-class model currently available. Kimi K2.6, at $0.74/$3.49, is roughly half the price of Claude Opus 4.6 and a fraction of GPT-5.4’s typical rate.
To put this in concrete terms: running 10 million input tokens per day — a reasonable production volume for a mid-size engineering team’s coding agent — costs $3,000/month at Kimi K2.6 list price, and $900/month at MiniMax M2.7. If you have cache hits on Kimi K2.6 (dropping input to $0.16/M), that gap narrows to $480 vs $900 — at which point Kimi K2.6 becomes cheaper on the input side.
The output token cost is where the gap persists: $3.49 vs $1.20. For agentic workloads where agents generate long responses, reason in extended chains, and produce multi-file code outputs, output tokens dominate the bill. MiniMax M2.7’s output price advantage is likely to be the deciding factor for cost-sensitive deployments.
For context on where the broader LLM pricing landscape is heading, the best AI coding assistants guide on vortx.ch covers the cost-per-value analysis across the full stack.
Who Should Use What
This is not a close call if you know your use case.
Choose Kimi K2.6 if:
- You are building long-horizon agentic systems where a single task runs for hours, coordinates multiple sub-agents, and requires sustained tool-use chains. K2.6’s 4,000-step, 13-hour session performance is the current open-weight record, and that margin matters for production reliability.
- Your pipeline is retrieval-heavy. The DeepSearchQA gap (92.5 vs 78.6 against GPT-5.4) is large enough that it will translate into measurable accuracy improvements in RAG and search-driven agentic workflows.
- You need the absolute best open-weight coding benchmark performance today. At 58.6% on SWE-Bench Pro, Kimi K2.6 is currently the top open-weight model on this metric.
- You can use prompt caching heavily, which brings the effective input cost down to $0.16/M — making the total cost difference with M2.7 much smaller on the input side.
Choose MiniMax M2.7 if:
- Cost per token is the primary constraint. At $0.30/$1.20, M2.7 delivers near-frontier performance at the lowest price currently available. For high-volume production deployments, the output token savings alone can exceed $2,000/month per 10M daily tokens.
- You are building ML research automation pipelines. M2.7’s 66.6% MLE-Bench Lite medal rate and demonstrated ability to handle 30-50% of RL workflow autonomously suggest it was specifically tuned for this use case.
- You want a model whose agentic behavior was shaped by training-time self-improvement, not just RLHF on agentic traces. The 100+ autonomous optimization rounds make M2.7’s tool-use behavior more compositionally stable in ways that are hard to quantify but visible in extended sessions.
- Your infrastructure runs on smaller GPU configurations. With only 10B active parameters, M2.7 can be served at a fraction of the compute cost of K2.6’s 32B active model.
If you are unsure: Run both on your specific task with 50–100 representative examples before committing. Benchmark rankings tell you where to start, not where to land. The 2.4 percentage point gap on SWE-Bench Pro narrows or disappears depending on the exact nature of your codebase and task type.
What This Means for the Broader AI Landscape
The more significant story here is not Kimi K2.6 versus MiniMax M2.7. It is that two Chinese open-weight models released within 8 days of each other now bracket GPT-5.4 on coding performance — and both are cheaper than GPT-5.4 by 70-90%.
The LLM leaderboard dynamics are changing in ways that matter structurally. We are past the point where frontier capability is a reliable moat for any single lab. The open-weight models that Chinese labs are releasing are not 80% of frontier — they are within rounding error of it, at a fraction of the cost, and available for self-hosting.
For teams building on proprietary APIs, the pressure to justify the cost premium will only increase. For teams that have already moved to open-weight hosting, the decision is now between efficiency configurations, not capability tiers. That is a fundamentally different market than existed 18 months ago.
Further Reading
- Kimi K2.6 Tech Blog (Moonshot AI) — official technical writeup covering the long-horizon agentic benchmark methodology and swarm scaling architecture.
- MiniMax M2.7: Early Echoes of Self-Evolution (MiniMax) — the primary source on M2.7’s autonomous scaffold optimization process and what “self-evolving” actually means in practice.
- Kimi K2.6 vs GLM-5.1 vs Qwen 3.6 Plus vs MiniMax M2.7 (Atlas Cloud) — the most comprehensive four-way benchmark comparison currently available, with detailed methodology notes.

