The Benchmark Landscape Has Split in Two
At the top of every leaderboard in May 2026 sits a model you cannot use. Claude Mythos Preview, released by Anthropic on April 7, leads SWE-bench Verified at 93.9%, GPQA Diamond at 94.6%, and USAMO at 97.6%. It also happens to discover zero-day vulnerabilities faster than any existing security team, which is exactly why Anthropic has said publicly it will not make Mythos available to the general public. Access is limited to roughly 40 critical-infrastructure organizations under Project Glasswing.
That leaves everyone else competing for the real frontier — the benchmark tier where you can actually deploy something. And that race is genuinely close, genuinely surprising, and worth understanding before your team makes another tooling decision.
The Accessible Frontier: GPT-5.5
OpenAI shipped GPT-5.5 on April 23, 2026, and it landed as the strongest generally-available model for agentic, computer-use-style tasks. On Terminal-Bench 2.0 — a benchmark that measures real shell environment operation — it scores 82.7%, leading Claude Opus 4.7 by over 13 points (69.4%). On OSWorld-Verified, which tests operating real graphical computer environments, GPT-5.5 edges out Claude at 78.7% vs 78.0%.
For math-heavy workflows, GPT-5.5 leads FrontierMath Tiers 1–3 at 51.7%, compared to Claude Opus 4.7 at 43.8%. The model runs on a 1M-token context window via the API and is available to ChatGPT Plus, Pro, Business, and Enterprise subscribers, as well as through the API from April 24.
Where GPT-5.5 does not lead is pure coding. SWE-bench Verified gives GPT-5.5 82.6%, against Claude Opus 4.7 at 82.0% — a tie within measurement noise. And on SWE-bench Pro, a harder dataset, both fall behind an open-source challenger from China.
GLM-5.1: The Open-Source Spoiler
Z.ai (formerly Zhipu AI, a Tsinghua University spinoff that went public on the Hong Kong exchange in January 2026) released GLM-5.1 the same day as Claude Mythos Preview: April 7. The timing was not a coincidence — the lab has been running a sustained effort to track frontier models at open-source cost.
GLM-5.1 is a post-training upgrade to the 744B Mixture-of-Experts GLM-5, with 40B active parameters per token, a 200K context window, and a 131K maximum output. The upgrade did not involve more pre-training — it targeted coding and agentic reasoning through refined reinforcement learning. The result: 58.4 on SWE-bench Pro, nudging past GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3.
More telling than the benchmark number is what the team demonstrated: GLM-5.1 was tasked with building a complete Linux desktop environment from scratch, ran 655 autonomous iterations over eight hours, and increased vector database query throughput to 6.9 times the production baseline. It does not just generate code on first pass — it runs a plan-execute-test-fix-optimize loop without requiring human checkpoints.
The model is available on Hugging Face under an MIT license. You can download, modify, fine-tune, and deploy it commercially without royalty fees (with one carve-out: products with 100M+ monthly active users or $20M+ monthly revenue must surface the “GLM-5.1” label in the UI).
The Full Comparison Table
Here is where the key models sit as of early May 2026 on the benchmarks most relevant to production engineering work:
| Model | SWE-bench Verified | SWE-bench Pro | GPQA Diamond | Terminal-Bench 2.0 | Access | Cost / 1M tok (input) |
|---|---|---|---|---|---|---|
| Claude Mythos Preview | 93.9% | — | 94.6% | 82.0% | Restricted | N/A |
| GPT-5.5 (xhigh) | 82.6% | — | — | 82.7% | ChatGPT / API | ~$30 |
| Claude Opus 4.7 | 82.0% | — | — | 69.4% | API / Claude.ai | ~$30 |
| Kimi K2.6 | 80.2% | 58.6% | 90.5% | — | API / Hugging Face (MIT-mod) | $0.95 |
| GLM-5.1 | 77.8% | 58.4% | — | — | Hugging Face (MIT) | Open weights |
| Claude Opus 4.6 | 80.8% | 57.3% | — | 65.4% | API / Claude.ai | ~$15 |
| MiniMax M2.5 | 80.2% | — | — | — | API | ~$2 |
What the Numbers Mean for Practitioners
Three conclusions emerge from this data.
The open-source gap is functionally closed for code. Six months ago, running your own model meant accepting a 15–20 point SWE-bench deficit. Today, GLM-5.1 (77.8%), Kimi K2.6 (80.2%), and MiniMax M2.5 (80.2%) are within 2–5 points of the best available proprietary options. If your use case is code generation, code review, or repo-level refactoring, and you have the infrastructure to run a 744B MoE model, the cost math changes dramatically — Kimi K2.6 via API runs at $0.95 per million input tokens, roughly 30x cheaper than GPT-5.5 at frontier-equivalent performance.
GPT-5.5 is the strongest choice for agentic computer-use tasks. If your workflow involves operating real software environments — browsers, terminals, spreadsheet applications, internal tools — GPT-5.5’s lead on Terminal-Bench 2.0 and OSWorld-Verified is real and large. Claude Opus 4.7 trails by 13 points on terminal tasks. For coding agents that stay inside a repo, the gap shrinks to statistical noise.
Claude Mythos is the benchmark distortion, not the benchmark reality. Its numbers look extraordinary, but they describe a model most organizations will never touch. If you are currently making a tool selection decision, comparing against Mythos is like comparing production cars to a Formula 1 prototype. The useful comparison is between GPT-5.5, Opus 4.7, Kimi K2.6, and GLM-5.1 — and on that terrain, the race is genuinely open.
Who Should Use What
For agentic coding agents in CI/CD: GPT-5.5 via API, or Claude Opus 4.7 — both are within noise on SWE-bench Verified; GPT-5.5 holds a larger lead on multi-step terminal tasks. See our earlier guide on choosing your AI coding stack for the full framework.
For cost-sensitive coding workflows or on-premise deployments: Kimi K2.6 at $0.95/M or GLM-5.1 open weights. The performance gap versus top proprietary models is now sub-3 points on SWE-bench Verified.
For reasoning-heavy research tasks (scientific literature synthesis, multi-step mathematical analysis): Claude Opus 4.7 leads on GPQA Diamond among publicly available models, or consider how benchmark inflation is affecting these comparisons before anchoring to any single score.
The benchmark landscape will shift again in the next 60 days. Both OpenAI and Anthropic have hinted at summer releases. GLM-5.2 and Kimi K3 are likely. The one prediction that feels safe: the open-source options will stay within a few points of whatever the next proprietary frontier turns out to be.
Further Reading
- The Decoder on GLM-5.1 — solid technical breakdown of how the iterative agent loop works in practice, including the Linux demo.
- LLM Stats on Claude Mythos Preview — numbers and pricing context for Anthropic’s restricted model, including the Project Glasswing access structure.
- BuildFastWithAI GPT-5.5 Review — independent benchmark walkthrough covering agentic task performance and API pricing tiers.

