GLM-5: Zhipu AI Tops Chatbot Arena Without Nvidia

Introduction

On February 11, 2026, Zhipu AI released GLM-5 — a 744-billion-parameter model that promptly seized the #1 position on Chatbot Arena with an ELO score of 1451, outperforming every closed-source model including GPT-5.2 and Claude Opus 4.5. What makes this particularly notable is not just the benchmark result, but the infrastructure behind it: GLM-5 was trained entirely on 100,000 Huawei Ascend 910B chips — not a single Nvidia GPU in sight. For anyone tracking where frontier AI capability is actually concentrated, this is a significant data point.

Zhipu AI (operating as Z.ai) completed a Hong Kong IPO on January 8, 2026, raising approximately HKD 4.35 billion (USD $558 million), becoming the first publicly traded foundation model company globally. Its stock surged 28.7% within 24 hours of GLM-5’s release. The model is available under an MIT license — the most permissive open-source license attached to any model at this scale.

Architecture: 744B MoE on Chinese Hardware

GLM-5 uses a Mixture-of-Experts (MoE) architecture with 744 billion total parameters but only 40 billion active per token — a design that keeps inference costs tractable despite the headline parameter count. The model routes tokens through 256 experts, activating 8 per inference step. Pre-training covered 28.5 trillion tokens, up from 23 trillion for its predecessor GLM-4.7.

For long-context handling, Zhipu borrowed DeepSeek’s Dynamically Sparse Attention (DSA) mechanism, enabling a 200,000-token context window with maximum output of 131,000 tokens. That positions GLM-5 among the top-tier models for tasks requiring sustained context over long documents or codebases.

The training hardware is where this story gets geopolitically interesting. The Huawei Ascend 910B delivers approximately 320 TFLOPS of FP16 performance — above the A100 (312 TFLOPS) but well below the H100 (989 TFLOPS). Training the model required the MindSpore framework, Huawei’s own deep learning stack, built to run on Ascend silicon. The fact that Zhipu achieved frontier-class results on this hardware, under active US export controls, is a concrete demonstration that domestic Chinese AI infrastructure is maturing faster than most Western observers anticipated.

Benchmarks: What the Numbers Actually Say

On SWE-bench Verified — the benchmark measuring real GitHub issue resolution rather than synthetic coding problems — GLM-5 scores 77.8%, ranking first among all open-source models. For comparison, Claude Opus 4.5 scores 80.9% and GPT-5.2 scores 80.0%, meaning the gap to the frontier closed models is now less than 3 percentage points.

The Chatbot Arena ELO of 1451 places GLM-5 ahead of Kimi K2.5 (1447) and GLM-4.7 (1445). Chatbot Arena scores reflect human preference across blind pairwise comparisons, which makes them a more reliable real-world signal than self-reported lab benchmarks. On AIME 2026, GLM-5 scores 92.7%; on GPQA-Diamond, 86.0%. On Humanity’s Last Exam with tool use, GLM-5 records 50.4, surpassing both Claude Opus 4.5 (43.4) and GPT-5.2 (45.5) on that specific evaluation.

One caveat worth stating explicitly: most of these figures come from Zhipu AI’s own evaluation reports. The Chatbot Arena result is independently collected, but the lab-internal benchmark scores should be treated as manufacturer specifications until reproduced by third parties. The hallucination reduction claims in particular — discussed below — need independent validation before being taken at face value.

The Slime Framework: Post-Training at Scale

Training a 744B MoE model with reinforcement learning introduces a specific engineering problem: long-tail generation times. When some rollouts take 10x longer than others, naive synchronous training wastes enormous compute waiting for slow samples to finish. Zhipu’s answer is Slime, an asynchronous RL infrastructure that decouples generation from training updates.

Slime uses a technique called Active Partial Rollouts (APRIL), which allows partial trajectory segments to feed into training before a full rollout completes. According to the GLM-5 technical report, this architecture compresses the hallucination rate from approximately 90% on GLM-4.7 to 34% on GLM-5 — a 56-percentage-point reduction that, if accurate, would represent a meaningful shift in practical reliability. Zhipu claims this result tops Anthropic’s Artificial Analysis Omniscience Index.

The broader implication of Slime is that it lowers the RL compute overhead for future training runs. If the technique is reproducible and generalizable, it could benefit the wider open-source community through the MIT license release — though Zhipu has not yet published Slime separately as a standalone library.

Pricing, Availability, and Deployment Reality

GLM-5 is available through the Z.ai API at approximately $1.00 per million input tokens and $3.20 per million output tokens. On OpenRouter, prices run slightly lower at $0.80 input and $2.56 output. Either way, this is roughly five to six times cheaper than Claude Opus 4.6 ($5/$25 per million tokens), which means teams doing high-volume inference have a financially compelling reason to evaluate it seriously.

Model weights are published on Hugging Face in both BF16 and FP8 formats, as well as on ModelScope. Self-hosting is feasible but expensive: you need a minimum of 8× H200 GPUs for FP8 inference. Inference speed runs at approximately 72.5 tokens per second — adequate for most production workloads, though slower than highly optimized closed-model serving stacks.

Limitations Worth Noting

GLM-5 is text-only. There is no native multimodal support — no image input, no audio. Competitors like Kimi K2.5 from Moonshot AI and Gemini 3 Pro offer multimodal capabilities that GLM-5 simply does not match. For applications where vision or audio processing matters, this is a hard constraint, and it is a notable gap in a model otherwise targeting frontier status.

The “Pony Alpha” episode — where GLM-5 appeared on OpenRouter under a pseudonym in early February 2026, sparking community speculation before the official announcement — is a minor oddity but also a reminder that unofficial benchmark results can circulate before formal validation. Some community excitement was built on pre-release, pseudonymous evaluations. That does not invalidate the results, but it is a useful reminder to trace claims back to their source.

Finally, the Slime-based hallucination reduction claim (90% to 34%) is the most consequential number in the announcement and also the least independently verified. Hallucination rates are notoriously sensitive to measurement methodology, and a 56-point reduction would be extraordinary by any standard. It deserves independent replication before being used in production reliability estimates.

What This Means for the Open-Source LLM Landscape

The gap between open-source and closed-source frontier models has been narrowing since Llama 3 in mid-2024, but GLM-5 represents the clearest evidence yet that open-weight models can reach Chatbot Arena #1. More than the ranking itself, the combination of MIT licensing, competitive API pricing, and published weights creates real optionality for teams that previously had no alternative to paying OpenAI or Anthropic rates.

The hardware story may ultimately matter more. If Zhipu AI can train at this quality level on Huawei Ascend chips, it substantially weakens the argument that chip export controls would cap Chinese AI capabilities at a meaningful distance from the global frontier. The next question is whether this was a one-off engineering achievement or a repeatable production capability. That answer will come with GLM-5’s successor.

GLM-5: Zhipu AI Tops Chatbot Arena Without Nvidia

Introduction

Architecture: 744B MoE on Chinese Hardware

Benchmarks: What the Numbers Actually Say

The Slime Framework: Post-Training at Scale

Pricing, Availability, and Deployment Reality

Limitations Worth Noting

What This Means for the Open-Source LLM Landscape

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

GLM-5: Zhipu AI Tops Chatbot Arena Without Nvidia

Introduction

Architecture: 744B MoE on Chinese Hardware

Benchmarks: What the Numbers Actually Say

The Slime Framework: Post-Training at Scale

Pricing, Availability, and Deployment Reality

Limitations Worth Noting

What This Means for the Open-Source LLM Landscape

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

Enjoyed this? Get one AI insight per day.

Related Articles

GLM-5.1 vs GPT-5.5 vs Claude: May 2026 Benchmarks

Best AI Coding Assistants (2026): The Complete Guide

Adobe CX Enterprise: Agentic AI Across the Full Customer Lifecycle