The Benchmark That Stopped Meaning Anything
MMLU was supposed to be the gold standard for measuring whether a language model understood the world. In 2021, GPT-3 scored around 43%. By early 2026, frontier models cluster between 97% and 99%. That number no longer tells you anything useful. If every serious model scores near-perfect, the benchmark has become a formality — a box to tick, not a signal to trust.
This is benchmark saturation, and it’s been eating the Open LLM Leaderboard for two years. HuggingFace has responded with two significant interventions — a full leaderboard rebuild in late 2024 and a scoring overhaul in early 2025. Neither has fully solved the problem, but both reveal how hard it is to build reliable, long-lived evaluation infrastructure in a field where models improve this fast.
What the Original Leaderboard Got Wrong
The first generation of the Open LLM Leaderboard launched in 2023 with benchmarks that were state-of-the-art at the time: MMLU, HellaSwag, ARC, and a handful of others. Within 18 months, those benchmarks were functionally useless at the frontier. HellaSwag measures commonsense reasoning; frontier models now hit 95–98%. ARC assesses grade-school science questions. These are not hard problems for a 70B model in 2026.
Saturation was only one problem. Contamination — models trained on data that overlapped with benchmark test sets — was the other. Researchers identified Yi-34B as having a statistically unlikely 94% probability of MMLU contamination. One model, CausalLM/34b, posted an MMLU score of 85.6 that researchers flagged as not theoretically achievable for a dense model of that size. The leaderboard was being gamed, sometimes intentionally, sometimes through sloppy training data curation.
Open LLM Leaderboard v2: Harder Benchmarks, Fairer Scoring
In October 2024, HuggingFace launched the second version of the leaderboard with a new benchmark suite designed to resist both saturation and contamination. The six core benchmarks replaced the saturated originals:
- MMLU-Pro — MMLU rewritten with 10-choice questions instead of 4, requiring deeper reasoning and reducing the value of pattern-matching
- GPQA (Google-Proof Q&A) — expert-authored questions that a Google search cannot answer; the dataset is gated to limit contamination risk
- MuSR (Multistep Soft Reasoning) — algorithmically generated multi-step problems including murder mysteries and resource allocation puzzles
- MATH Level 5 — the hardest subset of competition math problems from the MATH dataset
- IFEval (Instruction Following Evaluation) — tests whether models can follow explicit formatting and constraint instructions
- BBH (BIG-Bench Hard) — a curated set of 23 tasks where prior models struggled to exceed random chance
The move was well-received. These benchmarks are genuinely harder. As of early 2026, GPQA Diamond scores range from roughly 60% to 80% across frontier models — still enough spread to distinguish them. SWE-bench Verified (not part of the core suite but widely tracked) sits at 70–85%, and LiveCodeBench at 55–85%. There is still signal in these numbers, for now.
The Math-Verify Fix: 3,751 Models Re-Evaluated
Even after the v2 launch, a quieter problem surfaced. In February 2025, HuggingFace published a post-mortem revealing that the MATH benchmark had been scored incorrectly for months. The evaluation harness was failing to extract answers formatted in LaTeX \boxed{} notation — the standard format DeepSeek models use. As a result, DeepSeek models were systematically underscored.
The fix was Math-Verify, a new open-source answer-extraction library. HuggingFace used it to re-evaluate every model ever submitted — all 3,751 of them. The results were striking. On average, models solved 61 more MATH problems than previously scored, a 4.66-point increase across the board. DeepSeek models nearly tripled their MATH scores. The top-20 rankings on the MATH subset reshuffled entirely, with NVIDIA’s AceMath models emerging at the top and Qwen-family models filling the ranks below them.
This is worth sitting with. A scoring bug persisted for months, silently distorting how the field understood the relative capability of different model families. The fix required a complete retrospective evaluation of the entire submission history. It worked — but it also illustrates how fragile evaluation pipelines are when benchmarks depend on non-trivial answer parsing.
The Arms Race Continues: Why No Benchmark Lasts
The fundamental problem is structural. Benchmarks are published. Models are trained after publication. Training datasets — scraped from the web, aggregated from Common Crawl, assembled from GitHub — inevitably contain benchmark questions. Even without intentional data poisoning, contamination creeps in. And once a benchmark is published and widely used, lab incentives push toward maximizing scores on it, regardless of whether that generalizes.
Two approaches are trying to break this cycle. LiveBench, developed at the University of Illinois and MIT, releases new questions monthly drawn from recent arXiv papers, news articles, and recently-published datasets. Because the questions are new, they cannot be in any training set. The current LiveBench release (April 2025) covers 18 tasks across math, coding, reasoning, language, instruction following, and data analysis. LiveCodeBench takes a similar approach for code: new problems added continuously from recent competitive programming contests.
The tradeoff is freshness versus stability. A benchmark that changes monthly is harder to use as a consistent reference point. Publishing timelines make it difficult to compare a model evaluated in January against one evaluated in June when the benchmark tasks have shifted. Both approaches have merit; neither is a permanent solution.
What Engineers Should Actually Use in 2026
For teams evaluating models for deployment, the honest answer is that public leaderboards are a starting point, not a decision criterion. MMLU is saturated and tells you nothing at the frontier. HumanEval scores at 91–95% across the board — same problem. The benchmarks with remaining signal are the harder ones: GPQA Diamond, SWE-bench Verified, LiveCodeBench, and HLE (Humanity’s Last Exam, where scores still range from 10% to 46%).
More practically: run evals on your own data. A model that scores 82% on GPQA may be exactly the wrong choice for your code review pipeline if it consistently misses domain-specific error patterns your team cares about. Building your own evaluation set from real production examples is unglamorous work, but it is the only way to know whether a model change is actually an improvement for your use case.
The leaderboard arms race will continue. HuggingFace will launch v3 when v2 saturates. New contamination-resistant benchmarks will emerge. The community will find new ways to game them. What will not change is the gap between a number on a leaderboard and a model that works reliably in production.
Further Reading
- Fixing Open LLM Leaderboard with Math-Verify — HuggingFace’s own post-mortem on the scoring bug and the retrospective re-evaluation of 3,751 models
- LiveBench: A Challenging, Contamination-Free LLM Benchmark — The paper explaining the monthly-update approach to preventing training data contamination
- MMLU 85%, SimpleQA 3%: How to Actually Evaluate AI Models in 2026 — A clear-eyed analysis of why headline benchmark scores diverge so sharply from practical performance

