Introduction
Two benchmark results published within three months of each other have reshuffled the frontier AI leaderboard in ways that are genuinely hard to compare: OpenAI’s GPT-5.2 achieved a perfect 100% on AIME 2025 in December 2025, while Google’s Gemini 3.1 Pro, released on February 19, 2026, scored 77.1% on ARC-AGI-2 — more than doubling GPT-5.2’s score of 52.9% on the same test. These are not interchangeable results. Each measures something different, and deciding which model is “better” depends entirely on what you’re asking it to do.
What ARC-AGI-2 Actually Measures
Before treating any headline number as definitive, it helps to understand what ARC-AGI-2 is testing. Created by François Chollet and the ARC Prize team, the benchmark presents models with novel visual grid puzzles — each one unique, none of which appear in any training set. To solve them, a model must infer abstract rules from just a few examples and apply those rules to unseen cases under strict computational constraints.
The benchmark is designed to be easy for humans but hard for AI: a live study with over 400 members of the public showed that people consistently solve the tasks within two attempts. AI systems, even powerful ones, struggle particularly when tasks require symbols to carry context-dependent meaning or when multiple interacting rules must be applied simultaneously. That is not pattern matching. That is closer to reasoning.
ARC-AGI-2 matters because it resists the standard failure mode of AI benchmarks: saturation through data contamination. ARC-AGI-1 is now effectively solved — Gemini 3.1 Pro scores 98.0% on it — which is why the harder second version was introduced. Scores here are more meaningful than on benchmarks that frontier models have quietly memorized.
Gemini 3.1 Pro’s Reasoning Lead
Gemini 3.1 Pro’s 77.1% on ARC-AGI-2 is the highest score any model has recorded on the benchmark as of March 2026. It outpaces Claude Opus 4.6 (68.8%) and GPT-5.2 (52.9%) by a significant margin. According to independent testing by Artificial Analysis, Gemini 3.1 Pro ranked first out of 115 models on the Artificial Analysis Intelligence Index as of February 2026.
The model’s other benchmark results reinforce that picture. On GPQA Diamond — a graduate-level science test designed to trip up models without deep domain knowledge — Gemini 3.1 Pro scored 94.3%, reportedly the highest score on that benchmark to date. On SWE-Bench Verified, the most widely-cited coding benchmark, it hit 80.6%, a fraction above GPT-5.2’s 80.0%.
The 1 million token context window is the other headline feature. In practice, this means the model can process an entire codebase, a 900-page PDF, or roughly 8.4 hours of audio in a single prompt. That is not just a quantitative improvement over GPT-5.2’s context window — it opens up workflows that were previously impossible in a single inference call. Pricing stayed flat at $2 per million input tokens and $12 per million output tokens, the same as Gemini 3 Pro.
GPT-5.2’s Math and Coding Strengths
OpenAI’s GPT-5.2, released December 11, 2025, claims two records that Gemini 3.1 Pro has not matched: a perfect 100% on AIME 2025, and a 55.6% score on SWE-Bench Pro, the harder variant of the coding benchmark that uses proprietary, real-world repositories rather than the public dataset. Gemini 3.1 Pro has been reported at roughly 54.2% on SWE-Bench Pro — close, but GPT-5.2 holds a narrow lead on that specific evaluation.
AIME (American Invitational Mathematics Examination) is a competition-level test requiring multi-step algebraic and geometric reasoning. GPT-5.1 scored nowhere near 100%; that jump represents a genuine capability shift in formal mathematical reasoning. And unlike ARC-AGI-2, AIME problems are well-defined, closed-form, and verifiable — which matters when you need provably correct answers rather than plausible-sounding ones.
GPT-5.2 also showed strong improvement on ARC-AGI-2 relative to its predecessor. GPT-5.1 scored 17% on the test; GPT-5.2 reached 52.9% — a 3.1x improvement. That is a significant internal leap, even if Gemini 3.1 Pro’s absolute score is higher. The gap may narrow further with GPT-5.3 or later iterations, a pattern consistent with how the frontier has evolved over the past two years.
The Hallucination Problem: Hard Numbers
On hallucination rates — one of the most practically important reliability metrics — the picture is less flattering for both models, and the numbers diverge depending on who is measuring. OpenAI reports that GPT-5.2 hallucinates on 6.2% of queries, down from 8.8% with GPT-5.1, a 30% relative reduction. That self-reported figure, however, comes from OpenAI’s own internal test set of de-identified ChatGPT queries.
Independent testing tells a different story. Vectara’s hallucination benchmark places GPT-5.2 at 8.4%, behind DeepSeek’s 6.3%. This is not a contradiction — it is a measurement problem. Hallucination rates are deeply sensitive to the prompt distribution used for testing, the definition of “hallucination” applied by evaluators, and whether the model is in thinking mode or standard mode. Both numbers are real; neither is the full picture.
Gemini 3.1 Pro’s hallucination rate has not been independently benchmarked with the same systematic coverage as GPT-5.2’s at the time of writing. That absence of data is itself worth noting when choosing between models for high-stakes tasks where factual accuracy is non-negotiable.
What These Benchmarks Don’t Tell You
The honest read on the current frontier is that no single number settles the question of which model to use. GPT-5.2 leads on formal mathematics and holds a narrow edge on the harder coding benchmark. Gemini 3.1 Pro leads on abstract reasoning, graduate-level science, and long-context tasks by a meaningful margin. For agentic workflows — where models must coordinate tools across long sessions — Gemini 3.1 Pro’s scores of 33.5% on APEX-Agents and 69.2% on MCP Atlas suggest it is better tuned for multi-step orchestration.
What none of these benchmarks measure well is real-world reliability in production: how often models refuse valid queries, how gracefully they degrade on edge cases, and how consistent they are across repeated runs of the same prompt. These questions matter far more to engineering teams than marginal improvements on GPQA Diamond. The gap between benchmark performance and production behavior remains one of the most underreported stories in the frontier AI space — a theme that also emerged in earlier research on AI coding tools slowing experienced developers.
The benchmark war between OpenAI and Google is also accelerating the pace of releases in ways that make evaluation harder. GPT-5.2 arrived just weeks after GPT-5.1, explicitly in response to Gemini 3 competition. Gemini 3.1 Pro arrived three months after Gemini 3 Pro. Each new model obsoletes part of the prior comparison. By the time a team finishes evaluating today’s frontier on their specific use case, the landscape has shifted again.
Conclusion
Gemini 3.1 Pro is the stronger abstract reasoner and long-context processor as of March 2026; GPT-5.2 leads on formal math and holds a narrow SWE-Bench Pro edge. The question worth asking for any given team is not which model wins overall, but which capability gap matters most for their workload — and whether their evaluation methodology is rigorous enough to give a reliable answer. As the pace of releases shows no sign of slowing, that evaluation discipline may matter more than the benchmark headline itself.
Further Reading
- ARC-AGI-2 Official Overview — The benchmark’s design principles and what it actually tests, directly from the ARC Prize team.
- Introducing GPT-5.2 | OpenAI — OpenAI’s full technical announcement with benchmark tables and model card details.
- Gemini 3.1 Pro vs GPT-5.2: Coding Deep Dive — A detailed coding-focused comparison across SWE-Bench variants and real-world tasks published March 5, 2026.
