A Flash Model That Beats Last Year’s Pro
Google shipped Gemini 3.5 Flash on May 19, 2026 — the same day they announced it at I/O — and the headline is not what you would expect from a Flash-tier release: it outperforms the previous Gemini 3.1 Pro on agentic benchmarks, at 4x the speed and roughly 40% lower cost. For anyone building agents, that reordering matters.
The naming convention implies hierarchy: Flash is the fast-cheap option, Pro is the capable one. Google has effectively dissolved that distinction. On the metrics that matter most for autonomous workflows — tool use, multi-step coding, long-horizon tasks — 3.5 Flash does not just close the gap with 3.1 Pro. It opens one above it.
The Benchmark Numbers
Where Gemini 3.5 Flash Leads
On GDPval-AA, which measures real-world agentic task performance on an Elo scale, Gemini 3.5 Flash scores 1,656. Gemini 3.1 Pro scored 1,314 when it launched. That is not a marginal improvement — it represents a qualitative step in what the model can sustain across multi-turn, tool-heavy workflows.
MCP Atlas, which tests scaled tool-use reliability, shows the same pattern: 83.6% for 3.5 Flash versus 78.2% for 3.1 Pro, and 62.0% for Gemini 3 Flash. The 21.6-point jump over the previous Flash generation is the largest single improvement on that benchmark. Terminal-Bench 2.1 — coding inside a real terminal environment — lands at 76.2%, up from 70.3% for 3.1 Pro. Finance Agent v2 shows the widest spread: 57.9% against 43.0% for the previous Pro tier.
On MMMU-Pro, the multimodal reasoning benchmark, Gemini 3.5 Flash scores 84% — the highest ever recorded on that test at the time of writing. The overall multimodal average is 83.8, compared to 70.4 for GPT-5.5.
Where GPT-5.5 Still Leads
The comparison is not a clean sweep. GPT-5.5 outperforms Gemini 3.5 Flash on ARC-AGI-2 reasoning by 12.5 points and on MRCR v2 long-context retrieval at 128k tokens by 17.5 points. If your use case depends on deep single-session reasoning or precise long-context retrieval, OpenAI’s model retains a real edge. Across the full benchmark set, Gemini 3.5 Flash wins six categories, GPT-5.5 wins four, Claude Opus 4.7 wins one.
Head-to-Head Summary
| Benchmark | Gemini 3.5 Flash | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|
| GDPval-AA (Elo) | 1,656 | 1,314 | — |
| MCP Atlas (%) | 83.6 | 78.2 | — |
| Terminal-Bench 2.1 (%) | 76.2 | 70.3 | — |
| Finance Agent v2 (%) | 57.9 | 43.0 | — |
| MMMU-Pro (%) | 84.0 | — | 70.4 avg |
| ARC-AGI-2 reasoning | — | — | +12.5 pts |
| MRCR v2 (128k retrieval) | — | — | +17.5 pts |
| Price (input / output per 1M) | $1.50 / $9.00 | $2.00 / $12.00 | ~$4.50 / $27.00 |
| Output speed | 4x faster | baseline | slower |
The Price Point That Changes the Math
Gemini 3.5 Flash is priced at $1.50 per million input tokens and $9.00 per million output tokens. That undercuts GPT-5.5 by roughly 3x on both input and output. Against Claude Opus 4.7, it is 10x cheaper on input and 8x cheaper on output. Batch processing is 50% off, bringing input to $0.75/M for offline workflows.
For a single-turn chatbot, cost-per-call is a line item. For an agent that calls a model dozens or hundreds of times to complete one task, the arithmetic is different. A workflow that costs $0.90 per run with Opus 4.7 runs for roughly $0.09 with Gemini 3.5 Flash — if the task is within the model’s capability envelope. The benchmarks suggest most common agentic tasks now are.
The 1M-token context window (1,048,576 tokens precisely) adds further headroom: entire codebases, hours of video, long conversation histories — all in a single context. Supported modalities are text, image, audio, and video.
What This Means for Agentic Applications
The practical implication is that Google’s Flash tier is no longer a cost-optimized fallback — it is the recommended starting point for most agentic deployments. Developers who defaulted to Pro-tier models for anything complex should re-evaluate.
The MCP Atlas score (83.6%) is particularly relevant. MCP — the Model Context Protocol — has become the standard wiring layer for connecting AI agents to tools and data sources. We covered the MCP ecosystem reaching 9,400 registered servers earlier this month. A Flash-tier model that handles MCP tool calls reliably at scale changes what is economically viable to ship.
Finance Agent v2 performance — 57.9% versus 43.0% for the previous Pro — is the most concrete signal. Real-world financial agent tasks involve multi-step reasoning, constrained structured output, and error recovery across tool calls. A 15-point jump in that context is not a benchmark artifact.
That said, teams building agents that require deep long-horizon reasoning or heavy 128k+ retrieval should still run a side-by-side evaluation. GPT-5.5’s edges on ARC-AGI-2 and MRCR v2 are real. We compared frontier models on cost and context in our May 2026 benchmarks roundup — the picture has shifted again with this release.
Gemini 3.5 Pro Is on Deck
Google has confirmed Gemini 3.5 Pro is targeting a June 2026 launch with a 2M-token context window and a “Deep Think” reasoning mode — a heavier, slower mode designed for the tasks where 3.5 Flash runs into limits. If Flash already exceeds the previous Pro tier on agent benchmarks, the 3.5 Pro launch will be the real test of how far Google has pushed the frontier this generation.
The pattern is consistent with what Google signaled at Cloud Next ’26: the bet is on agents, not chatbots. The Gemini 3.5 generation is built around that assumption. The benchmark profile now supports the argument — at price points that make it hard to dismiss.
Further Reading
- Google DeepMind: Gemini 3.5 — frontier intelligence with action — the official model card with full benchmark methodology and scores from the DeepMind team
- WaveSpeed: A Flash-Tier Model Now Leads the Pro Tier on Agent Benchmarks — independent benchmark analysis with detailed per-category breakdowns
- DataCamp: Gemini 3.5 Flash vs GPT-5.5 — head-to-head practical comparison with code examples and use-case recommendations

