Still Waiting — and That’s the Story
Gemini 3.5 Pro was supposed to arrive in June. Sundar Pichai said so from the stage at Google I/O on May 19, asking the crowd to wait one more month after shipping Gemini 3.5 Flash that same day. The crowd reportedly groaned. As of June 15, Pro is still in limited enterprise preview on Vertex AI, with no public model card and no GA date confirmed. The wait, however, has produced something useful: nearly four weeks of Flash production data that tell you more about Pro than any press release would.
What Launched at I/O: The Flash Baseline
Flash is the fast, affordable tier of the Gemini 3.5 family. It launched May 19 with a 1-million-token context window, priced at $1.50 input / $9.00 output per 1M tokens — 40% cheaper than Gemini 3.1 Pro on both axes. On coding and agentic benchmarks, Flash meaningfully outperformed its predecessor.
The benchmark gains are specific and sizable. On Terminal-Bench 2.1, Flash scored 76.2% against Gemini 3.1 Pro’s 70.3%. On MCP Atlas, it hit 83.6% vs 78.2%. The most striking number: Finance Agent v2, where Flash landed at 57.9% against 3.1 Pro’s 43.0% — a nearly 15-point gap. On GDPval-AA, Flash reached 1,656 Elo versus 3.1 Pro’s 1,314.
These are the benchmarks that matter to developers building agentic pipelines. Flash is now closer to Claude Opus 4.8 on these measures than the previous Pro tier was. That’s a meaningful repositioning, not a marginal update.
Against GPT-5.5, Flash’s agentic lead holds. On Google’s published comparison covering MCP Atlas, Toolathlon, Finance Agent v2, CharXiv Reasoning, and MMMU-Pro, Flash leads both GPT-5.5 and Claude Opus 4.7. GPT-5.5 averages 85 on reasoning versus Flash’s 74.7 — a real gap. But Flash runs at roughly 3x lower cost than GPT-5.5’s $5.00/$30.00 pricing, which matters significantly when agents make dozens of model calls per task.
Where Flash Regresses — Pro’s Mandate
Flash didn’t win everywhere. Three benchmark categories regressed relative to Gemini 3.1 Pro, and they’re not random categories.
| Benchmark | 3.5 Flash | 3.1 Pro | Delta |
|---|---|---|---|
| Humanity’s Last Exam | 40.2% | 44.4% | −4.2 pts |
| ARC-AGI-2 | 72.1% | 77.1% | −5.0 pts |
| Long-context (128K) | 77.3% | 84.9% | −7.6 pts |
These are not edge cases. Humanity’s Last Exam tests reasoning at the limit of current model capability. ARC-AGI-2 measures abstract pattern recognition — a proxy for generalizable intelligence that’s hard to overfit. And the 7.6-point drop on 128K long-context retrieval matters directly for enterprise workloads: anyone relying on RAG against large codebases or document corpora would see real performance degradation switching from 3.1 Pro to Flash.
The pattern is consistent: Flash architecture traded reasoning depth and long-context recall for speed and agentic throughput. That trade-off is legible in the numbers. And it defines Pro’s job description almost perfectly. If Pro ships and doesn’t restore these numbers, it has no clear reason to exist at a premium above Flash.
What We Know About Pro’s Specs
Google has confirmed three things about Gemini 3.5 Pro. First: a 2-million-token context window, double Flash’s limit and the largest of any production frontier model as of mid-June. For RAG workloads and large-codebase analysis, a 2M context eliminates most chunking constraints. Second: a “Deep Think” reasoning mode — a toggleable inference extension that trades latency for accuracy on hard problems, based on Google’s implementation in the prior Gemini 3.1 Pro Deep Think release. Third: Deep Think will be exclusive to the $250/month Ultra subscription, not the $20 Pro consumer tier.
On pricing, prior Gemini generation ratios suggest roughly $15 input / $60 output per 1M tokens — approximately 10x Flash. That would position 3.5 Pro competitively against Claude Opus 4.8 and GPT-5.5, both priced at similar ranges. No confirmed number has been published as of this writing.
Everything else — model architecture, updated multimodal capabilities, whether Gemini Omni video integration ships with Pro — remains unconfirmed. The only public source is Pichai’s one-liner and an enterprise Vertex preview that hasn’t generated independent benchmark disclosures.
Four Things to Check When the Model Card Drops
When Pro ships and official benchmarks land, four specific numbers will determine whether it’s worth the premium over Flash or 3.1 Pro.
Humanity’s Last Exam and ARC-AGI-2. If Pro doesn’t clear 3.1 Pro’s 44.4% and 77.1% respectively, it has regressed on hard reasoning compared to a model that already shipped. Flash’s own regression shows this is possible when an architecture optimizes elsewhere — the Pro tier doesn’t get an automatic exemption.
Long-context recall at 128K and above. Flash dropped 7.6 points at 128K. If Pro restores the 3.1 Pro lead and extends it to the full 1M-token context window, that’s the most production-relevant improvement on the sheet. RAG-heavy applications should prioritize this number over any headline claim about “frontier” positioning.
Whether Pro retains Flash’s agentic gains. The ideal outcome is a model above 3.1 Pro on reasoning and at Flash’s level on Terminal-Bench and MCP Atlas. A Pro that fixes reasoning but drops coding performance is a different product than Flash — not a superset. Developers currently on Flash would have to choose which trade-off they’re making.
Deep Think at the API level. The most useful version would be a per-request parameter in the API — toggle thinking on, pay more per call, get more accuracy. If Deep Think is locked to the $250 consumer tier and unavailable via API, it’s inaccessible to most production developers regardless of how good it is. That tier-lock structure would be a meaningful constraint for enterprise use.
What to Do While Waiting
For developers building on the Gemini family: run your workloads against Flash now. It’s live on the Gemini API under model ID gemini-3.5-flash, in AI Studio, on Vertex, and via OpenAI-compatible wrappers. If Flash covers your use case — which it will for most agentic and coding pipelines — you may not need Pro at all, and the 3x cost premium will compound fast at scale.
For long-context and hard-reasoning workloads, stay on Gemini 3.1 Pro for now. The Flash regression at 128K is documented and real. Migrating to Flash just to run the latest model number while degrading retrieval quality is a poor trade. Wait until Pro ships and the context numbers are independently verified.
The more interesting possibility is that Pro ships this week — Google has had four weeks since I/O, and “give us until next month” puts late June as the real target window. Watch the official Gemini blog. The model card will tell the story the keynote couldn’t.
Further Reading
- Gemini 3.5 Flash: Google Bets on Agents, Not Chatbots — our breakdown of the Flash launch benchmarks and what they mean for agentic developers.
- Gemini 3.5 Pro Is Coming Next Month — WaveSpeed AI — detailed benchmark comparison between Flash and Gemini 3.1 Pro, with the exact table of regressions.
- Google Gemini 3.5 Pro Nears June Launch — TechTimes — confirmed spec details including the 2M context window, Deep Think mode, and consumer pricing tiers.

