Gemini 3.1 Pro Hits 94.3% on PhD-Level Science — Doubling Its Previous Score
Google's Gemini 3.1 Pro topped reasoning benchmarks in March 2026, scoring 94.3% on GPQA Diamond (PhD-level science questions) and 77.1% on ARC-AGI-2 — more than double Gemini 3.0 Pro's score on the same test.
◎ What happened?
In March 2026, Google released Gemini 3.1 Pro — a reasoning-focused update to the Gemini 3 family that rewrote the leaderboards.
Key benchmark results:
- GPQA Diamond (PhD-level science questions): 94.3% — #1 across all models
- ARC-AGI-2 (general intelligence reasoning): 77.1% — more than double Gemini 3 Pro's earlier score of ~35%
- Led on 13 of 16 major benchmarks at launch
- Maintains 2 million token context window from Gemini 2.5
The jump on ARC-AGI-2 is the headline number. ARC-AGI-2 was designed specifically to resist AI systems that memorize patterns — it tests genuine reasoning by presenting novel visual puzzles that require abstract thinking. Humans score around 85%. Gemini 3.1 Pro at 77.1% is the closest any AI has come.
Google's Gemini 3 Flash was also updated: it now beats Gemini 2.5 Pro on 18 of 20 benchmarks while delivering 3x faster responses at 60–70% lower cost.
◈ Why does it matter?
The GPQA Diamond and ARC-AGI-2 results matter differently than typical coding or math benchmarks.
GPQA Diamond at 94.3% means Gemini 3.1 Pro can answer the kinds of multi-step science questions that require genuine expert knowledge synthesis — the kind of reasoning a PhD student uses, not pattern matching. At this level, the model is genuinely useful for scientific research assistance, not just academic trivia.
ARC-AGI-2 at 77.1% is a much harder claim: that the model can reason about genuinely novel problems it has never seen. The ARC-AGI benchmark series was created specifically because its creators believed AI systems were "cheating" on other benchmarks through pattern memorization. Scoring 77.1% on a test designed to prevent that is a meaningful signal that something qualitatively different is happening in these models.
Gemini Flash's efficiency gains are separately important. 3x faster at 60–70% lower cost means the economics of AI integration shift significantly. Tasks that were cost-prohibitive at scale become practical.
◇ Should you switch?
For research and science tasks: yes, Gemini 3.1 Pro is now the top choice.
If your work involves scientific literature, research synthesis, or complex multi-domain reasoning, Gemini 3.1 Pro leads. Nothing else is close on GPQA Diamond.
For everyday coding and agentic work: Claude Fable 5 and GPT-5.4 remain competitive. Gemini 3.1 Pro is not a "do everything better" upgrade — it's specifically dominant on reasoning-heavy tasks.
Free tier: Gemini models remain available via Google AI Studio with generous free quotas. The Gemini Flash variant gives you near-Pro quality at a fraction of the cost — start there for high-volume use cases.
Action items:
- Research / science apps → evaluate Gemini 3.1 Pro immediately
- Cost-sensitive high-volume apps → Gemini 3 Flash is now the best value in the industry
- Multimodal tasks (images, video, audio) → Gemini's native multimodal architecture still leads