100% free AI setup guides — no credit card needed
← AI News
March 15, 2026googlegeminireasoning

Gemini 3.1 Pro Hits 94.3% on PhD-Level Science — Doubling Its Previous Score

Google's Gemini 3.1 Pro topped reasoning benchmarks in March 2026, scoring 94.3% on GPQA Diamond (PhD-level science questions) and 77.1% on ARC-AGI-2 — more than double Gemini 3.0 Pro's score on the same test.

What happened?

In March 2026, Google released Gemini 3.1 Pro — a reasoning-focused update to the Gemini 3 family that rewrote the leaderboards.

Key benchmark results:

  • GPQA Diamond (PhD-level science questions): 94.3% — #1 across all models
  • ARC-AGI-2 (general intelligence reasoning): 77.1% — more than double Gemini 3 Pro's earlier score of ~35%
  • Led on 13 of 16 major benchmarks at launch
  • Maintains 2 million token context window from Gemini 2.5

The jump on ARC-AGI-2 is the headline number. ARC-AGI-2 was designed specifically to resist AI systems that memorize patterns — it tests genuine reasoning by presenting novel visual puzzles that require abstract thinking. Humans score around 85%. Gemini 3.1 Pro at 77.1% is the closest any AI has come.

Google's Gemini 3 Flash was also updated: it now beats Gemini 2.5 Pro on 18 of 20 benchmarks while delivering 3x faster responses at 60–70% lower cost.

Why does it matter?

The GPQA Diamond and ARC-AGI-2 results matter differently than typical coding or math benchmarks.

GPQA Diamond at 94.3% means Gemini 3.1 Pro can answer the kinds of multi-step science questions that require genuine expert knowledge synthesis — the kind of reasoning a PhD student uses, not pattern matching. At this level, the model is genuinely useful for scientific research assistance, not just academic trivia.

ARC-AGI-2 at 77.1% is a much harder claim: that the model can reason about genuinely novel problems it has never seen. The ARC-AGI benchmark series was created specifically because its creators believed AI systems were "cheating" on other benchmarks through pattern memorization. Scoring 77.1% on a test designed to prevent that is a meaningful signal that something qualitatively different is happening in these models.

Gemini Flash's efficiency gains are separately important. 3x faster at 60–70% lower cost means the economics of AI integration shift significantly. Tasks that were cost-prohibitive at scale become practical.

Should you switch?

For research and science tasks: yes, Gemini 3.1 Pro is now the top choice.

If your work involves scientific literature, research synthesis, or complex multi-domain reasoning, Gemini 3.1 Pro leads. Nothing else is close on GPQA Diamond.

For everyday coding and agentic work: Claude Fable 5 and GPT-5.4 remain competitive. Gemini 3.1 Pro is not a "do everything better" upgrade — it's specifically dominant on reasoning-heavy tasks.

Free tier: Gemini models remain available via Google AI Studio with generous free quotas. The Gemini Flash variant gives you near-Pro quality at a fraction of the cost — start there for high-volume use cases.

Action items:

  • Research / science apps → evaluate Gemini 3.1 Pro immediately
  • Cost-sensitive high-volume apps → Gemini 3 Flash is now the best value in the industry
  • Multimodal tasks (images, video, audio) → Gemini's native multimodal architecture still leads

Who should care?

Researchers
Scientists
Developers
Businesses
Students