Google Gemini 3.1 Pro: Benchmarks, Pricing & Guide
Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 and 2887 Elo on LiveCodeBench at $2/$12M tokens. Full benchmarks, pricing, and competitive comparison guide.
ARC-AGI-2 Score
LiveCodeBench Pro Elo
Per 1M Tokens (Input/Output)
Context Window (Tokens)
Key Takeaways
Google released Gemini 3.1 Pro on February 19, 2026 — their most capable model yet, described as "designed for tasks where a simple answer isn't enough." It delivers a 2x+ reasoning performance boost over Gemini 3 Pro, tops most major benchmarks, and maintains the same $2/$12 pricing. This is Google's strongest play for the frontier AI crown.
The numbers speak for themselves: 77.1% on ARC-AGI-2 (up from 31.1%), 2887 Elo on LiveCodeBench Pro, 94.3% on GPQA Diamond, and #1 rankings on 12 of 18 tracked benchmarks. Gemini 3.1 Pro is the first ".1" increment between major Gemini versions — Google previously used ".5" for mid-cycle updates — and the jump in capability justifies the naming change.
What's New in Gemini 3.1 Pro
Gemini 3.1 Pro represents a fundamental shift in how Google delivers mid-cycle updates. Previous Gemini versions used ".5" increments (e.g., 2.0 Flash), but the ".1" naming signals a tighter, more targeted improvement cycle focused on reasoning depth and agentic capability rather than broad architecture changes.
The core improvement is a 2x+ reasoning boost. On ARC-AGI-2 — the benchmark that tests novel problem-solving without memorized patterns — Gemini 3.1 Pro scores 77.1% compared to Gemini 3 Pro's 31.1%. That 46 percentage point jump is the largest single-generation reasoning gain seen in any frontier model family. Google attributes this to more efficient thinking, where the model extracts more insight per compute token during its reasoning chain.
- 77.1% ARC-AGI-2 (up from 31.1%)
- 2887 Elo on LiveCodeBench Pro (+18%)
- New "Medium" thinking level
- Improved agentic and SWE capabilities
- $2/$12 per million tokens
- 1M token context window
- 64K token output limit
- Multimodal input support
Complete Benchmark Breakdown
Google published results across 19 benchmarks covering academic reasoning, coding, agentic tasks, multimodal understanding, and knowledge. The comparison includes Gemini 3 Pro, Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and GPT-5.3-Codex where available. Gemini 3.1 Pro leads on more benchmarks than any other model.
Reasoning & Knowledge
| Benchmark | 3.1 Pro | 3 Pro | Opus 4.6 | GPT-5.2 |
|---|---|---|---|---|
| HLE (No Tools) | 44.4% | 37.5% | 40.0% | 34.5% |
| HLE (Search+Code) | 51.4% | 45.8% | 53.1% | 45.5% |
| ARC-AGI-2 | 77.1% | 31.1% | 68.8% | 52.9% |
| GPQA Diamond | 94.3% | 91.9% | 91.3% | 92.4% |
| MMMLU | 92.6% | 91.8% | 91.1% | 89.6% |
| MMMU Pro | 80.5% | 81.0% | 73.9% | 79.5% |
| BrowseComp | 85.9% | 59.2% | 84.0% | 65.8% |
Coding & Software Engineering
| Benchmark | 3.1 Pro | 3 Pro | Opus 4.6 | GPT-5.2 |
|---|---|---|---|---|
| SWE-Bench Verified | 80.6% | 76.2% | 80.8% | 80.0% |
| SWE-Bench Pro (Public) | 54.2% | 43.3% | — | 55.6% |
| LiveCodeBench Pro (Elo) | 2887 | 2439 | — | 2393 |
| Terminal-Bench 2.0 | 68.5% | 56.9% | 65.4% | 54.0% |
| SciCode | 59% | 56% | 52% | 52% |
Agentic Tasks
| Benchmark | 3.1 Pro | 3 Pro | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|---|
| APEX-Agents | 33.5% | 18.4% | — | 29.8% |
| GDPval-AA (Elo) | 1317 | 1195 | 1633 | 1606 |
| τ²-bench Retail | 90.8% | 85.3% | 91.7% | 91.9% |
| τ²-bench Telecom | 99.3% | 98.0% | 97.9% | 99.3% |
| MCP Atlas | 69.2% | 54.1% | 61.3% | 59.5% |
| MRCR v2 128k | 84.9% | 77.0% | 84.9% | 84.0% |
The standout result is ARC-AGI-2 — Gemini 3.1 Pro's 77.1% represents a 2.5x improvement over Gemini 3 Pro and exceeds Opus 4.6 by over 8 percentage points. On LiveCodeBench Pro, the 2887 Elo rating places it significantly ahead of both GPT-5.2 (2393) and Gemini 3 Pro (2439). The MCP Atlas score of 69.2% also demonstrates strong tool-use coordination, leading all models tested.
Where Gemini 3.1 Pro Leads (and Where It Doesn't)
Where Gemini 3.1 Pro Dominates
Gemini 3.1 Pro holds the #1 position on at least 12 of 18 tracked benchmarks. Its strongest leads are in novel reasoning (ARC-AGI-2: 77.1%, 12% ahead of Opus 4.6), competitive coding (LiveCodeBench Pro: 2887 Elo, 21% above GPT-5.2), graduate-level science (GPQA Diamond: 94.3%), scientific coding (SciCode: 59%), autonomous agent tasks (APEX-Agents: 33.5%), tool coordination (MCP Atlas: 69.2%), and web research (BrowseComp: 85.9%).
77.1% ARC-AGI-2, 94.3% GPQA Diamond, 44.4% HLE (no tools). Best-in-class on novel problem-solving and graduate-level science.
2887 Elo LiveCodeBench, 80.6% SWE-Bench, 68.5% Terminal-Bench. Dominates competitive coding and matches top models on SWE tasks.
33.5% APEX-Agents, 69.2% MCP Atlas, 99.3% telecom tool use. Best autonomous agent and tool coordination performance.
Where Competitors Still Lead
Gemini 3.1 Pro is not the best everywhere. Claude Sonnet 4.6 and Opus 4.6 dominate GDPval-AA expert tasks (1633 and 1606 Elo respectively vs 1317 for Gemini 3.1 Pro). Opus 4.6 narrowly leads SWE-Bench Verified (80.8% vs 80.6%) and Humanity's Last Exam with tools (53.1% vs 51.4%). GPT-5.3-Codex leads specialized coding on Terminal-Bench 2.0 (77.3%) and SWE-Bench Pro (56.8%).
| Area | Leader | Score | 3.1 Pro |
|---|---|---|---|
| GDPval-AA (Expert Tasks) | Sonnet 4.6 | 1633 Elo | 1317 Elo |
| SWE-Bench Verified | Opus 4.6 | 80.8% | 80.6% |
| Terminal-Bench 2.0 | GPT-5.3-Codex | 77.3% | 68.5% |
| SWE-Bench Pro | GPT-5.3-Codex | 56.8% | 54.2% |
| HLE (Search+Code) | Opus 4.6 | 53.1% | 51.4% |
The pattern is clear: Gemini 3.1 Pro is the best general-purpose model available, but specialized models still win in their niches. Claude excels at expert office tasks and real-world SWE, while GPT-5.3-Codex is the specialist for dedicated coding workflows.
Pricing and Value Analysis
Gemini 3.1 Pro maintains the same pricing as Gemini 3 Pro — a massive performance upgrade at zero additional cost. At $2 per million input tokens and $12 per million output tokens, it's significantly cheaper than Claude Opus 4.6 ($15/$75) and competitive with Sonnet 4.6 ($3/$15). Context caching can reduce costs by up to 75%.
| Model | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M tokens |
| Gemini 3.1 Pro (>200K) | $4.00 | $18.00 | 1M tokens |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M tokens |
| Claude Opus 4.6 | $15.00 | $75.00 | 1M tokens |
| GPT-5.2 | $2.50 | $10.00 | 1M tokens |
$2 / $12
Input / Output per 1M tokens
Up to 75%
Savings with Context Caching
7.5x Cheaper
Than Claude Opus 4.6 (Input)
The value proposition is clear: Gemini 3.1 Pro leads on more benchmarks than Opus 4.6 while costing 7.5x less on input and 6.25x less on output. Even against the similarly priced GPT-5.2 ($2.50 input), Gemini 3.1 Pro delivers meaningfully better results on most benchmarks. For teams optimizing cost-to-performance, this is the strongest option available.
Where to Access Gemini 3.1 Pro
Gemini 3.1 Pro is available in preview across Google's full ecosystem, with general availability coming soon. Developers can start building immediately through multiple platforms.
- Google AI Studio — Free-tier access with rate limits, ideal for prototyping and experimentation
- Google AntiGravity IDE — Google's agentic development environment, purpose-built for AI-assisted coding
- Vertex AI — Enterprise deployment with GCP infrastructure, SLAs, and compliance
- Gemini CLI — Terminal-based access for developers who prefer command-line workflows
- Android Studio — Integrated access for mobile developers building Android applications
- Gemini App — Consumer access on Pro and Ultra plans for general-purpose use
- NotebookLM — Research and analysis tool with Gemini 3.1 Pro powering document understanding (Pro/Ultra)
- Gemini Enterprise — Google Workspace integration for business teams
What This Means for Developers and Businesses
For Developers
Gemini 3.1 Pro is the best-in-class coding model when combining competitive coding (2887 Elo LiveCodeBench), real-world SWE (80.6% SWE-Bench), and massive 1M token context at $2/$12 pricing. The new Medium thinking level lets developers balance cost and reasoning depth per request — use Low for autocomplete, Medium for code review, and High for complex debugging.
For Agencies and Businesses
With the strongest general-purpose performance across reasoning, coding, and agentic tasks, Gemini 3.1 Pro is the model to evaluate for complex AI workflows. The 69.2% MCP Atlas score means it handles multi-tool coordination well — critical for building automated pipelines that interact with multiple APIs, databases, and services.
For the AI Landscape
Google retakes the lead from Anthropic and OpenAI on most benchmarks with this release. The competition is tighter than ever — Gemini 3.1 Pro dominates general-purpose tasks, Claude excels at expert office work and computer use, and OpenAI's Codex models lead specialized coding. No single model wins everywhere, which is healthy for the ecosystem and gives developers real choices based on their specific needs. For a broader look at the pre-3.1 landscape, see our Opus 4.5 vs GPT-5.2 vs Gemini 3 Pro comparison.
How to Get Started
Getting started with Gemini 3.1 Pro takes minutes. The fastest path is through Google AI Studio, which offers free access with rate limits. For production workloads, use the Gemini API directly or through Vertex AI.
Quick Start via AI Studio
- Visit Google AI Studio and sign in with your Google account
- Select Gemini 3.1 Pro from the model dropdown
- Choose a thinking level: Low, Medium, or High
- Start prompting — no API key required for the playground
Recommended Thinking Levels
Simple queries, autocomplete, classification, and summarization. Fastest response, lowest cost.
Code review, data analysis, document Q&A, and multi-step tasks. Best balance of speed and depth.
Complex reasoning, advanced coding, research, and agentic workflows. Maximum capability at higher cost.
Migration from Gemini 3 Pro
Gemini 3.1 Pro is a drop-in replacement for Gemini 3 Pro — same API format, same pricing, same context window. Update the model ID in your API calls and you get the performance upgrade immediately. The only new consideration is the Medium thinking level, which can help optimize cost for workloads that previously used High thinking but don't require maximum reasoning depth. For cost-conscious use cases, consider Gemini 3 Flash as a faster, cheaper alternative for simpler tasks.
Conclusion
Gemini 3.1 Pro is Google's most capable model to date and arguably the strongest general-purpose AI model available. With 77.1% on ARC-AGI-2, 2887 Elo on LiveCodeBench Pro, and #1 rankings on 12+ benchmarks, it delivers a meaningful capability leap over Gemini 3 Pro at exactly the same $2/$12 price point. The new Medium thinking level adds cost optimization flexibility that competitors lack.
That said, no single model wins everywhere. Claude Opus 4.6 and Sonnet 4.6 still lead on expert office tasks (GDPval-AA) and edge out on SWE-Bench Verified. GPT-5.3-Codex dominates specialized coding benchmarks. The right choice depends on your workload — but for teams that need one model to handle reasoning, coding, agentic tasks, and multimodal understanding, Gemini 3.1 Pro is now the default recommendation.
Ready to Build with Gemini 3.1 Pro?
Whether you're deploying agentic AI, building coding assistants, or automating complex workflows, our team can help you leverage frontier models for measurable business results.
Frequently Asked Questions
Related Guides
Continue exploring AI model developments and frontier benchmarks