Artificial Intelligence March 5, 2026

Gemini 3.1 Pro Doubles ARC-AGI-2 Score to 77.1%, Redefining AI Reasoning

A 148% leap in abstract reasoning in a single generation – that is what Google DeepMind delivered on February 19, 2026, when Gemini 3.1 Pro posted a verified 77.1% score on the ARC-AGI-2 benchmark. Its predecessor, Gemini 3 Pro, managed just 31.1% on the same test. For context, Claude Opus 4.6 scored 68.8% and GPT-5.2 reached 52.9%, placing Gemini 3.1 Pro firmly at the top of one of the most demanding reasoning evaluations in AI.

ARC-AGI-2 was designed specifically to resist memorization. Every puzzle presents a novel logic pattern, forcing models to identify abstract rules from minimal examples and generalize to unseen problems. A score above 70% suggests a qualitative threshold – the kind of fluid reasoning that prior architectures simply could not sustain. This is not incremental progress. It is a structural shift in what large language models can do when reasoning depth becomes a first-class engineering priority.

Beyond the headline benchmark, Gemini 3.1 Pro ships as a natively multimodal model capable of processing text, images, audio, video, and entire code repositories within a 1 million token context window. It is available in preview across Google AI Studio, Vertex AI, the Gemini API, Gemini CLI, NotebookLM, and the new Google Antigravity agentic development platform. Here is everything that matters about the model, its benchmarks, its capabilities, and how to put it to work.

Benchmark Breakdown: Where 3.1 Pro Leads and Where It Doesn’t

The ARC-AGI-2 result grabs headlines, but Gemini 3.1 Pro posts gains across nearly every major evaluation. The table below captures the most significant comparisons using verified data from the official model card.

Benchmark	Gemini 3.1 Pro	Gemini 3 Pro	Opus 4.6	GPT-5.2
ARC-AGI-2 (abstract reasoning)	77.1%	31.1%	68.8%	52.9%
GPQA Diamond (scientific knowledge)	94.3%	91.9%	91.3%	92.4%
SWE-Bench Verified (agentic coding)	80.6%	76.2%	80.8%	80.0%
Terminal-Bench 2.0 (Terminus-2 harness)	68.5%	56.9%	65.4%	54.0%
Humanity’s Last Exam (no tools)	44.4%	37.5%	40.0%	34.5%
BrowseComp (agentic search)	85.9%	59.2%	84.0%	65.8%
MRCR v2 128k (long context)	84.9%	77.0%	84.0%	83.8%
MRCR v2 1M (long context)	26.3%	26.3%	N/S	N/S

The pattern is clear: decisive leads in reasoning-heavy tasks, competitive parity in coding, and a notable weakness at the extreme end of context scaling. At 1 million tokens, long-context retrieval drops to just 26.3% – identical to Gemini 3 Pro – revealing that raw context window size does not automatically translate to retrieval quality at scale. This is an honest limitation worth tracking as the model moves from preview to general availability.

What Makes ARC-AGI-2 Different – And Why 77.1% Matters

Most AI benchmarks can be gamed through training data contamination or pattern matching against known question formats. ARC-AGI-2 was built to prevent exactly that. Each puzzle uses entirely novel visual logic patterns that the model has never encountered during training. Solving them requires genuine abstraction: identifying an underlying rule from a handful of examples and applying it to a new grid configuration.

Prior to Gemini 3.1 Pro, no frontier model had broken 70% on ARC-AGI-2. Models hovered below 40% for most of 2025. The jump to 77.1% does not just represent a better score – it signals that the model can handle the kind of fluid, transferable reasoning that underpins real-world problem solving, from debugging unfamiliar code to synthesizing contradictory research findings into a coherent analysis.

Architecture and Multimodal Capabilities

Gemini 3.1 Pro builds directly on the Gemini 3 Pro architecture with targeted improvements to reasoning depth and token efficiency. It is not a new model family – it is a refined iteration that squeezes dramatically more capability from the same foundational design. Key specifications include a 1 million token input context window, 64,000 token output limit, and a knowledge cutoff of January 2025.

The model accepts text, images, video, audio, and PDF inputs natively. Audio support covers formats including AAC, FLAC, MP3, WAV, and OGG, with a maximum of approximately 8.4 hours of audio per prompt – though limited to a single audio file per request. Video input supports up to roughly 45 minutes with audio or about 1 hour without, across formats like MP4, WebM, and QuickTime. Image input allows up to 3,000 images per prompt with a maximum file size of 30 MB from Google Cloud Storage.

Supported Features

Code execution and function calling
Structured output (reliable JSON responses for enterprise workflows)
Grounding with Google Search
Implicit and explicit context caching
Thinking modes with configurable depth
URL context and file search (AI Studio only)
Batch API support

Notable exclusions in the current preview: no audio generation, no Gemini Live API support, no image generation, and no Content Credentials (C2PA).

The Three-Level Thinking System: LOW, MEDIUM, and HIGH

One of the most consequential additions in Gemini 3.1 Pro is the introduction of a three-tier thinking level system. Gemini 3 Pro offered only two levels – LOW and HIGH. The new MEDIUM tier fills a critical gap, and the implications for cost management and performance tuning are significant.

Thinking Level	Reasoning Depth	Best For	Approximate Latency
LOW	Minimal thinking tokens	Classification, translation, summarization, autocomplete	Under 5 seconds
MEDIUM (new)	Equivalent to Gemini 3 Pro HIGH	Code reviews, document Q&A, daily coding, standard API calls	10-40 seconds
HIGH (default)	Deep Think Mini activated	Complex debugging, math proofs, algorithm design, agent workflows	1-8 minutes

The critical insight here: Gemini 3 Pro’s HIGH mode is roughly equivalent to Gemini 3.1 Pro’s MEDIUM. So if you are migrating from 3 Pro, starting at MEDIUM maintains similar quality while cutting costs. The new HIGH mode activates Deep Think Mini – the engine behind the 77.1% ARC-AGI-2 score – and should be reserved for genuinely complex tasks.

A practical cost strategy: route 80% of daily requests through LOW or MEDIUM and reserve HIGH for the 20% that truly demand deep reasoning. This approach can reduce API spend by 50-70%. One critical API constraint – the thinkingLevel and thinkingBudget parameters cannot be used simultaneously, or the request returns a 400 error.

Real-World Applications and Demos

Google showcased several demonstrations that illustrate what enhanced reasoning enables in practice. These are not hypothetical use cases – they are working prototypes built with the model during its preview period.

Live aerospace dashboard. Gemini 3.1 Pro configured a public telemetry stream to visualize the International Space Station’s orbit in real time, bridging complex API configurations with user-friendly interactive design. This kind of multi-step system synthesis – understanding an API, pulling live data, rendering a dynamic visualization – exemplifies the agentic capabilities the model targets.

3D starling murmuration. The model coded an immersive 3D simulation of a bird flock, complete with hand-tracking manipulation and a generative audio score that shifts based on the flock’s velocity. This is not just visual code generation; it is building a multi-sensory interactive experience from a text prompt.

Simulated city environments. From terrain generation to traffic flow modeling, the model assembled multiple layers of a believable urban simulation, demonstrating its capacity for complex, multi-system reasoning.

Animated SVGs from text. The model generates website-ready, code-based animated SVGs that remain crisp at any scale with minimal file sizes – a practical tool for web developers who need lightweight, scalable motion graphics.

Getting Started: AI Studio, API, and CLI

The fastest path to Gemini 3.1 Pro is through Google AI Studio at aistudio.google.com. Sign in with a Google account, select Gemini 3.1 Pro from the model dropdown, and start prompting immediately – no installation required. Usage quotas allow up to 50 high-fidelity generations per month in Experimental Mode (with image input support) and 350 per month in Standard Mode (text-only, with Figma export capability).

API Configuration

For programmatic access, the model is available through the Gemini API and Vertex AI. A basic thinking level configuration via the native Google SDK looks like this:

thinking_config=types.ThinkingConfig(thinking_level="MEDIUM")

For visual tasks requiring higher fidelity, set mediaResolution to “high” (1,120 tokens) or “ultra_high” (2,240 tokens) for fine details. Developers building agentic workflows with custom tools and bash can use the dedicated gemini-3.1-pro-preview-customtools endpoint, which better prioritizes custom tool calls like view_file or search_code.

Gemini CLI Quick Install

Install Gemini CLI: npm install -g @google/gemini-cli
Optionally install GSD for structured design workflows: npx get-shit-done-cc@latest and select Gemini as the integration
Start a new project: run gemini then /gsd:new-project
Iterate through structured phases using commands like /gsd:discuss-phase 1, /gsd:plan-phase 1, /gsd:execute-phase 1, and /gsd:verify-work 1

Common Mistakes and How to Avoid Them

Vague prompts produce generic results. Always specify exact values – pixel dimensions, padding sizes, color hex codes, breakpoints. A prompt asking for a “spacious grid” gets interpreted differently than one specifying “24px gutters with 32px section padding.”
Overusing HIGH thinking mode. It is the default, but it is also the most expensive. Most tasks – translation, classification, standard code generation – perform well at LOW or MEDIUM. Reserve HIGH for spatial reasoning, complex debugging, and multi-step agent workflows.
Mixing thinkingLevel and thinkingBudget. These parameters are mutually exclusive in the API. Use thinkingLevel for Gemini 3+ models and thinkingBudget for Gemini 2.5 models.
Ignoring generation quotas. Experimental Mode caps at 50 generations per month; Standard Mode at 350. These reset monthly. Plan your usage accordingly, especially during prototyping sprints.
Expecting perfect long-context retrieval at 1M tokens. Retrieval accuracy drops to 26.3% at the full million-token window. For critical retrieval tasks, stay within the 128k range where the model hits 84.9%.

Safety, Limitations, and What Comes Next

Safety evaluations show Gemini 3.1 Pro performing consistently with Gemini 3 Pro, with marginal improvements in text-to-text safety (+0.10%) and multilingual safety (+0.11%). It does not reach critical capability levels (CCLs) for cyber or CBRN risks, though it triggers some exploratory alerts in Deep Think mode for situational awareness challenges like maximum token manipulation and oversight frequency modifications. Google maintains safety buffers and continuous testing protocols, including novel stealth evaluations, to monitor these edge cases.

The model is currently in public preview, meaning some capabilities may shift before general availability. Google has signaled that further advancements in agentic workflows are expected before the full release. For enterprises, the model is accessible through Vertex AI with support for provisioned throughput, standard pay-as-you-go, and batch prediction options.

The broader trajectory is unmistakable: reasoning depth is becoming the primary axis of competition among frontier models. The jump from 31% to 77.1% on a benchmark designed to resist shortcuts suggests that the next generation of AI applications will not just answer questions – they will reason through novel problems, orchestrate multi-step workflows, and synthesize information across modalities in ways that were genuinely out of reach twelve months ago. For developers and enterprises evaluating their AI strategy in 2026, Gemini 3.1 Pro is the clearest signal yet that the reasoning ceiling is still rising fast.