Back to list
Feb 20, 2026
64
0
0
Gemini

Gemini 3.1 Pro Arrives: Google Doubles Down on Reasoning With 77.1% ARC-AGI-2

Google launches Gemini 3.1 Pro on February 19, 2026, achieving 77.1% on ARC-AGI-2 reasoning benchmark, more than double its predecessor, with 1M token context and 64K output tokens.

#Gemini#Google#Gemini 3.1 Pro#ARC-AGI-2#Reasoning
Gemini 3.1 Pro Arrives: Google Doubles Down on Reasoning With 77.1% ARC-AGI-2
AI Summary

Google launches Gemini 3.1 Pro on February 19, 2026, achieving 77.1% on ARC-AGI-2 reasoning benchmark, more than double its predecessor, with 1M token context and 64K output tokens.

Google's First .1 Release Redefines Reasoning Performance

On February 19, 2026, Google released Gemini 3.1 Pro, its first incremental ".1" model update in the Gemini series. Previous generations used ".5" suffixes for mid-cycle updates, making this naming shift notable in itself. But the real headline is the benchmark performance: Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, a test designed to evaluate abstract logic and pattern recognition. That figure is more than double the 31.1% achieved by its predecessor, Gemini 3 Pro, released in November 2025.

The model is available in the Gemini app for Google AI Pro and Ultra subscribers, in NotebookLM for Pro and Ultra users, and through the Gemini API via AI Studio, Vertex AI, Gemini Enterprise, Gemini CLI, and Android Studio. It is currently in preview status ahead of general availability.

Benchmark Breakdown: Where 3.1 Pro Stands

The ARC-AGI-2 score of 77.1% is the most striking number, but Gemini 3.1 Pro delivers competitive results across a range of benchmarks. On GPQA Diamond, which evaluates scientific knowledge, the model scores 94.3%. On BrowseComp, a benchmark measuring web browsing and information retrieval capability, it reaches 85.9%. The MCP Atlas agentic benchmark result is 69.2%, and SWE-Bench Verified, which measures real-world software engineering ability, comes in at 80.6%.

For competitive coding, Gemini 3.1 Pro achieves an Elo score of 2,887 on LiveCodeBench Pro. On the multimodal benchmark MMMU Pro, it scores 80.5%, which is slightly below the 81.0% of Gemini 3 Pro, suggesting that the model update prioritized reasoning improvements over multimodal capabilities.

These numbers place Gemini 3.1 Pro in direct competition with the current frontrunners. Anthropic's Claude Opus 4.6 scores 68.8% on ARC-AGI-2 and 80.8% on SWE-Bench Verified. OpenAI's GPT-5.2 achieves 52.9% on ARC-AGI-2. Google's model now leads on abstract reasoning by a significant margin while remaining competitive on coding and agentic tasks.

Technical Architecture and Capabilities

Gemini 3.1 Pro maintains the natively multimodal architecture of the Gemini series. It supports a context window of 1 million tokens and an output limit of 64,000 tokens. Google describes the model as incorporating "upgraded core intelligence" that first debuted with Gemini 3 Deep Think, the experimental reasoning model released in December 2025.

The practical implications of this architecture are visible in the model's demonstrated capabilities. Google showcases the ability to generate scalable, code-based SVG animations from text prompts, synthesize complex systems into interactive dashboards, and create immersive experiences including live 3D simulations with real-time hand tracking and generative audio. These are not theoretical capabilities but demonstrations that Google has shared publicly to illustrate the model's practical range.

The model's strength in agentic tasks is reflected in the MCP Atlas benchmark score. MCP Atlas evaluates a model's ability to use tools, navigate multi-step workflows, and complete complex tasks with minimal human intervention. A score of 69.2% indicates that Gemini 3.1 Pro can handle a significant portion of agentic workflows autonomously, though it still falls short of perfect reliability in complex scenarios.

Pricing and Access Structure

Google has published API pricing for Gemini 3.1 Pro. For inputs up to 200,000 tokens, the rate is $2.00 per million tokens. Beyond 200,000 tokens, input pricing rises to $4.00 per million tokens. Output pricing is $12.00 per million tokens for standard usage and $18.00 per million tokens for outputs generated from inputs exceeding the 200,000-token threshold. Caching is available at $0.20 per million tokens for standard inputs and $0.40 per million tokens above the threshold.

Search grounding is included with 5,000 free monthly prompts, with additional queries priced at $14.00 per 1,000. This pricing positions Gemini 3.1 Pro as competitive with Claude Opus 4.6 and GPT-5.2 for most enterprise workloads, with the tiered pricing structure rewarding users who work within the standard context window.

For consumer access, the model rolls out in the Gemini app with higher rate limits for AI Pro and AI Ultra subscribers. NotebookLM access is exclusive to Pro and Ultra tiers. Developers can access the model through Google AI Studio, Vertex AI, and the Gemini CLI.

The Deep Think Connection

Gemini 3.1 Pro's reasoning improvements trace directly back to Gemini 3 Deep Think, an experimental model Google released on February 12, 2026, specifically targeted at modern science, research, and engineering tasks. The Deep Think model demonstrated that Google's architecture could achieve dramatically higher reasoning performance when allowed to engage in more deliberate, multi-step processing.

The 3.1 Pro release integrates these reasoning capabilities into a general-purpose model. Unlike Deep Think, which trades speed for reasoning depth, 3.1 Pro balances the enhanced reasoning with practical latency requirements for everyday tasks. This makes it suitable for both the quick-response needs of a chat application and the deeper analytical requirements of enterprise workflows.

What This Means for the Competitive Landscape

The ARC-AGI-2 benchmark has become a key differentiator in the frontier model race. It tests something fundamentally different from coding benchmarks or knowledge tests: the ability to identify abstract patterns and apply them to novel situations. A score of 77.1% suggests that Gemini 3.1 Pro has made genuine progress in generalized reasoning, not just pattern matching on training data.

Google's lead on this benchmark creates pressure on both Anthropic and OpenAI. Claude Opus 4.6, released on February 18, 2026, just one day before Gemini 3.1 Pro, scored 68.8% on the same test. OpenAI's GPT-5.2, while strong on practical tasks, trails at 52.9%. The gap between Gemini 3.1 Pro and its closest competitor is over 8 percentage points, a meaningful margin in a field where improvements are typically measured in single digits.

However, benchmark leadership does not automatically translate to user adoption. The practical experience of using a model, including response quality, latency, tool integration, and alignment, matters as much as benchmark scores. Google's advantage is its integration with the existing Google ecosystem: Workspace, BigQuery, Android Studio, and Vertex AI create a deployment pathway that neither OpenAI nor Anthropic can match for enterprises already invested in Google Cloud.

Conclusion

Gemini 3.1 Pro is a significant step forward for Google's AI ambitions. The 77.1% ARC-AGI-2 score is not just a benchmark win but an indicator that the model has substantially improved its ability to reason about novel problems. Combined with competitive performance on coding, science, and agentic benchmarks, a 1-million-token context window, and deep integration with Google's developer and enterprise platforms, 3.1 Pro makes a strong case as the most capable general-purpose model available as of February 2026. The model is best suited for developers, researchers, and enterprise teams who need advanced reasoning capabilities and are already working within or considering the Google Cloud ecosystem.

Pros

  • Industry-leading abstract reasoning with 77.1% ARC-AGI-2 score, significantly ahead of all competitors
  • 1-million-token context window supports analysis of long documents, codebases, and complex datasets
  • Competitive API pricing with tiered structure that rewards standard-context workloads
  • Seamless integration with Google Cloud, Workspace, NotebookLM, and developer tools
  • Strong across-the-board performance on science, coding, browsing, and agentic benchmarks

Cons

  • Currently in preview status with general availability timeline not yet confirmed
  • MMMU Pro multimodal score of 80.5% is slightly below Gemini 3 Pro's 81.0%, indicating a possible reasoning-multimodal tradeoff
  • MCP Atlas agentic score of 69.2% leaves room for improvement in complex autonomous workflows
  • Premium features limited to Google AI Pro and Ultra subscribers, adding cost for full access

Comments0

Key Features

Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 abstract reasoning benchmark, more than double its predecessor's 31.1%. It features a 1-million-token context window and 64,000-token output limit. The model achieves 94.3% on GPQA Diamond (science), 80.6% on SWE-Bench Verified (coding), and 85.9% on BrowseComp (web browsing). API pricing starts at $2.00/1M input tokens and $12.00/1M output tokens. Available through Gemini API, AI Studio, Vertex AI, Gemini app, NotebookLM, and Android Studio.

Key Insights

  • Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2, more than double its predecessor's 31.1% and exceeding Claude Opus 4.6 (68.8%) and GPT-5.2 (52.9%)
  • This is Google's first .1 incremental release, breaking from the previous .5 mid-cycle update pattern
  • The model inherits enhanced reasoning from Gemini 3 Deep Think while maintaining practical latency for general-purpose use
  • GPQA Diamond score of 94.3% demonstrates near-ceiling performance on scientific knowledge tasks
  • SWE-Bench Verified score of 80.6% nearly matches Claude Opus 4.6's 80.8%, making it competitive on real-world coding
  • Tiered API pricing rewards standard context usage at $2.00/1M input tokens, rising to $4.00/1M above 200K tokens
  • Deep Google ecosystem integration across Workspace, Vertex AI, and Android Studio provides a deployment advantage competitors cannot replicate

Was this review helpful?

Share

Twitter/X