Gemini 3 Deep Think Gets a Major Upgrade: 84.6% on ARC-AGI-2 and 18 Unsolved Problems Cracked
Google upgrades Gemini 3 Deep Think with record-breaking reasoning scores, gold-medal science performance, and real-world research applications.
Google upgrades Gemini 3 Deep Think with record-breaking reasoning scores, gold-medal science performance, and real-world research applications.
Google Pushes the Reasoning Frontier with Deep Think
On February 12, 2026, Google announced a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode designed to tackle the hardest problems in science, research, and engineering. The update delivers record-breaking benchmark scores and, more importantly, demonstrates practical problem-solving capabilities that extend well beyond standardized tests.
Deep Think is not a separate model but rather a reasoning mode within the Gemini 3 family. When activated, it allocates significantly more computation time to work through complex, multi-step problems. This upgrade refines that process with improvements developed in close partnership with scientists and researchers who face real-world challenges where data is messy, solutions are ambiguous, and problems often lack clear guardrails.
Benchmark Performance That Demands Attention
The numbers from this upgrade are striking across multiple domains. On ARC-AGI-2, the abstract reasoning benchmark designed to test general intelligence capabilities, Deep Think achieved an unprecedented 84.6 percent, verified independently by the ARC Prize Foundation. For context, the standard Gemini 3 Pro scores 31.1 percent on the same test, and Claude Opus 4.6 reaches 68.8 percent. On the original ARC-AGI-1 benchmark, Deep Think essentially saturated the test at 96 percent.
On Humanity's Last Exam, a benchmark specifically designed to push the limits of frontier models, Deep Think scored 48.4 percent without tools. In competitive programming, it achieved a Codeforces Elo of 3,455, placing it among the world's top competitive programmers. On GPQA Diamond, a graduate-level science benchmark, it scored 93.8 percent, narrowly edging out GPT-5.2 Pro at 93.2 percent.
Deep Think also earned gold-medal-level results on the written sections of both the 2025 International Physics Olympiad and the 2025 International Chemistry Olympiad.
Real-World Problem Solving Beyond Benchmarks
What distinguishes this upgrade from a typical benchmark improvement is the evidence of practical scientific impact. Google reports that Deep Think solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture from 2015.
In one notable case, a mathematician at Rutgers University used Deep Think to review a highly technical mathematics paper. The model successfully identified a subtle logical flaw that had previously passed through human peer review unnoticed. This represents a qualitatively different kind of capability from answering exam questions: the ability to catch errors that trained experts miss.
Another practical demonstration involves converting hand-drawn sketches into 3D-printable objects. Deep Think can analyze a drawing, model the complex shape mathematically, and generate a file ready for 3D printing. While this may sound like a novelty, it illustrates the model's ability to bridge abstract understanding and concrete engineering output.
How Deep Think Compares to the Competition
The AI reasoning landscape as of February 2026 is increasingly specialized. Each frontier model excels in different domains:
| Benchmark | Gemini 3 Deep Think | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| ARC-AGI-2 | 84.6% | 68.8% | N/A |
| GPQA Diamond | 93.8% | ~90% | 93.2% |
| Codeforces Elo | 3,455 | N/A | N/A |
| SWE-bench Verified | N/A | 80.8% | 80.0% |
| Humanity's Last Exam | 48.4% | N/A | N/A |
Deep Think dominates in abstract reasoning and scientific problem-solving. Claude Opus 4.6 remains the leader for real-world coding tasks and agentic workflows, scoring 80.8 percent on SWE-bench Verified. GPT-5.2 holds strong in mathematical reasoning. The takeaway is that the frontier is no longer defined by a single model but by a collection of specialized capabilities.
Availability and Access
The upgraded Deep Think is available through two channels. Individual users can access it through the Gemini app with a Google AI Ultra subscription. For the first time, Deep Think is also available via the Gemini API, targeting researchers, engineers, and enterprise users. Google is providing early access through an application process for API users.
Google has not disclosed specific pricing for Deep Think API calls, though Gemini API pricing generally remains competitive. The AI Ultra subscription that includes app access is priced at $24.99 per month.
What This Means for AI Reasoning
The Deep Think upgrade represents a meaningful step in a trend that has been building throughout 2025 and into 2026: the emergence of reasoning modes that trade speed for depth. Rather than trying to answer every query as quickly as possible, these systems allocate extended computation time to work through genuinely difficult problems.
The practical implications are significant for researchers and engineers. A model that can review papers for logical flaws, solve previously intractable research problems, and bridge the gap between abstract reasoning and physical engineering output is not just a benchmark achievement. It is a tool that could meaningfully accelerate scientific work.
However, it is worth noting that Deep Think's strengths are concentrated in structured reasoning tasks. For everyday conversational AI use, code generation, or creative writing, the standard Gemini 3 Pro or competing models may be more appropriate and faster.
Conclusion
Google's Gemini 3 Deep Think upgrade sets new standards in AI reasoning with its 84.6 percent ARC-AGI-2 score and practical scientific applications. It is best suited for researchers, engineers, and scientists who need an AI that can engage with genuinely difficult problems at a deep level. For users whose work involves complex analysis, mathematical reasoning, or scientific research, Deep Think is now the strongest option available. For general-purpose AI tasks, the standard frontier models remain the better choice.
Pros
- Record-breaking 84.6% on ARC-AGI-2, significantly ahead of all competitors in abstract reasoning
- Demonstrated practical scientific impact by solving real unsolved problems
- Now available via API for enterprise and research integration for the first time
- Gold-medal-level performance across multiple scientific olympiad domains
- Competitive pricing through the existing AI Ultra subscription at $24.99/month
Cons
- Slower response times compared to standard models due to extended reasoning computation
- Strengths are concentrated in structured reasoning; less suited for everyday conversational tasks
- API access currently limited to early-access application process
- Specific API pricing for Deep Think not yet disclosed
References
Comments0
Key Features
Google released a major upgrade to Gemini 3 Deep Think on February 12, 2026, achieving 84.6% on ARC-AGI-2 (verified by ARC Prize Foundation), 48.4% on Humanity's Last Exam without tools, and a Codeforces Elo of 3,455. The model solved 18 previously unsolved research problems, disproved a decade-old mathematical conjecture, earned gold-medal-level scores on the International Physics and Chemistry Olympiads, and demonstrated practical applications including identifying logical flaws in peer-reviewed papers and converting sketches to 3D-printable objects.
Key Insights
- ARC-AGI-2 score of 84.6% represents a 15.8-point lead over Claude Opus 4.6 and a 53.5-point lead over standard Gemini 3 Pro
- Deep Think solved 18 previously unsolved research problems, demonstrating capability beyond standardized benchmarks
- A Rutgers mathematician used Deep Think to identify a logical flaw missed by human peer review
- Gold-medal-level performance on both 2025 International Physics and Chemistry Olympiad written sections
- Codeforces Elo of 3,455 places it among the world's top competitive programmers
- First-time availability via the Gemini API for enterprise and research users
- The upgrade reflects a broader industry trend toward reasoning modes that trade speed for depth
- Each frontier model now excels in different domains rather than one model dominating all tasks
Was this review helpful?
Share
Related AI Reviews
Google Opens Gemini Deep Research Agent to Developers via New Interactions API
Google makes its autonomous Deep Research Agent available to developers through the Interactions API, powered by Gemini 3.1 Pro with web search and private data capabilities.
Google Launches Nano Banana 2: 4K Image Generation at Flash Speed
Google DeepMind unveils Nano Banana 2, combining Gemini Flash speed with 4K resolution, character consistency, and multilingual text rendering across 141 countries.
Lyria 3 Arrives in Gemini: Google Turns Text Prompts Into 30-Second Music Tracks
Google launches Lyria 3 inside the Gemini app on February 18, 2026, letting users generate 30-second songs with vocals, lyrics, and instrumentals from text or image prompts.
Gemini 3.1 Pro Arrives: Google Doubles Down on Reasoning With 77.1% ARC-AGI-2
Google launches Gemini 3.1 Pro on February 19, 2026, achieving 77.1% on ARC-AGI-2 reasoning benchmark, more than double its predecessor, with 1M token context and 64K output tokens.
