Gemini 3 Deep Think Gets a Major Upgrade: 84.6% on ARC-AGI-2 and 18 Unsolved Problems Cracked

Google upgrades Gemini 3 Deep Think with record-breaking reasoning scores, gold-medal science performance, and real-world research applications.

#Google#Gemini#Deep Think#AI Reasoning#ARC-AGI

Gemini 3 Deep Think Gets a Major Upgrade: 84.6% on ARC-AGI-2 and 18 Unsolved Problems Cracked

AI Summary

Google upgrades Gemini 3 Deep Think with record-breaking reasoning scores, gold-medal science performance, and real-world research applications.

Google Pushes the Reasoning Frontier with Deep Think

On February 12, 2026, Google announced a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode designed to tackle the hardest problems in science, research, and engineering. The update delivers record-breaking benchmark scores and, more importantly, demonstrates practical problem-solving capabilities that extend well beyond standardized tests.

Deep Think is not a separate model but rather a reasoning mode within the Gemini 3 family. When activated, it allocates significantly more computation time to work through complex, multi-step problems. This upgrade refines that process with improvements developed in close partnership with scientists and researchers who face real-world challenges where data is messy, solutions are ambiguous, and problems often lack clear guardrails.

Benchmark Performance That Demands Attention

The numbers from this upgrade are striking across multiple domains. On ARC-AGI-2, the abstract reasoning benchmark designed to test general intelligence capabilities, Deep Think achieved an unprecedented 84.6 percent, verified independently by the ARC Prize Foundation. For context, the standard Gemini 3 Pro scores 31.1 percent on the same test, and Claude Opus 4.6 reaches 68.8 percent. On the original ARC-AGI-1 benchmark, Deep Think essentially saturated the test at 96 percent.

On Humanity's Last Exam, a benchmark specifically designed to push the limits of frontier models, Deep Think scored 48.4 percent without tools. In competitive programming, it achieved a Codeforces Elo of 3,455, placing it among the world's top competitive programmers. On GPQA Diamond, a graduate-level science benchmark, it scored 93.8 percent, narrowly edging out GPT-5.2 Pro at 93.2 percent.

Deep Think also earned gold-medal-level results on the written sections of both the 2025 International Physics Olympiad and the 2025 International Chemistry Olympiad.

Real-World Problem Solving Beyond Benchmarks

What distinguishes this upgrade from a typical benchmark improvement is the evidence of practical scientific impact. Google reports that Deep Think solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture from 2015.

In one notable case, a mathematician at Rutgers University used Deep Think to review a highly technical mathematics paper. The model successfully identified a subtle logical flaw that had previously passed through human peer review unnoticed. This represents a qualitatively different kind of capability from answering exam questions: the ability to catch errors that trained experts miss.

Another practical demonstration involves converting hand-drawn sketches into 3D-printable objects. Deep Think can analyze a drawing, model the complex shape mathematically, and generate a file ready for 3D printing. While this may sound like a novelty, it illustrates the model's ability to bridge abstract understanding and concrete engineering output.

How Deep Think Compares to the Competition

The AI reasoning landscape as of February 2026 is increasingly specialized. Each frontier model excels in different domains:

Benchmark	Gemini 3 Deep Think	Claude Opus 4.6	GPT-5.2
ARC-AGI-2	84.6%	68.8%	N/A
GPQA Diamond	93.8%	~90%	93.2%
Codeforces Elo	3,455	N/A	N/A
SWE-bench Verified	N/A	80.8%	80.0%
Humanity's Last Exam	48.4%	N/A	N/A

Deep Think dominates in abstract reasoning and scientific problem-solving. Claude Opus 4.6 remains the leader for real-world coding tasks and agentic workflows, scoring 80.8 percent on SWE-bench Verified. GPT-5.2 holds strong in mathematical reasoning. The takeaway is that the frontier is no longer defined by a single model but by a collection of specialized capabilities.

Availability and Access

The upgraded Deep Think is available through two channels. Individual users can access it through the Gemini app with a Google AI Ultra subscription. For the first time, Deep Think is also available via the Gemini API, targeting researchers, engineers, and enterprise users. Google is providing early access through an application process for API users.

Google has not disclosed specific pricing for Deep Think API calls, though Gemini API pricing generally remains competitive. The AI Ultra subscription that includes app access is priced at $24.99 per month.

What This Means for AI Reasoning

The Deep Think upgrade represents a meaningful step in a trend that has been building throughout 2025 and into 2026: the emergence of reasoning modes that trade speed for depth. Rather than trying to answer every query as quickly as possible, these systems allocate extended computation time to work through genuinely difficult problems.

The practical implications are significant for researchers and engineers. A model that can review papers for logical flaws, solve previously intractable research problems, and bridge the gap between abstract reasoning and physical engineering output is not just a benchmark achievement. It is a tool that could meaningfully accelerate scientific work.

However, it is worth noting that Deep Think's strengths are concentrated in structured reasoning tasks. For everyday conversational AI use, code generation, or creative writing, the standard Gemini 3 Pro or competing models may be more appropriate and faster.

Conclusion

Google's Gemini 3 Deep Think upgrade sets new standards in AI reasoning with its 84.6 percent ARC-AGI-2 score and practical scientific applications. It is best suited for researchers, engineers, and scientists who need an AI that can engage with genuinely difficult problems at a deep level. For users whose work involves complex analysis, mathematical reasoning, or scientific research, Deep Think is now the strongest option available. For general-purpose AI tasks, the standard frontier models remain the better choice.

Editor's Verdict

Gemini 3 Deep Think Gets a Major Upgrade: 84.6% on ARC-AGI-2 and 18 Unsolved Problems Cracked stands out as one of the more compelling gemini developments we've covered recently.

The strongest case for paying attention is record-breaking 84.6% on ARC-AGI-2, significantly ahead of all competitors in abstract reasoning, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, demonstrated practical scientific impact by solving real unsolved problems adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: ARC-AGI-2 score of 84.6% represents a 15.8-point lead over Claude Opus 4.6 and a 53.5-point lead over standard Gemini 3 Pro. On the other side of the ledger, slower response times compared to standard models due to extended reasoning computation is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, strengths are concentrated in structured reasoning; less suited for everyday conversational tasks narrows the set of teams for whom this is an obvious yes.

For Google Cloud and Workspace integrators, multimodal-first teams, and Gemini API adopters, the answer here is to pilot now and plan for production use. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

Record-breaking 84.6% on ARC-AGI-2, significantly ahead of all competitors in abstract reasoning
Demonstrated practical scientific impact by solving real unsolved problems
Now available via API for enterprise and research integration for the first time
Gold-medal-level performance across multiple scientific olympiad domains
Competitive pricing through the existing AI Ultra subscription at $24.99/month

Cons

Slower response times compared to standard models due to extended reasoning computation
Strengths are concentrated in structured reasoning; less suited for everyday conversational tasks
API access currently limited to early-access application process
Specific API pricing for Deep Think not yet disclosed

References

Gemini 3 Deep Think: Advancing science, research and engineering - Google Blog Gemini 3 Deep Think gets 'major upgrade' aimed at practical applications - 9to5Google Google Gemini 3 Deep Think Beats Opus 4.6 and GPT-5.2, Solves 18 New Research Problems - WinBuzzer Is This AGI? Google's Gemini 3 Deep Think Shatters Humanity's Last Exam - MarkTechPost

Comments0

Key Features

Google released a major upgrade to Gemini 3 Deep Think on February 12, 2026, achieving 84.6% on ARC-AGI-2 (verified by ARC Prize Foundation), 48.4% on Humanity's Last Exam without tools, and a Codeforces Elo of 3,455. The model solved 18 previously unsolved research problems, disproved a decade-old mathematical conjecture, earned gold-medal-level scores on the International Physics and Chemistry Olympiads, and demonstrated practical applications including identifying logical flaws in peer-reviewed papers and converting sketches to 3D-printable objects.

Key Insights

ARC-AGI-2 score of 84.6% represents a 15.8-point lead over Claude Opus 4.6 and a 53.5-point lead over standard Gemini 3 Pro
Deep Think solved 18 previously unsolved research problems, demonstrating capability beyond standardized benchmarks
A Rutgers mathematician used Deep Think to identify a logical flaw missed by human peer review
Gold-medal-level performance on both 2025 International Physics and Chemistry Olympiad written sections
Codeforces Elo of 3,455 places it among the world's top competitive programmers
First-time availability via the Gemini API for enterprise and research users
The upgrade reflects a broader industry trend toward reasoning modes that trade speed for depth
Each frontier model now excels in different domains rather than one model dominating all tasks