Back to list
Feb 18, 2026
59
0
0
Claude

Claude Sonnet 4.6 Arrives: 1M Token Context and a Fivefold Leap in Computer Use

Anthropic releases Claude Sonnet 4.6 with a 1 million token context window, 72.5% on OSWorld for computer use, and competitive benchmark scores at $3/$15 per million tokens.

#Claude#Sonnet 4.6#Anthropic#Context Window#Computer Use
Claude Sonnet 4.6 Arrives: 1M Token Context and a Fivefold Leap in Computer Use
AI Summary

Anthropic releases Claude Sonnet 4.6 with a 1 million token context window, 72.5% on OSWorld for computer use, and competitive benchmark scores at $3/$15 per million tokens.

Anthropic Ships Its Most Capable Mid-Tier Model Yet

On February 17, 2026, Anthropic released Claude Sonnet 4.6, the latest iteration of its mid-tier model line and now the default model for both Free and Pro plan users on claude.ai and Claude Cowork. The release brings a 1 million token context window in beta, substantial improvements in coding and computer use, and benchmark scores that challenge models costing several times more.

Sonnet 4.6 is not a flagship launch in the way that Opus releases tend to be. It is something arguably more consequential for day-to-day users: a workhorse model that has quietly closed much of the gap with frontier-class competitors while remaining at the same $3 input / $15 output per million token price point.

The 1 Million Token Context Window

The headline feature is a 1 million token context window, available in beta. This is a significant expansion from the previous Sonnet models and puts Sonnet 4.6 in the same territory as Google's Gemini 3 Pro in terms of raw context capacity.

A 1 million token window means users can process entire codebases, dozens of research papers, or lengthy legal documents in a single request without chunking or retrieval-augmented generation workarounds. For developers working on large monorepos or researchers conducting literature reviews, this removes a practical bottleneck that previously required either Opus-tier models or external tooling.

The context window expansion is not just about capacity. Anthropic reports that Sonnet 4.6 demonstrates improved long-context reasoning, meaning it can better track relationships and dependencies across distant parts of a document rather than simply accepting more tokens.

Coding: Smarter, Not Just Faster

Coding improvements are where Sonnet 4.6 makes its strongest case. The model scores 79.6% on SWE-bench Verified, a benchmark that measures the ability to resolve real-world software engineering issues from GitHub repositories. This places it firmly in competitive territory with frontier models.

According to Anthropic, the key improvement is not raw code generation speed but contextual understanding. Sonnet 4.6 reads context before editing, reasons over long chunks of code, tightens logic instead of duplicating it, and delivers smarter answers with fewer iterations. In practice, this means fewer back-and-forth cycles when working on complex codebases.

The model also scores 89.3% on MMLU and 89.9% on GPQA Diamond, indicating strong general knowledge and graduate-level reasoning capabilities that support its coding performance.

Computer Use: 72.5% on OSWorld

Perhaps the most striking improvement is in computer use. Sonnet 4.6 reaches 72.5% on OSWorld, a benchmark that measures the ability to carry out multi-step tasks on a computer, such as filling out web forms, navigating file systems, and coordinating information across browser tabs.

Anthropiece notes this represents nearly a fivefold improvement in computer use capability over 16 months. The practical implication is that Claude can now handle complex, multi-step desktop workflows with significantly higher reliability. Tasks like booking travel by comparing options across multiple tabs, filling out complex forms by extracting information from documents, or managing files across applications are now within reach.

This positions Sonnet 4.6 as a practical option for the agentic use cases that the entire industry is racing toward, and it does so at the Sonnet price tier rather than requiring Opus-level costs.

Benchmark Performance in Context

Sonnet 4.6 posts competitive scores across standard evaluations:

BenchmarkScore
GPQA Diamond89.9%
ARC-AGI-258.3%
MMLU89.3%
SWE-bench Verified79.6%
OSWorld72.5%

These numbers do not top every leaderboard. GPT-5.2 and Claude Opus 4.5 still lead on several evaluations. But the significance lies in the ratio of performance to cost. At $3/$15 per million tokens, Sonnet 4.6 delivers approximately 80-90% of flagship model performance at roughly one-fifth the price.

Anthropiece's own framing emphasizes that this release prioritizes practical throughput over raw benchmark points: fewer hallucinations, less sycophancy, better instruction adherence, and faster iteration on engineering and office tasks.

Reduced Hallucinations and Sycophancy

Beyond the headline numbers, Anthropic highlights qualitative improvements that matter for production use. Sonnet 4.6 exhibits fewer hallucinations, meaning it is less likely to generate plausible-sounding but incorrect information. It also shows reduced sycophancy, the tendency to agree with the user rather than providing accurate corrections.

These improvements may not show up dramatically in benchmark scores, but they are critical for enterprise deployments where reliability and accuracy are non-negotiable. A model that pushes back when a user's premise is wrong is more valuable in professional settings than one that scores marginally higher on a standardized test.

Availability and Pricing

Sonnet 4.6 is immediately available as the default model for Free and Pro plan users on claude.ai and Claude Cowork. API access is available at the existing Sonnet pricing of $3 per million input tokens and $15 per million output tokens.

The 1 million token context window is available in beta, with Anthropic expected to expand access as stability is confirmed. The model is also available through Amazon Bedrock and Google Cloud's Vertex AI for enterprise customers.

What This Means for the Market

Sonnet 4.6 reinforces a trend that has been building throughout early 2026: the mid-tier model class is becoming good enough for the majority of production use cases. When a $3/$15 model can handle million-token contexts, achieve 72.5% on computer use benchmarks, and score nearly 80% on SWE-bench, the justification for flagship-tier pricing narrows considerably.

For developers and enterprises, this means the cost of building AI-powered applications continues to drop while capability continues to rise. For Anthropic's competitors, it means the performance floor that users expect from a "standard" model has just been raised again.

Conclusion

Claude Sonnet 4.6 is not a revolutionary leap. It is something more practically important: a substantial, across-the-board improvement to the model that most Claude users interact with daily. The 1 million token context window, fivefold computer use improvement, and strong coding performance make it the most capable mid-tier model available today. At unchanged pricing, it represents one of the best value propositions in the current AI landscape for developers, professionals, and enterprises building on Claude's platform.

Pros

  • 1 million token context window enables processing entire codebases or dozens of research papers in a single request
  • 72.5% OSWorld score makes it one of the most capable models for autonomous computer use at any price tier
  • Strong coding performance at 79.6% SWE-bench with improved contextual understanding and fewer iteration cycles
  • Unchanged $3/$15 pricing delivers exceptional performance-to-cost ratio compared to flagship models
  • Reduced hallucinations and sycophancy improve reliability for production and enterprise deployments

Cons

  • Still trails GPT-5.2 and Claude Opus 4.5 on several high-profile benchmarks including math and competitive coding
  • 1 million token context window is in beta, and full production stability has not been confirmed yet
  • ARC-AGI-2 score of 58.3% lags behind Gemini 3 Deep Think's 84.6%, indicating room for improvement on novel reasoning
  • Incremental naming convention (4.6 vs 5.0) may understate the significance of improvements to casual observers

Comments0

Key Features

Claude Sonnet 4.6 is Anthropic's latest mid-tier model featuring a 1 million token context window in beta, 72.5% on OSWorld for computer use (a fivefold improvement over 16 months), 79.6% on SWE-bench Verified for coding, and 89.9% on GPQA Diamond. It is now the default model for Free and Pro users on claude.ai at $3/$15 per million tokens, with reduced hallucinations and sycophancy compared to its predecessor.

Key Insights

  • Sonnet 4.6 introduces a 1 million token context window in beta, matching Gemini 3 Pro's capacity at a fraction of the cost
  • OSWorld score of 72.5% represents a nearly fivefold improvement in computer use capability over 16 months
  • SWE-bench Verified score of 79.6% puts Sonnet 4.6 in competitive territory with frontier-class coding models
  • Pricing remains at $3/$15 per million tokens, delivering roughly 80-90% of flagship performance at one-fifth the cost
  • The model is now the default for both Free and Pro plan users on claude.ai and Claude Cowork
  • Anthropic emphasizes reduced hallucinations and sycophancy as key qualitative improvements over previous Sonnet versions
  • GPQA Diamond score of 89.9% and MMLU of 89.3% demonstrate strong graduate-level reasoning capabilities
  • Improved long-context reasoning means the model tracks relationships across distant document sections more effectively

Was this review helpful?

Share

Twitter/X