Claude Sonnet 4.6 Arrives: 1M Token Context and a Fivefold Leap in Computer Use

Anthropic releases Claude Sonnet 4.6 with a 1 million token context window, 72.5% on OSWorld for computer use, and competitive benchmark scores at $3/$15 per million tokens.

#Claude#Sonnet 4.6#Anthropic#Context Window#Computer Use

Claude Sonnet 4.6 Arrives: 1M Token Context and a Fivefold Leap in Computer Use

AI Summary

Anthropic releases Claude Sonnet 4.6 with a 1 million token context window, 72.5% on OSWorld for computer use, and competitive benchmark scores at $3/$15 per million tokens.

Anthropic Ships Its Most Capable Mid-Tier Model Yet

On February 17, 2026, Anthropic released Claude Sonnet 4.6, the latest iteration of its mid-tier model line and now the default model for both Free and Pro plan users on claude.ai and Claude Cowork. The release brings a 1 million token context window in beta, substantial improvements in coding and computer use, and benchmark scores that challenge models costing several times more.

Sonnet 4.6 is not a flagship launch in the way that Opus releases tend to be. It is something arguably more consequential for day-to-day users: a workhorse model that has quietly closed much of the gap with frontier-class competitors while remaining at the same $3 input / $15 output per million token price point.

The 1 Million Token Context Window

The headline feature is a 1 million token context window, available in beta. This is a significant expansion from the previous Sonnet models and puts Sonnet 4.6 in the same territory as Google's Gemini 3 Pro in terms of raw context capacity.

A 1 million token window means users can process entire codebases, dozens of research papers, or lengthy legal documents in a single request without chunking or retrieval-augmented generation workarounds. For developers working on large monorepos or researchers conducting literature reviews, this removes a practical bottleneck that previously required either Opus-tier models or external tooling.

The context window expansion is not just about capacity. Anthropic reports that Sonnet 4.6 demonstrates improved long-context reasoning, meaning it can better track relationships and dependencies across distant parts of a document rather than simply accepting more tokens.

Coding: Smarter, Not Just Faster

Coding improvements are where Sonnet 4.6 makes its strongest case. The model scores 79.6% on SWE-bench Verified, a benchmark that measures the ability to resolve real-world software engineering issues from GitHub repositories. This places it firmly in competitive territory with frontier models.

According to Anthropic, the key improvement is not raw code generation speed but contextual understanding. Sonnet 4.6 reads context before editing, reasons over long chunks of code, tightens logic instead of duplicating it, and delivers smarter answers with fewer iterations. In practice, this means fewer back-and-forth cycles when working on complex codebases.

The model also scores 89.3% on MMLU and 89.9% on GPQA Diamond, indicating strong general knowledge and graduate-level reasoning capabilities that support its coding performance.

Computer Use: 72.5% on OSWorld

Perhaps the most striking improvement is in computer use. Sonnet 4.6 reaches 72.5% on OSWorld, a benchmark that measures the ability to carry out multi-step tasks on a computer, such as filling out web forms, navigating file systems, and coordinating information across browser tabs.

Anthropiece notes this represents nearly a fivefold improvement in computer use capability over 16 months. The practical implication is that Claude can now handle complex, multi-step desktop workflows with significantly higher reliability. Tasks like booking travel by comparing options across multiple tabs, filling out complex forms by extracting information from documents, or managing files across applications are now within reach.

This positions Sonnet 4.6 as a practical option for the agentic use cases that the entire industry is racing toward, and it does so at the Sonnet price tier rather than requiring Opus-level costs.

Benchmark Performance in Context

Sonnet 4.6 posts competitive scores across standard evaluations:

Benchmark	Score
GPQA Diamond	89.9%
ARC-AGI-2	58.3%
MMLU	89.3%
SWE-bench Verified	79.6%
OSWorld	72.5%

These numbers do not top every leaderboard. GPT-5.2 and Claude Opus 4.5 still lead on several evaluations. But the significance lies in the ratio of performance to cost. At $3/$15 per million tokens, Sonnet 4.6 delivers approximately 80-90% of flagship model performance at roughly one-fifth the price.

Anthropiece's own framing emphasizes that this release prioritizes practical throughput over raw benchmark points: fewer hallucinations, less sycophancy, better instruction adherence, and faster iteration on engineering and office tasks.

Reduced Hallucinations and Sycophancy

Beyond the headline numbers, Anthropic highlights qualitative improvements that matter for production use. Sonnet 4.6 exhibits fewer hallucinations, meaning it is less likely to generate plausible-sounding but incorrect information. It also shows reduced sycophancy, the tendency to agree with the user rather than providing accurate corrections.

These improvements may not show up dramatically in benchmark scores, but they are critical for enterprise deployments where reliability and accuracy are non-negotiable. A model that pushes back when a user's premise is wrong is more valuable in professional settings than one that scores marginally higher on a standardized test.

Availability and Pricing

Sonnet 4.6 is immediately available as the default model for Free and Pro plan users on claude.ai and Claude Cowork. API access is available at the existing Sonnet pricing of $3 per million input tokens and $15 per million output tokens.

The 1 million token context window is available in beta, with Anthropic expected to expand access as stability is confirmed. The model is also available through Amazon Bedrock and Google Cloud's Vertex AI for enterprise customers.

What This Means for the Market

Sonnet 4.6 reinforces a trend that has been building throughout early 2026: the mid-tier model class is becoming good enough for the majority of production use cases. When a $3/$15 model can handle million-token contexts, achieve 72.5% on computer use benchmarks, and score nearly 80% on SWE-bench, the justification for flagship-tier pricing narrows considerably.

For developers and enterprises, this means the cost of building AI-powered applications continues to drop while capability continues to rise. For Anthropic's competitors, it means the performance floor that users expect from a "standard" model has just been raised again.

Conclusion

Claude Sonnet 4.6 is not a revolutionary leap. It is something more practically important: a substantial, across-the-board improvement to the model that most Claude users interact with daily. The 1 million token context window, fivefold computer use improvement, and strong coding performance make it the most capable mid-tier model available today. At unchanged pricing, it represents one of the best value propositions in the current AI landscape for developers, professionals, and enterprises building on Claude's platform.

Editor's Verdict

Claude Sonnet 4.6 Arrives: 1M Token Context and a Fivefold Leap in Computer Use stands out as one of the more compelling claude developments we've covered recently.

The strongest case for paying attention is 1 million token context window enables processing entire codebases or dozens of research papers in a single request, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, 72.5% OSWorld score makes it one of the most capable models for autonomous computer use at any price tier adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: sonnet 4.6 introduces a 1 million token context window in beta, matching Gemini 3 Pro's capacity at a fraction of the cost. On the other side of the ledger, still trails GPT-5.2 and Claude Opus 4.5 on several high-profile benchmarks including math and competitive coding is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, 1 million token context window is in beta, and full production stability has not been confirmed yet narrows the set of teams for whom this is an obvious yes.

For Anthropic and Claude users, alignment-focused teams, and developers already invested in the Claude ecosystem, the answer here is to pilot now and plan for production use. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

1 million token context window enables processing entire codebases or dozens of research papers in a single request
72.5% OSWorld score makes it one of the most capable models for autonomous computer use at any price tier
Strong coding performance at 79.6% SWE-bench with improved contextual understanding and fewer iteration cycles
Unchanged $3/$15 pricing delivers exceptional performance-to-cost ratio compared to flagship models
Reduced hallucinations and sycophancy improve reliability for production and enterprise deployments

Cons

Still trails GPT-5.2 and Claude Opus 4.5 on several high-profile benchmarks including math and competitive coding
1 million token context window is in beta, and full production stability has not been confirmed yet
ARC-AGI-2 score of 58.3% lags behind Gemini 3 Deep Think's 84.6%, indicating room for improvement on novel reasoning
Incremental naming convention (4.6 vs 5.0) may understate the significance of improvements to casual observers

References

Introducing Claude Sonnet 4.6 - Anthropic Anthropic Launches Claude Sonnet 4.6 With 1M-Token Context - FindArticles Anthropic releases Claude Sonnet 4.6, continuing breakneck pace of AI model releases - CNBC Anthropic's Sonnet 4.6 matches flagship AI performance at one-fifth the cost - VentureBeat Claude Sonnet 4.6 Brings Improved Coding, Computer Use, and Office Tasks - MacRumors

Comments0

Key Features

Claude Sonnet 4.6 is Anthropic's latest mid-tier model featuring a 1 million token context window in beta, 72.5% on OSWorld for computer use (a fivefold improvement over 16 months), 79.6% on SWE-bench Verified for coding, and 89.9% on GPQA Diamond. It is now the default model for Free and Pro users on claude.ai at $3/$15 per million tokens, with reduced hallucinations and sycophancy compared to its predecessor.

Key Insights

Sonnet 4.6 introduces a 1 million token context window in beta, matching Gemini 3 Pro's capacity at a fraction of the cost
OSWorld score of 72.5% represents a nearly fivefold improvement in computer use capability over 16 months
SWE-bench Verified score of 79.6% puts Sonnet 4.6 in competitive territory with frontier-class coding models
Pricing remains at $3/$15 per million tokens, delivering roughly 80-90% of flagship performance at one-fifth the cost
The model is now the default for both Free and Pro plan users on claude.ai and Claude Cowork
Anthropic emphasizes reduced hallucinations and sycophancy as key qualitative improvements over previous Sonnet versions
GPQA Diamond score of 89.9% and MMLU of 89.3% demonstrate strong graduate-level reasoning capabilities
Improved long-context reasoning means the model tracks relationships across distant document sections more effectively