AI Development8 min readFeatured Guide

Claude Sonnet 4.6: Benchmarks, Pricing & Complete Guide

Claude Sonnet 4.6 scores 72.5% on OSWorld and 79.6% on SWE-bench Verified at $3/$15M tokens. Complete benchmarks, coding, computer use, and pricing guide.

Digital Applied Team
February 17, 2026
8 min read
72.5%

OSWorld-Verified Score

79.6%

SWE-bench Verified

$3/$15

Per Million Tokens

1M

Context Window (Tokens)

Key Takeaways

Near Opus-level computer use: 72.5% on OSWorld-Verified, within 0.2% of Opus 4.6 (72.7%), while crushing GPT-5.2 at 38.2%.
79.6% on SWE-bench Verified: Significant coding improvement over Sonnet 4.5 (77.2%), approaching Opus-tier performance.
Same pricing as Sonnet 4.5: $3 input / $15 output per million tokens, making it the best value in frontier AI.
4.3x improvement on ARC-AGI-2: Novel problem-solving jumps from 13.6% to 58.3%, the largest single-generation gain.
Best-in-class office and financial tasks: Leads all models with 1633 Elo on GDPval-AA and 63.3% on Finance Agent.
1M token context with compaction: Beta context compaction auto-summarizes older context for effectively unlimited conversations.

Anthropic released Claude Sonnet 4.6 on February 17, 2026 — their most capable Sonnet model to date. It scores 72.5% on OSWorld-Verified for computer use (within 0.2% of Opus 4.6), 79.6% on SWE-bench Verified for coding, and leads all models in office productivity tasks at 1633 Elo. The price stays at $3/$15 per million tokens, making it the strongest value proposition in frontier AI.

Users preferred Sonnet 4.6 over Sonnet 4.5 in 70% of head-to-head comparisons and over Opus 4.5 in 59% of comparisons. Combined with a 4.3x jump on ARC-AGI-2 (13.6% to 58.3%), 63.3% on the Finance Agent benchmark, and beta 1M token context window, Sonnet 4.6 narrows the gap between Sonnet and Opus tiers to the point where most workloads no longer need the premium model.

What Is Claude Sonnet 4.6?

Claude Sonnet 4.6 is the latest model in Anthropic's Sonnet tier — positioned between the lightweight Haiku and the premium Opus. It ships as claude-sonnet-4-6 on the API and replaces Sonnet 4.5 as the default model on claude.ai for both Free and Pro plan users. The model delivers near-Opus performance on the two capabilities that matter most for production AI agents: computer use and coding.

The headline numbers tell the story: 72.5% on OSWorld-Verified (Opus 4.6 scores 72.7%), 79.6% on SWE-bench Verified (Opus scores 80.8%), and best-in-class office task performance at 1633 Elo on GDPval-AA. Anthropic reports that 70% of users prefer Sonnet 4.6 over Sonnet 4.5 and 59% prefer it over the older Opus 4.5 — making this the first Sonnet model to be preferred over its Opus-tier predecessor.

What's New
Key improvements over Sonnet 4.5
  • 72.5% OSWorld computer use (+11.1pp)
  • 79.6% SWE-bench coding (+2.4pp)
  • 58.3% ARC-AGI-2 (+44.7pp)
  • 1M token context window (beta)
Unchanged
What stays the same from Sonnet 4.5
  • $3/$15 per million tokens
  • Extended thinking mode
  • Tool use and function calling
  • Vision and image analysis

Full Benchmark Comparison

Anthropic published results across 16 benchmarks covering coding, computer use, agentic tasks, reasoning, and domain-specific evaluations. Sonnet 4.6 matches or leads the field in several categories — particularly office productivity, financial analysis, and scaled tool use — while approaching Opus 4.6 on the most demanding technical benchmarks.

BenchmarkSonnet 4.6Sonnet 4.5Opus 4.6GPT-5.2
SWE-bench Verified79.6%77.2%80.8%77.0%
Terminal-Bench 2.059.1%51.0%62.7%46.7%
OSWorld-Verified72.5%61.4%72.7%38.2%
Vending-Bench Arena ($)~$5,700~$2,100~$7,400
ARC-AGI-258.3%13.6%75.2%
GPQA Diamond74.1%65.0%74.5%73.8%
MMLU-Pro79.1%78.1%81.2%80.6%
GDPval-AA (Office, Elo)1633137515591524
Finance Agent63.3%57.3%62.0%60.7%
MCP-Atlas Scaled Tool Use61.3%60.3%
τ²-bench Retail91.7%88.0%93.5%
τ²-bench Telecom97.9%94.2%97.9%
MATH-50097.8%96.4%97.6%97.4%
MMMB76.1%73.0%78.2%75.6%
Humanity's Last Exam19.1%11.4%26.3%20.3%
Pace Insurance94%

The standout categories for Sonnet 4.6 are office productivity (1633 Elo on GDPval-AA, ahead of all models including Opus 4.6), financial analysis (63.3% on Finance Agent, also best-in-class), and scaled tool use (61.3% on MCP-Atlas, beating Opus 4.6's 60.3%). On pure reasoning benchmarks like GPQA Diamond and Humanity's Last Exam, it tracks close to GPT-5.2 while Opus 4.6 retains a clear lead.

Computer Use: 72.5% on OSWorld

Computer use has been Claude's defining capability since Anthropic pioneered the feature in October 2024. The OSWorld-Verified trajectory shows remarkable progress: 14.9% (Sonnet 3.5) → 28.0% (Sonnet 3.5 v2) → 42.2% (Sonnet 3.6) → 61.4% (Sonnet 4.5) → 72.5% (Sonnet 4.6) over 16 months. Sonnet 4.6 now matches Opus 4.6 (72.7%) for practical purposes, while GPT-5.2 lags significantly at 38.2%.

On the Pace insurance benchmark — a real-world evaluation of desktop automation in insurance workflows — Sonnet 4.6 achieved 94% accuracy. This includes navigating spreadsheets, filling multi-step web forms, interacting with legacy desktop applications, and completing end-to-end processes without custom APIs or automation scripts. The model handles tasks that previously required robotic process automation (RPA) tooling.

Computer Use Applications

Form Automation

Multi-step web forms, insurance applications, and data entry workflows with 94% accuracy on real-world tasks.

Legacy Systems

Interacts with desktop applications and legacy systems that lack APIs, replacing traditional RPA tooling.

QA Testing

End-to-end UI testing, spreadsheet validation, and cross-app workflows without custom test scripts.

The near-parity between Sonnet 4.6 and Opus 4.6 on computer use is significant because Sonnet is approximately 5x cheaper. For teams building computer-use agents, Sonnet 4.6 delivers essentially the same capability at a fraction of the cost — making it the default choice for production deployments.

Coding and Agentic Performance

Sonnet 4.6 scores 79.6% on SWE-bench Verified — a 2.4 percentage point improvement over Sonnet 4.5's 77.2% and approaching Opus 4.6's 80.8%. On Terminal-Bench 2.0, which evaluates more complex agentic coding tasks, it reaches 59.1% (up from 51.0%), significantly ahead of GPT-5.2's 46.7%. Anthropic notes qualitative improvements in context reading, shared logic consolidation, and fewer false success claims.

Coding Benchmark Comparison

BenchmarkSonnet 4.6Sonnet 4.5Opus 4.6
SWE-bench Verified79.6%77.2%80.8%
Terminal-Bench 2.059.1%51.0%62.7%
Vending-Bench Arena~$5,700~$2,100~$7,400

The Vending-Bench Arena result is particularly noteworthy — Sonnet 4.6 generates approximately $5,700 in revenue on the simulated vending machine business task, a 2.7x improvement over Sonnet 4.5's $2,100. This benchmark measures real-world agentic capability by having models build and operate a complete business, suggesting Sonnet 4.6's coding improvements translate directly to practical multi-step task execution.

Agentic Tool Use

Beyond pure coding, Sonnet 4.6 excels at tool-use benchmarks that test a model's ability to interact with external systems. On τ²-bench, it scores 91.7% on retail scenarios and 97.9% on telecom scenarios — matching Opus 4.6 on telecom. The MCP-Atlas scaled tool use benchmark, which tests coordination across many tools simultaneously, shows Sonnet 4.6 at 61.3% — actually ahead of Opus 4.6's 60.3%.

For developers using coding-focused AI models, Sonnet 4.6 represents a strong option that balances coding performance with broader agentic capabilities and competitive pricing.

New Features and Capabilities

Beyond raw benchmark improvements, Sonnet 4.6 introduces several new features that expand what Claude can do in practice. These range from infrastructure changes (1M token context) to product features (Claude in Excel) that make the model more useful for enterprise workflows.

1M Token Context Window (Beta)

Sonnet 4.6 supports a 1M token context window in beta — roughly 750,000 words or the equivalent of 5-10 full codebases. This is paired with a new context compaction feature that automatically summarizes older context when approaching the limit, allowing effectively unlimited conversations without losing critical information.

Adaptive and Extended Thinking

Sonnet 4.6 retains extended thinking from Sonnet 4.5 but adds an adaptive mode that automatically adjusts thinking depth based on task complexity. Simple questions get fast responses, while complex reasoning problems trigger deeper thinking chains — optimizing both latency and cost.

Improved Prompt Injection Resistance

Anthropic reports improved resistance to prompt injection attacks, which is critical for production deployments where models process untrusted user input. This is particularly relevant for computer-use agents that interact with arbitrary web pages and desktop applications.

Web Search, Code Execution & Memory

Claude now has generally available web search and web fetch tools, code execution in a sandboxed environment, and a memory tool that persists information across conversations. These move from beta to GA with Sonnet 4.6, enabling more capable agentic workflows on claude.ai and through the API.

Claude in Excel & MCP Connectors

Anthropic introduces Claude in Excel, bringing AI-powered data analysis directly into spreadsheets through MCP (Model Context Protocol) connectors. This builds on the office task benchmark leadership (1633 Elo on GDPval-AA) and gives enterprise users a way to leverage Claude for financial modeling, data transformation, and report generation without leaving their existing tools.

Pricing and Availability

Sonnet 4.6 maintains the same pricing as Sonnet 4.5 — $3 per million input tokens and $15 per million output tokens. Given the significant performance improvements, this makes Sonnet 4.6 the best value proposition in frontier AI for most workloads. The API model ID is claude-sonnet-4-6.

$3 / $15

Input / Output per 1M tokens

claude-sonnet-4-6

API Model ID

5 Platforms

claude.ai, Cowork, Code, Bedrock, Vertex

Where to Access Sonnet 4.6

  • claude.ai — Default model for Free and Pro plans, available immediately
  • Claude Cowork — Team collaboration with Sonnet 4.6 as the default agent model
  • Claude Code — CLI tool for developers, Sonnet 4.6 available as fast model
  • Anthropic API — Direct access via claude-sonnet-4-6
  • Amazon Bedrock — Enterprise deployment with AWS infrastructure
  • Google Cloud Vertex AI — Enterprise deployment with GCP infrastructure

When to Choose Sonnet 4.6 vs Opus 4.6

Choose Sonnet 4.6
Best for most workloads
  • Computer use agents (72.5% vs 72.7%)
  • Office productivity and data tasks
  • Standard coding tasks (79.6% SWE-bench)
  • Cost-sensitive production deployments
Choose Opus 4.6
For maximum capability
  • Novel reasoning (75.2% vs 58.3% ARC-AGI-2)
  • Complex multi-file coding
  • Research and deep analysis
  • Hardest reasoning problems (HLE: 26.3%)

How Sonnet 4.6 Compares

vs Opus 4.6

Sonnet 4.6 achieves near-parity with Opus 4.6 on computer use (72.5% vs 72.7%) and trails slightly on coding (79.6% vs 80.8% SWE-bench). On reasoning, Opus maintains a clear lead — 75.2% vs 58.3% on ARC-AGI-2 and 26.3% vs 19.1% on Humanity's Last Exam. But Sonnet 4.6 actually leads on office tasks (1633 vs 1559 Elo) and matches on telecom tool use (97.9%). At approximately 5x lower cost, Sonnet 4.6 is the right default for most production workloads.

vs GPT-5.2

The comparison with GPT-5.2 is stark on computer use: Sonnet 4.6's 72.5% nearly doubles GPT-5.2's 38.2% on OSWorld-Verified. On coding, they're closely matched (79.6% vs 77.0% SWE-bench), while Sonnet 4.6 leads on office tasks (1633 vs 1524 Elo) and financial analysis (63.3% vs 60.7%). GPT-5.2 is competitive on reasoning benchmarks (GPQA Diamond: 73.8% vs 74.1%) but lacks the computer-use capability that defines Claude's differentiation.

vs Gemini 3 Pro

Sonnet 4.6 leads Gemini 3 Pro across computer use, coding, and agentic benchmarks. While Google's model competes on multimodal understanding and general knowledge, Claude's strength in desktop automation, tool use, and office productivity makes Sonnet 4.6 the stronger choice for enterprise agentic deployments. The Qwen 3.5 guide covers additional frontier model comparisons.

The Value Proposition

Sonnet 4.6 delivers 97-99% of Opus 4.6's capability on computer use and coding at approximately 20% of the cost. For teams that previously needed Opus for agent workloads, Sonnet 4.6 eliminates the cost premium without meaningful capability loss. The only scenario where Opus remains clearly superior is novel reasoning tasks (ARC-AGI-2, Humanity's Last Exam).

Conclusion

Claude Sonnet 4.6 represents a significant milestone in Anthropic's model lineup. By achieving near-Opus performance on computer use (72.5% vs 72.7%) and coding (79.6% vs 80.8%) at Sonnet-tier pricing ($3/$15 per million tokens), it effectively makes frontier-level AI agents accessible at 5x lower cost. The best-in-class office productivity (1633 Elo) and financial analysis (63.3%) scores further distinguish it from both Opus 4.6 and GPT-5.2.

For most production workloads — computer use agents, coding assistants, tool-use pipelines, and office automation — Sonnet 4.6 is now the right default choice. Opus 4.6 retains its edge for the hardest reasoning problems and complex multi-step analysis, but Sonnet 4.6's combination of performance, cost, and new features (1M context, compaction, Claude in Excel) makes it the most practical frontier AI model available.

Ready to Build with Claude Sonnet 4.6?

Whether you're deploying computer-use agents, building coding assistants, or automating enterprise workflows, our team can help you leverage Claude's capabilities for measurable business results.

Free consultation
Expert AI integration guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring AI model developments and frontier benchmarks