AI DevelopmentDecision Matrix16 min readPublished May 22, 2026

The routing guide for teams choosing between three production-grade computer-use stacks.

Computer-Use Agents: Routing Microsoft, Anthropic, and Google

Three computer-use agent stacks now compete for enterprise attention: Microsoft Copilot Studio reached GA on May 13, 2026; Anthropic Claude Computer Use remains in public beta; Google Gemini 2.5 Computer Use is a preview. OSWorld-Verified scores now sit above the human baseline across frontier models. The question is no longer which model leads — it is which stack fits your use case.

DA
Digital Applied Team
Senior strategists · Published May 22, 2026
PublishedMay 22, 2026
Read time16 min
Sources25
Microsoft Copilot Studio GA
May 13
2026 · All commercial geos
Windows-only target
Claude Opus 4.7 OSWorld-Verified
78.0%
Apr 2026 · Vellum benchmarks
Above human baseline
Gemini 3.5 Flash OSWorld-Verified
78.4%
May 19, 2026
No CU API as of May 24
Human OSWorld baseline
72.36%
Coasty · May 2026
3 frontier models now exceed

Three major computer-use agent platforms now compete in the same category: Microsoft Copilot Studio reached general availability on May 13, 2026; Anthropic Claude Computer Use has been in public beta since October 2024; and Google Gemini 2.5 Computer Use shipped as a developer preview in October 2025. All three pass the OSWorld benchmark's human baseline of approximately 72%, but they differ sharply in architecture, sandbox model, pricing unit, and enterprise-fit. This is the routing guide — not a horse race.

The benchmark convergence is real but misleading as a selection signal. Claude Opus 4.7 scores 78.0% OSWorld-Verified; Gemini 3.5 Flash reportedly scores 78.4% — though that model does not expose a Computer Use API as of May 24, 2026. GPT-5.5 reportedly scores 78.7% (vendor self-reported). All three sit within 0.7 points of each other, above the ~72.36% human baseline. At that resolution, benchmark differentiation is noise. The real question is architecture fit: what does your agent need to access, how is it deployed, what does it cost per step, and who governs its actions?

This guide covers the current status of each stack, an honest OSWorld calibration, a proprietary 3-way decision matrix, sandbox architecture differences, per-task cost modeling, a concrete routing guide, security and governance primitives, and a projection of where the three stacks converge over the next 12 months. For the Microsoft-specific deep dive, see Copilot Studio Computer-Use Agents: GA Deep Dive. For enterprise governance and guardrails, see Agent Computer Use: Enterprise Automation Playbook.

Key takeaways
  1. 01
    Three stacks, three control models.Microsoft Copilot Studio computer use is GA on May 13, 2026 — a governed, Windows-only platform embedded in Power Platform with per-step Credit billing, DLP policies, and Purview audit logs built in. Anthropic Claude Computer Use is in public beta — API-first, runs against a reference Docker container you operate, charges per token, and is the most flexible for native desktop control. Google Gemini 2.5 Computer Use is a preview — browser-optimized, priced at $1.25/$10 per Mtok, but the 2.5 preview model is the only one exposing the Computer Use API (Gemini 3.5 Flash does not, as of May 24, 2026).
  2. 02
    OSWorld convergence above human baseline — but the scores are not comparable.Claude Opus 4.7 scores 78.0% OSWorld-Verified (Apr 2026); Gemini 3.5 Flash reportedly scores 78.4% (May 19, 2026); GPT-5.5 scores 78.7% (vendor-reported). The human baseline is approximately 72.36%. All frontier models have crossed the human line. Two critical caveats: OSWorld-Verified (used from Sonnet 4.5 onward) and original OSWorld (pre-4.5 scores) are not directly comparable; and Gemini 3.5 Flash's score does not correspond to a usable Computer Use API.
  3. 03
    Pricing model is the most under-analyzed difference.Microsoft charges 5 Copilot Credits per step on standard models and 15 per step on premium (Opus 4.6) — at $0.01 per Credit pay-as-you-go, that is $0.05 or $0.15 per step. Anthropic and Google both charge per token, which means cost scales with screenshot volume and context length, not step count. For long workflows with many screenshots, token billing can exceed Credit billing; for short workflows with few steps, Credit billing may be cheaper. Run the math before committing to a stack on cost grounds alone.
  4. 04
    Sandbox architecture governs security posture more than any model spec.Anthropic ships a reference Docker container (Xvfb, Mutter, Firefox, LibreOffice) that you run in your own infrastructure — Claude does not directly connect to the execution environment. Microsoft requires a Power Automate–registered Windows machine with allowlists and optional Azure Key Vault for credentials. Google's Gemini 2.5 Computer Use runs on browser surfaces only and provides no sandbox container at all. That architecture difference determines your data perimeter, not the model card.
  5. 05
    Routing rule: Microsoft for governance, Anthropic for desktop control, Google for browser speed.Teams inside the Microsoft stack (Power Platform, Azure AD, Purview) should default to Copilot Studio — the governance is already wired in. Teams with native Linux/Windows desktop automation needs outside Microsoft's stack should use Anthropic's API — the Docker sandbox gives the most control. Teams building lightweight browser-automation workflows where latency and cost are the primary constraints should evaluate Gemini 2.5 Computer Use — the pricing is lowest and the browser-task benchmarks are competitive.

01Current StatusGA, beta, preview — what each readiness level actually means.

The three readiness labels carry operational weight. GA means Microsoft has committed to SLAs, billing, and enterprise support contracts for Copilot Studio computer use — teams can build production workflows against it today across all commercial Power Platform geographies. Beta means Anthropic's Computer Use API is stable enough for production use but may change; the beta header requirement (computer-use-2025-11-24for Opus 4.7, Opus 4.6, Sonnet 4.6, and Opus 4.5) is the flag that the interface is not yet frozen. Preview means Google's Gemini 2.5 Computer Use is a developer-facing model card release — it can be used, but it carries no GA commitment and the API surface may change without notice.

Microsoft's GA announcement on May 13, 2026 was broader than a technical flag — it included a customer case study from Graebel, a 1,500-employee mobility firm that deployed a Service Order Agent to process free-form relocation request emails into its proprietary "Global Connect" platform across more than 30 relocation service categories. Matt Brownlee, Chief Revenue Officer of Graebel, stated: "By adopting Microsoft Copilot Studio and AI agents, we've moved beyond traditional automation to a more intelligent, scalable operating model." This is a meaningful signal — GA is production-ready, not just feature-complete.

Anthropic's Computer Use was first announced October 22, 2024 with Claude 3.5 Sonnet, scoring just 14.9% on OSWorld at screenshot-only mode. That figure is 16 months old and has since been superseded by the OSWorld-Verified methodology. The beta label has persisted through six model generations because Anthropic is still refining the tool interface, not because the capability is unstable. The Claude Computer Use API docs are detailed and production-grade. Our Anthropic Computer Use API guide covers the full surface in depth.

Google's Gemini 2.5 Computer Use preview launched October 7, 2025 and is available via the Gemini API on Google AI Studio and Vertex AI. The model identifier is gemini-2.5-computer-use-preview-10-2025. One critical clarification: Gemini 3.5 Flash, which launched at Google I/O 2026 on May 19, 2026 and reportedly scores 78.4% OSWorld-Verified, does not expose the Computer Use API as of May 24, 2026. Teams evaluating Google's computer-use stack must use the 2.5 preview model, not 3.5 Flash. For the 3.5 Flash context, see our Gemini 3.5 Flash benchmarks guide.

Microsoft
May 13, 2026 — all commercial geos
GA

Copilot Studio computer use is generally available across all commercial Power Platform geographies. Windows-only execution target. Standard models: 5 Credits/step. Premium (Opus 4.6): 15 Credits/step. Pay-as-you-go: $0.01/Credit.

Production-ready · SLA-backed
Anthropic
Public beta since Oct 2024 — API-first
Beta

Claude Computer Use has been in public beta since October 22, 2024 (Claude 3.5 Sonnet). Requires beta header (computer-use-2025-11-24 for Opus 4.7/Opus 4.6/Sonnet 4.6/Opus 4.5). Token billing. Docker sandbox shipped by Anthropic, operated by you.

Stable but interface not frozen
Google
Developer preview since Oct 7, 2025
Preview

Gemini 2.5 Computer Use preview model (gemini-2.5-computer-use-preview-10-2025) via Gemini API. $1.25/$10 per Mtok input/output. 131K context, 64K max output. Browser-only execution — no container provided. Gemini 3.5 Flash does NOT expose this API.

No GA commitment · may change
Human baseline
OSWorld — Coasty / Anthropic, May 2026
72.36%

The OSWorld human baseline is approximately 72.36% across 369 desktop tasks on Ubuntu, Windows, and macOS. Sonnet 4.6, Opus 4.6, Opus 4.7, GPT-5.5, and Gemini 3.5 Flash all reportedly exceed this threshold — making benchmark differentiation between frontier models increasingly marginal.

All frontier models now above

02Benchmark CalibrationThree OSWorld traps that will mislead your evaluation.

OSWorld is the most credible independent benchmark for computer-use agents — 369 actual desktop tasks across real operating systems, real apps, and real workflows. But it is being misread in three consistent ways across engineering blogs and vendor comparisons. Understanding the traps is essential before using the scores as a selection input.

Trap 1: OSWorld vs OSWorld-Verified are not the same scale. Anthropic introduced OSWorld-Verified in July 2025 — a re-evaluation methodology that corrects for task ambiguity and partial-credit scoring in the original OSWorld. Scores from Sonnet 4.5 onward (61.4%) use OSWorld-Verified; scores from Sonnet 3.5 (14.9%) through Sonnet 3.6 (42.2%) use original OSWorld. Mixed-scale comparisons inflate the apparent slope. The trajectory within each scale is valid; cross-scale comparisons are not. As noted in Coasty's independent benchmark commentary, "An agent scoring 38% or even 61% isn't ready to handle the unpredictable, multi-step, context-heavy work that fills a real knowledge worker's day."

Trap 2: Gemini 3.5 Flash's 78.4% does not mean a usable Computer Use API. Gemini 3.5 Flash, launched at Google I/O 2026 on May 19, reportedly scores 78.4% OSWorld-Verified per llm-stats.com analysis. This figure refers to the model's reasoning and vision capabilities, not a deployed Computer Use API surface. As of May 24, 2026, Gemini 3.5 Flash does not expose the computer-use-preview tooling. Only the gemini-2.5-computer-use-preview-10-2025 model does. A team that reads "78.4%" and builds against Gemini 3.5 Flash will find no Computer Use API.

Trap 3: above-human convergence means benchmark differentiation has diminishing return. Claude Opus 4.7 scores 78.0% OSWorld-Verified; GPT-5.5 scores 78.7% (vendor self-reported); Gemini 3.5 Flash reportedly scores 78.4%. The three-way spread is 0.7 points — less than the measurement error in the benchmark. At this level, architecture fit, pricing model, and governance primitives dominate the selection decision far more than the benchmark delta. Our earlier three-way matrix comparing Claude, OpenAI, and Gemini covered the pre-Microsoft-GA landscape; this post updates that analysis with the May 13 Microsoft GA and the OSWorld-Verified recalibration.

Claude computer-use OSWorld trajectory — 16-month arc

Sources: Anthropic launch posts, Coasty OSWorld leaderboard (May 3, 2026), Vellum benchmark roundup (Apr 2026), llm-stats.com. ⚠️ Scores prior to Sonnet 4.5 use original OSWorld; 4.5+ use OSWorld-Verified — not directly comparable across the break.
Claude Sonnet 3.5 — original OSWorldOct 2024 · screenshot-only mode · original OSWorld scale
14.9%
OpenAI CUA — original OSWorldJan 2025 · Coasty independent leaderboard
38.1%
Gemini 2.5 CU — original OSWorld (vendor self-report)Oct 2025 · Gemini 2.5 Computer Use model card
40.9%
Claude Sonnet 4.5 — OSWorld-VerifiedSep 2025 · first OSWorld-Verified score
61.4%
Claude Sonnet 4.6 — OSWorld-VerifiedFeb 17, 2026 · llm-stats.com
72.5%
Human baseline — OSWorld~72.36% · Coasty / Anthropic Sonnet 4.6 release notes
~72.36%
Claude Opus 4.7 — OSWorld-VerifiedApr 16, 2026 · Vellum benchmark roundup
78.0%

03Decision MatrixThe 3-way computer-use comparison: Microsoft, Anthropic, Google.

No published comparison maps all three stacks against the same dimensions — particularly cost-per-task and sandbox architecture, which are the two most consequential variables for production deployments. The matrix below draws on Microsoft Learn (Computer Use, updated May 21, 2026), Claude API docs (Computer use tool), and Google DeepMind Gemini 2.5 Computer Use Model Card. For deeper detail on the Microsoft stack, see our Copilot Studio GA deep dive.

Microsoft Copilot Studio
Governed enterprise stack · GA May 13, 2026

Launch status: Generally Available · Geographic scope: All commercial Power Platform geographies · Models: OpenAI CUA, Claude Sonnet 4.5, Claude Sonnet 4.6 (experimental), Claude Opus 4.6 (experimental, premium) · Sandbox: Power Automate–registered Windows machine — you provide the machine, Copilot Studio orchestrates it · Target OS: Windows only (web and desktop apps) · Pricing: 5 Credits/step (standard) · 15 Credits/step (Opus 4.6 premium) · $0.01/Credit pay-as-you-go · OSWorld-Verified: inherits Sonnet 4.6 (72.5%) or Opus 4.6 (72.7%) depending on model selected · Security: per-environment DLP, Azure Key Vault, Purview audit logs, Power Platform RBAC, human-in-the-loop via Outlook email reviewer · Best-fit: regulated enterprise running Windows workloads inside Microsoft's trust boundary.

Best for governed enterprise
Anthropic Claude Computer Use
Native desktop control · Public beta

Launch status: Public beta · Geographic scope: Anthropic API, Amazon Bedrock, Google Cloud Vertex AI · Models: Opus 4.7 (78.0% OSWorld-Verified), Sonnet 4.6 (72.5%), Opus 4.6 (72.7%), Opus 4.5 · Sandbox: Reference Docker container you operate (Xvfb virtual X11, Mutter, Tint2, Firefox, LibreOffice) — Claude does not connect directly to the environment · Target OS: Linux (container), Windows, macOS via custom sandbox · Pricing: per-token — Opus 4.7: $5/$25 per Mtok; Sonnet 4.6: $3/$15 per Mtok · Token overhead: 466–499 tokens system prompt + 735 tokens per tool definition + screenshot vision tokens · OSWorld-Verified: Opus 4.7 78.0%; Sonnet 4.6 72.5% · Security: prompt-injection classifiers on screenshots, sandbox isolation in your infrastructure · Best-fit: teams needing native Linux/Windows/macOS desktop control outside Microsoft's stack.

Best for native desktop control
Google Gemini 2.5 Computer Use
Browser-optimized · Preview

Launch status: Developer preview · Launch date: Oct 7, 2025 · Geographic scope: Gemini API (Google AI Studio, Vertex AI) · Model: gemini-2.5-computer-use-preview-10-2025 (NOT Gemini 3.5 Flash — that model has no CU API) · Sandbox: browser-only — no container provided by Google · Target OS: browser surfaces · Pricing: $1.25/$10 per Mtok (input/output) — lowest of the three · Context: 131K tokens, 64K max output · OSWorld score: 40.9% (vendor model card, original OSWorld scale, Oct 2025) · Browser scores: Online-Mind2Web 59.4%, WebVoyager 62.7%, AndroidWorld 46.0% · Security: no built-in governance layer — you own the sandbox and access controls · Best-fit: browser-automation workflows where token cost and latency are primary constraints.

Best for browser-only, cost-sensitive

04ArchitectureSandbox architecture: the under-told story in every comparison.

Every mainstream comparison of computer-use agents focuses on OSWorld scores and pricing. Almost none discusses sandbox architecture in detail — which is the variable that actually governs your data perimeter, security posture, and operational complexity.

Anthropic's architecture: Docker container you operate. Anthropic ships a reference Docker container with Xvfb (virtual X11 display), Mutter window manager, Tint2 panel, and pre-installed Linux apps including Firefox and LibreOffice. The agent loop runs in your infrastructure. Claude itself does not connect directly to the execution environment — it sends tool calls (screenshot, click, type, key, cursor_position) and your agent loop executes them inside the container. This architecture gives the most flexibility (you can install any Linux software, mount any filesystem, configure network isolation at the container level) and puts the most operational burden on you. Our Claude Computer Use production deployment guide covers the container configuration, token management, and rollback patterns in detail.

Microsoft's architecture: Power Automate–registered Windows machine. Copilot Studio computer use executes against a Windows machine registered in Power Automate's "machines" pool. Per-credential authentication via Power Platform internal storage or Azure Key Vault. Access control via per-website and per-app allowlists configured at the agent level. Human supervision via an Outlook email reviewer for low-confidence steps. This is the most prescriptive architecture — you cannot run a Linux container, cannot mount arbitrary filesystems, and cannot target macOS. But you also get Purview-based audit logs that propagate run history to Purview and Dataverse, DLP policies enforced at the environment level, and Power Platform RBAC — governance that would take months to build from scratch on Anthropic's stack.

Google's architecture: browser-only, no container. Gemini 2.5 Computer Use runs on browser surfaces — it can navigate, click, fill forms, and interact with web content. It does not provide a sandbox container at all. Google's earlier consumer-facing surfaces (Project Mariner, Firebase Testing Agent, AI Mode in Search) were the precursors to the developer API. For browser-automation use cases, this is appropriate. For desktop app automation — SAP GUI, Salesforce Classic, legacy Windows apps — it is not the right tool. See our Gemini 2.5 Computer Use for marketing automation guide for the browser use-case patterns in depth.

Microsoft Copilot Studio GA blog · May 13, 2026

“Instead of brittle selector-based automation, the computer use tool uses vision and reasoning to navigate live UIs — adapting when layouts shift, fields move, or workflows branch.” — Microsoft Copilot Studio GA blog, techcommunity.microsoft.com, May 13, 2026.

05Cost ModelingPer-task pricing: where the economics flip between stacks.

List prices are the wrong way to compare these three stacks. The right frame is per-task economics — what does a concrete workflow cost on each platform? The answer changes dramatically based on the number of steps and the number of screenshots per step, because the three pricing models scale differently.

Microsoft's Credit billing is step-count-based: 5 Credits per step on standard models (OpenAI CUA, Claude Sonnet 4.5/4.6), 15 Credits per step on premium models (Claude Opus 4.6 experimental). At $0.01 per Credit pay-as-you-go, that is $0.05 per step (standard) or $0.15 per step (premium). A 10-step web form workflow costs $0.50 on standard or $1.50 on premium. A 50-step research-and-summarize task costs $2.50 on standard or $7.50 on premium. The capacity pack (25,000 Credits for $200/month) reduces per-Credit cost to $0.008 — so a 10-step task at standard tier becomes $0.40. Microsoft's pricing is predictable and RPA-style: it scales with step count, not with how verbose the model is.

Anthropic's per-token pricing means cost scales with screenshot volume and context length. Each screenshot taken by the agent consumes vision tokens; each tool call adds 466–499 tokens of system prompt overhead plus 735 tokens per tool definition. A simple 10-step browser task at Sonnet 4.6 ($3/$15 per Mtok) with modest screenshots might cost $0.10–0.30. A 50-step legacy-desktop task with frequent screenshots and a large system prompt could reach $2.00–5.00 per run at Opus 4.7 ($5/$25 per Mtok). Token billing rewards short workflows and penalizes verbose, screenshot-heavy ones.

Google's Gemini 2.5 Computer Use charges $1.25/$10 per Mtok (input/output), the lowest token price of the three. For browser-automation workflows where the agent makes fewer tool calls and takes smaller screenshots than a full desktop agent, Gemini 2.5 CU can be meaningfully cheaper than Anthropic at equivalent task complexity. The 131K context window and 64K max output also reduce truncation risks on long-horizon workflows. The constraint is browser-only execution — you cannot use this model for desktop automation regardless of the price.

The break-even point:for a 20-step workflow, Microsoft standard ($1.00) competes with Anthropic Sonnet 4.6 at roughly 330K total tokens (input+output combined), which a screenshot-heavy task will exceed. For screenshot-light API-based tasks, Anthropic's per-token billing can undercut Microsoft. Run your own step-count and screenshot estimates against these figures before committing to a pricing model — the economics flip more often than vendor comparisons acknowledge.

06Routing GuideThe decision tree: which stack for your use case.

The routing guide below maps the three stacks to the most common enterprise use cases. It is opinionated by design — every team that says "we need to evaluate all three" is burning engineering hours that a routing framework can recover.

Regulated enterprise
Use Microsoft Copilot Studio
Signal: already in Microsoft stack

If your team is already in Power Platform, Azure AD, and Purview, Copilot Studio computer use is the path of least resistance. Governance is pre-wired: DLP policies, RBAC, Purview audit logs, Azure Key Vault for credentials, and Outlook-based human-in-the-loop. You can deploy against Windows machine pools today with GA-level SLAs. The Windows-only constraint is rarely limiting for back-office ERP and CRM workflows. See our Copilot Studio GA deep dive for the full integration guide.

GA · Windows · Power Platform
Native desktop control
Use Anthropic Claude Computer Use
Signal: need Linux/macOS or non-Windows desktop

If your automation target is a native desktop application — SAP GUI, legacy Windows app on a non-Power Platform machine, macOS app, or a Linux-native workflow — Anthropic's Docker sandbox gives you the most control. You configure the container, you own the network perimeter, and you choose which model (Sonnet 4.6 for cost, Opus 4.7 for capability). The prompt-injection classifier adds a meaningful security layer for high-risk targets. The production deployment guide covers rollback patterns and screenshot token management.

Beta · Multi-OS · Docker sandbox
Browser automation
Use Google Gemini 2.5 CU
Signal: browser-only target, cost-sensitive

If your workflow is entirely browser-based — web scraping, form submission, SaaS app automation, or web testing — Gemini 2.5 Computer Use offers the lowest token pricing ($1.25/$10 per Mtok) and competitive browser-task benchmarks (WebVoyager 62.7%, Online-Mind2Web 59.4%). The 131K context window handles long-horizon browser sessions. The absence of a sandbox container is not a constraint when the execution target is a browser tab. For marketing automation use cases, see the Gemini 2.5 Computer Use for marketing automation guide.

Preview · Browser-only · Lowest price

One important nuance for the Microsoft routing: Copilot Studio computer use embeds both OpenAI CUA and Anthropic Claude models under the hood. Teams that "standardize on Microsoft" are implicitly consuming Anthropic capacity — Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 are all available in the Copilot Studio model menu. This two-sided dependency is not widely noted in enterprise architecture discussions. The practical implication: if your organization has Anthropic API agreements or data processing addendums, those obligations may extend to your Copilot Studio usage when Anthropic models are selected.

For teams already evaluating the broader OpenAI computer-use lineage, our guide to OpenAI's GPT-5.4 computer-use tool covers the model that Microsoft licenses for its CUA option in Copilot Studio, including the lineage from ChatGPT Operator (sunset Aug 31, 2025) to the current CUA API.

07Security & GovernanceSecurity primitives: what each stack gives you and what you must build yourself.

Computer-use agents operate in a uniquely risky attack surface. Prompt injection via on-screen content — a malicious website embedding instructions in visible text that the agent reads via screenshot — is the most common attack vector. Each stack addresses it differently.

Anthropic:Claude's computer-use API includes prompt-injection classifiers that automatically scan screenshots for embedded injection attempts. On detection, the model steers to ask for user confirmation rather than executing the embedded instruction. This is a model-level defense — it operates before the tool call executes. You still need to implement sandbox isolation (network egress controls, credential management, filesystem scope limits) at the container level. The Claude API docs provide a detailed risk section covering injection mitigations, credential handling, and the recommendation to always run in a sandboxed environment with limited system permissions.

Microsoft: Copilot Studio provides governance at the platform level — per-environment DLP policies that can block specific website categories, per-agent allowlists for which websites and desktop apps the agent may interact with, Azure Key Vault for credential storage (so no credentials live in the agent prompt), and Purview audit logs for every run. Human-in-the-loop is implemented via Outlook email review — a low-friction, enterprise- native checkpoint for low-confidence steps. This is the most complete out-of-the-box governance story of the three stacks.

Google: Gemini 2.5 Computer Use provides no built-in governance layer. The browser-only scope is itself a constraint that limits the blast radius — a browser agent that goes rogue cannot interact with native desktop applications. But there is no DLP layer, no audit log, no credential vault, and no human-in-the-loop primitive provided by Google. For enterprise deployments, the security stack is entirely your responsibility.

For a detailed guardrail checklist covering all three stacks — including human-in-the-loop patterns, screenshot logging, and rollback procedures — see the Agent Computer Use: Enterprise Automation Playbook. That playbook is the companion to this post and covers the operational depth that this routing guide does not.

The next chapter of enterprise AI isn't about chatting with assistants — it's about agents that actually do the work.Mustapha Lazrek, Microsoft, Microsoft Community Hub GA announcement, May 13, 2026

08Forward Projection16-month arc and the 12-month convergence forecast.

The OSWorld trajectory from October 2024 to May 2026 is the most dramatic benchmark improvement in enterprise AI tooling since GPT-4 crossed the bar exam in 2023. Claude Computer Use went from 14.9% (Sonnet 3.5, Oct 2024) to 28.0% (Sonnet 3.5 v2) to 42.2% (Sonnet 3.6) to 61.4% (Sonnet 4.5) to 72.5% (Sonnet 4.6, Feb 2026) to 78.0% (Opus 4.7, Apr 2026). That is sub-15% to above-human-baseline in 16 months. The slope is the story, not any individual data point.

What the slope implies is that the benchmark ceiling will be hit before the market ceiling. If Opus 4.7 and Gemini 3.5 Flash are both already at 78%, and GPT-5.5 is at 78.7%, the next meaningful innovation is not a higher OSWorld score — it is depth of integration with enterprise systems, reliability on multi-hour tasks, and cost reduction at scale. The vendor that wins the enterprise computer-use category over the next 12 months will do so on governance, integrations, and pricing, not on benchmark scores. That framing strongly favors Microsoft in regulated industries (governance is already built) and Anthropic in technical teams (the Docker architecture gives the most integration flexibility).

For Google, the key inflection point is whether Gemini 3.5 Flash or a successor model exposes a Computer Use API. If Google ships a 78.4%-class model with a Computer Use API surface — combining the best benchmark score of the three with the lowest token pricing — the routing calculus changes significantly. As of May 24, 2026, that model does not exist in the developer API. Watch for it in the next Gemini release cycle.

One structural pattern that will shape the next 12 months: the Copilot Studio model menu already includes Claude Sonnet 4.5 and Sonnet 4.6 alongside the OpenAI CUA option. As Anthropic ships stronger models, those models will likely appear in the Copilot Studio menu without requiring Microsoft to build any new capability. The Claude Opus 4.7 reference guide covers the model's full capability profile including why it sits at 78.0% OSWorld-Verified and what that score does and does not predict about production performance.

Our view: the benchmark convergence above human baseline is a threshold event, not a steady-state. The period from H2 2026 through H1 2027 will be characterized by reliability and governance improvements rather than raw benchmark gains — much like how LLM development shifted from perplexity improvements to RLHF and safety work in 2023. Teams that use this window to build the operational infrastructure (logging, HITL, credential management, rollback) will compound their advantage when the next model generation arrives. Our AI transformation services help enterprise teams build that infrastructure before it is table-stakes.

Conclusion

Routing beats benchmarking: match the stack to the architecture, not the leaderboard.

All three computer-use stacks have crossed the human OSWorld baseline. The frontier-model scores — 78.0% for Opus 4.7, 78.4% for Gemini 3.5 Flash (which has no CU API), 78.7% for GPT-5.5 (self-reported) — are within noise of each other. Choosing a stack based on a 0.7-point benchmark delta is the wrong frame. The right frame is architecture fit: what OS does your target run on, who governs the execution, how does cost scale with your workflow profile, and what security primitives does the vendor provide versus what you must build yourself?

Microsoft Copilot Studio is the answer for regulated enterprise teams already inside Power Platform — the governance is pre-wired and the GA status means production SLAs exist. Anthropic Claude Computer Use is the answer for teams that need native desktop control across operating systems and want the most architectural flexibility — the Docker sandbox and prompt-injection classifiers are genuinely production-grade. Google Gemini 2.5 Computer Use is the answer for browser-automation workflows where token cost and latency matter more than desktop-app control — its pricing is the lowest of the three and its browser benchmarks are competitive.

The 12-month outlook points toward governance and reliability differentiation rather than benchmark gains. The teams that invest now in HITL checkpoints, audit logging, credential management, and rollback patterns — regardless of which stack they choose — will have a meaningful head start when computer-use agents become a standard component of enterprise workflow automation. That window is open now.

Deploy computer-use agents with confidence

From routing decision to production deployment.

We help enterprise teams evaluate, select, and deploy computer-use agent stacks — from Copilot Studio governance architecture to Anthropic Docker sandbox configuration and Gemini browser automation pipelines.

Free consultationExpert guidanceTailored solutions
What we work on

Computer-use agent architecture

  • Stack selection and routing framework
  • Copilot Studio Power Platform integration
  • Anthropic Docker sandbox configuration
  • Human-in-the-loop and audit logging
  • Per-task cost modeling and optimization
FAQ · Computer-Use Agents

The questions teams ask when choosing a computer-use stack.

The two stacks differ in three key dimensions. Architecture: Copilot Studio runs against a Power Automate–registered Windows machine with per-environment DLP, Azure Key Vault, and Purview audit logs built in; Anthropic runs against a Docker container you operate, with Claude sending tool calls (screenshot, click, type) to your agent loop. Pricing: Copilot Studio charges 5 Credits per step on standard models and 15 Credits per step on Claude Opus 4.6 premium (at $0.01/Credit pay-as-you-go); Anthropic charges per token — Opus 4.7 at $5/$25 per Mtok, Sonnet 4.6 at $3/$15 per Mtok. OS scope: Copilot Studio targets Windows only; Anthropic's Docker container can run Linux, and you can configure Windows or macOS targets in custom sandboxes. Microsoft is the better fit for regulated enterprise teams already in the Power Platform ecosystem. Anthropic is the better fit for teams needing cross-OS desktop control or custom infrastructure.