Browser-automation agents bifurcated in 2025-2026 between DOM-driven approaches (Playwright + Claude, Stagehand, Browserbase) and vision-driven approaches (Anthropic Computer Use, OpenAI CUA). The DOM-driven stacks are 12-17 percentage points more reliable on common tasks; the vision-driven stacks unlock workloads the DOM-driven stacks can't reach (canvas-only apps, image-driven UIs, anti-bot screens).

We compare five stacks across reliability, runtime locality, cost, DX, and best-fit workload. Most teams default to a DOM-driven stack (Playwright + Claude or Stagehand) for the 80% of workloads it covers, and reach for a vision-driven stack only when DOM access fails.

This post covers the 7-axis matrix, deep dives on each stack, and four reference workloads we run for clients today — data extraction, form-filling automation, QA testing, and competitive intelligence.

Key takeaways

01
DOM-driven stacks beat vision-driven stacks on common-task reliability by 12-17 points.Playwright + Claude scores 92% reliability on common browser-automation tasks; Stagehand 89%; Browserbase 90%. Anthropic Computer Use scores 78%; OpenAI CUA scores 75%. The gap is real and persists across task types — DOM access is more reliable than vision-driven inference for the 80% of tasks where DOM is available. Use vision-driven stacks only for workloads where DOM access fails.
02
Playwright + Claude is the DX leader — deterministic + agentic + cheapest at scale.Playwright is the deterministic web-automation gold standard; pairing it with Claude (or another LLM) for natural-language task definition and DOM reasoning produces the cleanest developer experience in the field. Self-hosted Playwright + LLM API costs $0.02-0.10/task (cheapest at scale). Right primary for engineering teams that own their automation infrastructure.
03
Stagehand is the cleanest dev abstraction — Playwright underneath, agent ergonomics on top.Stagehand by Browserbase wraps Playwright with agent-friendly methods (act, observe, extract). The abstraction reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Pairs naturally with Browserbase managed runtime. Right pick for teams that want agent ergonomics without hand-rolling the LLM-to-Playwright glue.
04
Browserbase is the managed-runtime leader — pay-per-minute browser-as-a-service.Browserbase offers managed Chromium with CDP access at $0.10-0.40/browser-minute. Right pick when self-hosting Playwright at scale becomes operational toil — managed CAPTCHA solving, anti-bot evasion, residential proxy support, session recording. Pairs natively with Stagehand. Cost ramps with scale; under 100 browser-hours/month, self-hosted Playwright wins on cost.
05
Computer Use + CUA are vision-driven — for workloads where DOM fails.Anthropic Computer Use and OpenAI Computer-Using-Agent operate on screen pixels rather than DOM. Reach workloads that DOM-driven stacks can't (canvas-heavy apps, image-driven UIs, anti-bot screens that obscure DOM). Trade-offs: 12-17 point reliability gap to DOM-driven on common tasks; 4-8x cost; cloud-only runtime for CUA. Use as fallback, not primary.

01 — The FieldThe 2026 browser-agent field.

Browser-automation agents are at the intersection of three ecosystems: deterministic web-automation (Playwright, Puppeteer, Selenium), LLM-driven reasoning (Claude, GPT-5.5, Gemini 3), and cloud-runtime infrastructure (Browserbase, Apify, ScrapingBee). By 2026, the production-grade stacks combine pieces from each — the five we compare here represent the dominant combinations.

The five stacks split on two primary axes: DOM-driven vs vision-driven (which surface the agent operates on), and self-hosted vs managed runtime (where the browser actually runs). DOM-driven self-hosted (Playwright + Claude) is the cheapest and highest-reliability default; managed runtimes pay back when scale becomes operational toil; vision-driven stacks unlock workloads the DOM-driven stacks can't reach.

Stack 1

Playwright + Claude — DX leader

Self-hosted · DOM-driven · LLM API only

Deterministic Playwright + Claude (or any frontier LLM) for natural-language task definition + DOM reasoning. Self-hosted runtime; pay only for LLM API calls. Cheapest at scale; cleanest DX for engineering-owned infrastructure.

Engineering teams

Stack 2

Stagehand — agent abstraction

act/observe/extract API · Browserbase or self-hosted

Stagehand wraps Playwright with agent-friendly methods. Reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Pairs naturally with Browserbase. Right pick for teams that want agent ergonomics without DIY plumbing.

DX-first abstraction

Stack 3

Browserbase — managed runtime

Cloud Chromium · CDP-as-a-service · $0.10-0.40/min

Pay-per-minute managed Chromium. CAPTCHA solving, anti-bot evasion, residential proxies, session recording. Right pick when self-hosting at scale becomes operational toil.

Managed runtime

Stack 4

Anthropic Computer Use

Vision-driven · Claude · screen control

Operates on screen pixels instead of DOM. Reach workloads DOM-driven stacks can't (canvas apps, image UIs, anti-bot screens). 78% reliability on common tasks; runs against any browser the agent can see.

Vision-driven

Stack 5

OpenAI Computer-Using-Agent

Cloud-only · OpenAI · vision-driven

OpenAI's vision-driven counterpart. Cloud-only runtime; OpenAI-locked. 75% reliability on common tasks. Right pick when OpenAI lock-in is acceptable and the workload is vision-driven.

OpenAI vision

02 — MatrixFeature matrix, five stacks.

The matrix below covers the seven capabilities that drive 2026 browser-agent decisions: reliability on common tasks, runtime locality, DOM vs vision surface, cost per task, DX (developer experience), provider posture, and best-fit workload.

Capability

Reliability on common tasks

Playwright + Claude wins (92%). Browserbase 90%, Stagehand 89%, Computer Use 78%, CUA 75%. The DOM-driven stacks lead by 12-17 percentage points on common tasks. Use vision-driven stacks only when DOM access fails.

Playwright + Claude

Capability

Runtime locality (self-hosted vs managed)

Playwright + Claude: self-hosted only. Stagehand: self-hosted or Browserbase managed. Browserbase: managed only. Computer Use: any browser the agent can see (most flexible). CUA: cloud-only OpenAI runtime.

Computer Use most flexible

Capability

DOM vs vision surface

DOM-driven (more reliable for 80% of tasks): Playwright + Claude, Stagehand, Browserbase. Vision-driven (reaches workloads DOM can't): Anthropic Computer Use, OpenAI CUA. Pick DOM-driven first; fall back to vision when DOM fails.

DOM-driven for 80%

Capability

Cost per task

Playwright + Claude $0.02-0.10/task (cheapest at scale). Stagehand $0.05-0.15. Browserbase $0.10-0.40 (browser-minute pricing). Computer Use $0.20-0.40 (vision tokens). CUA $0.20-0.50 (vision + OpenAI premium).

Playwright + Claude

Capability

Developer experience (DX)

Stagehand wins on agent abstractions (act/observe/extract reduce boilerplate 60-70%). Playwright + Claude wins on flexibility. Browserbase + Stagehand together produce the cleanest managed-runtime DX. Computer Use + CUA have minimal DX surface — they're vision-only.

Stagehand (abstraction) · Playwright (flexibility)

Capability

Provider posture (lock-in)

Playwright + Claude: provider-flexible (Claude swappable for any LLM). Stagehand: provider-flexible. Browserbase: tied to Browserbase managed runtime. Computer Use: Anthropic-only model. CUA: OpenAI-only model + cloud.

Playwright (most flexible)

Capability

Best-fit workload

Playwright + Claude: engineering-team workflows, scale-cost-sensitive. Stagehand: any team that wants agent abstractions. Browserbase: managed runtime needs (anti-bot, CAPTCHA). Computer Use: canvas/image-driven UIs. CUA: vision tasks where OpenAI lock-in is acceptable.

Match workload

03 — Playwright + ClaudePlaywright + Claude — the DX leader.

Playwright is the deterministic web-automation gold standard. Pairing it with Claude (or another frontier LLM) for natural-language task definition and DOM reasoning produces the cleanest developer experience for engineering-owned automation. Self-hosted runtime; pay only for LLM API calls. The combination is the cheapest scale-out path and remains the highest-reliability default for DOM-accessible workloads.

Strength

92%

Highest common-task reliability

Playwright's deterministic browser control combined with Claude's DOM reasoning hits 92% reliability on common automation tasks (data extraction, form filling, navigation). The gap to vision-driven stacks (78-80%) is real and persists across task families.

Reliability leader

Strength

$0.02

Cheapest at scale

Self-hosted Playwright + Claude API calls land at $0.02-0.10 per task. At 1000+ tasks/month, the cost gap to managed runtimes (Browserbase $0.10-0.40/task, CUA $0.20-0.50/task) compounds. Right default for high-volume automation.

Cost leader

Trade-off

DIY

Operational ownership at scale

Self-hosted Playwright requires browser-fleet management at scale: CAPTCHA handling, anti-bot evasion, residential proxy rotation, session recording. Pays back on cost but adds operational toil. Above 100 browser-hours/month, evaluate managed runtimes.

Self-hosted toil

"Playwright + Claude wins on reliability and cost. Stagehand wins on developer experience. Browserbase wins on operational simplicity. Pick by which axis hurts most."— Internal browser-agent retro, March 2026

04 — StagehandStagehand — the agent abstraction leader.

Stagehand by Browserbase wraps Playwright with agent-friendly methods (act, observe, extract). The abstraction reduces boilerplate by 60-70% vs raw Playwright + LLM glue. Pairs naturally with Browserbase managed runtime but works with self-hosted Playwright too. Right pick for teams that want agent ergonomics without hand-rolling the LLM-to-Playwright integration.

Strength

60%

Boilerplate reduction

Stagehand's act/observe/extract methods reduce boilerplate by 60-70% vs raw Playwright + LLM glue. The abstraction is well-designed — high enough to remove plumbing, low enough that the underlying Playwright is still accessible when needed.

DX abstraction

Strength

89%

Near-Playwright reliability

89% common-task reliability — only 3 points below raw Playwright + Claude (92%). The abstraction overhead is small. Pairs naturally with Browserbase for managed runtime; works with self-hosted Playwright too.

Reliable abstraction

Trade-off

Newer

Younger ecosystem

Stagehand is younger than raw Playwright. Community size is smaller; debugging resources are thinner; edge cases occasionally surface where the abstraction layer adds friction. The trade-off is minor for most workloads but real for the truly unusual.

Younger ecosystem

05 — BrowserbaseBrowserbase — the managed runtime leader.

Browserbase offers managed Chromium with CDP access as a service. Pay-per-minute pricing ($0.10-0.40/browser-minute) covers managed CAPTCHA solving, anti-bot evasion, residential proxy support, and session recording. Right pick when self-hosting Playwright at scale becomes operational toil — typically above 100 browser-hours/month.

Strength

Managed CAPTCHA + anti-bot evasion

Browserbase handles the operational layer that breaks self-hosted Playwright at scale: CAPTCHA solving via vendor partnerships, anti-bot evasion patterns, fingerprint randomization, residential proxies. Pays back at any scale where these become recurring failures.

Operational simplicity

Strength

Pay-per-minute economics

$0.10-0.40 per browser-minute (varies by features). At 100+ browser-hours/month the cost is meaningful but pays back on operational simplicity. Below ~100 hours/month, self-hosted Playwright wins on cost. Crossover point depends on the team's ops capacity.

Scale-cost trade

Trade-off

Cost ramps with scale

At 1000+ browser-hours/month, managed runtime costs compound vs self-hosted Playwright + Claude. The crossover point depends on how much ops time the team has — if engineering capacity is constrained, Browserbase pays back longer.

Scale-dependent

06 — Computer Use + CUAVision-driven — Computer Use and CUA.

Anthropic Computer Use and OpenAI Computer-Using-Agent operate on screen pixels rather than DOM. They reach workloads that DOM-driven stacks can't (canvas-heavy apps, image-driven UIs, anti-bot screens that obscure DOM). The trade-offs are real: 12-17 point reliability gap to DOM-driven on common tasks, 4-8x cost, and cloud-only runtime for CUA. Use as fallback, not primary.

Anthropic Computer Use

Vision-driven · Claude · 78% reliability

Operates on screen pixels with Claude reasoning. Runs against any browser the agent can see — local Chrome, headless, remote VM. 78% reliability on common tasks. Pairs naturally with Claude Code or Anthropic API as the agent surface. Right when DOM fails or the workload is canvas-heavy.

Vision · flexible runtime

OpenAI CUA

Cloud-only · OpenAI · 75% reliability

OpenAI's vision-driven counterpart. Cloud-only runtime — agent runs in OpenAI's managed VMs. OpenAI-locked. 75% reliability on common tasks. Right when OpenAI lock-in is acceptable and the team values managed-runtime simplicity over flexibility.

OpenAI-native vision

07 — Reference WorkloadsFour reference workloads.

Below are the four browser-automation workloads we deploy most often for client engagements, with the stack recommendation that consistently wins on each. The mapping isn't absolute, but each pairing is the path of least friction.

Workload 1

Data extraction (structured scraping)

Pull structured data from a list of URLs. DOM-driven; tasks are short and high-volume. Cost matters most. Playwright + Claude is the default; self-hosted runtime; pay only for LLM API calls. Add Browserbase if anti-bot evasion becomes a bottleneck.

Playwright + Claude

Workload 2

Form-filling automation (cross-app workflows)

Fill complex forms, navigate multi-step workflows, handle error states. DOM-driven; medium duration; agent ergonomics matter. Stagehand (with Browserbase or self-hosted Playwright) wins on developer experience and reliability for these workloads.

Stagehand + Browserbase

Workload 3

QA testing (visual + functional)

Run end-to-end tests against a web app. DOM-driven for functional tests; vision-driven for visual regression. Playwright + Claude for functional flows; layer Computer Use for visual regression where DOM diffing isn't enough.

Playwright + Claude (+ Computer Use)

Workload 4

Competitive intelligence (anti-bot heavy sites)

Pull data from sites with active anti-bot defenses (price-aggregator, booking, ticketing). DOM access is intermittently blocked; managed runtime helps. Browserbase + Stagehand handles most cases; fall back to Computer Use when DOM is fully obscured.

Browserbase + Stagehand (+ Computer Use)

08 — ConclusionPick by workload + ops capacity, not novelty.

Browser-automation agents, April 2026

There is no single best browser-agent stack. There are right defaults per workload and ops capacity.

By April 2026 the browser-automation field has consolidated to five production-grade stacks: Playwright + Claude, Stagehand, Browserbase, Anthropic Computer Use, and OpenAI CUA. Each occupies a different spot on the trade-off surface, and each wins on its home territory. There is no "best" stack in the abstract; there is the right default for the workload and the team's ops capacity.

The pattern that scales: pick the DOM-driven stack first (Playwright + Claude or Stagehand) for the 80% of workloads it covers. Add Browserbase as managed runtime when self-hosting becomes operational toil (typically above 100 browser-hours/month). Reach for vision-driven stacks (Computer Use or CUA) only when DOM access fails — canvas apps, image UIs, anti-bot screens that obscure DOM.

The right move for most engineering teams: standardize on Playwright + Claude as the primary; add Browserbase when scale demands it; layer Computer Use as the vision-driven fallback. The three-stack pattern covers ~95% of browser-automation workloads with disciplined cost economics and a single primary mental model.

Browser Automation AI Agents: Playwright vs Stagehand.

01 — The FieldThe 2026 browser-agent field.

Playwright + Claude — DX leader

Stagehand — agent abstraction

Browserbase — managed runtime

Anthropic Computer Use

OpenAI Computer-Using-Agent

02 — MatrixFeature matrix, five stacks.

Reliability on common tasks

Runtime locality (self-hosted vs managed)

DOM vs vision surface

Cost per task

Developer experience (DX)

Provider posture (lock-in)

Best-fit workload

03 — Playwright + ClaudePlaywright + Claude — the DX leader.

Highest common-task reliability

Cheapest at scale

Operational ownership at scale

04 — StagehandStagehand — the agent abstraction leader.

Boilerplate reduction

Near-Playwright reliability

Younger ecosystem

05 — BrowserbaseBrowserbase — the managed runtime leader.

Managed CAPTCHA + anti-bot evasion

Pay-per-minute economics

Cost ramps with scale

06 — Computer Use + CUAVision-driven — Computer Use and CUA.

Vision-driven · Claude · 78% reliability

Cloud-only · OpenAI · 75% reliability

07 — Reference WorkloadsFour reference workloads.

Data extraction (structured scraping)

Form-filling automation (cross-app workflows)

QA testing (visual + functional)

Competitive intelligence (anti-bot heavy sites)

08 — ConclusionPick by workload + ops capacity, not novelty.

There is no single best browser-agent stack. There are right defaults per workload and ops capacity.

Move past stack debates. Pick by workload shape.

Browser-agent engagements

The questions we get every week.

Continue exploring agentic AI infrastructure.

Tool-Use Success Rates: 5 Frontier Models Tested

AI Agent Adoption 2026: 120+ Enterprise Data Points

GPT-5.4: Computer Use, Tool Search, Benchmarks, Pricing