Gemini 3.5 Flash computer use is now a native, built-in tool — announced June 24, 2026 as a public preview — letting a single Google production model see a screen and operate a browser, mobile device, or desktop without routing to a separate computer-use model. That architectural shift, not the headline benchmark, is the story: perception and action collapse into one inference pass.
This is not the standalone Gemini 2.5 Computer Use model that shipped in October 2025. That was a separate preview built on Gemini 2.5 Pro, browser-focused, capped at 128K tokens of context, and it forced developers to route between it and their main model. The June 24 update folds the capability directly into Gemini 3.5 Flash — Google’s fastest, most economical production model — and pairs it with a million-token context window.
This guide covers what actually shipped and how it differs from the legacy standalone model, an honest read of the OSWorld-Verified benchmark (every score on that board is self-reported), the cost case against GPT-5.5 specifically, the seven configurable safety categories you must lock down before deploying, and how a marketing or operations team can put it to work.
- 01Computer use is native now, not a separate model.Announced June 24, 2026 as a public preview, computer use is built into gemini-3.5-flash alongside function calling, Google Search grounding, and Maps. One agent can see, reason, and act across browser, mobile, and desktop with no model-hopping.
- 02It is a preview, not general availability.The Gemini API changelog labels computer use a public preview as of June 24, 2026. Treat it as pre-production: prototype against it, but verify current status before you wire it into anything that touches real customer data or spend.
- 03The benchmark is effectively a tie, and self-reported.Gemini 3.5 Flash posts 78.4 on OSWorld-Verified versus GPT-5.5’s 78.7 — within 0.3 points. Every score on the OSWorld-Verified board is self-reported by the model provider, with no independent third-party verification as of June 2026.
- 04The cost case is specific to GPT-5.5.At $1.50 input / $9 output per million tokens, Gemini 3.5 Flash is priced at exactly 30% — roughly one-third — of GPT-5.5’s $5 / $30. That comparison holds against GPT-5.5, not against every frontier model, which all price differently.
- 05Safeguards exist, but most are opt-in.Seven configurable safety categories, a safety_decision on every action, plus opt-in enforced user confirmation and automatic task termination on detected prompt injection. Google’s own caveat is the honest one: “no single safeguard is foolproof.”
01 — What ShippedComputer use as a built-in tool, across three environments.
The change is small in the API and large in architecture. Where a developer previously called a dedicated computer-use model, computer use is now a tool you switch on inside Gemini 3.5 Flash — the same model you already use for function calling and grounding. You declare it with tools=[{"type": "computer_use", "environment": "browser|mobile|desktop"}], and the model returns actions on a normalized 0–1000 coordinate scale that you denormalize to the real screen.
One honest framing point up front: Gemini 3.5 Flash is not strictly the first Gemini model with built-in computer use — Gemini 3 Flash Preview supports it too. What June 24 delivered is the recommended and most capable built-in computer-use model, with browser, mobile, and desktop support and Google’s strongest stated agentic numbers to date.
For a developer the ergonomic change is concrete. Switching computer use on is a tool declaration, not a second integration: the model that already reasons about your task now also returns the click, the keystroke, and the scroll, in the same response stream as its function calls. The standalone path asked you to maintain two model clients, marshal context between them, and reconcile their separate failure modes. Collapsing that into one model is the kind of simplification that changes what a small team can actually ship — fewer moving parts, one bill, and one place to debug when an agent does the wrong thing.
Web automation
Click, double-click, right-click, type, scroll, navigate, drag-and-drop, hotkeys, and screenshot — the full vocabulary for driving a dashboard, ad console, or CRM in a real browser session.
App-level control
A separate set tuned for devices: open_app, list_apps, click, type, long_press, drag-and-drop, press_key, go_back, and take_screenshot. The same model, a different interaction grammar.
OS-level operation
Desktop shares the browser action vocabulary, extending control beyond the tab to the operating system — the environment the standalone 2.5 model was not optimized for.
02 — What ChangedFrom a separate model to a native capability.
If you read our earlier guide to the standalone Gemini 2.5 Computer Use model, this is the next chapter — and a genuine architectural break, not a version bump. The 2.5 model (gemini-2.5-computer-use-preview-10-2025) was a discrete preview built on Gemini 2.5 Pro, optimized for the browser, and limited to 128K tokens of context. To use it you ran a two-model setup: a main model for reasoning, the computer-use model for screen actions, and your own glue code routing between them.
Native integration removes that hop. The same Gemini 3.5 Flash instance reasons and acts, carries a 1M-token context window — an eight-fold expansion over the standalone model’s 128K — and adds an intent field on every action that the 2.5 model did not emit. The table below traces the full lineage so the shift is visible at a glance.
| Model ID | Computer-use integration | Environments | Input context | Intent field | OSWorld-Verified |
|---|---|---|---|---|---|
gemini-2.5-computer-use-preview-10-2025 | Standalone preview model (Oct 7, 2025), built on Gemini 2.5 Pro | Browser-focused | 128K tokens | No | Not on the board |
gemini-3-flash-preview | Built-in tool (preview) | Browser · mobile · desktop | 1M tokens | Yes | 65.1 (self-reported) |
gemini-3.5-flash | Built-in tool · native · recommended (CU added Jun 24, 2026) | Browser · mobile · desktop | 1M tokens | Yes | 78.4 (self-reported) |
The eight-fold context jump is easy to under-read as a spec line, but it is the difference between an agent that remembers only the last few screenshots and one that holds an entire workflow in view. For a long-horizon task — search-ad setup, then the landing page, then the CRM record — 1M tokens lets the agent keep the whole chain coherent rather than losing the thread between steps.
The legacy model was no toy — it posted around 70% on the Online-Mind2Web web-task benchmark at its October 2025 launch, and for browser-only automation it remains usable. But it was a point solution: a model that did one job and handed control back. The native tool reframes computer use as a capability of a general production model rather than a destination you route traffic to. If you are running the standalone model today, treat 3.5 Flash as the forward path rather than a drop-in swap — the action vocabulary is broadly familiar, but the intent field, the wider environments, and the larger context change what you can ask the agent to attempt.
03 — The Architecture ShiftOne agent that sees, reasons, and acts.
The underreported value is what the native integration removes. With computer use, Google Search grounding, and Google Maps all available inside one Gemini 3.5 Flash call, a single agent can look at a screen, ground a fact against live search, and pull a location — without handing context between specialized models. Each model hop in a multi-model pipeline is a place where context gets translated, latency stacks, and errors propagate; collapsing perception and action into a single inference pass eliminates that entire class of failure.
Computer use
The agent takes a screenshot, decides the next action, and returns it with the intent behind it. Browser, mobile, and desktop, on a normalized 0–1000 coordinate grid.
Search grounding
Google Search grounding lets the same agent check a claim against the live web mid-task — confirm a price, a policy, a competitor’s live offer — without a hand-off to a separate retrieval model.
Maps
Google Maps resolves locations in the same call, so a workflow that needs an address, a store, or a service area does not break out into a third tool. One agent, three native capabilities.
The other quietly important addition is the intent field. Per the Gemini API docs, every action response includes an intent that explains the model’s reasoning for that step — for example, “Click the search box to type the destination.” For a regulated marketing or operations team this is not a nicety; it is an audit trail. You can log exactly why an agent clicked a button in your CRM, which is the difference between an action you can defend and one you cannot. The model runs the loop the docs describe: send a screenshot, receive a function call with an action plus its intent, execute it with denormalized coordinates, capture the next screenshot, and repeat until the task is done. For the broader capability profile, our full Gemini 3.5 Flash benchmark and API guide covers the non-computer-use side.
It is worth dwelling on why the intent field matters beyond debugging. Most coverage treated it as a developer convenience, but for a regulated buyer it is closer to a control. When an agent acts inside a CRM or a billing console, the question a compliance team asks is not “did it work” but “can you show why it did that.” An action log that pairs every click with the model’s stated reason is the difference between an automation you can put in front of an auditor and one you quietly turn off the first time it surprises someone. That is a capability the standalone 2.5 model did not offer, and it is arguably more consequential for enterprise adoption than the three-tenths of a benchmark point that dominated the headlines.
Input token window
1,048,576 input tokens with up to 65,536 output tokens (Google-stated) — an 8x expansion over the standalone 2.5 model’s 128K, enough to hold a full multi-step workflow in memory.
Output throughput
Roughly 289 tokens per second, which Google frames as about 4x faster than competing frontier models. Treat both figures as vendor-stated; speed is the model’s central selling point.
Intent per action
Every action carries an intent field explaining the model’s reasoning for that step — new in 3.5 Flash relative to the 2.5 model, and the basis for a defensible audit log.
"Computer use is now a built-in tool supported in Gemini 3.5 Flash, delivering our best performance yet for agentic computer use tasks."— Google DeepMind, Introducing computer use in Gemini 3.5 Flash, June 24, 2026
Dynamic thinking is on by default in Gemini 3.5 Flash, with the model allocating more compute to harder problems automatically — useful when a screen task occasionally needs deeper reasoning mid-loop. Google positions Antigravity, refreshed as Antigravity 2.0 at I/O 2026 and powered by 3.5 Flash, as the primary platform for building these agents, including parallel-executing subagent workflows.
The screen-driving numbers also sit on top of a model Google positions as broadly strong on agentic work. At I/O, Google reported Gemini 3.5 Flash outperforming Gemini 3.1 Pro on several agentic benchmarks — 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, 1656 Elo on GDPval-AA, and 84.2% on CharXiv multimodal reasoning. Treat those as vendor-stated like everything else here, but they explain why a Flash-tier model can credibly anchor an agent at all: computer use is only as good as the reasoning deciding what to click, and 3.5 Flash is not a lightweight model wearing a screen-control hat.
04 — BenchmarksThe leaderboard, read honestly.
Gemini 3.5 Flash scores 78.4 on OSWorld-Verified, the benchmark for agentic desktop control. GPT-5.5 sits at 78.7 and Claude Opus 4.7 at 78.0 — a gap of 0.3 points at the top, which independent coverage has described as effectively a three-way tie. The right reading is not that Gemini 3.5 Flash “matches” GPT-5.5 as a definitive claim, but that the three are within any plausible margin of error.
One caveat governs the entire chart below, so state it plainly: every score on the OSWorld-Verified board is self-reported by the model provider. There is no independent third-party verification as of June 2026. Read the bars as vendor claims arranged on a common axis, not as audited results.
OSWorld-Verified · self-reported computer-use scores
Source: OSWorld-Verified scores via The Decoder and MLQ.ai — all figures self-reported by model providers, no independent third-party verification as of June 2026The more interesting question is strategic, not numeric. As one analysis of the launch put it, Google embedded computer use directly into its fastest, most economical model rather than reserving it for a premium tier — the opposite of how the other frontier labs have so far packaged the capability. That is a deliberate bet that cost and speed, not a marginal lead on raw capability, are what get agents into production. If the benchmark is a tie, the model that drives the screen for a third of the price tends to win the deployments, and the leaderboard row becomes a footnote to the procurement decision.
There is a real ceiling to acknowledge in the same breath. The absolute top of the self-reported board belongs to the Claude line — Opus 4.8 at 83.4 and Fable 5 at 85.0, the latter with access suspended under export controls. A team chasing the highest possible task success on the hardest desktop workflows may still reach for a premium model. The argument for Gemini 3.5 Flash is not that it is the most capable screen agent; it is that it is the most capable screen agent you can afford to run at volume, which for most real automation is the constraint that actually binds.
05 — The Cost CaseThe number that actually moves a decision.
With the benchmark a tie, price is the lever. Gemini 3.5 Flash is $1.50 per million input tokens and $9.00 per million output tokens, with cached input at $0.15. The widely-quoted “roughly one-third the cost” line is precise, and it is precise against GPT-5.5 specifically: GPT-5.5 lists $5 input and $30 output, so $1.50 is exactly 30% of $5 and $9 is exactly 30% of $30. That clean one-third holds against GPT-5.5 — not against every frontier model, each of which prices differently.
| Model | OSWorld-Verified | Input ($/Mtok) | Output ($/Mtok) | Cost-efficiency index |
|---|---|---|---|---|
| GPT-5.5 | 78.7self-reported | $5.00 | $30.00 | 15.7 |
| Gemini 3.5 Flash | 78.4self-reported | $1.50 | $9.00 | 52.3 |
Read the last column, not the second. On a cost-efficiency index — benchmark score divided by input price — Gemini 3.5 Flash returns 52.3 against GPT-5.5’s 15.7, roughly 3.3x more benchmark per dollar at statistically indistinguishable performance. The “78.4 vs 78.7” headline becomes close to irrelevant once you normalize for cost: you are paying three times less for the same measured capability. Two honest hedges keep this directional rather than definitive — the scores are self-reported, and real-world token mix per session shifts the effective ratio away from the clean per-token figure.
At a marketing team’s volume the gap stops being abstract. Agentic computer-use loops are token-hungry — every screenshot, every intent, every step of the loop adds input and output tokens — so a workflow run hundreds or thousands of times a month accumulates real spend, and it is the output line, billed at $9 against GPT-5.5’s $30, where the difference compounds across a long task. We avoid putting a single dollar figure on it here because the honest number depends on your token mix per session, which varies with how visual and how long each task is. The durable point is narrower and more useful than a hero stat: a model priced at a third of the alternative changes which automations clear the bar of being worth building at all.
Per million tokens
Exactly 30% of GPT-5.5’s $5 input rate. The same 30% ratio holds on output ($9 vs $30). This is the GPT-5.5 comparison specifically, not a claim about the whole frontier.
Per million tokens
Against GPT-5.5’s $30 output rate. For agentic computer-use loops, which generate a lot of intermediate tokens, the output line is where the cost gap compounds across a long task.
Cached input
Cached input drops to $0.15 per million tokens. For agents that repeatedly re-send a stable system prompt and tool schema across a session, caching compresses the input bill further.
06 — Safety & GovernanceWhat you must lock down before you deploy.
An agent that can click anything in a logged-in browser is exactly as dangerous as that sounds, and Google’s safety design reads like a map of what worries them. The API ships seven configurable safety-policy categories, and every action response carries a safety_decision with a value of regular/allowed, require_confirmation, or blocked. The categories can be overridden via disabled_safety_policies — which is precisely why you need to decide your posture deliberately rather than inherit a default.
The table below maps all seven categories to the marketing and operations failure each one guards against. They line up almost exactly with the threat model for an automation agent: paying for things, editing records, sending messages, agreeing to terms.
| Safety category | What it gates | Marketing / ops risk scenario | Suggested posture |
|---|---|---|---|
FINANCIAL_TRANSACTIONS | Payments, purchases, ad-spend commitments | An agent books campaign budget or pays an invoice in a billing console without sign-off. | Keep enforced confirmation on |
SENSITIVE_DATA_MODIFICATION | Edits to sensitive records | An agent overwrites a CRM contact, deal value, or customer record while updating the pipeline. | Confirmation on |
COMMUNICATION_TOOL | Sending messages and email | An agent sends an email or direct message to a customer list from a connected inbox. | Confirmation on |
ACCOUNT_CREATION | Creating new accounts | An agent signs up for a SaaS tool or ad account mid-task. | Block for autonomous agents |
DATA_MODIFICATION | General data writes | An agent edits a shared spreadsheet, dashboard, or scheduled report. | Confirmation on |
USER_CONSENT_MANAGEMENT | Consent and privacy settings | An agent toggles a cookie, consent, or subscription preference on a live property. | Block for autonomous agents |
LEGAL_TERMS_AND_AGREEMENTS | Accepting terms and contracts | An agent clicks “I agree” on a vendor’s terms of service or data-processing addendum. | Block for autonomous agents |
A screen agent introduces a threat ordinary chat models do not face: the attack can arrive through the pixels. A malicious instruction hidden in a webpage, an ad, or a rendered document can try to hijack the agent mid-task — telling it to exfiltrate data or click something it should not. Because the agent operates inside your authenticated sessions, a successful injection inherits your permissions: it is not a chatbot saying something off-policy, it is software acting as a logged-in you. That is the specific reason a screen agent needs more than the safety categories above.
Two further safeguards address that exposure: prompt injection through the pixels themselves. Prompt-injection detection is opt-in — you enable it with enable_prompt_injection_detection — and it scans screenshots for hidden adversarial instructions, blocking execution when it finds them. Google says it applied targeted adversarial training to harden 3.5 Flash against these visual exploits, and offers two opt-in enterprise controls: enforced user confirmation before sensitive or irreversible actions, and automatic task termination on detecting indirect prompt injection. Both are off by default. If you are designing the defensive layer, our prompt-injection defense framework covers the layered approach production agents need.
07 — Putting It to WorkWhere a marketing or ops team starts today.
Google named Salesforce, Xero, Shopify, and Ramp among the early adopters building on Gemini 3.5 Flash for automation — supplier identification, invoice OCR, growth forecasting, multi-subagent enterprise tasks. Read those as broad 3.5 Flash adoption rather than proof that each one runs computer-use workloads specifically; the coverage attributes them to the model, not to this preview tool in particular. The practical starting points for a smaller team are narrower and lower-risk, and they map cleanly onto the safety postures above.
The named examples are instructive even with that caveat. Coverage describes Salesforce wiring 3.5 Flash into multi-subagent enterprise tasks on its Agentforce platform, Xero running autonomous accounting steps like supplier identification and tax-form processing, Shopify using parallel subagents for merchant growth forecasting, Ramp reading complex invoices against historical patterns, and Macquarie Bank applying it to customer onboarding. The pattern across them is high-volume, repetitive back-office work where a fast, cheap model pays off — the same shape a marketing or operations team faces, just at a different scale. A smaller team does not need an enterprise platform to start; it needs one well-scoped workflow.
Campaign & dashboard QA
Point a research-only agent at ad consoles and analytics dashboards to read state and flag anomalies — no writes, no spend. The intent field gives you a reviewable log of what it checked and why.
CRM data entry & lead routing
Let the agent draft record updates across a browser CRM, but keep enforced confirmation on for SENSITIVE_DATA_MODIFICATION and DATA_MODIFICATION so a person approves each write before it lands.
Unified reporting
Use the native stack — computer use plus Search grounding plus Maps in one agent — to pull a screen metric, verify a fact, and resolve a location in a single pass, instead of stitching three tools together.
Anything that spends or sends
Booking ad budget, sending customer email, accepting vendor terms: keep FINANCIAL_TRANSACTIONS, COMMUNICATION_TOOL, and LEGAL_TERMS_AND_AGREEMENTS gated, and turn on prompt-injection detection before the agent touches a logged-in tab.
A concrete first project makes this tangible. Take weekly campaign QA: an agent opens each ad platform in a browser, reads spend pacing and policy status across accounts, and writes a single plain-language summary — reading state, never changing it. The intent field documents each check, so the output is auditable from day one, and because nothing is written, the FINANCIAL_TRANSACTIONS and DATA_MODIFICATION categories never come into play. It is the lowest-risk way to learn whether a screen agent is reliable enough on your own surfaces before you hand it anything that writes, spends, or sends.
The honest sequencing is the same one we use with clients: prove value on read-only QA, add human-confirmed writes once the audit log earns trust, and only then widen autonomy behind the safety categories. That scoping — which workflows, which guardrails, which confirmation gates — is exactly where our agentic AI transformation engagements begin, before any model commitment.
08 — ConclusionThe model becomes the agent.
Native computer use makes the model the agent, not a step in the pipeline.
The June 24 update is best read as an architecture change wearing a benchmark headline. Folding computer use into Gemini 3.5 Flash — alongside Search grounding and Maps — lets a single agent see, reason, and act in one inference pass, and the new intent field gives that agent the audit trail an enterprise needs to deploy it. The 78.4 OSWorld-Verified score, self-reported like every score on that board, is a tie with GPT-5.5, not a lead.
Keep the framing precise. This is a public preview, not general availability. The cost advantage is real and specific — exactly one-third of GPT-5.5’s per-token price — but it is a GPT-5.5 comparison, not a claim about the whole field. And the safeguards, from the seven safety categories to injection detection, are mostly opt-in, behind Google’s own admission that no single safeguard is foolproof.
The forward read is straightforward. When the cheapest, fastest production model also drives a screen, computer use stops being a premium add-on and starts being a default expectation of an agent platform. The teams that win will not be the ones chasing the top leaderboard row — they will be the ones who wire a tied-but-cheaper model into a real workflow, behind real guardrails, with a log they can defend. That, not a fraction of a benchmark point, is what this release actually changes.