Google launched Gemini 3.5 Flash and the standalone Antigravity 2.0 desktop agent on May 19, 2026 — same day, same press cycle, framed together. This guide compares 3.5 Flash against the frontier Pro tiers from OpenAI and Anthropic (gpt-5.5 and claude-opus-4-7) specifically on agentic coding workloads: MCP-driven multi-step execution, terminal coding, SWE-Bench Pro repo edits, and computer-use control.

This is not a fair fight at the tier level. Gemini 3.5 Pro is confirmed for rollout next month per the official Gemini 3.5 announcement — that's the model that should be the Pro-tier comparison point. But Google chose to ship Flash first, and on Google's own published benchmark table the Flash model already leads Opus 4.7 and GPT-5.5 on several agentic evals. That changes the shopping question.

The framing here: what is the smallest, fastest model that still does the agentic-coding job for you? If the answer today is 3.5 Flash, the speed and cost profile cascade into your entire stack. We also fold in the launch of Google Antigravity 2.0, the agent-first desktop app that ships powered by the latest Gemini models — it's the deployment frame Google is betting on, and worth covering in context. A full deep dive on Antigravity 2.0 and its rivals comes later; this post sets the table.

Key takeaways

01
Asymmetric by design — Flash vs Pro tiers.Gemini 3.5 Flash is a Flash-tier model up against the frontier Pro tiers from OpenAI (GPT-5.5) and Anthropic (Opus 4.7). Gemini 3.5 Pro is rolling out next month per the official announcement — that's the real apples-to-apples comparison. This post is the in-between picture for the four-week window we actually have.
02
Three models, three coding-flavor wins.On Google's published table: 3.5 Flash leads MCP Atlas (83.6%) and Toolathlon (56.5%). Opus 4.7 leads SWE-Bench Pro (64.3%). GPT-5.5 leads Terminal-Bench 2.1 (78.2%) and OSWorld-Verified (78.7%). For agentic coding, the right model depends on which subtask dominates your loop.
03
Flash's edge is the speed-times-capability product.Google claims 3.5 Flash runs roughly 4x faster than other frontier models on output tokens per second. For tool-heavy loops where a single workflow hits the model hundreds of times, a smaller per-call score that runs 4x faster can beat a higher score that runs 4x slower on wall-clock and cost.
04
Antigravity 2.0 is the deployment frame.Google shipped Antigravity 2.0 the same day — a standalone agent-first desktop app for macOS, Linux, and Windows, powered by the latest Gemini models. Dynamic subagents, scheduled tasks, JSON hooks, and slash commands (/goal, /grill-me, /schedule, /browser) all assume a model fast enough to live inside a tight loop. That's 3.5 Flash's lane.
05
Flash undercuts both Pro tiers by ~3x on price.Gemini 3.5 Flash standard-tier pricing posted on the Gemini API pricing page the day after launch: $1.50 / $9.00 per Mtok input/output. GPT-5.5 is $5 / $30 (standard, under 272K). Claude Opus 4.7 is $5 / $25 with flat 1M-context pricing. Flash is roughly 3.3x cheaper on input and 2.8-3.3x cheaper on output than the Pro tiers — before batch (50% off) or context caching ($0.15 / Mtok).

01 — The frameA Flash model vs two Pro tiers — and why it still matters.

Inside Google, Gemini 3.5 Flash is a Flash-tier model — the small, fast sibling in the lineup. Gemini 3.5 Pro is in development and rolling out next month. Holding the Pro tier back and leading the launch with Flash is unusual; the natural read is that Google is confident enough in the Flash-tier numbers to take the headline on its own.

Comparing Flash to GPT-5.5 and Opus 4.7 — the current Pro tiers from OpenAI and Anthropic — therefore lands as an apples-to-grapefruits exercise on the tier axis. The fair comparison is going to be 3.5 Pro versus those models next month. But three things make this comparison worth running today.

Reason 01

It's what shipped

Today

Gemini 3.5 Flash is GA today across the Gemini app, AI Mode in Google Search, the Gemini API, Antigravity 2.0, and Gemini Enterprise. Teams deciding what to use this week need a frame; the Pro-tier comparison is four weeks out.

May 19, 2026

Reason 02

Agentic workloads compound speed

Loops

Agentic coding is a tight loop — model call, tool use, model call, tool use. Wall-clock time and cost per task are determined by latency and per-call price, not headline benchmark. Flash-tier speed flips the math even when capability is slightly behind.

4x output speed claim

Reason 03

Flash actually leads on some evals

Lead

On MCP Atlas, Toolathlon, Finance Agent v2, CharXiv Reasoning, and MMMU-Pro, 3.5 Flash leads both GPT-5.5 and Opus 4.7 on Google's published table. That's the cleanest signal that the tier line inside Google is narrower than the marketing label.

5 of 14 benchmarks

Caveat baked into this post

Every benchmark cited below comes from Google's evals-methodology page for gemini-3-5-flash — published the same day as the launch. Vendor self-evals are vendor self-evals; treat them as the floor of confidence, not the ceiling. Independent benchmarks for 3.5 Flash will land in the coming weeks; until then, run your own evals on your own prompts before changing production routing.

02 — Headline pictureThe five agentic-coding evals, side by side.

These are the five evals that map most directly to agentic software-engineering and tool-use workloads. Each bar shows the top score across the three models; the sub-line names the runners-up. Bars are colored by which vendor leads.

Agentic coding evals · top score per benchmark

Source: Google evals methodology page for gemini-3-5-flash, May 19, 2026

MCP AtlasMulti-step workflows using MCP · Opus 4.7 79.1 · GPT-5.5 75.3

83.6%

Flash wins

Terminal-Bench 2.1Agentic terminal coding · 3.5 Flash 76.2 · Opus 4.7 66.1

78.2%

GPT-5.5 wins

SWE-Bench Pro (Public)Diverse agentic coding tasks · GPT-5.5 58.6 · 3.5 Flash 55.1

64.3%

Opus 4.7 wins

OSWorld-VerifiedAgentic computer use · Opus 4.7 78.0 · 3.5 Flash 78.4

78.7%

GPT-5.5 wins

ToolathlonReal-world general tool use · GPT-5.5 55.6 · Opus 4.7 —

56.5%

Flash wins

3.5 Flash leadsGPT-5.5 / Opus 4.7 leads

Read this carefully. 3.5 Flash wins on MCP-driven workflows (the kind Antigravity 2.0 and most modern agent stacks actually run) and on general tool-use. GPT-5.5 wins on terminal coding and computer-use control — pure-execution workloads where the model is operating the machine. Opus 4.7 wins on SWE-Bench Pro — diverse repo-scale coding tasks where the model has to read and edit existing code with low error tolerance.

On OSWorld-Verified, the three are within 0.7 percentage points of each other (78.7 / 78.4 / 78.0). That's effectively a three-way tie on agentic computer use — but note that 3.5 Flash does not actually support Computer Use in the Gemini API (the doc page is explicit on that point). The benchmark score is a research result; deploying browser/desktop agents on 3.5 Flash today is not possible. Use gemini-3-flash-preview for Computer Use in production.

03 — Detail viewThe full agentic + reasoning table.

The headline view above flatters Flash by picking the five evals that are most agentic-coding-shaped. The full table tells a different story on harder reasoning and dense long-context retrieval. Bold cells mark the row leader.

Benchmark	3.5 Flash	Opus 4.7	GPT-5.5	Reads as
MCP Atlas	83.6%	79.1%	75.3%	Multi-step MCP workflows
Toolathlon	56.5%	—	55.6%	General tool use
Terminal-Bench 2.1	76.2%	66.1%	78.2%	Agentic terminal coding
SWE-Bench Pro	55.1%	64.3%	58.6%	Diverse agentic coding
OSWorld-Verified	78.4%	78.0%	78.7%	Computer use (3-way tie)
Finance Agent v2	57.9%	51.5%	51.8%	Analyst-style workflows
GDPval-AA (Elo)	1656	1753	1769	Knowledge-work value
MRCR v2 (128k avg)	77.3%	59.3%	94.8%	Mid-context retrieval
Humanity's Last Exam	40.2%	46.9%	41.4%	Academic reasoning
ARC-AGI-2	72.1%	75.8%	84.6%	Abstract reasoning
Source: Google evals-methodology page for gemini-3-5-flash. Em-dash indicates the score is not published for that model. Bold marks the row leader.

On hardest-reasoning workloads (Humanity's Last Exam, ARC-AGI-2) and on mid-context retrieval (MRCR v2 at 128k average), the Pro tiers still pull ahead — by 6.7 points on Humanity's Last Exam, 12.5 points on ARC-AGI-2, and 17.5 points on MRCR v2 128k. If your coding loop depends on dense retrieval over a single large document or on abstract-reasoning puzzles disguised as coding tasks, Flash gives up real ground.

04 — Loop mathSpeed and cost are half the comparison for agents.

Agentic coding is a tight loop: the model proposes a step, calls a tool, gets a result, proposes another step. A single user task often expands to 50, 100, or 500 model calls. In that regime, the per-call latency and cost compound — wall-clock-per-task and cost-per-task can be dominated by the slower or pricier model even when its per-call accuracy is higher.

Google's announcement claims that 3.5 Flash runs roughly 4x faster than other frontier models on output tokens per second. There's no head-to-head latency chart yet — treat the 4x figure as Google's framing until independent measurements land. But even at half that delta, Flash dominates the loop dimension.

Pricing as of May 19, 2026

Model	Input / Mtok	Output / Mtok	Context	Note
Gemini 3.5 Flash	$1.50	$9.00	1.05M / 65k out	Batch / flex 50% off; caching $0.15 / Mtok
Claude Opus 4.7	$5	$25	1M flat	Flat pricing across full window
GPT-5.5	$5	$30	1.05M	Long-context surcharge above 272K (2x in, 1.5x out)
Sources: Anthropic, OpenAI, and Google (ai.google.dev) API pricing pages. Gemini 3.5 Flash pricing landed on the Gemini API pricing page the day after launch.

Three things to note. First, Opus 4.7 and GPT-5.5 are at rough input-cost parity ($5/Mtok input each), differing only on the output side. Second, GPT-5.5's long-context surcharge above 272K input tokens means that single-shot prompts over a 1M-token codebase get expensive fast on GPT-5.5 — 2x input and 1.5x output for the rest of the session. Third, Gemini 3.5 Flash undercuts both by roughly 3.3x on input and 2.8-3.3x on output at the standard tier — and the batch / flex tier drops that to $0.75 / $4.50 per Mtok for workloads tolerant of asynchronous delivery.

Against historical Flash tiers that ran 10-30x cheaper than their Pro counterparts, the 3x gap is narrower than the lineage suggests — Google is pricing 3.5 Flash closer to the Pro envelope, which tracks with the benchmark posture of leading on five evals against the Pro tier. Even at 3x, the loop math is one-sided for high-volume agentic workloads: a tool-heavy loop that hits the model hundreds of times runs roughly a third the cost on 3.5 Flash versus GPT-5.5 or Opus 4.7, before the 4x output-speed claim ever enters the equation.

"The right model for an agentic loop is the smallest one that does the job — multiplied by 100s of tool calls, every saved second and cent compounds."— Our framing of the speed/capability tradeoff for agentic coding

05 — The cockpitAntigravity 2.0 — Google's agent-first desktop app.

Google launched Antigravity 2.0 the same day as Gemini 3.5 Flash. From the official Antigravity 2.0 announcement: it is "a new, standalone desktop application that fully delivers on a truly agent-optimized experience, available on macOS, Linux, and Windows." It is powered by the latest Gemini models — explicitly including Gemini 3.5 Flash per the Antigravity team's linked launch post.

The Antigravity team frames it bluntly: "Users interact with powerful agents both synchronously and asynchronously, and there is no IDE." This is a different shape than Cursor, Cline, Claude Code, or Codex CLI — those are IDE-shaped agents that give you a code editor and an AI inside it. Antigravity 2.0 gives you an agent and treats the editor as something you dual-wield alongside it. The product positions itself as a platform to orchestrate multiple autonomous agents working in parallel across independent projects.

The deep-dive on Antigravity 2.0 — and how it stacks up against agentic IDEs and standalone CLIs — is its own post. For this comparison, what matters is that Google built its flagship agentic surface assuming 3.5 Flash as the model. The product design choices map tightly to Flash-tier strengths: speed, MCP-driven tool use, cron-style autonomy, parallel subagents.

Architecture

Standalone agent-first desktop

macOS · Linux · Windows · no IDE

Antigravity 2.0 is not an IDE. It is the Agent Manager surface from the original Antigravity IDE, lifted into a standalone desktop app and rebuilt agent-first from the ground up. The Antigravity IDE remains available; Google recommends dual-wielding 2.0 with the IDE of your choice — Antigravity, VS Code, or otherwise.

May 19, 2026 · GA

Tooling around it

CLI, SDK, Managed Agents API

all shipped same day

Antigravity 2.0 lands alongside the Antigravity CLI, the Antigravity SDK, and the Managed Agents Gemini API. Google's framing is that the model layer, agent harness, and product UI are now co-optimized — Gemini training and evaluation stacks are integrated with the Antigravity product harness.

platform launch, not point product

The agentic capabilities that map to Flash

Several Antigravity 2.0 features only make sense at Flash-tier speed. Putting these next to the per-call latency profile is what unlocks the design.

Dynamic subagents. The main agent can dynamically define and invoke subagents to tackle focused sub-tasks in parallel, keeping the main context window clean. Spawning N parallel agents only works if the model is cheap and fast enough to run N in parallel without cost or queue blowing up.
Scheduled Tasks. Define a cron schedule and the agent runs autonomously in the background — recurring or one-off. Autonomous execution that runs all day assumes per-invocation cost is low enough that the budget holds up.
JSON hooks. Intercept and control the agent's behavior via a simple JSON format. Sits at the intersection of MCP-style orchestration and policy — the same primitive that MCP Atlas measures, and the one 3.5 Flash leads on.
New slash commands. /goal (run until done, no intermediate input), /grill-me (ask clarifying questions before implementing), /schedule (cron-or-timer), and /browser (explicit browser-tool toggle).
Live voice transcription on the input box, powered by the latest Gemini Audio models. Real-time conversion of conversational speech into a clearly phrased prompt — closer to a voice cockpit than a recording feature.
Projects, not workspaces. Agent conversations are no longer pinned to a single repository. A project can span multiple folders, with its own settings and scoped permissions — the agent gets context across more of your work without losing guardrails.

Editorial note

A full deep dive on Antigravity 2.0 — comparing it against Cursor, Cline, Claude Code, Codex CLI, Devin, and the rest of the agentic coding cohort — is on the editorial calendar. This section is the contextual frame: this is the cockpit Google built for the model we're comparing. The Pro tiers from OpenAI and Anthropic ship with their own deployment frames (Codex / Claude Code) and deserve the same treatment.

06 — RoutingWhich model for which loop.

The decision matrix below maps the verified scores to concrete routing recommendations. Multi-vendor routing is the realistic production pattern — pick the right model per workload class, not a single default for the whole stack.

MCP-driven agents

Tool-loop heavy workflows

Gemini 3.5 Flash leads MCP Atlas at 83.6% — 4.5 points clear of Opus 4.7 (79.1%) and 8.3 points clear of GPT-5.5 (75.3%). Combine the lead with the 4x output-speed claim and Flash is the right default for MCP-driven tool loops, including inside Antigravity 2.0.

Pick Gemini 3.5 Flash

Repo-scale SWE

Multi-file edits with low error tolerance

Opus 4.7 still leads SWE-Bench Pro at 64.3% versus GPT-5.5 58.6% and 3.5 Flash 55.1%. For complex multi-file software-engineering tasks where the cost of a wrong edit is high, Opus 4.7 remains the better pick.

Stay with Opus 4.7

Terminal / shell agents

Direct execution at the prompt

GPT-5.5 leads Terminal-Bench 2.1 at 78.2% versus 3.5 Flash 76.2% and Opus 4.7 66.1%. For shell-driven agents where the model runs commands and reads output, GPT-5.5 holds a 2-point edge over Flash and a 12-point edge over Opus.

Pick GPT-5.5

Computer Use

Browser / desktop control

All three are within 0.7 points on OSWorld-Verified (78.7 / 78.4 / 78.0). But Gemini 3.5 Flash does not actually support Computer Use in the API — only gemini-3-flash-preview does. For browser-control agents on the Gemini side, use the preview; for cross-vendor routing, GPT-5.5 has the highest benchmark.

Route by support, not score

Long-context coding

Single-shot 100k-1M token prompts

GPT-5.5 leads MRCR v2 at 128k average (94.8%) versus 3.5 Flash (77.3%) and Opus 4.7 (59.3%). For dense mid-context retrieval workloads — large-codebase RAG, multi-doc analysis — GPT-5.5 has the cleanest profile. Watch for the long-context surcharge above 272K input tokens.

Pick GPT-5.5

Highest-stakes reasoning

ARC-AGI-2 and academic eval

GPT-5.5 leads ARC-AGI-2 (84.6%) by a wide margin; Opus 4.7 leads Humanity's Last Exam (46.9%). For coding tasks that hide a hardest-reasoning problem — protocol design, novel algorithmic work, formal verification — route to the Pro tier of either OpenAI or Anthropic, not Flash.

Route to Pro tier

07 — Next monthWhat 3.5 Pro likely changes.

The fair tier-matched comparison is going to be Gemini 3.5 Pro vs GPT-5.5 vs Opus 4.7, and that comparison is roughly four weeks out. Google has confirmed the rollout for next month without committing to a specific date.

Two things follow if 3.5 Pro lands meaningfully ahead of 3.5 Flash on the same benchmarks where Flash already leads Opus 4.7 and GPT-5.5 (MCP Atlas, Toolathlon, Finance Agent v2, CharXiv, MMMU-Pro):

The agentic-coding default rotates to Pro. Any team currently routing tool-loop work to Opus 4.7 or GPT-5.5 for capability reasons gets a Gemini-side option that is tier-matched and benchmark-ahead.
Flash stays in play for high-volume loops. The Pro tier won't match Flash on output speed — Flash is Flash. So a two-model pattern emerges: 3.5 Pro for capability-bound steps in an agent, 3.5 Flash for the high-volume tool-loop inner core. Inside Antigravity 2.0, this is exactly the dynamic-subagents shape.

What does not automatically follow is that Pro will leapfrog Opus 4.7 on SWE-Bench Pro or GPT-5.5 on Terminal-Bench and ARC-AGI-2. Those are different capability dimensions and recent history (Gemini 3 Pro vs 3.1 Pro vs the previous Anthropic and OpenAI Pro tiers) shows tier-to-tier leapfrogging happens on some axes and not others. Wait for the Pro evals; route multi-vendor in the meantime. For wider cross-vendor context, our existing Pro-tier comparison from the GPT-5.4 / Opus 4.6 / Gemini 3.1 Pro generation still applies as a baseline.

08 — ConclusionThe smallest model that does the job wins.

The shape of agentic coding, May 2026

For tool-loop coding, 3.5 Flash already redraws the routing matrix — even before Gemini 3.5 Pro lands.

Comparing a Flash-tier model to two frontier Pro tiers should have been a tier-mismatched curiosity. Instead, Gemini 3.5 Flash posts the highest score in Google's published table on five evaluations, including MCP Atlas, the benchmark that maps most directly to the tool-loop core of modern agentic coding. That is unusual, and it changes the routing question.

The honest reading: for repo-scale software engineering, Opus 4.7 still wins. For terminal coding and abstract reasoning, GPT-5.5 still wins. For tool-driven, multi-step, MCP-orchestrated workflows — the kind Antigravity 2.0 was specifically designed around — Gemini 3.5 Flash is the model you reach for first, and the speed-times-capability product makes it especially strong for high-volume loops. With pricing now published at $1.50 / $9.00 per Mtok versus $5 / $25-30 for the Pro tiers, the cost picture is no longer half-drawn — Flash is roughly 3x cheaper, which compounds in any loop that hits the model more than a handful of times.

The bigger signal is the bundle. Google shipped a Flash-tier model and a desktop agent product on the same day, with a CLI, SDK, and Managed Agents API alongside. The model, the harness, and the product are co-optimized. Whether that thesis holds up depends on what Gemini 3.5 Pro does next month — and on whether third-party agent platforms (Cursor, Cline, Claude Code, Codex CLI, Devin) end up adopting 3.5 Flash as a routing default in their own loops. The four-week window between today and the Pro launch is the one to watch.

For the full I/O 2026 announcement index — Antigravity 2.0, Gemini Spark, Omni, Search overhaul, Universal Cart, Managed Agents API, and the Anthropic counter-programming the same morning — see our Google I/O 2026 complete guide.

Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7: Agentic Coding

01 — The frameA Flash model vs two Pro tiers — and why it still matters.

It's what shipped

Agentic workloads compound speed

Flash actually leads on some evals

02 — Headline pictureThe five agentic-coding evals, side by side.

Agentic coding evals · top score per benchmark

03 — Detail viewThe full agentic + reasoning table.

04 — Loop mathSpeed and cost are half the comparison for agents.

Pricing as of May 19, 2026

05 — The cockpitAntigravity 2.0 — Google's agent-first desktop app.

Standalone agent-first desktop

CLI, SDK, Managed Agents API

The agentic capabilities that map to Flash

06 — RoutingWhich model for which loop.

Tool-loop heavy workflows

Multi-file edits with low error tolerance

Direct execution at the prompt

Browser / desktop control

Single-shot 100k-1M token prompts

ARC-AGI-2 and academic eval

07 — Next monthWhat 3.5 Pro likely changes.

08 — ConclusionThe smallest model that does the job wins.

For tool-loop coding, 3.5 Flash already redraws the routing matrix — even before Gemini 3.5 Pro lands.

A Flash model just took the agentic coding lead on five benchmarks.

Multi-vendor model engagements

The questions we get on launch day.

Continue exploring frontier comparisons.

Claude Fable 5 vs GPT-5.5: Benchmarks & Cost Compared

Claude Opus 4.8 vs GPT-5.5: Benchmarks & Cost Compared

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing

GPT-5.5 Complete Guide: Thinking, Pro & 1M Context

Claude Design by Anthropic Labs: AI-Native Design Tool