Gemini 3.5 Flash launched at Google I/O 2026 on May 19 with a dense benchmark card. Five days in, the independent evaluation community has published its first reads — and every Google-announced number that has been re-tested matches to the decimal. The story isn't fabricated scores; it's what independents measured that Google didn't disclose: a 61% hallucination rate, an Intelligence Index rank of #8 behind both GPT-5.5 and Claude Opus 4.7, and a $1,552 cost-to-evaluate that is 5.6 times what the prior Flash cost to run through the same suite.

The gap between vendor-launch velocity and independent-eval velocity is the structural tension this post documents. Five days is long enough for Artificial Analysis, llm-stats, and WaveSpeed to publish; it is not long enough for Aider Polyglot — whose leaderboard was last refreshed November 20, 2025 — to add a Gemini 3.5 Flash entry. That absence is itself a data point. The launch-day benchmarks and API guide published May 19 documented what Google claimed; this post documents what the first wave of independent evaluators found five days later.

What follows covers: the Google-claimed vs independently-verified scores in a structured table, the Artificial Analysis Intelligence Index framing (Flash is #8, not #1), the agentic and coding benchmarks where Flash leads the field, the frontier reasoning benchmarks where it trails Pro-tier rivals by 7–17 points, the Aider Polyglot gap, the cost-to-evaluate story, and a five-day verdict on what engineering and procurement teams should do with this information. For the developer migration mechanics (thinking_level enum, function calling IDs), see the API developer migration guide.

Key takeaways

01
Independent numbers match Google's to the decimal.Every Google-announced benchmark that has been independently re-tested — Terminal-Bench 2.1 (76.2%), MCP Atlas (83.6%), Humanity's Last Exam (40.2%), GDPval-AA (1,656 Elo), MRCR v2 128k (77.3%) — confirms the vendor number with zero delta. The five-day reality check is that the scores are real. What independents added are three metrics Google didn't volunteer.
02
Artificial Analysis ranks Flash #8, not #1.The Artificial Analysis Intelligence Index v4.0 places Gemini 3.5 Flash at score 55, ranking it #8 of 148 models — behind GPT-5.5 and Claude Opus 4.7 by approximately 3 points each. Google compared Flash to its own Gemini 3.1 Pro; independents compared it to current-gen rivals. The agentic story is real. The 'frontier intelligence' framing is conditional.
03
Hallucination rate is 61% — improved but still high.Artificial Analysis's AA-Omniscience sub-evaluation measures hallucination rate at 61% for Gemini 3.5 Flash — a 31-point improvement over Gemini 3 Flash, but still high in absolute terms. Google did not disclose this figure on launch day. Teams running retrieval-augmented or fact-sensitive workloads should weight this against the benchmark gains.
04
Flash leads the entire field on agentic tool use.On MCP Atlas (scaled tool use), Gemini 3.5 Flash scores 83.6% — ahead of Claude Opus 4.7 (79.1%) and GPT-5.5 (75.3%). WaveSpeed AI's post-launch analysis confirms Flash leads on Finance Agent v2 and Toolathlon as well. For teams building autonomous agents, this is the most actionable finding of the five-day window.
05
The Aider Polyglot leaderboard has no Flash entry as of May 25.Aider's Paul Gauthier has not published a Polyglot run for any 2026-era Gemini model. The leaderboard was last refreshed November 20, 2025. This is not a criticism of Aider — it documents the lag between vendor-launch velocity and indie-eval velocity. Anyone citing an Aider Polyglot score for Gemini 3.5 Flash is fabricating a number.

01 — 5-Day Reality CheckGoogle-claimed vs independently verified — the full score matrix.

The table below is the publishable artifact of this five-day window. No public source has yet matrixed Google's May 19 launch numbers against the post-launch independent re-runs with explicit “not yet run” flags. The primary sources are the DeepMind official Gemini 3.5 Flash model card, Artificial Analysis, llm-stats launch analysis, and WaveSpeed AI. Aider Polyglot sourced from aider.chat/docs/leaderboards.

The structural finding: Google's benchmark card held up. The narrative gap emerged from three independent additions — an Intelligence Index rank, a hallucination rate, and a cost-to-evaluate — that Google did not include in its May 19 disclosure.

Terminal-Bench 2.1

Google-claimed: 76.2% · AA confirmed: 76.2%

76.2%

Delta: 0.0. Beats Gemini 3.1 Pro (70.3%). GPT-5.5 leads at 78.2% — a 2-point gap. Confirmed as AA sub-evaluation within the Intelligence Index v4.0.

Delta 0.0 — confirmed

MCP Atlas

Google-claimed: 83.6% · WaveSpeed confirmed: 83.6%

83.6%

Delta: 0.0. Flash leads the entire field — Claude Opus 4.7 (79.1%) and GPT-5.5 (75.3%) both trail. Confirmed independently by WaveSpeed AI (May 20) and as an AA sub-evaluation.

#1 in field — confirmed

HLE

Google-claimed: 40.2% · AA confirmed: 40.2%

40.2%

Delta: 0.0. Humanity's Last Exam — Flash trails Gemini 3.1 Pro (44.4%) and Claude Opus 4.7 (46.9%). Confirmed as component of AA Intelligence Index v4.0.

Trails Opus 4.7 by −6.7 pts

MMMU-Pro

Google-claimed: 83.6% · llm-stats: #1 on leaderboard

83.6%

Delta: 0.0. Flash is currently #1 on the llm-stats MMMU-Pro leaderboard as of May 25, 2026. Google reported the same figure — no re-test delta.

#1 on llm-stats leaderboard

Three benchmarks that Google announced but independents have not yet re-run: SWE-Bench Pro (55.1%), ARC-AGI-2 (72.1%), and MRCR v2 128k (77.3%). These are cited by llm-stats and the DeepMind model card but have not yet had a fresh independent run published. On Blueprint-Bench 2, Gemini 3.5 Flash is the only model listed on the llm-stats leaderboard (score 0.336) as of May 25 — no peer comparison is yet possible. Three metrics are additions from independents with no corresponding Google claim: the Intelligence Index rank (#8 of 148), the 61% hallucination rate, and the $1,552 cost-to-evaluate.

02 — Artificial AnalysisIntelligence Index 55, #8 of 148 — and a hallucination rate Google didn't disclose.

Artificial Analysis published its Gemini 3.5 Flash teardown on May 19, 2026 — the same day as the Google I/O launch. The launch analysis places Flash at Intelligence Index v4.0 score of 55, ranking it #8 of 148 models evaluated. That is nine points above Gemini 3 Flash (46) and well above the median Intelligence Index of 36 for models in a comparable price tier. It is not, however, the top-tier position the term “Flash” might imply relative to the Pro-class rivals it is being compared against.

The Artificial Analysis framing is precise on this point: “Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index, up 9 points from Gemini 3 Flash, driven primarily by agentic performance gains and hallucination reduction... well above average among other reasoning models in a similar price tier (median: 36).” Note what that framing compares against: models in “a similar price tier,” not top-tier frontier models. GPT-5.5 and Claude Opus 4.7 sit approximately 3 points above Flash on the Index at the five-day mark.

The minimal thinking variant — model ID gemini-3.5-flash-minimal — scores 43 on the same Index, not 55. Teams running cost-optimized workloads should confirm which variant their API calls resolve to; the 12-point gap between the variants is material for latency-sensitive deployments.

Speed measurement: Artificial Analysis independently clocked 203.5 output tokens per second (sustained), with peaks above 280 tok/sec. Google's announcement said “4× faster than other frontier models” — a relative claim. AA's absolute measurement is the more actionable figure for capacity planning. Time to first token with thinking enabled is 18.88 seconds.

Artificial Analysis — May 19, 2026

“Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index, up 9 points from Gemini 3 Flash, driven primarily by agentic performance gains and hallucination reduction... well above average among other reasoning models in a similar price tier (median: 36).” — Artificial Analysis launch analysis, May 19, 2026.

Artificial Analysis Intelligence Index — Gemini 3.5 Flash in context

Source: Artificial Analysis Intelligence Index v4.0 · artificialanalysis.ai/models/gemini-3-5-flash · May 19, 2026

Gemini 3.5 Flash — Intelligence Index v4.0Artificial Analysis · #8 of 148 models · May 19, 2026

Gemini 3 Flash — Intelligence Index v4.0Artificial Analysis · prior generation

Gemini 3.5 Flash (minimal thinking variant)Artificial Analysis · model ID gemini-3.5-flash-minimal

Median — reasoning models at comparable price tierArtificial Analysis benchmark median

03 — Coding & Agent BenchmarksMCP Atlas leadership and SWE-Bench Pro — where Flash wins and where it doesn't.

The agentic and coding benchmarks are the strongest part of Gemini 3.5 Flash's five-day independent picture. WaveSpeed AI published its analysis on May 20 — one day after Google I/O — with the sharpest competitive framing: “A Flash-tier model now beats Pro-tier frontier models on most agent suites — Claude Opus 4.7 and GPT-5.5 trail Flash on MCP Atlas, Toolathlon, and Finance Agent v2.” That claim, referenced against the independently confirmed 83.6% MCP Atlas figure, holds up.

For teams building autonomous agents, the MCP Atlas leadership is the most operationally significant finding of the five-day period. MCP Atlas measures scaled tool use — the ability to orchestrate multiple tools in sequence under realistic agent task conditions. Flash leads Claude Opus 4.7 by 4.5 points and GPT-5.5 by 8.3 points on this benchmark. For a deeper methodological grounding on what MCP Atlas and Terminal-Bench 2.1 actually measure, see our SWE-Bench and Terminal-Bench benchmark guide.

SWE-Bench Pro tells a different story. Flash scores 55.1% — an improvement over Gemini 3 Flash (49.6%), but 9.2 points behind Claude Opus 4.7 (64.3%). On heavy code refactoring and open-source software engineering tasks, Opus 4.7 still wins. For the head-to-head breakdown across agentic coding scenarios, see our Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7 agentic coding comparison. The short version: Flash wins at agent orchestration; Opus 4.7 wins at complex refactoring.

MCP Atlas

Flash leads the field — 83.6%

Agentic tool use · WaveSpeed / AA confirmed

83.6% — the highest score of any model in the MCP Atlas benchmark as of May 25, 2026. Claude Opus 4.7 trails by 4.5 pts (79.1%); GPT-5.5 trails by 8.3 pts (75.3%). WaveSpeed AI confirms Flash also leads on Finance Agent v2 and Toolathlon. Delta vs Google-claimed: 0.0.

+4.5 over Opus 4.7 · +8.3 over GPT-5.5

Terminal-Bench 2.1

Strong — second place behind GPT-5.5

Coding terminal use · AA sub-eval confirmed

76.2% — beats Gemini 3.1 Pro (70.3%) by 5.9 pts. GPT-5.5 leads at 78.2% — a 2-point gap that puts Flash at a strong second. Confirmed as AA sub-evaluation; delta vs Google-claimed: 0.0.

−2.0 vs GPT-5.5 (78.2%)

SWE-Bench Pro

Improved — but trails Opus 4.7 by 9.2 pts

Code refactoring · DeepMind card / llm-stats cited

55.1% — up from Gemini 3 Flash (49.6%). Claude Opus 4.7 leads at 64.3%, a 9.2-point gap. No independent re-run at the 5-day mark; llm-stats cites the DeepMind model card figure. Teams with heavy refactoring workloads: Opus 4.7 still wins this benchmark.

−9.2 vs Opus 4.7 (64.3%)

CharXiv Reasoning

Multimodal reasoning field leader — 84.2%

Multimodal · DeepMind card confirmed

84.2% — Flash leads the field on CharXiv Reasoning (chart-intensive multimodal understanding). MMMU-Pro is also 83.6%, currently #1 on the llm-stats MMMU-Pro leaderboard as of May 25, 2026. Multimodal is a genuine strength.

#1 CharXiv · #1 MMMU-Pro

04 — Frontier ReasoningARC-AGI-2, HLE, MRCR 128k — where Flash trails the frontier.

The five-day independent picture confirms three material gaps on frontier reasoning benchmarks. These gaps are not surprises — Google compared Flash primarily to Gemini 3.1 Pro (its own previous-gen Pro model) rather than to GPT-5.5 and Claude Opus 4.7. When independents ran those cross-model comparisons, the gaps emerged clearly.

ARC-AGI-2: Flash scores 72.1% — a 38.5-point jump over Gemini 3 Flash (33.6%), which is real and significant. But GPT-5.5 scores 84.6% on the same benchmark — a 12.5-point lead over Flash. On abstract reasoning and novel problem-solving tasks, Flash is meaningfully behind the current frontier. No independent re-run of ARC-AGI-2 has been published at the five-day mark; the figure is cited from the DeepMind model card.

Humanity's Last Exam: Flash scores 40.2% — confirmed by Artificial Analysis as a component of the Intelligence Index v4.0. That trails Gemini 3.1 Pro (44.4%) and Claude Opus 4.7 (46.9%) by 4.2 and 6.7 points respectively. According to the llm-stats launch analysis, “Google moved the frontier line down to the Flash tier” — an accurate observation about the model's agentic capabilities, but HLE demonstrates that parametric knowledge and academic reasoning remain a Pro-class differentiator.

MRCR v2 at 128k: Flash scores 77.3% — confirmed by the DeepMind model card with no independent re-test published yet. GPT-5.5 scores 94.8% on the same benchmark — a 17.5-point gap that is the largest single comparative deficit in Flash's profile. At 1M token context (the full Flash context window), the MRCR score drops to 26.6%. Teams with long-context retrieval workloads where the needle is buried deep in a 1M-token corpus should treat this figure with caution: the 128k retrieval performance is strong relative to earlier Flash models, but substantially below GPT-5.5.

The framing trap to avoid: these gaps exist on specific benchmark categories. On agentic tool use, Flash leads GPT-5.5 and Opus 4.7. The correct characterization — supported by five days of independent data — is that Flash wins on agent orchestration and multimodal tasks; it does not win on frontier reasoning, dense long-context retrieval, or complex refactoring.

ARC-AGI-2

Flash: 72.1% — GPT-5.5 leads at 84.6%

Abstract reasoning and novel problem-solving. Flash jumped 38.5 pts over Gemini 3 Flash (33.6%). But GPT-5.5 at 84.6% holds a 12.5-point lead. No independent re-run at 5-day mark — cited from DeepMind model card. For workloads requiring frontier abstract reasoning, GPT-5.5 currently leads.

GPT-5.5 leads · −12.5 pts

HLE

Flash: 40.2% — trails Opus 4.7 and Gemini 3.1 Pro

Humanity's Last Exam — parametric knowledge and academic reasoning. Flash (40.2%) trails Gemini 3.1 Pro (44.4%) by 4.2 pts and Claude Opus 4.7 (46.9%) by 6.7 pts. Confirmed as component of AA Intelligence Index v4.0. For knowledge-intensive RAG or academic tasks, Pro-tier rivals still lead.

Opus 4.7 leads · −6.7 pts

MRCR 128k

Flash: 77.3% — GPT-5.5 leads at 94.8%

Dense long-context retrieval at 128k tokens — Flash's largest comparative gap (−17.5 pts vs GPT-5.5). At 1M tokens, Flash's MRCR score reportedly drops to 26.6%. Teams with needle-in-haystack retrieval across long context should evaluate whether the 1M window carries the retrieval accuracy they need.

GPT-5.5 leads · −17.5 pts

SWE-Bench Pro

Flash: 55.1% — Opus 4.7 leads at 64.3%

Complex code refactoring and open-source engineering tasks. Flash (55.1%) improved over Gemini 3 Flash (49.6%), but trails Claude Opus 4.7 (64.3%) by 9.2 pts. No independent re-run yet. For heavy refactoring pipelines, Opus 4.7 remains the benchmark leader.

Opus 4.7 leads · −9.2 pts

05 — Missing EvalThe Aider Polyglot gap — five days in and no entry.

The Aider Polyglot leaderboard is the most-cited independent coding benchmark among AI practitioners evaluating model-assisted development. As of May 25, 2026 — five days after Gemini 3.5 Flash GA — the leaderboard has no entry for Gemini 3.5 Flash. The leaderboard was last refreshed November 20, 2025. Paul Gauthier (Aider's creator) has not yet published a Polyglot run for any 2026-era Gemini model.

This matters for two reasons. First, the Aider Polyglot benchmark is widely used in model selection decisions for coding copilot and code-generation workloads — its absence means practitioners lack the one indie data point they often go to first. Second, it illustrates the asymmetry between vendor-launch velocity and indie-eval velocity. Google can ship a model and a benchmark card on the same day; Aider cannot be expected to run a fresh leaderboard within five days of every major model launch. This is not a criticism of Aider — it is a structural feature of how the evaluation ecosystem works.

The practical implication: any source claiming an Aider Polyglot score for Gemini 3.5 Flash as of May 25, 2026 is fabricating a number. When Gauthier does publish a Polyglot run, that figure will carry meaningful signal. Until then, the most credible independent coding proxies for Flash are the AA sub-evaluation components of Terminal-Bench 2.1 (76.2%, confirmed) and SWE-Bench Pro (55.1%, not yet independently re-run). For coding workload comparisons at the five-day mark, our head-to-head agentic coding breakdown covers the available data without fabricating an Aider number.

Any source claiming an Aider Polyglot score for Gemini 3.5 Flash as of May 25, 2026 is fabricating a number. The leaderboard was last refreshed November 20, 2025 — and there is no Gemini 3.5 Flash entry.Digital Applied analysis, May 25, 2026

06 — Evaluation Economics$1,552 to run the full eval suite — 5.6× the prior Flash.

The Hacker News reaction to Gemini 3.5 Flash's launch is broadly mischaracterized as skepticism about the benchmark numbers. The HN thread shows something more specific: the community is frustrated by pricing, not by the scores. The most upvoted comment — from user GodelNumbering — compares the two model generations side-by-side: “Gemini 2.5 flash: $0.30/$2.50... Gemini 3.5 flash: $1.50/$9.00.” The numbers are accurate: Gemini 3.5 Flash is priced at $1.50 per 1M input tokens and $9.00 per 1M output tokens — approximately 3× the Gemini 2.5 Flash pricing and 5.5× Gemini 3 Flash. The “Flash” name has historically implied budget-tier pricing; that implication is no longer supported by the price point.

Artificial Analysis published a figure that contextualizes the pricing frustration from an evaluator's perspective: running the full Intelligence Index v4.0 suite on Gemini 3.5 Flash costs $1,552. That is approximately 5.6× the cost of running the same suite on Gemini 3 Flash Preview, and approximately 9× the cost of Gemini 2.5 Flash. Artificial Analysis disclosed this figure in its launch analysis— it is not a Google-disclosed metric, and Google did not address the cost-to-evaluate dimension in its May 19 announcement.

For engineering teams, the relevant pricing context is the comparison against the models Flash competes with. At $1.50 input / $9.00 output, Flash is approximately 17% cheaper than Gemini 2.5 Pro on workloads under 200K tokens, and approximately 75% more expensive than Gemini 3.1 Pro. The community estimate on Hacker News places the model at 250-300B total parameters with 10-16B active parameters — this is an unconfirmed community speculation, not a Google disclosure, and should not be cited as fact. Google has not disclosed parameter counts for Gemini 3.5 Flash.

The practical framing for procurement: Flash can deliver Pro-tier agentic performance at a price point that is meaningfully below Opus 4.7 and GPT-5.5, depending on workload. For teams running high-volume agent pipelines where MCP Atlas performance matters more than frontier reasoning or heavy refactoring, the pricing calculus can work. Our AI transformation advisory practice has been running this specific cost-performance analysis for clients evaluating model-selection decisions in Q2 2026.

Input pricing

Gemini 3.5 Flash input cost

$1.50/1M tok

~3× more expensive than Gemini 2.5 Flash ($0.30/1M). ~5.5× more than Gemini 3 Flash. Cached input: $0.15/1M (90% discount). 50% off via Batch API.

Verified: AA · llm-stats · OpenRouter

Output pricing

Gemini 3.5 Flash output cost

$9.00/1M tok

~3× more expensive than Gemini 2.5 Flash ($2.50/1M). The 'Flash' tier now sits at a price point that was previously Pro-tier territory. Community frustration is directed at this pricing delta, not the benchmark scores.

~3× vs Gemini 2.5 Flash

Eval suite cost

Cost to run AA Intelligence Index v4.0

$1,552

Artificial Analysis's cost to run the full Intelligence Index v4.0 on Gemini 3.5 Flash. ~5.6× the Gemini 3 Flash Preview cost. ~9× more than Gemini 2.5 Flash. This cost is the real driver of HN frustration — not the benchmark numbers themselves.

~5.6× Gemini 3 Flash Preview

Context window

1,048,576 input / 65,536 output

1Mtok

Thinking on by default. thinking_level enum replaces the integer thinking_budget from Gemini 3 Flash. Knowledge cutoff: January 2026. MRCR at 1M: 26.6% — significantly below the 128k score of 77.3%.

thinking_level enum (Jan 2026 cutoff)

07 — Five-Day VerdictWhat was true, what's missing, and what to watch.

The five-day independent read on Gemini 3.5 Flash resolves the most important question cleanly: Google did not fabricate its benchmark card. Every announced score that has been independently re-tested matches to the decimal. That is more than can be said for many model launches in this generation. The question is whether the surrounding narrative — that Flash has joined the frontier — survives independent scrutiny. The honest answer is: in agentic contexts, yes; in frontier reasoning, no.

The original analysis: the story of Gemini 3.5 Flash's first five days is a case study in the difference between vendor-curated benchmarks and independently-contextualized evals. Google chose benchmarks where Flash performs well, compared it to its own previous-gen Pro rather than to current frontier rivals, and left hallucination rate and cost-to-evaluate out of its disclosure. Independents added all three. None of this is unusual vendor behavior — it is the normal pattern of a model launch. What is useful is having the independent layer run promptly enough to contextualize it while the model is still in early procurement consideration.

Projecting forward: the Aider Polyglot gap will close. When Gauthier publishes a Polyglot run for Flash, it will be the most-watched indie coding data point for the model — and based on the SWE-Bench Pro (55.1%) and Terminal-Bench 2.1 (76.2%) numbers, a Flash Polyglot score in the 65-75% range would be consistent with the available data. That is a projection, not a figure; it should not be cited as a number. The more meaningful watch is whether LMArena's ranking stabilizes around the Artificial Analysis assessment (within ~3 pts of Opus 4.7) over the next two to four weeks as public voting volume increases.

For the broader Gemini 3.5 Flash context — model card, API migration, and the I/O 2026 announcement that introduced it — see the Google I/O 2026 complete announcement guide. For teams making procurement decisions between Flash and its rivals on agentic coding tasks specifically, the head-to-head agentic coding comparison is the next read.

Conclusion

Five days confirms the scores. The narrative is what independents reframed.

Gemini 3.5 Flash's announced benchmark numbers held up against five days of independent scrutiny. Artificial Analysis, llm-stats, and WaveSpeed all confirm the figures Google published on May 19 — not a single re-tested score diverged from the vendor claim. That is the cleanest possible five-day outcome for a model launch.

What independents added is what makes this roundup worth reading: a 61% hallucination rate that Google did not disclose, an Intelligence Index rank of #8 rather than #1 (behind GPT-5.5 and Claude Opus 4.7), and a $1,552 cost-to-evaluate that confirms the pricing frustration visible on Hacker News is grounded in real economics. Flash is the most capable agentic Flash-tier model available as of May 25, 2026. It is not the frontier reasoning leader. For teams whose workloads are agent-heavy, it is a strong candidate. For teams whose workloads are frontier-reasoning-heavy, Opus 4.7 and GPT-5.5 still win the relevant benchmarks by 7–17 points. The Aider Polyglot leaderboard has no entry as of today — watch for that when it publishes.

Gemini 3.5 Pro has not yet shipped (rumored June 2026). When it does, the competitive picture may shift. Until then, Flash is the most honest available read on where Google's Flash tier sits in the current generation: leading on agentic orchestration, trailing on frontier reasoning, and priced at a level that has redefined what “Flash” means.

Five days of independent evals: what they confirmed — and what they added.