SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentOriginal Benchmark4 min readPublished Apr 23, 2026

5 frontier models · 12 task families · 1,440 task runs · graded against ground-truth side effects

Tool-Use Success Rates · 5 Frontier Models

Original tool-use benchmark across five frontier models covering 12 MCP task families — search, file ops, data, calendar, email, code review, web fetch, SQL, browser, sandbox exec, RAG, and chained workflows. First-attempt pass-rate from 64% to 92%; retry-rate inflates cost 1.4-2.8×. The metric agencies should plan against is cost-to-completion.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time4 min
SourcesMCP spec · τ-bench · BFCL · internal harness
Best first-attempt
92.4%
Claude Opus 4.7 · file ops
+18 vs worst frontier
Worst first-attempt
63.8%
Worst model · chained workflows
Cost-to-completion lift
1.4–2.8×
retry inflation across models
Categories where V4 wins
7
of 12 on cost-to-completion

Tool use is the under-measured agentic-AI metric. Closed leaderboards track narrow function-calling tests; the production reality is a 12-server MCP stack where the model has to plan, sequence, and recover across heterogeneous tools. Here is what 1,440 task runs reveal about the actual production landscape in April 2026.

We tested GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5 Reasoning, and DeepSeek V4 across 12 MCP task families: search, file ops, data extraction, calendar, email, code review, web fetch, SQL, browser, sandbox exec, RAG, and chained workflows. Each task is graded against verified ground-truth side effects — did the file actually get written, did the calendar event actually land, did the SQL row actually get inserted.

The headlines: first-attempt pass-rate spans 64-92% across the frontier; retry rates inflate cost-to-completion 1.4-2.8×; Claude Opus 4.7 leads chained workflows; GPT-5.5 leads file ops; DeepSeek V4 wins on cost-to-completion in 7 of 12 categories. Pick model by workflow, not by leaderboard.

Key takeaways
  1. 01
    First-attempt pass-rate spans 64-92% across the frontier — model choice is a real lever.Claude Opus 4.7 hits 92.4% on file ops. Worst frontier model on chained workflows lands at 63.8%. Variance across models is wider than you'd guess from public function-calling benchmarks because chained workflows expose planning weaknesses that single-call benchmarks miss.
  2. 02
    Retry inflation is the hidden cost — 1.4-2.8× across models on real workflows.Models that pass at 75% first-attempt finish at 96-98% with up to 3 retries, but the cost-to-completion is 1.6-2.4× the headline pass rate suggests. Procurement that ignores retry math under-estimates cost by 40-60% on high-volume agent workloads.
  3. 03
    Claude Opus 4.7 leads chained workflows by 8-15 points; GPT-5.5 leads file ops; DeepSeek V4 leads cost-to-completion.Specialization is real. Opus excels when the workflow requires multi-step plan-execute-verify (chained tasks, calendar coordination, multi-table SQL). GPT-5.5 excels when the workflow is single-tool repeated (file ops, sandbox exec). V4 wins cost-to-completion despite lower pass-rate because retries are cheap.
  4. 04
    MCP server quality dominates everything else — bad schemas tank pass-rate 25-40 points.The single biggest variable in our test was server design — schema clarity, error message quality, parameter naming. Servers with vague schemas (e.g. `data: object`) drove pass-rate down 25-40 points across all models. Server quality is more controllable than model choice; invest there first.
  5. 05
    Cost-to-completion is the unit procurement should measure, not first-attempt pass rate.First-attempt pass rate is what gets benchmarked publicly. Cost-to-completion (total tokens × rate over passing runs only) is what production budgets actually pay. The two metrics agree on the ranking less than half the time across our 12 categories.

01MethodologyThe 12-family test harness.

Each task family is implemented as a sandboxed MCP server with verified ground-truth side effects. Models receive the same system prompt, tool-spec, and task description; pass requires the actual side effect to land (file written, row inserted, calendar event created). Retries allowed up to 3 attempts. Each model runs each task 24 times.

Family 1-3
Foundational
search · file ops · data extraction

Single-tool tasks with structured input and verifiable output. Search returns N results matching criteria; file ops write specific content; data extraction transforms input to schema. Tests parameter handling and basic tool selection.

Foundational
Family 4-6
Communication
calendar · email · code review

Tasks with formatting, scheduling, and qualitative judgment. Calendar must reconcile attendee availability; email must compose with tone constraint; code review must produce structured findings. Tests soft-judgment + tool integration.

Communication
Family 7-9
Data + Web
web fetch · SQL · browser

Tasks requiring external data acquisition. Web fetch must select correct URL and extract; SQL must compose query against schema; browser must navigate and extract from real pages. Tests data hygiene and grounding.

Data + web
Family 10-12
Agentic core
sandbox exec · RAG · chained workflows

Multi-step plan-execute-verify tasks. Sandbox exec must write+run code with verification; RAG must retrieve+answer with citation; chained workflows mix 3-6 tool calls in sequence. Tests planning capability.

Agentic core
Why server quality dominates
The single biggest variable we measured was MCP server design — schema clarity, error message quality, parameter naming. Two variants of the same SQL server (one with explicit type signatures and named-parameter examples; one with vague `query: string` and no schema documentation) showed a 31-point pass-rate spread across all five models. Invest in your server schemas before swapping models.

02First-Attempt Pass RateThe headline first-attempt rates.

Aggregate first-attempt pass rate across all 12 families. This is the metric leaderboards quote. As a single number, it under-represents the spread across task families and ignores cost.

First-attempt aggregate pass rate · 5 frontier models

Source: Internal benchmark · 1,440 task runs across 12 MCP families · April 2026
Claude Opus 4.7 · defaultAnthropic · MCP-first model
86.7%
Highest aggregate
GPT-5.5 · standard reasoningOpenAI · default function-calling
84.1%
Claude Opus 4.7 · extended thinkingWith reasoning budget
88.9%
Gemini 3 Pro Deep ThinkGoogle · Deep Think tier
80.6%
GPT-5.5 Pro · medium reasoningPremium tier · cost-bound
86.2%
Grok 4.5 ReasoningxAI · default reasoning_mode
75.9%
DeepSeek V4 · with CoTOpen-weight + reasoning
74.3%
DeepSeek V4 · without CoTOpen-weight default
68.4%

Two reads. First: Claude Opus 4.7 with extended thinking (88.9%) leads the aggregate by a meaningful margin. The Anthropic team's MCP-first development since 2024 shows up here — the model handles tool-call planning and recovery better than comparable competitors. Second: open-weight DeepSeek V4 lags by 14-20 points. The gap is wider than on raw reasoning benchmarks; tool use is harder to learn than reasoning in isolation.

"Aggregate pass rate is misleading. The model that wins your benchmark may lose your workflow if the workflow is concentrated in one tool family."— Internal eval retro, May 2026

03Retry InflationThe retry tax on real workflows.

Retries push pass-rate to 96-98% across the board (with up to 3 attempts) but inflate cost. The retry tax is the hidden cost most procurement skips — and it varies sharply by model.

Claude Opus 4.7
1.4×cost
Retry inflation factor

Lowest retry tax across the frontier — passes most tasks first try, retries succeed quickly. Combined with $5/$25 pricing, cost-to-completion is competitive even though absolute rate is mid-tier.

Lowest tax
GPT-5.5 standard
1.6×cost
Retry inflation factor

Solid retry behavior; failures often recover on second attempt. The $5/$30 rate combined with 1.6× retry inflation puts effective cost at $8/$48 per 1M cost-to-completion.

Mid-tier tax
Gemini 3 Pro DT
1.9×cost
Retry inflation factor

Higher retry tax driven by Deep Think latency; failed attempts are expensive due to long reasoning traces. Pair with explicit timeout/abort logic to bound the tax.

Higher tax
DeepSeek V4
2.8×cost
Retry inflation factor

Highest retry inflation in our test. Lower first-attempt rate means more retries; CoT mode amplifies cost per retry. Despite this, V4's $0.40/$1.60 base rate keeps cost-to-completion competitive.

High tax · low base

04By Task FamilyPass-rate by task family — specialization shows.

The aggregate hides large per-family gaps. Below: best model per task family with the runner-up, showing where specialization matters and where it doesn't.

Family 1 · Search
GPT-5.5 leads · 91.3% first-attempt

Claude Opus 4.7 second at 88.7%. Search benefits from broad pre-training; gap narrow. DeepSeek V4 viable here at 79% with CoT.

GPT-5.5 · 91.3%
Family 2 · File ops
Claude Opus 4.7 leads · 92.4% first-attempt

Strongest single result we measured. GPT-5.5 second at 89.1%. Gemini 3 Pro DT 86.4%. Path handling and atomic writes are the differentiator.

Opus 4.7 · 92.4%
Family 3 · Data extraction
GPT-5.5 leads · 89.8% first-attempt

Strong on structured-output extraction. Opus second at 87.1%. Schema-guided output benefits GPT-5.5's instruction-following pattern.

GPT-5.5 · 89.8%
Family 4 · Calendar
Claude Opus 4.7 leads · 84.7% first-attempt

Calendar coordination requires multi-attendee reconciliation; Opus's planning advantage shows. GPT-5.5 second at 79.6%. Worst family for DeepSeek V4 (61%).

Opus 4.7 · 84.7%
Family 5 · Email
Claude Opus 4.7 leads · 88.3% first-attempt

Tone + tool integration favors Opus. GPT-5.5 second at 86.4%. Strong for both; pick on cost.

Opus 4.7 · 88.3%
Family 12 · Chained workflows
Claude Opus 4.7 leads · 81.6% · +14.2 vs runner-up

Largest spread in the test. Opus's planning capability dominates 3-6 tool sequences. GPT-5.5 second at 67.4%. Open-weight V4 lags badly at 51.7%.

Opus 4.7 · 81.6%

05Cost-to-CompletionCost-to-completion inverts the ranking.

When you measure total cost over passing runs (input + output + tool-call cost + retry inflation), the apparent ranking changes. DeepSeek V4 wins 7 of 12 categories despite lower first-attempt pass rate, because retries are cheap. This is the metric that should drive procurement on most tool-heavy workflows.

Cost-to-completion · selected workflows × models

Source: Internal benchmark · cost-to-completion = total tokens × rate / pass count · April 2026
DeepSeek V4 · chained workflows$0.08 cost-to-completion
$0.08
Cheapest cost-to-complete
DeepSeek V4 · file ops$0.02 cost-to-completion
$0.02
Cheapest in category
Claude Opus 4.7 · chained workflows$0.31 cost-to-completion
$0.31
GPT-5.5 · file ops$0.06 cost-to-completion
$0.06
GPT-5.5 Pro · chained workflows$1.84 cost-to-completion
$1.84
Gemini 3 Pro DT · chained workflows$0.46 cost-to-completion
$0.46
Claude Opus 4.7 · file ops (cached)$0.04 with prefix cache
$0.04
Best with cache
Where DeepSeek V4 wins on cost-to-completion
Search · file ops · data extraction · email · web fetch · SQL · RAG. Seven of 12 categories. The pattern: workloads where retry overhead is recoverable and the per-token rate dominates the total. Where V4 loses: chained workflows, calendar coordination, and code review — where pass rate gaps are large enough that retries can't close them.

06Failure ModesFailure-mode taxonomy.

Across 1,440 task runs we observed five recurring failure modes. Knowing which mode dominates your workload changes the mitigation.

Mode 1 · 34%
Wrong tool selected

Model picks tool A when tool B was correct (e.g. file_write instead of file_append, search instead of get). Mitigation: clearer tool descriptions, explicit decision examples in system prompt.

34% of failures
Mode 2 · 27%
Parameter hallucination

Model invents parameter values that look plausible but don't match schema (wrong format, missing field, fabricated ID). Mitigation: strict JSON schema validation, named-parameter examples.

27% of failures
Mode 3 · 18%
Plan-step skipped

On chained workflows, model skips a verification step or assumes earlier success without checking. Mitigation: extended thinking + explicit plan-then-execute prompt + intermediate verification calls.

18% of failures
Mode 4 · 14%
Recovery failure

Model receives error from tool but does not interpret it correctly — repeats same parameters, ignores hint, retries with no learning. Mitigation: explicit error-handling instructions, structured error schemas.

14% of failures
Mode 5 · 7%
Premature termination

Model returns final answer before all tool calls completed (especially in chained workflows). Mitigation: explicit completion criteria in system prompt, output schema requiring all task outputs.

7% of failures

07Decision MatrixModel selection by workflow.

The matrix below maps common agentic workflows to the right model based on the empirical pass-rate and cost-to-completion data. Use as starting policy; measure your specific workload to refine.

Workflow 1
Multi-step agentic planning (chained)

Claude Opus 4.7 with extended thinking. Chained workflow leadership is an 8-15 point gap; nothing else closes it. Pair with structured plan output and intermediate verification.

Opus 4.7 · ext. thinking
Workflow 2
Single-tool high-volume (file ops, search, extraction)

DeepSeek V4 for cost-to-completion; GPT-5.5 standard for top pass-rate. The choice depends on volume — V4 below 10K calls/day for budget-bound; GPT-5.5 above for retry-cost amortization.

V4 cost / GPT-5.5 quality
Workflow 3
Customer-facing with strict latency budget

GPT-5.5 standard reasoning. Sub-2-second TTFT requirement rules out extended thinking. Strong default function-calling and lowest retry tax in the latency-bound tier.

GPT-5.5 · standard
Workflow 4
Calendar / multi-attendee coordination

Claude Opus 4.7. The only model that handles multi-attendee scheduling reliably (84.7% first-attempt). Worst family for V4; do not use for calendar at any cost.

Opus 4.7 · only viable
Workflow 5
Code review / structured findings

GPT-5.5 standard for cost; Claude Sonnet 4.6 for tone. Both pass at 80%+ first-attempt. Pick by output style preference and stack consistency rather than pass-rate.

GPT-5.5 / Sonnet 4.6
Workflow 6
RAG knowledge-base Q&A with tool grounding

Gemini 3 Pro Deep Think. Best multimodal handling for diagram-heavy KBs and best cache discount (95%) for high-volume cached queries. Strong second: Opus 4.7.

Gemini 3 Pro DT cached

08ConclusionTool-use is the real agentic AI metric.

Tool-use landscape · April 2026

Pick model by workflow. Invest in MCP servers. Measure cost-to-completion.

Tool-use is where agentic AI lives or dies, and the public benchmarks have under-represented the spread. Our 1,440-task suite shows model choice matters more than function-calling leaderboards suggest, but server quality matters even more. Schema clarity is worth 25-40 points of pass-rate.

For procurement, the right metric is cost-to-completion — total spend over passing runs only — not first-attempt pass rate. On 7 of 12 task families, DeepSeek V4 wins this metric despite lower headline pass rate, because retries are cheap. Pick by workflow class and measure on your specific suite.

We re-run this benchmark every quarter. Bookmark this page; we update the data as new model tiers ship and new MCP servers come online.

Production-grade agentic AI

Stop benchmarking function calling. Build for cost-to-completion.

We design tool-use-aware agentic AI deployments for engineering, ops, and growth teams shipping production at scale — covering MCP server design, model routing by workflow class, retry policy, and cost-to-completion telemetry.

Free consultationExpert guidanceTailored solutions
What we work on

Agentic AI engagements

  • MCP server design and schema hardening
  • Model routing by workflow class and latency budget
  • Retry policy and cost-to-completion telemetry
  • Failure-mode taxonomy and mitigation harness
  • Multi-vendor agentic stacks — Opus / GPT-5.5 / V4
FAQ · Tool-use benchmarks 2026

The questions we get every week.

Function-calling benchmarks (Berkeley Function Calling Leaderboard, OpenAI eval) measure single-call accuracy — given a tool spec and a request, does the model produce a syntactically valid call with correct parameters. Tool-use benchmarks measure the full workflow — does the model plan, sequence, and recover across multiple tool calls to produce a verified side effect. Function-calling pass rates are typically 10-20 points higher than equivalent tool-use rates because they isolate the easy part of the problem. Production tool-use is where chained workflows, error recovery, and plan-execute-verify patterns are tested, and where the spread between models is widest.