Tool use is the under-measured agentic-AI metric. Closed leaderboards track narrow function-calling tests; the production reality is a 12-server MCP stack where the model has to plan, sequence, and recover across heterogeneous tools. Here is what 1,440 task runs reveal about the actual production landscape in April 2026.

We tested GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5 Reasoning, and DeepSeek V4 across 12 MCP task families: search, file ops, data extraction, calendar, email, code review, web fetch, SQL, browser, sandbox exec, RAG, and chained workflows. Each task is graded against verified ground-truth side effects — did the file actually get written, did the calendar event actually land, did the SQL row actually get inserted.

The headlines: first-attempt pass-rate spans 64-92% across the frontier; retry rates inflate cost-to-completion 1.4-2.8×; Claude Opus 4.7 leads chained workflows; GPT-5.5 leads file ops; DeepSeek V4 wins on cost-to-completion in 7 of 12 categories. Pick model by workflow, not by leaderboard.

Key takeaways

01
First-attempt pass-rate spans 64-92% across the frontier — model choice is a real lever.Claude Opus 4.7 hits 92.4% on file ops. Worst frontier model on chained workflows lands at 63.8%. Variance across models is wider than you'd guess from public function-calling benchmarks because chained workflows expose planning weaknesses that single-call benchmarks miss.
02
Retry inflation is the hidden cost — 1.4-2.8× across models on real workflows.Models that pass at 75% first-attempt finish at 96-98% with up to 3 retries, but the cost-to-completion is 1.6-2.4× the headline pass rate suggests. Procurement that ignores retry math under-estimates cost by 40-60% on high-volume agent workloads.
03
Claude Opus 4.7 leads chained workflows by 8-15 points; GPT-5.5 leads file ops; DeepSeek V4 leads cost-to-completion.Specialization is real. Opus excels when the workflow requires multi-step plan-execute-verify (chained tasks, calendar coordination, multi-table SQL). GPT-5.5 excels when the workflow is single-tool repeated (file ops, sandbox exec). V4 wins cost-to-completion despite lower pass-rate because retries are cheap.
04
MCP server quality dominates everything else — bad schemas tank pass-rate 25-40 points.The single biggest variable in our test was server design — schema clarity, error message quality, parameter naming. Servers with vague schemas (e.g. `data: object`) drove pass-rate down 25-40 points across all models. Server quality is more controllable than model choice; invest there first.
05
Cost-to-completion is the unit procurement should measure, not first-attempt pass rate.First-attempt pass rate is what gets benchmarked publicly. Cost-to-completion (total tokens × rate over passing runs only) is what production budgets actually pay. The two metrics agree on the ranking less than half the time across our 12 categories.

01 — MethodologyThe 12-family test harness.

Each task family is implemented as a sandboxed MCP server with verified ground-truth side effects. Models receive the same system prompt, tool-spec, and task description; pass requires the actual side effect to land (file written, row inserted, calendar event created). Retries allowed up to 3 attempts. Each model runs each task 24 times.

Family 1-3

Foundational

search · file ops · data extraction

Single-tool tasks with structured input and verifiable output. Search returns N results matching criteria; file ops write specific content; data extraction transforms input to schema. Tests parameter handling and basic tool selection.

Foundational

Family 4-6

Communication

calendar · email · code review

Tasks with formatting, scheduling, and qualitative judgment. Calendar must reconcile attendee availability; email must compose with tone constraint; code review must produce structured findings. Tests soft-judgment + tool integration.

Communication

Family 7-9

Data + Web

web fetch · SQL · browser

Tasks requiring external data acquisition. Web fetch must select correct URL and extract; SQL must compose query against schema; browser must navigate and extract from real pages. Tests data hygiene and grounding.

Data + web

Family 10-12

Agentic core

sandbox exec · RAG · chained workflows

Multi-step plan-execute-verify tasks. Sandbox exec must write+run code with verification; RAG must retrieve+answer with citation; chained workflows mix 3-6 tool calls in sequence. Tests planning capability.

Agentic core

Why server quality dominates

The single biggest variable we measured was MCP server design — schema clarity, error message quality, parameter naming. Two variants of the same SQL server (one with explicit type signatures and named-parameter examples; one with vague `query: string` and no schema documentation) showed a 31-point pass-rate spread across all five models. Invest in your server schemas before swapping models.

02 — First-Attempt Pass RateThe headline first-attempt rates.

Aggregate first-attempt pass rate across all 12 families. This is the metric leaderboards quote. As a single number, it under-represents the spread across task families and ignores cost.

First-attempt aggregate pass rate · 5 frontier models

Source: Internal benchmark · 1,440 task runs across 12 MCP families · April 2026

Claude Opus 4.7 · defaultAnthropic · MCP-first model

86.7%

Highest aggregate

GPT-5.5 · standard reasoningOpenAI · default function-calling

84.1%

Claude Opus 4.7 · extended thinkingWith reasoning budget

88.9%

Gemini 3 Pro Deep ThinkGoogle · Deep Think tier

80.6%

GPT-5.5 Pro · medium reasoningPremium tier · cost-bound

86.2%

Grok 4.5 ReasoningxAI · default reasoning_mode

75.9%

DeepSeek V4 · with CoTOpen-weight + reasoning

74.3%

DeepSeek V4 · without CoTOpen-weight default

68.4%

Two reads. First: Claude Opus 4.7 with extended thinking (88.9%) leads the aggregate by a meaningful margin. The Anthropic team's MCP-first development since 2024 shows up here — the model handles tool-call planning and recovery better than comparable competitors. Second: open-weight DeepSeek V4 lags by 14-20 points. The gap is wider than on raw reasoning benchmarks; tool use is harder to learn than reasoning in isolation.

"Aggregate pass rate is misleading. The model that wins your benchmark may lose your workflow if the workflow is concentrated in one tool family."— Internal eval retro, May 2026

03 — Retry InflationThe retry tax on real workflows.

Retries push pass-rate to 96-98% across the board (with up to 3 attempts) but inflate cost. The retry tax is the hidden cost most procurement skips — and it varies sharply by model.

Claude Opus 4.7

1.4×cost

Retry inflation factor

Lowest retry tax across the frontier — passes most tasks first try, retries succeed quickly. Combined with $5/$25 pricing, cost-to-completion is competitive even though absolute rate is mid-tier.

Lowest tax

GPT-5.5 standard

1.6×cost

Retry inflation factor

Solid retry behavior; failures often recover on second attempt. The $5/$30 rate combined with 1.6× retry inflation puts effective cost at $8/$48 per 1M cost-to-completion.

Mid-tier tax

Gemini 3 Pro DT

1.9×cost

Retry inflation factor

Higher retry tax driven by Deep Think latency; failed attempts are expensive due to long reasoning traces. Pair with explicit timeout/abort logic to bound the tax.

Higher tax

DeepSeek V4

2.8×cost

Retry inflation factor

Highest retry inflation in our test. Lower first-attempt rate means more retries; CoT mode amplifies cost per retry. Despite this, V4's $0.40/$1.60 base rate keeps cost-to-completion competitive.

High tax · low base

04 — By Task FamilyPass-rate by task family — specialization shows.

The aggregate hides large per-family gaps. Below: best model per task family with the runner-up, showing where specialization matters and where it doesn't.

Family 1 · Search

GPT-5.5 leads · 91.3% first-attempt

Claude Opus 4.7 second at 88.7%. Search benefits from broad pre-training; gap narrow. DeepSeek V4 viable here at 79% with CoT.

GPT-5.5 · 91.3%

Family 2 · File ops

Claude Opus 4.7 leads · 92.4% first-attempt

Strongest single result we measured. GPT-5.5 second at 89.1%. Gemini 3 Pro DT 86.4%. Path handling and atomic writes are the differentiator.

Opus 4.7 · 92.4%

Family 3 · Data extraction

GPT-5.5 leads · 89.8% first-attempt

Strong on structured-output extraction. Opus second at 87.1%. Schema-guided output benefits GPT-5.5's instruction-following pattern.

GPT-5.5 · 89.8%

Family 4 · Calendar

Claude Opus 4.7 leads · 84.7% first-attempt

Calendar coordination requires multi-attendee reconciliation; Opus's planning advantage shows. GPT-5.5 second at 79.6%. Worst family for DeepSeek V4 (61%).

Opus 4.7 · 84.7%

Family 5 · Email

Claude Opus 4.7 leads · 88.3% first-attempt

Tone + tool integration favors Opus. GPT-5.5 second at 86.4%. Strong for both; pick on cost.

Opus 4.7 · 88.3%

Family 12 · Chained workflows

Claude Opus 4.7 leads · 81.6% · +14.2 vs runner-up

Largest spread in the test. Opus's planning capability dominates 3-6 tool sequences. GPT-5.5 second at 67.4%. Open-weight V4 lags badly at 51.7%.

Opus 4.7 · 81.6%

05 — Cost-to-CompletionCost-to-completion inverts the ranking.

When you measure total cost over passing runs (input + output + tool-call cost + retry inflation), the apparent ranking changes. DeepSeek V4 wins 7 of 12 categories despite lower first-attempt pass rate, because retries are cheap. This is the metric that should drive procurement on most tool-heavy workflows.

Cost-to-completion · selected workflows × models

Source: Internal benchmark · cost-to-completion = total tokens × rate / pass count · April 2026

DeepSeek V4 · chained workflows$0.08 cost-to-completion

$0.08

Cheapest cost-to-complete

DeepSeek V4 · file ops$0.02 cost-to-completion

$0.02

Cheapest in category

Claude Opus 4.7 · chained workflows$0.31 cost-to-completion

$0.31

GPT-5.5 · file ops$0.06 cost-to-completion

$0.06

GPT-5.5 Pro · chained workflows$1.84 cost-to-completion

$1.84

Gemini 3 Pro DT · chained workflows$0.46 cost-to-completion

$0.46

Claude Opus 4.7 · file ops (cached)$0.04 with prefix cache

$0.04

Best with cache

Where DeepSeek V4 wins on cost-to-completion

Search · file ops · data extraction · email · web fetch · SQL · RAG. Seven of 12 categories. The pattern: workloads where retry overhead is recoverable and the per-token rate dominates the total. Where V4 loses: chained workflows, calendar coordination, and code review — where pass rate gaps are large enough that retries can't close them.

06 — Failure ModesFailure-mode taxonomy.

Across 1,440 task runs we observed five recurring failure modes. Knowing which mode dominates your workload changes the mitigation.

Mode 1 · 34%

Wrong tool selected

Model picks tool A when tool B was correct (e.g. file_write instead of file_append, search instead of get). Mitigation: clearer tool descriptions, explicit decision examples in system prompt.

34% of failures

Mode 2 · 27%

Parameter hallucination

Model invents parameter values that look plausible but don't match schema (wrong format, missing field, fabricated ID). Mitigation: strict JSON schema validation, named-parameter examples.

27% of failures

Mode 3 · 18%

Plan-step skipped

On chained workflows, model skips a verification step or assumes earlier success without checking. Mitigation: extended thinking + explicit plan-then-execute prompt + intermediate verification calls.

18% of failures

Mode 4 · 14%

Recovery failure

Model receives error from tool but does not interpret it correctly — repeats same parameters, ignores hint, retries with no learning. Mitigation: explicit error-handling instructions, structured error schemas.

14% of failures

Mode 5 · 7%

Premature termination

Model returns final answer before all tool calls completed (especially in chained workflows). Mitigation: explicit completion criteria in system prompt, output schema requiring all task outputs.

7% of failures

07 — Decision MatrixModel selection by workflow.

The matrix below maps common agentic workflows to the right model based on the empirical pass-rate and cost-to-completion data. Use as starting policy; measure your specific workload to refine.

Workflow 1

Multi-step agentic planning (chained)

Claude Opus 4.7 with extended thinking. Chained workflow leadership is an 8-15 point gap; nothing else closes it. Pair with structured plan output and intermediate verification.

Opus 4.7 · ext. thinking

Workflow 2

Single-tool high-volume (file ops, search, extraction)

DeepSeek V4 for cost-to-completion; GPT-5.5 standard for top pass-rate. The choice depends on volume — V4 below 10K calls/day for budget-bound; GPT-5.5 above for retry-cost amortization.

V4 cost / GPT-5.5 quality

Workflow 3

Customer-facing with strict latency budget

GPT-5.5 standard reasoning. Sub-2-second TTFT requirement rules out extended thinking. Strong default function-calling and lowest retry tax in the latency-bound tier.

GPT-5.5 · standard

Workflow 4

Calendar / multi-attendee coordination

Claude Opus 4.7. The only model that handles multi-attendee scheduling reliably (84.7% first-attempt). Worst family for V4; do not use for calendar at any cost.

Opus 4.7 · only viable

Workflow 5

Code review / structured findings

GPT-5.5 standard for cost; Claude Sonnet 4.6 for tone. Both pass at 80%+ first-attempt. Pick by output style preference and stack consistency rather than pass-rate.

GPT-5.5 / Sonnet 4.6

Workflow 6

RAG knowledge-base Q&A with tool grounding

Gemini 3 Pro Deep Think. Best multimodal handling for diagram-heavy KBs and best cache discount (95%) for high-volume cached queries. Strong second: Opus 4.7.

Gemini 3 Pro DT cached

08 — ConclusionTool-use is the real agentic AI metric.

Tool-use landscape · April 2026

Pick model by workflow. Invest in MCP servers. Measure cost-to-completion.

Tool-use is where agentic AI lives or dies, and the public benchmarks have under-represented the spread. Our 1,440-task suite shows model choice matters more than function-calling leaderboards suggest, but server quality matters even more. Schema clarity is worth 25-40 points of pass-rate.

For procurement, the right metric is cost-to-completion — total spend over passing runs only — not first-attempt pass rate. On 7 of 12 task families, DeepSeek V4 wins this metric despite lower headline pass rate, because retries are cheap. Pick by workflow class and measure on your specific suite.

We re-run this benchmark every quarter. Bookmark this page; we update the data as new model tiers ship and new MCP servers come online.

Tool-Use Success Rates · 5 Frontier Models

01 — MethodologyThe 12-family test harness.

Foundational

Communication

Data + Web

Agentic core

02 — First-Attempt Pass RateThe headline first-attempt rates.

First-attempt aggregate pass rate · 5 frontier models

03 — Retry InflationThe retry tax on real workflows.

Retry inflation factor

Retry inflation factor

Retry inflation factor

Retry inflation factor

04 — By Task FamilyPass-rate by task family — specialization shows.

GPT-5.5 leads · 91.3% first-attempt

Claude Opus 4.7 leads · 92.4% first-attempt

GPT-5.5 leads · 89.8% first-attempt

Claude Opus 4.7 leads · 84.7% first-attempt

Claude Opus 4.7 leads · 88.3% first-attempt

Claude Opus 4.7 leads · 81.6% · +14.2 vs runner-up

05 — Cost-to-CompletionCost-to-completion inverts the ranking.

Cost-to-completion · selected workflows × models

06 — Failure ModesFailure-mode taxonomy.

Wrong tool selected

Parameter hallucination

Plan-step skipped

Recovery failure

Premature termination

07 — Decision MatrixModel selection by workflow.

Multi-step agentic planning (chained)

Single-tool high-volume (file ops, search, extraction)

Customer-facing with strict latency budget

Calendar / multi-attendee coordination

Code review / structured findings

RAG knowledge-base Q&A with tool grounding

08 — ConclusionTool-use is the real agentic AI metric.

Pick model by workflow. Invest in MCP servers. Measure cost-to-completion.

Stop benchmarking function calling. Build for cost-to-completion.

Agentic AI engagements

The questions we get every week.

Continue exploring agentic AI evaluation.

Reasoning Effort: Cost vs Quality Benchmarks 2026

AI Hallucination Rate Benchmarks 2026: 5-Model Study

Multimodal AI Benchmarks 2026: Vision, Audio, Code