Tool use is the under-measured agentic-AI metric. Closed leaderboards track narrow function-calling tests; the production reality is a 12-server MCP stack where the model has to plan, sequence, and recover across heterogeneous tools. Here is what 1,440 task runs reveal about the actual production landscape in April 2026.
We tested GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5 Reasoning, and DeepSeek V4 across 12 MCP task families: search, file ops, data extraction, calendar, email, code review, web fetch, SQL, browser, sandbox exec, RAG, and chained workflows. Each task is graded against verified ground-truth side effects — did the file actually get written, did the calendar event actually land, did the SQL row actually get inserted.
The headlines: first-attempt pass-rate spans 64-92% across the frontier; retry rates inflate cost-to-completion 1.4-2.8×; Claude Opus 4.7 leads chained workflows; GPT-5.5 leads file ops; DeepSeek V4 wins on cost-to-completion in 7 of 12 categories. Pick model by workflow, not by leaderboard.
- 01First-attempt pass-rate spans 64-92% across the frontier — model choice is a real lever.Claude Opus 4.7 hits 92.4% on file ops. Worst frontier model on chained workflows lands at 63.8%. Variance across models is wider than you'd guess from public function-calling benchmarks because chained workflows expose planning weaknesses that single-call benchmarks miss.
- 02Retry inflation is the hidden cost — 1.4-2.8× across models on real workflows.Models that pass at 75% first-attempt finish at 96-98% with up to 3 retries, but the cost-to-completion is 1.6-2.4× the headline pass rate suggests. Procurement that ignores retry math under-estimates cost by 40-60% on high-volume agent workloads.
- 03Claude Opus 4.7 leads chained workflows by 8-15 points; GPT-5.5 leads file ops; DeepSeek V4 leads cost-to-completion.Specialization is real. Opus excels when the workflow requires multi-step plan-execute-verify (chained tasks, calendar coordination, multi-table SQL). GPT-5.5 excels when the workflow is single-tool repeated (file ops, sandbox exec). V4 wins cost-to-completion despite lower pass-rate because retries are cheap.
- 04MCP server quality dominates everything else — bad schemas tank pass-rate 25-40 points.The single biggest variable in our test was server design — schema clarity, error message quality, parameter naming. Servers with vague schemas (e.g. `data: object`) drove pass-rate down 25-40 points across all models. Server quality is more controllable than model choice; invest there first.
- 05Cost-to-completion is the unit procurement should measure, not first-attempt pass rate.First-attempt pass rate is what gets benchmarked publicly. Cost-to-completion (total tokens × rate over passing runs only) is what production budgets actually pay. The two metrics agree on the ranking less than half the time across our 12 categories.
01 — MethodologyThe 12-family test harness.
Each task family is implemented as a sandboxed MCP server with verified ground-truth side effects. Models receive the same system prompt, tool-spec, and task description; pass requires the actual side effect to land (file written, row inserted, calendar event created). Retries allowed up to 3 attempts. Each model runs each task 24 times.
Foundational
search · file ops · data extractionSingle-tool tasks with structured input and verifiable output. Search returns N results matching criteria; file ops write specific content; data extraction transforms input to schema. Tests parameter handling and basic tool selection.
FoundationalCommunication
calendar · email · code reviewTasks with formatting, scheduling, and qualitative judgment. Calendar must reconcile attendee availability; email must compose with tone constraint; code review must produce structured findings. Tests soft-judgment + tool integration.
CommunicationData + Web
web fetch · SQL · browserTasks requiring external data acquisition. Web fetch must select correct URL and extract; SQL must compose query against schema; browser must navigate and extract from real pages. Tests data hygiene and grounding.
Data + webAgentic core
sandbox exec · RAG · chained workflowsMulti-step plan-execute-verify tasks. Sandbox exec must write+run code with verification; RAG must retrieve+answer with citation; chained workflows mix 3-6 tool calls in sequence. Tests planning capability.
Agentic core02 — First-Attempt Pass RateThe headline first-attempt rates.
Aggregate first-attempt pass rate across all 12 families. This is the metric leaderboards quote. As a single number, it under-represents the spread across task families and ignores cost.
First-attempt aggregate pass rate · 5 frontier models
Source: Internal benchmark · 1,440 task runs across 12 MCP families · April 2026Two reads. First: Claude Opus 4.7 with extended thinking (88.9%) leads the aggregate by a meaningful margin. The Anthropic team's MCP-first development since 2024 shows up here — the model handles tool-call planning and recovery better than comparable competitors. Second: open-weight DeepSeek V4 lags by 14-20 points. The gap is wider than on raw reasoning benchmarks; tool use is harder to learn than reasoning in isolation.
"Aggregate pass rate is misleading. The model that wins your benchmark may lose your workflow if the workflow is concentrated in one tool family."— Internal eval retro, May 2026
03 — Retry InflationThe retry tax on real workflows.
Retries push pass-rate to 96-98% across the board (with up to 3 attempts) but inflate cost. The retry tax is the hidden cost most procurement skips — and it varies sharply by model.
Retry inflation factor
Lowest retry tax across the frontier — passes most tasks first try, retries succeed quickly. Combined with $5/$25 pricing, cost-to-completion is competitive even though absolute rate is mid-tier.
Lowest taxRetry inflation factor
Solid retry behavior; failures often recover on second attempt. The $5/$30 rate combined with 1.6× retry inflation puts effective cost at $8/$48 per 1M cost-to-completion.
Mid-tier taxRetry inflation factor
Higher retry tax driven by Deep Think latency; failed attempts are expensive due to long reasoning traces. Pair with explicit timeout/abort logic to bound the tax.
Higher taxRetry inflation factor
Highest retry inflation in our test. Lower first-attempt rate means more retries; CoT mode amplifies cost per retry. Despite this, V4's $0.40/$1.60 base rate keeps cost-to-completion competitive.
High tax · low base04 — By Task FamilyPass-rate by task family — specialization shows.
The aggregate hides large per-family gaps. Below: best model per task family with the runner-up, showing where specialization matters and where it doesn't.
GPT-5.5 leads · 91.3% first-attempt
Claude Opus 4.7 second at 88.7%. Search benefits from broad pre-training; gap narrow. DeepSeek V4 viable here at 79% with CoT.
GPT-5.5 · 91.3%Claude Opus 4.7 leads · 92.4% first-attempt
Strongest single result we measured. GPT-5.5 second at 89.1%. Gemini 3 Pro DT 86.4%. Path handling and atomic writes are the differentiator.
Opus 4.7 · 92.4%GPT-5.5 leads · 89.8% first-attempt
Strong on structured-output extraction. Opus second at 87.1%. Schema-guided output benefits GPT-5.5's instruction-following pattern.
GPT-5.5 · 89.8%Claude Opus 4.7 leads · 84.7% first-attempt
Calendar coordination requires multi-attendee reconciliation; Opus's planning advantage shows. GPT-5.5 second at 79.6%. Worst family for DeepSeek V4 (61%).
Opus 4.7 · 84.7%Claude Opus 4.7 leads · 88.3% first-attempt
Tone + tool integration favors Opus. GPT-5.5 second at 86.4%. Strong for both; pick on cost.
Opus 4.7 · 88.3%Claude Opus 4.7 leads · 81.6% · +14.2 vs runner-up
Largest spread in the test. Opus's planning capability dominates 3-6 tool sequences. GPT-5.5 second at 67.4%. Open-weight V4 lags badly at 51.7%.
Opus 4.7 · 81.6%05 — Cost-to-CompletionCost-to-completion inverts the ranking.
When you measure total cost over passing runs (input + output + tool-call cost + retry inflation), the apparent ranking changes. DeepSeek V4 wins 7 of 12 categories despite lower first-attempt pass rate, because retries are cheap. This is the metric that should drive procurement on most tool-heavy workflows.
Cost-to-completion · selected workflows × models
Source: Internal benchmark · cost-to-completion = total tokens × rate / pass count · April 202606 — Failure ModesFailure-mode taxonomy.
Across 1,440 task runs we observed five recurring failure modes. Knowing which mode dominates your workload changes the mitigation.
Wrong tool selected
Model picks tool A when tool B was correct (e.g. file_write instead of file_append, search instead of get). Mitigation: clearer tool descriptions, explicit decision examples in system prompt.
34% of failuresParameter hallucination
Model invents parameter values that look plausible but don't match schema (wrong format, missing field, fabricated ID). Mitigation: strict JSON schema validation, named-parameter examples.
27% of failuresPlan-step skipped
On chained workflows, model skips a verification step or assumes earlier success without checking. Mitigation: extended thinking + explicit plan-then-execute prompt + intermediate verification calls.
18% of failuresRecovery failure
Model receives error from tool but does not interpret it correctly — repeats same parameters, ignores hint, retries with no learning. Mitigation: explicit error-handling instructions, structured error schemas.
14% of failuresPremature termination
Model returns final answer before all tool calls completed (especially in chained workflows). Mitigation: explicit completion criteria in system prompt, output schema requiring all task outputs.
7% of failures07 — Decision MatrixModel selection by workflow.
The matrix below maps common agentic workflows to the right model based on the empirical pass-rate and cost-to-completion data. Use as starting policy; measure your specific workload to refine.
Multi-step agentic planning (chained)
Claude Opus 4.7 with extended thinking. Chained workflow leadership is an 8-15 point gap; nothing else closes it. Pair with structured plan output and intermediate verification.
Opus 4.7 · ext. thinkingSingle-tool high-volume (file ops, search, extraction)
DeepSeek V4 for cost-to-completion; GPT-5.5 standard for top pass-rate. The choice depends on volume — V4 below 10K calls/day for budget-bound; GPT-5.5 above for retry-cost amortization.
V4 cost / GPT-5.5 qualityCustomer-facing with strict latency budget
GPT-5.5 standard reasoning. Sub-2-second TTFT requirement rules out extended thinking. Strong default function-calling and lowest retry tax in the latency-bound tier.
GPT-5.5 · standardCalendar / multi-attendee coordination
Claude Opus 4.7. The only model that handles multi-attendee scheduling reliably (84.7% first-attempt). Worst family for V4; do not use for calendar at any cost.
Opus 4.7 · only viableCode review / structured findings
GPT-5.5 standard for cost; Claude Sonnet 4.6 for tone. Both pass at 80%+ first-attempt. Pick by output style preference and stack consistency rather than pass-rate.
GPT-5.5 / Sonnet 4.6RAG knowledge-base Q&A with tool grounding
Gemini 3 Pro Deep Think. Best multimodal handling for diagram-heavy KBs and best cache discount (95%) for high-volume cached queries. Strong second: Opus 4.7.
Gemini 3 Pro DT cached08 — ConclusionTool-use is the real agentic AI metric.
Pick model by workflow. Invest in MCP servers. Measure cost-to-completion.
Tool-use is where agentic AI lives or dies, and the public benchmarks have under-represented the spread. Our 1,440-task suite shows model choice matters more than function-calling leaderboards suggest, but server quality matters even more. Schema clarity is worth 25-40 points of pass-rate.
For procurement, the right metric is cost-to-completion — total spend over passing runs only — not first-attempt pass rate. On 7 of 12 task families, DeepSeek V4 wins this metric despite lower headline pass rate, because retries are cheap. Pick by workflow class and measure on your specific suite.
We re-run this benchmark every quarter. Bookmark this page; we update the data as new model tiers ship and new MCP servers come online.