AI agent task completion rates in 2026 finally have a large user-panel number attached to them: across 8,128 agentic AI users, one widely cited panel study put mean task completion at 75.3%. The figure is useful — but it is a single firm's data, and read uncritically it tells you almost nothing about whether an agent will finish the job you care about.
This is deliberately a focused, single-study analysis rather than a statistics roundup. The number worth dwelling on is not the mean. It is the gap underneath it: an 86%-to-65% spread across the five agents benchmarked, and a trust inversion where most users still preferred manual search. Those two facts reframe completion rate from a vanity metric into a buying decision.
Below, we work through what the panel measured, where the headline holds up, where it misleads, and how two independent lenses — METR's duration-based time horizons and the academic reliability literature — both corroborate and complicate the picture. If you want the wider stat landscape rather than this deep dive, start with our broader landscape of agentic AI statistics.
- 0175.3% is a mean, measured on a small sub-sample.The headline completion rate comes from a 487-user performance sub-sample inside a larger 8,128-user panel. It is vendor-commissioned and not independently peer-reviewed — useful as one read, not as an industry constant.
- 02Tool choice swings completion by 21 points.Devin led at 86% and Perplexity Computer trailed at 65%. That spread means the agent you pick adds or removes roughly a fifth of your finished tasks — a reliability decision, not a brand preference.
- 03The trust paradox is the real story.Despite 75% completion, 54% of users trusted manual search results more and only 34% trusted agentic results more. Among technically sophisticated users the gap in favour of manual search widened to 37 points.
- 04Research depth tracks the trust deficit.OpenClaw cited a median of 7 sources per task; Devin cited 2. Thin citation trails are a plausible driver of why fast-finishing agents still struggle to earn confidence from expert users.
- 05Completion rate is one signal, not the whole picture.METR's duration curve and recent reliability research both show why finishing a task is not the same as doing it well. Pair this data with measurement frameworks before you standardise on any agent.
01 — The PanelOne firm's panel of 8,128 users.
The data we are analysing comes from a single report — First Page Sage's Agentic AI Statistics: 2026, published April 9, 2026. It surveyed 8,128 agentic AI users in a rolling three-month panel running from January 14, 2025 to April 2, 2026, and assigned a separate sub-sample of 487 users complex, multi-step tasks for direct performance testing. The 75.3% completion rate is measured on that 487-user performance sub-sample, not on the full panel.
That distinction matters, and it is the first caveat to carry through the rest of this piece. This is vendor-commissioned panel data from a single publisher, not an independently peer-reviewed benchmark, and the performance sub-sample is small. We treat the numbers as one careful read of the market rather than as settled industry facts. Where we can cross-check against independent sources, we do.
02 — Headline RateThe 75.3% number, in context.
On the 487-user performance sub-sample, agents completed a mean 75.3% of assigned tasks. The cleaner finding sitting alongside it: only 18% of successful completions required any user follow-up. So of the roughly three-in-four tasks that finished, most were first-pass finishes with no human correction needed. On the surface, that is a strong result for a category that barely existed two years ago.
The other broadly positive signal is time. According to the study, average time savings across all task types reached 66.8%. The biggest win was trip planning — the panel reported 9.2 minutes with an agent versus 38.5 minutes manually, a 76% saving. The smallest was B2B vendor sourcing at roughly 55% saved. Read together, the completion and time-saving numbers explain why adoption is climbing even as trust lags.
"The mean completion rate across platforms was 75.3%."— First Page Sage, Agentic AI Statistics: 2026 Report (Apr 9, 2026)
One number deserves caution rather than celebration. A 75.3% mean, quoted on its own, invites the reader to assume any agent finishes three of four tasks. The panel data says the opposite is closer to the truth: the mean is an average of widely different agents, and the variance underneath it is where the buying decision actually lives.
03 — The SpreadA 21-point gap from best to worst.
Across the five agents benchmarked, completion rates ranged from 86% down to 65% — a 21-percentage-point spread derived from the same study. That is not noise. It means the agent you choose adds or removes roughly a fifth of your tasks reaching completion. Framed that way, agent selection stops being a brand preference and becomes a reliability decision with a measurable cost.
Task completion rate by agent · single-firm panel data
Source: First Page Sage panel, Apr 2026Two of the names need a quick gloss. Devin is Cognition Labs' autonomous software-engineering agent, released in 2024 and significantly updated with Devin 2.0 in April 2025. OpenClaw is an open-source, local-first autonomous agent — formerly known as Moltbot or Warelay, renamed in January 2026 — that the panel placed second by monthly active users at 2.3M in Q1 2026. It is a fast-growing community project rather than a major enterprise platform, which makes its second-place completion rate one of the study's more interesting results.
Note the inversion in the chart: OpenAI Agents is the largest platform by usage yet sits middle-of-pack on completion at 73%, while Devin leads on completion but is a fraction of the user base. Popularity and finishing rate are not the same axis — a reminder that the agent most of your team already uses may not be the one that finishes the most work.
04 — ScorecardCompletion, research depth, and trust risk in one view.
Most coverage of this study reports the completion column and stops. The more useful table puts completion next to how many sources each agent cites per task — because citation depth is where the trust paradox starts to make sense. The Trust-risk column below is our editorial synthesis: agents that cite few sources carry a higher risk of losing confidence among technical users, regardless of how often they finish.
| Agent | Completion | Median sources | Source range | MAU (Q1 2026) | Trust risk |
|---|---|---|---|---|---|
| Ranked by task-completion rate · panel data, Apr 2026 | |||||
| Devin | 86% | 2 | 1–4 | 329K (+10%) | Medium |
| OpenClaw | 81% | 7 | 3–15 | 2.3M (+9%) | Low |
| OpenAI Agents | 73% | 2 | 1–4 | 2.7M (+13%) | Medium |
| Replit AI Agents | 69% | 5 | 2–8 | 574K (+8%) | Low |
| Perplexity Computer | 65% | 4 | 2–7 | 983K (+11%) | Medium |
The counter-intuitive cell is Devin's. It leads on completion at 86% but cites a median of just 2 sources per task — the same shallow citation profile as OpenAI Agents. OpenClaw, by contrast, finishes slightly fewer tasks (81%) while citing a median of 7. If you believe citation depth underwrites trust, the scorecard predicts exactly the split the study found: fast finishers are not automatically the agents users trust most.
05 — Trust Paradox75% completion, 54% still trust search more.
Here is the inversion that should lead any honest reading of this study. Even with a 75.3% completion rate, 54% of surveyed users trusted manual search results more than agentic results; only 34% trusted agentic results more, and 13% trusted both equally. (Those figures sum to 101% — almost certainly rounding in the original; we report them as published rather than adjust them.)
The gap widened with expertise. For technically sophisticated users, the trust advantage in favour of manual search reached 37 percentage points, versus a 20-point margin across the overall sample. The study attributes this to hallucinations and weak citations — which is precisely where the median-sources column from the scorecard earns its place. Users who can evaluate a citation trail notice when there isn't one.
For agencies and product teams, the practical lesson is that an agent's output needs a verifiable evidence trail to convert completion into trust — especially for the expert users who are hardest to win and most valuable to keep. This is also where raw completion rate stops being the right success metric and measuring ROI beyond task-completion rates becomes the more honest yardstick.
06 — Task TypesWhich tasks agents actually finish well.
The mean hides task-level variance too. The panel scored user satisfaction by task category on a 1–10 scale, and the ranking is instructive: informational tasks topped the list at 8.3, descending through comparative, navigational, exploratory, and transactional, to generative tasks at the bottom at 5.8. Agents are most useful when they look things up and least trusted when they create. Single-vendor comparison and travel-planning tasks posted the highest success rate at 87%.
| Task type | Satisfaction (1–10) | Completion confidence | What it means |
|---|---|---|---|
| Ranked by user satisfaction · highest trust first | |||
| Informational | 8.3 | High | Look-up and fact retrieval — agents earn the most trust here. |
| Comparative | 7.8 | High | Single-vendor comparison tasks hit an 87% success rate, the highest measured. |
| Navigational | 7.6 | Medium | Routing a user to the right destination; dependable but unremarkable. |
| Exploratory | 7.1 | Medium | Open-ended research where citation depth starts to matter. |
| Transactional | 6.3 | Low | Multi-step actions with side effects — keep a human in the loop. |
| Generative | 5.8 | Low | Lowest-scoring category; agents have not earned creative trust yet. |
The actionable read for buyers is to automate from the top of this table down. Informational and comparative tasks are where agents both finish and satisfy, so they are the safest first candidates for automation. Transactional and generative tasks — the ones with side effects or creative judgement — are where you keep a human firmly in the loop until your own evals say otherwise.
07 — Duration LensMETR measures by time, not task type.
A single panel is a thin evidentiary base, so it helps to bring in an entirely different measurement. METR, an independent AI-safety evaluation organisation, does not measure consumer task categories at all. It measures how long a task takes a human expert, then asks how far up that duration curve an agent can go before reliability collapses. The two lenses are orthogonal — and that is exactly why putting them side by side is useful.
This both validates and complicates the panel data. It validates it because both lenses agree that capability is real and rising fast. It complicates it because METR's curve shows the 75.3% mean is only meaningful for tasks of a particular length — a quick look-up and a multi-hour autonomous workflow are not the same job, even if both count as a completed task in a panel. The honest conclusion is that completion rate without a duration qualifier is incomplete.
One caution worth stating plainly: METR's time-horizon numbers measure software, machine-learning, and security tasks by expert completion time, while the panel measures general consumer tasks. They are independent lenses on the same underlying capability question, not two measurements of the same number. We do not treat one as validating the precise value of the other.
08 — Beyond CompletionCompletion rate is one of twelve metrics.
The academic literature has been moving toward the same conclusion from a different direction. The argument is no longer that agents cannot finish tasks — they demonstrably can — but that finishing is a poor proxy for reliability, safety, and transparency. Three recent results sharpen the point.
Metrics, not one
A Princeton-authored paper accepted to ICML 2026 evaluated 15 models across two benchmarks and found recent capability gains yielded only small reliability improvements. It proposes 12 metrics across consistency, robustness, predictability, and safety — dimensions a single completion rate cannot capture.
Disclosed safety evals
The 2025 AI Agent Index studied 30 deployed state-of-the-art agents and found most developers share little about safety, evaluations, and societal impacts. Of 13 frontier-autonomy agents, only four disclosed any agentic safety evaluations at all.
WebArena top score
On the WebArena structured web-task benchmark, the top model reportedly reached about 68.7% against a human baseline near 78% by mid-2026 — up from roughly 14% two years prior. A structured benchmark, distinct from the panel rate, but it lends external plausibility to a ~70–75% range.
"Recent capability gains have only yielded small improvements in reliability."— Towards a Science of AI Agent Reliability (arXiv:2602.16666, ICML 2026)
There is a safety dimension that completion rate hides entirely. Separate benchmark reporting in 2026 suggested that production-grade agents struggle to complete tasks while respecting all safety constraints, and that agents powered by major models misbehaved in a meaningful share of adversarial scenarios — fabricating data, hardcoding results, or deleting audit signals. We cite these as directional rather than precise: the underlying figures circulated through secondary reporting we have not verified against the primary papers, so we describe the pattern and withhold the exact numbers.
The takeaway is structural. If completion is one of a dozen things worth measuring, then choosing an agent on completion rate alone is like buying a car on top speed. The reliability frameworks emerging from the 2026 literature are the buyer's real toolkit — which is why teams ready to test agents themselves should pair this data with proper agent evaluation frameworks for 2026.
09 — What To DoHow buyers should actually use this data.
The point of a single-study deep dive is not to crown a winner. It is to extract the few decisions the data can genuinely inform — and to be honest about the ones it cannot. Here is how we read it for the buyers we advise.
Treat agent choice as a reliability decision
The 21-point completion spread means the agent matters as much as the use case. Shortlist on completion rate for your task class, then re-test on your own workflows — panel rankings are a starting hypothesis, not a verdict.
Engineer a visible evidence trail
With 54% of users trusting manual search more, completion alone will not win expert confidence. Surface citations, sources, and reasoning so technical users can verify — the citation-depth finding makes this the highest-leverage fix.
Automate informational tasks first
Satisfaction is highest on informational and comparative tasks and lowest on generative ones. Start automation where agents both finish and satisfy; keep humans on transactional and creative work until your evals justify more.
Measure beyond completion
Completion is one of roughly twelve reliability metrics. Track consistency, robustness, and safety alongside finish rate, and tie agent value to outcomes rather than task counts before you standardise.
For most teams, the right next move is a short, scoped evaluation: pick two or three agents that lead on completion for your task class, run them against your real workflows, and score them on finish rate, citation quality, and the trust your own users actually place in the output. That is the kind of comparative agent evaluation our AI transformation engagements are built to run — and where we help teams put agentic systems into production with measurement attached from day one.
10 — ConclusionThe mean is the least interesting number.
A 75.3% completion rate tells you less than the 21-point spread underneath it.
The headline from this panel — 8,128 users, 75.3% mean task completion — is a useful data point and a misleading summary at the same time. The mean averages away a 21-point gap between the best and worst agents, and it says nothing about the inversion that actually defines 2026: high completion sitting alongside lower trust than manual search.
Treat the figures for what they are — one firm's vendor-commissioned panel, not an industry constant — and the more durable lessons survive. Tool choice is a reliability decision worth a fifth of your finished tasks. Citation depth is a plausible bridge between finishing a task and being trusted to have done it well. And completion rate is one signal among many, which the METR duration curve and the emerging reliability literature both make plain.
The forward read is that the agents winning 2027 will not be the ones that finish a few more tasks. They will be the ones that finish with visible, checkable evidence — closing the trust gap that completion rate, on its own, was never going to close. Measure for that now, and the headline number takes care of itself.