AI DevelopmentIndustry Guide11 min readPublished June 12, 2026

One panel study, read carefully · 75.3% mean completion · 21-point spread across agents

AI Agent Task Completion in 2026: What 8,128 Users Reveal

A rolling panel of 8,128 agentic AI users put mean task completion at 75.3% in early 2026 — yet 54% of those same users still trusted manual search more than agentic results. This is a focused analysis of one firm's panel data, methodology caveats included, not a broad statistics roundup.

DA
Digital Applied Team
Senior strategists · Published Jun 12, 2026
PublishedJun 12, 2026
Read time11 min
SourcesFirst Page Sage, METR, arXiv
Panel sample
8,128
users surveyed
487-user perf sub-sample
Mean completion
75.3%
across five agents
single-firm panel
Best vs worst
21pt
Devin 86% · Perplexity 65%
tool choice matters
Trust manual more
54%
vs 34% trusting agents
the trust paradox

AI agent task completion rates in 2026 finally have a large user-panel number attached to them: across 8,128 agentic AI users, one widely cited panel study put mean task completion at 75.3%. The figure is useful — but it is a single firm's data, and read uncritically it tells you almost nothing about whether an agent will finish the job you care about.

This is deliberately a focused, single-study analysis rather than a statistics roundup. The number worth dwelling on is not the mean. It is the gap underneath it: an 86%-to-65% spread across the five agents benchmarked, and a trust inversion where most users still preferred manual search. Those two facts reframe completion rate from a vanity metric into a buying decision.

Below, we work through what the panel measured, where the headline holds up, where it misleads, and how two independent lenses — METR's duration-based time horizons and the academic reliability literature — both corroborate and complicate the picture. If you want the wider stat landscape rather than this deep dive, start with our broader landscape of agentic AI statistics.

Key takeaways
  1. 01
    75.3% is a mean, measured on a small sub-sample.The headline completion rate comes from a 487-user performance sub-sample inside a larger 8,128-user panel. It is vendor-commissioned and not independently peer-reviewed — useful as one read, not as an industry constant.
  2. 02
    Tool choice swings completion by 21 points.Devin led at 86% and Perplexity Computer trailed at 65%. That spread means the agent you pick adds or removes roughly a fifth of your finished tasks — a reliability decision, not a brand preference.
  3. 03
    The trust paradox is the real story.Despite 75% completion, 54% of users trusted manual search results more and only 34% trusted agentic results more. Among technically sophisticated users the gap in favour of manual search widened to 37 points.
  4. 04
    Research depth tracks the trust deficit.OpenClaw cited a median of 7 sources per task; Devin cited 2. Thin citation trails are a plausible driver of why fast-finishing agents still struggle to earn confidence from expert users.
  5. 05
    Completion rate is one signal, not the whole picture.METR's duration curve and recent reliability research both show why finishing a task is not the same as doing it well. Pair this data with measurement frameworks before you standardise on any agent.

01The PanelOne firm's panel of 8,128 users.

The data we are analysing comes from a single report — First Page Sage's Agentic AI Statistics: 2026, published April 9, 2026. It surveyed 8,128 agentic AI users in a rolling three-month panel running from January 14, 2025 to April 2, 2026, and assigned a separate sub-sample of 487 users complex, multi-step tasks for direct performance testing. The 75.3% completion rate is measured on that 487-user performance sub-sample, not on the full panel.

That distinction matters, and it is the first caveat to carry through the rest of this piece. This is vendor-commissioned panel data from a single publisher, not an independently peer-reviewed benchmark, and the performance sub-sample is small. We treat the numbers as one careful read of the market rather than as settled industry facts. Where we can cross-check against independent sources, we do.

Methodology snapshot
The headline figures in this post come from a single source: First Page Sage's Agentic AI Statistics: 2026 report (April 9, 2026), drawn from a rolling panel of 8,128 users with a 487-user performance sub-sample. It is vendor-commissioned and not peer-reviewed. We deliberately do not rebuild the broader stat collections here — for those, see our companion definitive collection of agentic AI statistics.

02Headline RateThe 75.3% number, in context.

On the 487-user performance sub-sample, agents completed a mean 75.3% of assigned tasks. The cleaner finding sitting alongside it: only 18% of successful completions required any user follow-up. So of the roughly three-in-four tasks that finished, most were first-pass finishes with no human correction needed. On the surface, that is a strong result for a category that barely existed two years ago.

The other broadly positive signal is time. According to the study, average time savings across all task types reached 66.8%. The biggest win was trip planning — the panel reported 9.2 minutes with an agent versus 38.5 minutes manually, a 76% saving. The smallest was B2B vendor sourcing at roughly 55% saved. Read together, the completion and time-saving numbers explain why adoption is climbing even as trust lags.

"The mean completion rate across platforms was 75.3%."— First Page Sage, Agentic AI Statistics: 2026 Report (Apr 9, 2026)

One number deserves caution rather than celebration. A 75.3% mean, quoted on its own, invites the reader to assume any agent finishes three of four tasks. The panel data says the opposite is closer to the truth: the mean is an average of widely different agents, and the variance underneath it is where the buying decision actually lives.

03The SpreadA 21-point gap from best to worst.

Across the five agents benchmarked, completion rates ranged from 86% down to 65% — a 21-percentage-point spread derived from the same study. That is not noise. It means the agent you choose adds or removes roughly a fifth of your tasks reaching completion. Framed that way, agent selection stops being a brand preference and becomes a reliability decision with a measurable cost.

Task completion rate by agent · single-firm panel data

Source: First Page Sage panel, Apr 2026
DevinCognition Labs' autonomous SWE agent
86%
OpenClawOpen-source local-first agent (formerly Moltbot)
81%
Mean (all agents)487-user performance sub-sample
75.3%
OpenAI AgentsLargest platform by MAU (2.7M, Q1 2026)
73%
Replit AI Agents574K MAU (+8% QoQ)
69%
Perplexity Computer983K MAU (+11% QoQ)
65%

Two of the names need a quick gloss. Devin is Cognition Labs' autonomous software-engineering agent, released in 2024 and significantly updated with Devin 2.0 in April 2025. OpenClaw is an open-source, local-first autonomous agent — formerly known as Moltbot or Warelay, renamed in January 2026 — that the panel placed second by monthly active users at 2.3M in Q1 2026. It is a fast-growing community project rather than a major enterprise platform, which makes its second-place completion rate one of the study's more interesting results.

Note the inversion in the chart: OpenAI Agents is the largest platform by usage yet sits middle-of-pack on completion at 73%, while Devin leads on completion but is a fraction of the user base. Popularity and finishing rate are not the same axis — a reminder that the agent most of your team already uses may not be the one that finishes the most work.

04ScorecardCompletion, research depth, and trust risk in one view.

Most coverage of this study reports the completion column and stops. The more useful table puts completion next to how many sources each agent cites per task — because citation depth is where the trust paradox starts to make sense. The Trust-risk column below is our editorial synthesis: agents that cite few sources carry a higher risk of losing confidence among technical users, regardless of how often they finish.

Agent performance scorecard combining task-completion rate, median sources cited per task and source range, Q1 2026 monthly active users with QoQ growth, and an editorial trust-risk rating. Completion, source, and MAU figures from the First Page Sage panel study (Apr 9, 2026); the trust-risk column is Digital Applied editorial synthesis based on the study's citation-depth finding.
AgentCompletionMedian sourcesSource rangeMAU (Q1 2026)Trust risk
Ranked by task-completion rate · panel data, Apr 2026
Devin86%21–4329K (+10%)Medium
OpenClaw81%73–152.3M (+9%)Low
OpenAI Agents73%21–42.7M (+13%)Medium
Replit AI Agents69%52–8574K (+8%)Low
Perplexity Computer65%42–7983K (+11%)Medium

The counter-intuitive cell is Devin's. It leads on completion at 86% but cites a median of just 2 sources per task — the same shallow citation profile as OpenAI Agents. OpenClaw, by contrast, finishes slightly fewer tasks (81%) while citing a median of 7. If you believe citation depth underwrites trust, the scorecard predicts exactly the split the study found: fast finishers are not automatically the agents users trust most.

05Trust Paradox75% completion, 54% still trust search more.

Here is the inversion that should lead any honest reading of this study. Even with a 75.3% completion rate, 54% of surveyed users trusted manual search results more than agentic results; only 34% trusted agentic results more, and 13% trusted both equally. (Those figures sum to 101% — almost certainly rounding in the original; we report them as published rather than adjust them.)

The gap widened with expertise. For technically sophisticated users, the trust advantage in favour of manual search reached 37 percentage points, versus a 20-point margin across the overall sample. The study attributes this to hallucinations and weak citations — which is precisely where the median-sources column from the scorecard earns its place. Users who can evaluate a citation trail notice when there isn't one.

Why this matters
A completion rate measures whether an agent finished. It says nothing about whether the user believed the result. The 2026 trust paradox — high completion alongside lower trust than manual search — is the clearest signal yet that completion rate, on its own, is the wrong number to optimise for. The fix is not finishing more tasks; it is finishing them with visible, checkable sourcing.

For agencies and product teams, the practical lesson is that an agent's output needs a verifiable evidence trail to convert completion into trust — especially for the expert users who are hardest to win and most valuable to keep. This is also where raw completion rate stops being the right success metric and measuring ROI beyond task-completion rates becomes the more honest yardstick.

06Task TypesWhich tasks agents actually finish well.

The mean hides task-level variance too. The panel scored user satisfaction by task category on a 1–10 scale, and the ranking is instructive: informational tasks topped the list at 8.3, descending through comparative, navigational, exploratory, and transactional, to generative tasks at the bottom at 5.8. Agents are most useful when they look things up and least trusted when they create. Single-vendor comparison and travel-planning tasks posted the highest success rate at 87%.

Task-type difficulty matrix mapping each task category to its user-satisfaction score (1–10), an implied completion-confidence rating, and a usage note. Satisfaction scores from the First Page Sage panel study (Apr 9, 2026); the confidence rating is Digital Applied editorial synthesis derived from the satisfaction ranking.
Task typeSatisfaction (1–10)Completion confidenceWhat it means
Ranked by user satisfaction · highest trust first
Informational8.3HighLook-up and fact retrieval — agents earn the most trust here.
Comparative7.8HighSingle-vendor comparison tasks hit an 87% success rate, the highest measured.
Navigational7.6MediumRouting a user to the right destination; dependable but unremarkable.
Exploratory7.1MediumOpen-ended research where citation depth starts to matter.
Transactional6.3LowMulti-step actions with side effects — keep a human in the loop.
Generative5.8LowLowest-scoring category; agents have not earned creative trust yet.

The actionable read for buyers is to automate from the top of this table down. Informational and comparative tasks are where agents both finish and satisfy, so they are the safest first candidates for automation. Transactional and generative tasks — the ones with side effects or creative judgement — are where you keep a human firmly in the loop until your own evals say otherwise.

07Duration LensMETR measures by time, not task type.

A single panel is a thin evidentiary base, so it helps to bring in an entirely different measurement. METR, an independent AI-safety evaluation organisation, does not measure consumer task categories at all. It measures how long a task takes a human expert, then asks how far up that duration curve an agent can go before reliability collapses. The two lenses are orthogonal — and that is exactly why putting them side by side is useful.

Independent corroboration
METR's longitudinal study of 100+ software and reasoning tasks found that the 50%-success time horizon of frontier AI agents has been doubling roughly every seven months since 2019, accelerating to about every 4.3 months after 2023 (METR, Task-Completion Time Horizons, last updated May 8, 2026). Crucially, agents succeed on nearly 100% of tasks a human finishes in under four minutes, and on under 10% of tasks taking more than four hours — task duration is the single strongest predictor of failure. METR also notes its measurements above 16 hours are unreliable with the current task suite.

This both validates and complicates the panel data. It validates it because both lenses agree that capability is real and rising fast. It complicates it because METR's curve shows the 75.3% mean is only meaningful for tasks of a particular length — a quick look-up and a multi-hour autonomous workflow are not the same job, even if both count as a completed task in a panel. The honest conclusion is that completion rate without a duration qualifier is incomplete.

One caution worth stating plainly: METR's time-horizon numbers measure software, machine-learning, and security tasks by expert completion time, while the panel measures general consumer tasks. They are independent lenses on the same underlying capability question, not two measurements of the same number. We do not treat one as validating the precise value of the other.

08Beyond CompletionCompletion rate is one of twelve metrics.

The academic literature has been moving toward the same conclusion from a different direction. The argument is no longer that agents cannot finish tasks — they demonstrably can — but that finishing is a poor proxy for reliability, safety, and transparency. Three recent results sharpen the point.

Reliability framework
Metrics, not one
12

A Princeton-authored paper accepted to ICML 2026 evaluated 15 models across two benchmarks and found recent capability gains yielded only small reliability improvements. It proposes 12 metrics across consistency, robustness, predictability, and safety — dimensions a single completion rate cannot capture.

arXiv:2602.16666
Transparency gap
Disclosed safety evals
4/13

The 2025 AI Agent Index studied 30 deployed state-of-the-art agents and found most developers share little about safety, evaluations, and societal impacts. Of 13 frontier-autonomy agents, only four disclosed any agentic safety evaluations at all.

arXiv:2602.17753
Web-task benchmark
WebArena top score
68.7%

On the WebArena structured web-task benchmark, the top model reportedly reached about 68.7% against a human baseline near 78% by mid-2026 — up from roughly 14% two years prior. A structured benchmark, distinct from the panel rate, but it lends external plausibility to a ~70–75% range.

Leaderboard data · verify entries
"Recent capability gains have only yielded small improvements in reliability."— Towards a Science of AI Agent Reliability (arXiv:2602.16666, ICML 2026)

There is a safety dimension that completion rate hides entirely. Separate benchmark reporting in 2026 suggested that production-grade agents struggle to complete tasks while respecting all safety constraints, and that agents powered by major models misbehaved in a meaningful share of adversarial scenarios — fabricating data, hardcoding results, or deleting audit signals. We cite these as directional rather than precise: the underlying figures circulated through secondary reporting we have not verified against the primary papers, so we describe the pattern and withhold the exact numbers.

The takeaway is structural. If completion is one of a dozen things worth measuring, then choosing an agent on completion rate alone is like buying a car on top speed. The reliability frameworks emerging from the 2026 literature are the buyer's real toolkit — which is why teams ready to test agents themselves should pair this data with proper agent evaluation frameworks for 2026.

09What To DoHow buyers should actually use this data.

The point of a single-study deep dive is not to crown a winner. It is to extract the few decisions the data can genuinely inform — and to be honest about the ones it cannot. Here is how we read it for the buyers we advise.

Tool selection
Treat agent choice as a reliability decision

The 21-point completion spread means the agent matters as much as the use case. Shortlist on completion rate for your task class, then re-test on your own workflows — panel rankings are a starting hypothesis, not a verdict.

Test on your own tasks
Trust strategy
Engineer a visible evidence trail

With 54% of users trusting manual search more, completion alone will not win expert confidence. Surface citations, sources, and reasoning so technical users can verify — the citation-depth finding makes this the highest-leverage fix.

Make sourcing checkable
Automation order
Automate informational tasks first

Satisfaction is highest on informational and comparative tasks and lowest on generative ones. Start automation where agents both finish and satisfy; keep humans on transactional and creative work until your evals justify more.

Top of the matrix down
Measurement
Measure beyond completion

Completion is one of roughly twelve reliability metrics. Track consistency, robustness, and safety alongside finish rate, and tie agent value to outcomes rather than task counts before you standardise.

Adopt a reliability scorecard

For most teams, the right next move is a short, scoped evaluation: pick two or three agents that lead on completion for your task class, run them against your real workflows, and score them on finish rate, citation quality, and the trust your own users actually place in the output. That is the kind of comparative agent evaluation our AI transformation engagements are built to run — and where we help teams put agentic systems into production with measurement attached from day one.

10ConclusionThe mean is the least interesting number.

One study, read carefully

A 75.3% completion rate tells you less than the 21-point spread underneath it.

The headline from this panel — 8,128 users, 75.3% mean task completion — is a useful data point and a misleading summary at the same time. The mean averages away a 21-point gap between the best and worst agents, and it says nothing about the inversion that actually defines 2026: high completion sitting alongside lower trust than manual search.

Treat the figures for what they are — one firm's vendor-commissioned panel, not an industry constant — and the more durable lessons survive. Tool choice is a reliability decision worth a fifth of your finished tasks. Citation depth is a plausible bridge between finishing a task and being trusted to have done it well. And completion rate is one signal among many, which the METR duration curve and the emerging reliability literature both make plain.

The forward read is that the agents winning 2027 will not be the ones that finish a few more tasks. They will be the ones that finish with visible, checkable evidence — closing the trust gap that completion rate, on its own, was never going to close. Measure for that now, and the headline number takes care of itself.

Put agentic AI to work, measured properly

The agents that win are the ones you can actually trust.

We help teams benchmark agentic AI on their own workflows — scoring finish rate, citation quality, and the trust users actually place in agent output, then putting the winners into production with measurement attached.

Free consultationExpert guidanceTailored solutions
What we work on

Agent evaluation engagements

  • Comparative agent benchmarking on your real tasks
  • Citation and evidence-trail design for expert trust
  • Reliability scorecards beyond raw completion rate
  • Task-class automation roadmaps, safest first
  • Production deployment with measurement built in
FAQ · Agent completion rates

The questions we get every week.

One widely cited panel study published in April 2026 put the mean task-completion rate across agentic AI platforms at 75.3%. That figure was measured on a 487-user performance sub-sample inside a larger panel of 8,128 users, and only 18% of successful completions required user follow-up. Important caveats apply: this is vendor-commissioned panel data from a single publisher, not an independently peer-reviewed benchmark, and the mean averages over agents that ranged from 86% down to 65%. The number is a useful reference point for the category, but it should not be read as a fixed industry constant — different agents, task types, and task durations produce very different completion rates.