Agentic workflow completion metrics are the production measurement panel that distinguishes a pipeline you can operate from a prototype you keep nursing. Ten KPIs — completion rate, stage abandonment, retry depth, parallel-branch coverage, cost per success, time-to-completion percentiles, human-in-the-loop frequency, eval-fail rate, drift signals, incident frequency — each with a formula, a target band, and the failure pattern it surfaces before it becomes a Slack-channel incident.

The pattern that pushed us to publish the panel is consistent across engagements. A team ships an agent workflow, watches success-rate climb during the demo phase, then loses confidence three months in when nobody can explain why it costs twice what it did last month or why one customer's briefings keep arriving incomplete. The capability hasn't regressed; the measurement layer was never built. Without a pipeline-health panel, every operational question becomes an archaeology dig.

This guide walks through each KPI in the panel: what it measures, the formula we use, the target band we recommend, and the production incident class it surfaces. Skip to the FAQ for the implementation questions teams ask before they wire the first metric up.

Key takeaways

01
Completion rate is the trust signal.The headline metric every other KPI defends. A workflow whose completion rate hovers in the 70s is one that customers stop trusting; the 90%+ band is what separates a system you can sell from a beta you keep apologizing for.
02
Stage abandonment surfaces friction.Per-stage abandonment rates pinpoint exactly where workflows die — a single stage with 12% abandonment can drag overall completion below 90% even when every other stage runs cleanly. Debug at the stage, not the workflow.
03
Retry depth predicts cost overruns.Retries are silent cost amplifiers. Workflows that average two retries per stage burn three times the token budget of clean runs; capping retry depth at three with bounded backoff is the single highest-leverage cost control.
04
Cost per success is the production metric.Cost per attempt flatters the prototype phase; cost per success tells the operational truth. Divide total spend by completed workflows — the gap between attempt-cost and success-cost is the resilience tax you're paying.
05
Human-in-the-loop frequency tracks autonomy maturity.Falling HITL rates mean the agent is winning more decisions on its own; rising rates mean the workflow is regressing toward manual operation. The slope matters more than the absolute number — track it weekly and ask why on every change.

01 — Why Pipeline HealthDemos measure capability; pipelines measure operability.

The measurement conversation around agentic workflows is still dominated by capability metrics — eval scores, leaderboard placements, demo-pass rates. Those metrics matter when you're picking a model; they almost stop mattering once the workflow is in production. What replaces them is operational measurement: does the workflow finish what it starts, at a cost the business can absorb, with human intervention rare enough that the agent is genuinely doing work?

The panel below is what we wire up on engagement number one with every client running agent workflows in production. The metrics aren't exotic — most are borrowed from distributed-systems and SRE practice — but the framing is specific to agentic AI: non-deterministic execution, LLM cost amplification, partial failure as a routine occurrence rather than an exception. A traditional SLO panel doesn't capture cost per success or retry depth; those are the agent-shaped additions.

Capability metrics

Eval scores · benchmarks · demo pass rate

Useful for model selection and during the prototype phase. Stop mapping cleanly to production once the workflow has real users — a 95% eval score doesn't prevent the workflow from costing 4x what it should or stalling on every fifth run.

Pre-production only

Pipeline-health metrics

Completion · cost-per-success · retry depth · HITL frequency

The operational panel. Every metric has a formula, a target band, and a tied incident class. Reviewed weekly with engineering plus the product owner; drift on any metric triggers a focused debugging session, not a vague morale meeting.

Production discipline

Pure SRE metrics

Latency · uptime · error rate

Necessary but insufficient. An agent workflow can be 100% uptime and 95% error-free and still be losing money on every successful run because the cost-per-success is double the revenue. SRE metrics need agent-shaped companions.

Necessary, not sufficient

Vendor dashboards

Per-provider token spend · per-tool API counters

Tells you what was consumed; tells you nothing about whether the consumption produced a completed workflow. Useful as a raw data feed into the pipeline-health panel; not a substitute for the per-workflow attribution the panel provides.

Raw feed, not panel

The ten KPIs sit across six surfaces: completion (the headline), abandonment (per-stage drilldown), retry (cost amplifier), cost (the business reality), human-in-the-loop (the autonomy slope), and drift plus incidents (the leading indicators). The rest of this guide walks through each, starting with the headline metric every other KPI exists to defend.

02 — Completion RateThe headline metric every other KPI defends.

Completion rate is the share of workflow invocations that finish in the "success" terminal state, measured over a defined window. The formula is straightforward: completed workflows divided by total invocations, expressed as a percentage. The discipline is in the denominator — every invocation counts, including the ones killed by timeouts, the ones aborted by HITL queue inaction, and the ones that ran clean but produced an unusable output that the eval layer flagged.

We split completion rate into three operating modes because the target band, debugging path, and remediation playbook all differ by mode. The same workflow has different completion-rate expectations depending on whether you're measuring the happy path, the partial-recovery path, or the strict full-success path that your business actually depends on.

Mode 1

Happy- path completion

Success ÷ invocations · target 95%+

Workflow ran end-to-end with no errors, no retries, no HITL interventions. This is the demo metric — the band the prototype phase optimizes for. Production rarely sits above 60-70% on this mode once real input variety lands.

Prototype reference

Mode 2

Partial- recovery completion

(Clean + recovered) ÷ invocations · target 90%+

Workflow finished in the success state, with retries, fallbacks, or compensating actions allowed. This is what most production agent workflows actually deliver. Below 90% means resilience scaffolding is missing or under-tuned.

Production target

Mode 3

Strict business completion

(Success ∧ eval-pass ∧ no-HITL-fail) ÷ inv. · 80%+

Workflow completed, eval layer approved the output, no HITL checkpoint failed or timed out. The metric customers experience. The gap between Mode 2 and Mode 3 is your eval-layer accuracy and HITL queue health.

Customer-facing truth

Three completion modes, three remediation paths

When happy-path completion is high but business completion is low, the gap lives in the eval layer or HITL queue — not the agent. When partial-recovery completion is below 90%, the resilience scaffolding is missing or undertuned. When all three modes are low, the workflow itself needs redesign before any metric tuning will move the panel.

The cadence matters too. Daily completion rate is too noisy on low-volume workflows; weekly is the right review rhythm for most production agent panels. Trend matters more than snapshot — a workflow tracking from 92% to 88% over four weeks is a different conversation than a workflow that's been steady at 88% for six months. Plot the four-week trailing average alongside the weekly point.

03 — Stage AbandonmentThe per-stage drilldown when completion rate slips.

Stage abandonment is the share of workflow invocations that enter a specific stage but never leave it in the success state. The formula is per-stage: stage-failures plus stage-timeouts divided by stage-entries. When overall completion rate slips, stage abandonment is the metric that tells you which stage to debug — the single most useful drilldown in the panel.

The pattern we see most often: a workflow whose overall completion rate is 86% has one stage running at 14% abandonment and four other stages clean. Fix the one stage and overall completion returns to 94%. Without per-stage abandonment, the same workflow looks like a generic "the agent isn't working" problem that resists targeted investment.

Tool-call stages

External API · search · database read

Abandonment here is usually about timeouts, rate limits, or auth issues. Target band: under 3%. Remediation pattern: per-tool timeout tuned to p99 latency, exponential backoff with jitter, circuit breaker on persistent failure. Cheap to fix; the audit pays back fast.

Tighten tool resilience

LLM-decision stages

Routing · ranking · classification

Abandonment here usually means structured-output parsing failures or low-confidence decisions hitting a fallback. Target band: under 5%. Remediation pattern: tighten the schema, add a constrained-output decoder, route low-confidence cases to HITL instead of failing.

Tighten output schema

Compensation stages

Saga undo · refund · retraction

Abandonment in compensation stages is a high-severity event — the workflow can't roll back its own side effects. Target band: under 1%. Remediation pattern: dedicated retry policy, manual escalation queue, end-to-end compensation drill once per quarter.

Escalate every failure

HITL-gated stages

Approval gates · confidence checkpoints

Abandonment here is usually queue starvation — no human responds within the timeout, fail-safe default fires. Target band: under 8% (HITL inherently introduces variance). Remediation pattern: queue SLA review, default-action policy audit, escalation routing fix.

Queue SLA review

Tracking abandonment requires per-stage instrumentation — a structured event at stage entry and another at stage exit, with the exit event carrying the terminal state (success, retry, fail, timeout, HITL-abort). Most teams have the entry event but not the structured exit event; adding it is half a sprint of work and unlocks the entire stage-abandonment drilldown for the rest of the panel's lifetime.

04 — Retry DepthThe silent cost amplifier and the resilience-tax measurement.

Retry depth is the average number of retry attempts per logical step, measured across all stages and aggregated weekly. The formula is total retry attempts divided by total stage entries; a workflow with retry depth 0.4 averages roughly one retry every two and a half stage entries. The metric matters because retries are the most under-tracked cost driver in agent workflows — every retry on an LLM-heavy stage doubles or triples that stage's token spend, and the per-attempt cost dashboards most teams use don't surface it.

We recommend a hard ceiling of three retries per logical step, with exponential backoff and jitter, and a per-workflow retry budget that prevents a single degraded dependency from consuming the workflow's entire latency and cost budget on retries. Retry depth above 1.0 sustained over multiple weeks is a signal that the workflow is operating in a degraded regime even when completion rate looks fine — the cost is the canary.

"Retries are the most under-tracked cost driver in agent workflows. Every retry on an LLM-heavy stage doubles or triples that stage's token spend, and the per-attempt cost dashboards most teams use don't surface it."— Engagement teardown, Q1 2026

The structural fix is to make retries cheap before they're frequent. Idempotency keys at the tool layer mean retries on mutating calls don't risk duplicate side effects. Per-tool retry policies — aggressive on search, conservative on payments — keep the budget proportional to the operation's cost. Circuit breakers on persistent failure stop the retry storm before it consumes the workflow's budget. None of these are exotic patterns; the audit point is making sure they all exist before retry depth becomes a board-level cost conversation.

For a deeper teardown of the retry, rollback, and idempotency patterns underneath the metric, our companion resilience audit (70-point checklist) covers the engineering primitives in dedicated depth.

05 — Cost Per SuccessThe metric that attempt-cost dashboards hide.

Cost per success is total workflow spend divided by completed workflows over the measurement window. The denominator is what makes it the production metric — every dollar burned on a failed run gets attributed to the successes, which is the economic reality the business actually faces. Cost per attempt flatters the prototype phase; cost per success tells the operational truth.

The gap between attempt-cost and success-cost is the resilience tax. A workflow with 90% completion and $0.30 per attempt has roughly $0.33 per success; a workflow with 70% completion and the same per-attempt cost is paying $0.43 per success — 30% more for the same business outcome. That gap is the single best argument for investing in the resilience and measurement layers.

Approximate resilience tax · cost-per-success vs cost-per-attempt

Approximate ratios — actual values depend on workflow shape, dependency profile, and retry policy. Cost-per-success / cost-per-attempt approaches 1.0 as completion rate approaches 100%.

Prototype baseline (70% completion)Naïve retries · no idempotency · cost-per-success ~1.4x attempt cost

1.4×

Defensive scaffolding (85% completion)Timeouts + retry caps · cost-per-success ~1.18x attempt cost

1.18×

Production target (90% completion)Idempotency + saga + bounded retries · cost-per-success ~1.11x attempt cost

1.11×

Hardened product (95%+ completion)Full resilience layer · cost-per-success ~1.05x attempt cost

1.05×

Cost attribution lives at three grains and the panel tracks all three: per-workflow (total spend ÷ completed workflows), per-tenant (the same calculation scoped to a single customer or org), and per-stage (so the most expensive stage in the workflow is always visible). Per-tenant attribution is what unlocks pricing decisions; per-stage attribution is what unlocks targeted optimization. Aggregate cost-per-success without the breakdowns is a vanity metric.

For the full attribution framework — token spend, tool API cost, infrastructure cost, the per-task and per-user grains — see our companion agent cost metrics framework and the AI transformation engagement that typically wires both panels up together.

06 — Human-in-the-Loop FrequencyThe autonomy slope and the regression signal.

Human-in-the-loop frequency is the share of workflows that hit at least one HITL checkpoint requiring a manual decision, measured weekly. The formula divides workflows with one or more human interventions by total workflow invocations. The slope — week-over-week direction — matters more than the absolute number, because the right HITL rate depends entirely on the workflow's blast radius and the maturity stage it's operating at.

We track two flavors: planned-HITL (workflow hit a checkpoint that was designed in — irreversible action gate, high-cost branch, sensitive customer comms) and unplanned-HITL (workflow escalated to a human because confidence dropped below threshold, rate limit tripped, or compensation failed). The two numbers tell different stories and need separate target bands.

Planned-HITL

5-15%

Target band, irreversible-action workflows

Designed-in checkpoints at high-blast-radius steps. Stable rate is healthy; rising rate suggests the workflow is taking on more sensitive actions; falling rate suggests removed safety gates rather than improved autonomy.

Designed coverage

Unplanned-HITL

<3%

Target band, mature workflows

Confidence escalations and compensation failures. Above 5% means the agent is consistently uncertain or the resilience layer is leaking. Trend the metric weekly; drift above target is the leading signal for an upcoming incident.

Maturity threshold

Autonomy slope

−2pp/qtr

Healthy direction over time

A maturing workflow sheds unplanned-HITL roughly two percentage points per quarter as model upgrades land, eval coverage tightens, and confidence thresholds get re-tuned. Positive slope (rising HITL) is a regression alert.

Maturity trend

The interpretive trap is treating HITL frequency as a standalone optimization target. Dropping HITL to zero by removing checkpoints is not the same as winning autonomy; it's the same as ignoring blast radius. The right discipline is to keep planned-HITL stable at the level the workflow's risk profile demands, and to drive unplanned-HITL down through better resilience and confidence calibration. The slope on unplanned-HITL is the production-honest autonomy signal.

07 — Drift + Incident SignalsThe leading indicators that close the panel.

The last four KPIs are the leading indicators — the metrics that move before completion rate slips and before cost per success spikes. They are the panel's early-warning layer, the signals that let you debug the next incident before it happens rather than after.

Time-to-completion percentiles (p50, p95, p99) catch latency regressions that completion rate hides. Eval-fail rate catches quality regressions that the workflow itself never errors on — the workflow finishes, the eval layer disagrees with the output, and the metric is the only signal. Parallel-branch coverage catches degraded throughput on fan-out workflows. Incident frequency is the trailing summary metric — the number that should trend toward zero as the rest of the panel tightens.

KPI 7

Time-to-completion percentiles

p50 / p95 / p99 stage and workflow duration

Latency regressions show up here first. Track per-stage and end-to-end. p99 drifting up while p50 stays flat usually means a degraded dependency; both moving means a model or prompt change.

Latency regression

KPI 8

Eval- fail rate

Eval-layer rejections ÷ completed workflows

Quality regressions the workflow itself never errors on. Run an automated eval pass on every completed run (or a sampled subset); track rejection rate weekly. Drift above the baseline is a model-quality or prompt-drift signal.

Quality regression

KPI 9

Parallel-branch coverage

Branches completed ÷ branches dispatched · per workflow

Fan-out workflows live and die here. A workflow that dispatches 20 parallel branches and completes 17 is a degraded run that completion rate alone can't see. Target band: above 95% on parallel-heavy workflows.

Fan-out integrity

KPI 10

Incident frequency

Sev-1/Sev-2 incidents ÷ time · trailing summary

The trailing metric. Tightening the other nine KPIs is what makes incident frequency fall. Track per quarter on lower-volume workflows; per month on high-volume. Going six months without a Sev-1 is the maturity goal.

Trailing summary

The panel is reviewed weekly with engineering and the product owner. Each KPI has a defined owner — engineering owns completion, retry, cost; product owns HITL frequency and eval-fail rate; both share drift and incident signals. Drift on any metric beyond a defined threshold triggers a focused debugging session against the relevant stage trace, not a vague morale meeting.

For a deeper look at the anti-patterns that drive these metrics off-band — the orchestration mistakes that produce high retry depth, low completion rate, and unstable cost-per-success — see our companion orchestration anti-patterns guide and the broader resilience teardown linked above.

Conclusion

Pipeline-health metrics turn agentic workflows from prototypes into products.

The pattern across engagements is consistent. Teams ship agent workflows and measure them with capability metrics — eval scores, demo-pass rates — that flatter the prototype phase and stop mapping cleanly to production once real users land. The workflow keeps running; the team loses confidence; nobody can answer the operational questions the business actually asks. What replaces the capability panel is the operational one: does the workflow finish what it starts, at a cost the business can absorb, with human intervention rare enough that the agent is genuinely doing work?

None of the ten KPIs is exotic. Most are borrowed from distributed-systems and SRE practice. The agent-shaped additions — cost per success, retry depth, eval-fail rate, unplanned-HITL — are what bridge SRE discipline into the non-deterministic, cost-amplified, partial-failure-by-default world of agentic workflows. The reason the panel keeps getting skipped is the same reason the resilience layer keeps getting skipped: it adds engineering time before the first demo and pays back only at production scale. Which is exactly when the team that skipped it is paying the bill in incidents and cost overruns.

Practical next step: pick one production agent workflow this week and wire up three of the ten KPIs against it — completion rate, retry depth, cost per success. The first three are the highest-leverage subset and the cheapest to instrument; most teams can ship them in a sprint. The remaining seven follow once the panel proves its worth, which it usually does the first time a drift signal catches an incident before it reaches a customer.

Agentic Workflow Completion Metrics: Pipeline Health 2026