SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentMethodology3 min readPublished Apr 27, 2026

5 outcome classes · cost-adjusted scoring · 200-trace reference dataset

Agent Success Rate (ASR)

Production agents need a shared definition of "success." Pass-rate alone treats a partially-correct answer as a failure; cost-blind metrics treat a $4 task that succeeded as equivalent to a $0.04 task that succeeded. ASR closes both gaps: a single rate, partial credit weighted, cost-adjusted, with confidence intervals.

DA
Digital Applied Team
Senior strategists · Published Apr 27, 2026
PublishedApr 27, 2026
Read time3 min
Sourcesτ-bench · BFCL · MLE-bench · SWE-Verified · LangSmith
Outcome classes
5
complete · partial+ · partial− · halluc. · abandoned
Partial weight
0.4
for partial-correct outcomes
Reference dataset
200 traces
across 6 task families
open
Confidence
bootstrap
1,000 resamples by default

Agent evaluation is in a strange place in 2026. The benchmark suites — τ-bench, Berkeley Function Calling, SWE-bench Verified, MLE-bench — give us model-level numbers. Production teams running agents in real workflows are stuck between three options: roll their own metric (different per team, not comparable), report a simplistic pass-rate (treats a partially correct answer as a failure), or use a cost-blind metric (treats a $4 task that succeeded as equivalent to a $0.04 task that succeeded).

Agent Success Rate (ASR) is the methodology we ship to clients to close those gaps. It is not a benchmark suite — it is a measurement spec that runs on top of any benchmark suite or production trace store. The spec defines the outcome classes, the partial-credit weighting, the cost-penalty function, the confidence-interval reporting, and the reference dataset of 200 production-agent traces we use as a calibration set.

Key takeaways
  1. 01
    ASR is one rate that captures completion, partial credit, and cost in a single number.Pass-rate alone treats a partial answer as a failure. Cost-blind metrics treat expensive successes as equivalent to cheap ones. ASR folds both into a single rate, with the formula tunable per workflow but the structure shared across teams.
  2. 02
    The five outcome classes give reviewers something concrete to label.Completed (full credit), partial-correct (0.4 credit), partial-incorrect (0 credit, no penalty), hallucinated (0 credit, flagged), abandoned (0 credit, flagged). The labels carry diagnostic signal that pass/fail throws away.
  3. 03
    Cost-penalty engages above the budget ceiling, not on every task.Per-task cost ceiling is set per workflow (e.g., $0.10 for a triage task, $2.00 for a research task). Costs below the ceiling do not penalise. Costs above the ceiling reduce the score linearly, with a hard floor at 0.
  4. 04
    Confidence intervals make ASR comparable across runs and teams.Default reporting: ASR ± 95% CI via 1,000 bootstrap resamples. A 0.6-point ASR delta with overlapping CIs is not a regression; a 4-point delta with non-overlapping CIs is. Without confidence intervals, ASR comparisons are noise.
  5. 05
    The 200-trace reference dataset is the calibration tool, not the benchmark.Six task families × 30-35 traces each. Each trace has the prompt, the agent's response, the canonical labels, the cost, and the latency. Use it to calibrate reviewers; do not use it as a public leaderboard — drift is too easy when the dataset becomes the target.

01ProblemWhy we need ASR.

By Q1 2026, most production agentic teams report some flavour of success rate. The metrics are not comparable. Some teams use strict pass-rate (binary, no partial credit). Some teams use task completion (allows partial). Some teams report cost per successful task without any partial credit. Some teams report a qualitative reviewer score on a 1-5 scale. The result: when a VP-level conversation needs to compare two agent versions or two providers, no apples-to-apples answer exists.

ASR is the methodology that closes the gap. It does not invent a new benchmark; it specifies how to compute a comparable rate from any benchmark or production trace store, with partial credit and cost-adjustment baked in.

"We had three teams reporting agent success rate three different ways and the QBR turned into a definitions argument. ASR ended that — same formula, same outcome classes, same confidence intervals."— Head of AI infrastructure, public SaaS, March 2026

02DefinitionFormal definition.

The full ASR formula:

ASR = (n_full + 0.4·n_partial+ − Σ cost_penalty_i) / n_total

  cost_penalty_i = max(0, (cost_i − budget_ceiling) / budget_ceiling)
                   capped at 1.0 per task

  outcome ∈ {complete, partial+, partial−, hallucinated, abandoned}

ASR is a rate in the range [0, 1], reported as a percentage. Higher is better. A perfect agent on a basket of in-budget tasks scores 100%. A partially-effective agent on a basket of slightly-over-budget tasks might score 38%. The formula degrades gracefully and does not produce negative values.

03Outcome classesFive outcome classes.

Reviewers label each agent run into exactly one of five classes. The labels are structured to make reviewer agreement high (binary decision tree at each level) and to preserve diagnostic signal for downstream analysis.

Class 1
Completed (full credit)
weight 1.0

The agent completed the task to the standard a competent human reviewer would accept. All sub-goals satisfied; output meets the spec. Most production-grade systems target 65%+ in this class.

Full credit
Class 2
Partial-correct (partial credit)
weight 0.4

The agent got the right shape of answer but missed sub-goals (e.g., correct research summary but missed two of five required sources). Partial credit captures the real value without rewarding low effort.

0.4 weight
Class 3
Partial-incorrect (no credit, no penalty)
weight 0.0

The agent attempted the task but the output is meaningfully wrong on sub-goals that matter (e.g., research summary that confidently includes a wrong fact). No credit; flagged for prompt or guard-rail review.

0 credit
Class 4
Hallucinated (no credit, flagged)
weight 0.0 + flag

The agent confidently produced output that is fabricated — invented citations, invented function results, invented user state. Distinct from partial-incorrect because hallucinations indicate model or retrieval failure, not logic failure.

Failure mode flag
Class 5
Abandoned (no credit, flagged)
weight 0.0 + flag

The agent did not produce output (timed out, errored, hit a tool-use deadlock, exceeded retries). Tracked separately because abandonment usually indicates infrastructure issues, not agent capability issues.

Infra signal

04Cost penaltyCost penalty design.

The cost-penalty function is the part of ASR that practitioners push back on the most. The pushback is fair — penalising an agent for being expensive can produce metrics that prefer cheap-and-bad over expensive-and-correct. The design below threads that needle: costs only penalise above a per-task budget ceiling, and the penalty is linear-up-to-1.0, never producing negative scores.

Design choice 1
Per-task budget ceiling, not per-token rate

The ceiling is set per workflow (triage tasks $0.10, research tasks $2.00, code-review tasks $0.50). A token-rate ceiling would punish thoughtful agents that use more tokens to be correct; the per-task ceiling tracks the economic decision instead.

Workflow-level
Design choice 2
Linear penalty above ceiling

Once a task exceeds budget, the penalty grows linearly with the overage as a fraction of the ceiling: 1.5× ceiling = 0.5 penalty, 2× = 1.0 penalty (max). Linear is simpler than log/quadratic and easier to communicate to non-technical stakeholders.

Linear above ceiling
Design choice 3
Hard floor at 0; no negative ASR

Tasks that succeeded but went 5× over budget contribute 0 to ASR rather than going negative. Negative scores would let cost overruns dominate the headline metric in unintuitive ways. Cost overruns are tracked separately as a 'cost variance' sub-metric.

Floor at 0
Design choice 4
Cost variance as separate sub-metric

ASR penalises cost only above ceiling; cost variance reports the absolute cost distribution (P50, P90, P99) and the % of tasks above ceiling separately. Together the two paint the picture: ASR for headline, cost variance for budgeting.

Sub-metric pair

05Reference datasetThe 200-trace reference dataset.

The reference dataset is six task families × 30-35 traces each. Its purpose is reviewer calibration, not benchmarking. Teams use it to align internal reviewers on what counts as partial-vs-incorrect-vs-hallucinated; once calibrated, they apply the same labels to their own production traces.

Family 1
35
Research

Multi-source research tasks. Examples: 'summarise the leading positions on the EU AI Act,' 'find three recent peer-reviewed papers on retrieval augmentation.' Cost ceiling $2.00; median actual $1.40.

Multi-source
Family 2
32
Support triage

Customer-support classification + first-response drafting. Examples: triage 50 incoming tickets into severity bands and draft initial replies. Cost ceiling $0.10/ticket; median actual $0.06.

High volume
Family 3
30
Code review

PR-level code review with structured findings. Examples: review a 300-line PR for bugs, style, security. Cost ceiling $0.50/PR; median actual $0.32.

Structured output
Family 4
33
Data extraction

Structured-output extraction from unstructured docs. Examples: pull contract terms from a 30-page MSA. Cost ceiling $0.25/doc; median actual $0.18.

Schema-bound
Family 5
35
Content drafting

First-draft content generation against a brief. Examples: 800-word blog post against a structured outline. Cost ceiling $0.50/draft; median actual $0.31.

Quality-graded
Family 6
35
Scheduling

Multi-agent scheduling tasks (calendar, tool-use, follow-up). Examples: schedule a 6-person customer call across three time zones. Cost ceiling $0.15/task; median actual $0.09.

Tool-use heavy

06ReportingReporting + confidence intervals.

ASR without confidence intervals is noise. The default reporting format includes the headline number, a 95% bootstrap CI, the outcome-class breakdown, and the cost-variance side panel. The template below is what we ship to engineering leads weekly.

Default report
ASR 0.62 ± 0.04 (95% CI)
1,000 bootstrap resamples

Headline number with confidence interval. A 0.04 CI on a 1,000-trace sample is typical; tighter on larger samples. Without CIs, week-over-week ASR comparisons are unreliable below 5 percentage points.

Headline number
Outcome breakdown
Class distribution table
% per class · with deltas

Completed: 58%. Partial-correct: 12%. Partial-incorrect: 8%. Hallucinated: 4%. Abandoned: 18%. Track WoW deltas; flag any class above 2% week-on-week change.

Diagnostic
Cost variance
P50 / P90 / P99 cost · % over ceiling
side panel

P50 $0.31 · P90 $0.84 · P99 $2.10. 8% of tasks above ceiling. The cost-variance side panel is what tells the team whether the cost penalty is dominating or marginal.

Budget signal
Time-series
ASR with rolling 4-week mean
weekly cadence

Rolling 4-week mean smooths weekly noise and shows whether the program is trending. Use raw weekly ASR for anomaly detection; use rolling mean for trend reporting.

Trend view

07ComparisonASR vs adjacent metrics.

Metric
ASR vs strict pass-rate

Strict pass-rate is binary; ASR captures partial credit. A team running pass-rate often improves their headline number by widening their definition of 'pass' — ASR makes that move visible because partial-correct shows up as 0.4 credit instead of being absorbed into 'pass'. ASR > pass-rate by design.

ASR wins
Metric
ASR vs cost per successful task

Cost per successful task is a denominator; ASR is a rate. Both are useful, and they pair: report ASR for the rate of success, $/successful-task for the unit economics. ASR alone misses cost; $/successful-task alone misses partial credit.

Pair them
Metric
ASR vs LangSmith / LangFuse eval scores

Eval scores are usually qualitative (1-5 scales, judge-LLM grades). ASR is structurally compatible: feed the eval score into the outcome-class label decision (5 → completed, 4 → partial-correct, 3-1 → partial-incorrect or worse). Use the platform for trace storage, ASR for headline.

Layered
Metric
ASR vs τ-bench / SWE-bench Verified

Benchmark suites give model-level numbers; ASR gives workflow-level numbers. Use the suite numbers to decide which model to deploy; use ASR to track whether the deployment continues working in production. Different lenses, both needed.

Different lenses

08ConclusionOne metric, five outcomes.

ASR — methodology spec, April 2026

Adopt ASR as the headline; keep the outcome breakdown for diagnosis. The combination is what makes agent programs defensible across teams.

Pass-rate alone is too coarse. Cost-blind metrics are too forgiving of expensive failures. Qualitative reviewer scores are not comparable across teams. ASR closes those gaps without inventing a new benchmark — it specifies how to compute a comparable rate from any benchmark suite or production trace store.

Adopt the outcome classes (5), the partial weight (0.4), the cost penalty (linear above ceiling, floor at 0), and the bootstrap CI reporting. Use the 200-trace reference dataset for reviewer calibration; do not use it as a public leaderboard.

Pair ASR with cost variance as a side-panel sub-metric. Track WoW deltas; flag class-distribution shifts above 2% as diagnostic signals. The formula is open — fork the weights for your own workflow if 0.4 partial credit feels wrong; preserve the structure so teams can compare numbers across programs.

Agent measurement

Stop arguing about agent metrics. Adopt ASR.

We instrument ASR end-to-end for engineering teams running production agents — outcome-class taxonomy design, reviewer calibration, cost-penalty tuning, and the weekly reporting loop. Most engagements ship a defensible ASR baseline within 30 days.

Free consultationExpert guidanceTailored solutions
What we work on

ASR engagements

  • Outcome-class taxonomy + reviewer calibration
  • Cost-penalty tuning per workflow ceiling
  • Bootstrap CI reporting + weekly cadence
  • Reference-dataset adaptation for the team's domain
  • Eval-platform integration (LangSmith, LangFuse, Arize)
FAQ · Agent Success Rate methodology

The questions we get every week.

0.5 over-rewards the partial outcome. We arrived at 0.4 empirically: across 28 client engagements, ASR computed at 0.4-partial-weight correlated with downstream stakeholder satisfaction more strongly than 0.5 or 0.3. Stakeholders consistently described partial-correct outputs as 'about half useful' verbally and 'closer to 40%' when asked to score them on a 0-100 scale. Teams should feel free to fork the weight (some research-heavy teams prefer 0.5; some support-triage teams prefer 0.3); the methodology is open. Preserve the structure even when forking the weight.