Agent evaluation is in a strange place in 2026. The benchmark suites — τ-bench, Berkeley Function Calling, SWE-bench Verified, MLE-bench — give us model-level numbers. Production teams running agents in real workflows are stuck between three options: roll their own metric (different per team, not comparable), report a simplistic pass-rate (treats a partially correct answer as a failure), or use a cost-blind metric (treats a $4 task that succeeded as equivalent to a $0.04 task that succeeded).

Agent Success Rate (ASR) is the methodology we ship to clients to close those gaps. It is not a benchmark suite — it is a measurement spec that runs on top of any benchmark suite or production trace store. The spec defines the outcome classes, the partial-credit weighting, the cost-penalty function, the confidence-interval reporting, and the reference dataset of 200 production-agent traces we use as a calibration set.

Key takeaways

01
ASR is one rate that captures completion, partial credit, and cost in a single number.Pass-rate alone treats a partial answer as a failure. Cost-blind metrics treat expensive successes as equivalent to cheap ones. ASR folds both into a single rate, with the formula tunable per workflow but the structure shared across teams.
02
The five outcome classes give reviewers something concrete to label.Completed (full credit), partial-correct (0.4 credit), partial-incorrect (0 credit, no penalty), hallucinated (0 credit, flagged), abandoned (0 credit, flagged). The labels carry diagnostic signal that pass/fail throws away.
03
Cost-penalty engages above the budget ceiling, not on every task.Per-task cost ceiling is set per workflow (e.g., $0.10 for a triage task, $2.00 for a research task). Costs below the ceiling do not penalise. Costs above the ceiling reduce the score linearly, with a hard floor at 0.
04
Confidence intervals make ASR comparable across runs and teams.Default reporting: ASR ± 95% CI via 1,000 bootstrap resamples. A 0.6-point ASR delta with overlapping CIs is not a regression; a 4-point delta with non-overlapping CIs is. Without confidence intervals, ASR comparisons are noise.
05
The 200-trace reference dataset is the calibration tool, not the benchmark.Six task families × 30-35 traces each. Each trace has the prompt, the agent's response, the canonical labels, the cost, and the latency. Use it to calibrate reviewers; do not use it as a public leaderboard — drift is too easy when the dataset becomes the target.

01 — ProblemWhy we need ASR.

By Q1 2026, most production agentic teams report some flavour of success rate. The metrics are not comparable. Some teams use strict pass-rate (binary, no partial credit). Some teams use task completion (allows partial). Some teams report cost per successful task without any partial credit. Some teams report a qualitative reviewer score on a 1-5 scale. The result: when a VP-level conversation needs to compare two agent versions or two providers, no apples-to-apples answer exists.

ASR is the methodology that closes the gap. It does not invent a new benchmark; it specifies how to compute a comparable rate from any benchmark or production trace store, with partial credit and cost-adjustment baked in.

"We had three teams reporting agent success rate three different ways and the QBR turned into a definitions argument. ASR ended that — same formula, same outcome classes, same confidence intervals."— Head of AI infrastructure, public SaaS, March 2026

02 — DefinitionFormal definition.

The full ASR formula:

ASR = (n_full + 0.4·n_partial+ − Σ cost_penalty_i) / n_total

  cost_penalty_i = max(0, (cost_i − budget_ceiling) / budget_ceiling)
                   capped at 1.0 per task

  outcome ∈ {complete, partial+, partial−, hallucinated, abandoned}

ASR is a rate in the range [0, 1], reported as a percentage. Higher is better. A perfect agent on a basket of in-budget tasks scores 100%. A partially-effective agent on a basket of slightly-over-budget tasks might score 38%. The formula degrades gracefully and does not produce negative values.

03 — Outcome classesFive outcome classes.

Reviewers label each agent run into exactly one of five classes. The labels are structured to make reviewer agreement high (binary decision tree at each level) and to preserve diagnostic signal for downstream analysis.

Class 1

Completed (full credit)

weight 1.0

The agent completed the task to the standard a competent human reviewer would accept. All sub-goals satisfied; output meets the spec. Most production-grade systems target 65%+ in this class.

Full credit

Class 2

Partial-correct (partial credit)

weight 0.4

The agent got the right shape of answer but missed sub-goals (e.g., correct research summary but missed two of five required sources). Partial credit captures the real value without rewarding low effort.

0.4 weight

Class 3

Partial-incorrect (no credit, no penalty)

weight 0.0

The agent attempted the task but the output is meaningfully wrong on sub-goals that matter (e.g., research summary that confidently includes a wrong fact). No credit; flagged for prompt or guard-rail review.

0 credit

Class 4

Hallucinated (no credit, flagged)

weight 0.0 + flag

The agent confidently produced output that is fabricated — invented citations, invented function results, invented user state. Distinct from partial-incorrect because hallucinations indicate model or retrieval failure, not logic failure.

Failure mode flag

Class 5

Abandoned (no credit, flagged)

weight 0.0 + flag

The agent did not produce output (timed out, errored, hit a tool-use deadlock, exceeded retries). Tracked separately because abandonment usually indicates infrastructure issues, not agent capability issues.

Infra signal

04 — Cost penaltyCost penalty design.

The cost-penalty function is the part of ASR that practitioners push back on the most. The pushback is fair — penalising an agent for being expensive can produce metrics that prefer cheap-and-bad over expensive-and-correct. The design below threads that needle: costs only penalise above a per-task budget ceiling, and the penalty is linear-up-to-1.0, never producing negative scores.

Design choice 1

Per-task budget ceiling, not per-token rate

The ceiling is set per workflow (triage tasks $0.10, research tasks $2.00, code-review tasks $0.50). A token-rate ceiling would punish thoughtful agents that use more tokens to be correct; the per-task ceiling tracks the economic decision instead.

Workflow-level

Design choice 2

Linear penalty above ceiling

Once a task exceeds budget, the penalty grows linearly with the overage as a fraction of the ceiling: 1.5× ceiling = 0.5 penalty, 2× = 1.0 penalty (max). Linear is simpler than log/quadratic and easier to communicate to non-technical stakeholders.

Linear above ceiling

Design choice 3

Hard floor at 0; no negative ASR

Tasks that succeeded but went 5× over budget contribute 0 to ASR rather than going negative. Negative scores would let cost overruns dominate the headline metric in unintuitive ways. Cost overruns are tracked separately as a 'cost variance' sub-metric.

Floor at 0

Design choice 4

Cost variance as separate sub-metric

ASR penalises cost only above ceiling; cost variance reports the absolute cost distribution (P50, P90, P99) and the % of tasks above ceiling separately. Together the two paint the picture: ASR for headline, cost variance for budgeting.

Sub-metric pair

05 — Reference datasetThe 200-trace reference dataset.

The reference dataset is six task families × 30-35 traces each. Its purpose is reviewer calibration, not benchmarking. Teams use it to align internal reviewers on what counts as partial-vs-incorrect-vs-hallucinated; once calibrated, they apply the same labels to their own production traces.

Family 1

Research

Multi-source research tasks. Examples: 'summarise the leading positions on the EU AI Act,' 'find three recent peer-reviewed papers on retrieval augmentation.' Cost ceiling $2.00; median actual $1.40.

Multi-source

Family 2

Support triage

Customer-support classification + first-response drafting. Examples: triage 50 incoming tickets into severity bands and draft initial replies. Cost ceiling $0.10/ticket; median actual $0.06.

High volume

Family 3

Code review

PR-level code review with structured findings. Examples: review a 300-line PR for bugs, style, security. Cost ceiling $0.50/PR; median actual $0.32.

Structured output

Family 4

Data extraction

Structured-output extraction from unstructured docs. Examples: pull contract terms from a 30-page MSA. Cost ceiling $0.25/doc; median actual $0.18.

Schema-bound

Family 5

Content drafting

First-draft content generation against a brief. Examples: 800-word blog post against a structured outline. Cost ceiling $0.50/draft; median actual $0.31.

Quality-graded

Family 6

Scheduling

Multi-agent scheduling tasks (calendar, tool-use, follow-up). Examples: schedule a 6-person customer call across three time zones. Cost ceiling $0.15/task; median actual $0.09.

Tool-use heavy

06 — ReportingReporting + confidence intervals.

ASR without confidence intervals is noise. The default reporting format includes the headline number, a 95% bootstrap CI, the outcome-class breakdown, and the cost-variance side panel. The template below is what we ship to engineering leads weekly.

Default report

ASR 0.62 ± 0.04 (95% CI)

1,000 bootstrap resamples

Headline number with confidence interval. A 0.04 CI on a 1,000-trace sample is typical; tighter on larger samples. Without CIs, week-over-week ASR comparisons are unreliable below 5 percentage points.

Headline number

Outcome breakdown

Class distribution table

% per class · with deltas

Completed: 58%. Partial-correct: 12%. Partial-incorrect: 8%. Hallucinated: 4%. Abandoned: 18%. Track WoW deltas; flag any class above 2% week-on-week change.

Diagnostic

Cost variance

P50 / P90 / P99 cost · % over ceiling

side panel

P50 $0.31 · P90 $0.84 · P99 $2.10. 8% of tasks above ceiling. The cost-variance side panel is what tells the team whether the cost penalty is dominating or marginal.

Budget signal

Time-series

ASR with rolling 4-week mean

weekly cadence

Rolling 4-week mean smooths weekly noise and shows whether the program is trending. Use raw weekly ASR for anomaly detection; use rolling mean for trend reporting.

Trend view

07 — ComparisonASR vs adjacent metrics.

Metric

ASR vs strict pass-rate

Strict pass-rate is binary; ASR captures partial credit. A team running pass-rate often improves their headline number by widening their definition of 'pass' — ASR makes that move visible because partial-correct shows up as 0.4 credit instead of being absorbed into 'pass'. ASR > pass-rate by design.

ASR wins

Metric

ASR vs cost per successful task

Cost per successful task is a denominator; ASR is a rate. Both are useful, and they pair: report ASR for the rate of success, $/successful-task for the unit economics. ASR alone misses cost; $/successful-task alone misses partial credit.

Pair them

Metric

ASR vs LangSmith / LangFuse eval scores

Eval scores are usually qualitative (1-5 scales, judge-LLM grades). ASR is structurally compatible: feed the eval score into the outcome-class label decision (5 → completed, 4 → partial-correct, 3-1 → partial-incorrect or worse). Use the platform for trace storage, ASR for headline.

Layered

Metric

ASR vs τ-bench / SWE-bench Verified

Benchmark suites give model-level numbers; ASR gives workflow-level numbers. Use the suite numbers to decide which model to deploy; use ASR to track whether the deployment continues working in production. Different lenses, both needed.

Different lenses

08 — ConclusionOne metric, five outcomes.

ASR — methodology spec, April 2026

Adopt ASR as the headline; keep the outcome breakdown for diagnosis. The combination is what makes agent programs defensible across teams.

Pass-rate alone is too coarse. Cost-blind metrics are too forgiving of expensive failures. Qualitative reviewer scores are not comparable across teams. ASR closes those gaps without inventing a new benchmark — it specifies how to compute a comparable rate from any benchmark suite or production trace store.

Adopt the outcome classes (5), the partial weight (0.4), the cost penalty (linear above ceiling, floor at 0), and the bootstrap CI reporting. Use the 200-trace reference dataset for reviewer calibration; do not use it as a public leaderboard.

Pair ASR with cost variance as a side-panel sub-metric. Track WoW deltas; flag class-distribution shifts above 2% as diagnostic signals. The formula is open — fork the weights for your own workflow if 0.4 partial credit feels wrong; preserve the structure so teams can compare numbers across programs.

Agent Success Rate (ASR)

01 — ProblemWhy we need ASR.

02 — DefinitionFormal definition.

03 — Outcome classesFive outcome classes.

Completed (full credit)

Partial-correct (partial credit)

Partial-incorrect (no credit, no penalty)

Hallucinated (no credit, flagged)

Abandoned (no credit, flagged)

04 — Cost penaltyCost penalty design.

Per-task budget ceiling, not per-token rate

Linear penalty above ceiling

Hard floor at 0; no negative ASR

Cost variance as separate sub-metric

05 — Reference datasetThe 200-trace reference dataset.

Research

Support triage

Code review

Data extraction

Content drafting

Scheduling

06 — ReportingReporting + confidence intervals.

ASR 0.62 ± 0.04 (95% CI)

Class distribution table

P50 / P90 / P99 cost · % over ceiling

ASR with rolling 4-week mean

07 — ComparisonASR vs adjacent metrics.

ASR vs strict pass-rate

ASR vs cost per successful task

ASR vs LangSmith / LangFuse eval scores

ASR vs τ-bench / SWE-bench Verified

08 — ConclusionOne metric, five outcomes.

Adopt ASR as the headline; keep the outcome breakdown for diagnosis. The combination is what makes agent programs defensible across teams.

Stop arguing about agent metrics. Adopt ASR.

ASR engagements

The questions we get every week.

Continue exploring agent evaluation methodology.

AI Evaluation Metrics Reference Guide 2026: 80 Metrics

Agentic AI Glossary: 200 Essential Terms for 2026

Reasoning Effort: Cost vs Quality Benchmarks 2026