Agent evaluation is in a strange place in 2026. The benchmark suites — τ-bench, Berkeley Function Calling, SWE-bench Verified, MLE-bench — give us model-level numbers. Production teams running agents in real workflows are stuck between three options: roll their own metric (different per team, not comparable), report a simplistic pass-rate (treats a partially correct answer as a failure), or use a cost-blind metric (treats a $4 task that succeeded as equivalent to a $0.04 task that succeeded).
Agent Success Rate (ASR) is the methodology we ship to clients to close those gaps. It is not a benchmark suite — it is a measurement spec that runs on top of any benchmark suite or production trace store. The spec defines the outcome classes, the partial-credit weighting, the cost-penalty function, the confidence-interval reporting, and the reference dataset of 200 production-agent traces we use as a calibration set.
- 01ASR is one rate that captures completion, partial credit, and cost in a single number.Pass-rate alone treats a partial answer as a failure. Cost-blind metrics treat expensive successes as equivalent to cheap ones. ASR folds both into a single rate, with the formula tunable per workflow but the structure shared across teams.
- 02The five outcome classes give reviewers something concrete to label.Completed (full credit), partial-correct (0.4 credit), partial-incorrect (0 credit, no penalty), hallucinated (0 credit, flagged), abandoned (0 credit, flagged). The labels carry diagnostic signal that pass/fail throws away.
- 03Cost-penalty engages above the budget ceiling, not on every task.Per-task cost ceiling is set per workflow (e.g., $0.10 for a triage task, $2.00 for a research task). Costs below the ceiling do not penalise. Costs above the ceiling reduce the score linearly, with a hard floor at 0.
- 04Confidence intervals make ASR comparable across runs and teams.Default reporting: ASR ± 95% CI via 1,000 bootstrap resamples. A 0.6-point ASR delta with overlapping CIs is not a regression; a 4-point delta with non-overlapping CIs is. Without confidence intervals, ASR comparisons are noise.
- 05The 200-trace reference dataset is the calibration tool, not the benchmark.Six task families × 30-35 traces each. Each trace has the prompt, the agent's response, the canonical labels, the cost, and the latency. Use it to calibrate reviewers; do not use it as a public leaderboard — drift is too easy when the dataset becomes the target.
01 — ProblemWhy we need ASR.
By Q1 2026, most production agentic teams report some flavour of success rate. The metrics are not comparable. Some teams use strict pass-rate (binary, no partial credit). Some teams use task completion (allows partial). Some teams report cost per successful task without any partial credit. Some teams report a qualitative reviewer score on a 1-5 scale. The result: when a VP-level conversation needs to compare two agent versions or two providers, no apples-to-apples answer exists.
ASR is the methodology that closes the gap. It does not invent a new benchmark; it specifies how to compute a comparable rate from any benchmark or production trace store, with partial credit and cost-adjustment baked in.
"We had three teams reporting agent success rate three different ways and the QBR turned into a definitions argument. ASR ended that — same formula, same outcome classes, same confidence intervals."— Head of AI infrastructure, public SaaS, March 2026
02 — DefinitionFormal definition.
The full ASR formula:
ASR = (n_full + 0.4·n_partial+ − Σ cost_penalty_i) / n_total
cost_penalty_i = max(0, (cost_i − budget_ceiling) / budget_ceiling)
capped at 1.0 per task
outcome ∈ {complete, partial+, partial−, hallucinated, abandoned}ASR is a rate in the range [0, 1], reported as a percentage. Higher is better. A perfect agent on a basket of in-budget tasks scores 100%. A partially-effective agent on a basket of slightly-over-budget tasks might score 38%. The formula degrades gracefully and does not produce negative values.
03 — Outcome classesFive outcome classes.
Reviewers label each agent run into exactly one of five classes. The labels are structured to make reviewer agreement high (binary decision tree at each level) and to preserve diagnostic signal for downstream analysis.
Completed (full credit)
weight 1.0The agent completed the task to the standard a competent human reviewer would accept. All sub-goals satisfied; output meets the spec. Most production-grade systems target 65%+ in this class.
Full creditPartial-correct (partial credit)
weight 0.4The agent got the right shape of answer but missed sub-goals (e.g., correct research summary but missed two of five required sources). Partial credit captures the real value without rewarding low effort.
0.4 weightPartial-incorrect (no credit, no penalty)
weight 0.0The agent attempted the task but the output is meaningfully wrong on sub-goals that matter (e.g., research summary that confidently includes a wrong fact). No credit; flagged for prompt or guard-rail review.
0 creditHallucinated (no credit, flagged)
weight 0.0 + flagThe agent confidently produced output that is fabricated — invented citations, invented function results, invented user state. Distinct from partial-incorrect because hallucinations indicate model or retrieval failure, not logic failure.
Failure mode flagAbandoned (no credit, flagged)
weight 0.0 + flagThe agent did not produce output (timed out, errored, hit a tool-use deadlock, exceeded retries). Tracked separately because abandonment usually indicates infrastructure issues, not agent capability issues.
Infra signal04 — Cost penaltyCost penalty design.
The cost-penalty function is the part of ASR that practitioners push back on the most. The pushback is fair — penalising an agent for being expensive can produce metrics that prefer cheap-and-bad over expensive-and-correct. The design below threads that needle: costs only penalise above a per-task budget ceiling, and the penalty is linear-up-to-1.0, never producing negative scores.
Per-task budget ceiling, not per-token rate
The ceiling is set per workflow (triage tasks $0.10, research tasks $2.00, code-review tasks $0.50). A token-rate ceiling would punish thoughtful agents that use more tokens to be correct; the per-task ceiling tracks the economic decision instead.
Workflow-levelLinear penalty above ceiling
Once a task exceeds budget, the penalty grows linearly with the overage as a fraction of the ceiling: 1.5× ceiling = 0.5 penalty, 2× = 1.0 penalty (max). Linear is simpler than log/quadratic and easier to communicate to non-technical stakeholders.
Linear above ceilingHard floor at 0; no negative ASR
Tasks that succeeded but went 5× over budget contribute 0 to ASR rather than going negative. Negative scores would let cost overruns dominate the headline metric in unintuitive ways. Cost overruns are tracked separately as a 'cost variance' sub-metric.
Floor at 0Cost variance as separate sub-metric
ASR penalises cost only above ceiling; cost variance reports the absolute cost distribution (P50, P90, P99) and the % of tasks above ceiling separately. Together the two paint the picture: ASR for headline, cost variance for budgeting.
Sub-metric pair05 — Reference datasetThe 200-trace reference dataset.
The reference dataset is six task families × 30-35 traces each. Its purpose is reviewer calibration, not benchmarking. Teams use it to align internal reviewers on what counts as partial-vs-incorrect-vs-hallucinated; once calibrated, they apply the same labels to their own production traces.
Research
Multi-source research tasks. Examples: 'summarise the leading positions on the EU AI Act,' 'find three recent peer-reviewed papers on retrieval augmentation.' Cost ceiling $2.00; median actual $1.40.
Multi-sourceSupport triage
Customer-support classification + first-response drafting. Examples: triage 50 incoming tickets into severity bands and draft initial replies. Cost ceiling $0.10/ticket; median actual $0.06.
High volumeCode review
PR-level code review with structured findings. Examples: review a 300-line PR for bugs, style, security. Cost ceiling $0.50/PR; median actual $0.32.
Structured outputData extraction
Structured-output extraction from unstructured docs. Examples: pull contract terms from a 30-page MSA. Cost ceiling $0.25/doc; median actual $0.18.
Schema-boundContent drafting
First-draft content generation against a brief. Examples: 800-word blog post against a structured outline. Cost ceiling $0.50/draft; median actual $0.31.
Quality-gradedScheduling
Multi-agent scheduling tasks (calendar, tool-use, follow-up). Examples: schedule a 6-person customer call across three time zones. Cost ceiling $0.15/task; median actual $0.09.
Tool-use heavy06 — ReportingReporting + confidence intervals.
ASR without confidence intervals is noise. The default reporting format includes the headline number, a 95% bootstrap CI, the outcome-class breakdown, and the cost-variance side panel. The template below is what we ship to engineering leads weekly.
ASR 0.62 ± 0.04 (95% CI)
1,000 bootstrap resamplesHeadline number with confidence interval. A 0.04 CI on a 1,000-trace sample is typical; tighter on larger samples. Without CIs, week-over-week ASR comparisons are unreliable below 5 percentage points.
Headline numberClass distribution table
% per class · with deltasCompleted: 58%. Partial-correct: 12%. Partial-incorrect: 8%. Hallucinated: 4%. Abandoned: 18%. Track WoW deltas; flag any class above 2% week-on-week change.
DiagnosticP50 / P90 / P99 cost · % over ceiling
side panelP50 $0.31 · P90 $0.84 · P99 $2.10. 8% of tasks above ceiling. The cost-variance side panel is what tells the team whether the cost penalty is dominating or marginal.
Budget signalASR with rolling 4-week mean
weekly cadenceRolling 4-week mean smooths weekly noise and shows whether the program is trending. Use raw weekly ASR for anomaly detection; use rolling mean for trend reporting.
Trend view07 — ComparisonASR vs adjacent metrics.
ASR vs strict pass-rate
Strict pass-rate is binary; ASR captures partial credit. A team running pass-rate often improves their headline number by widening their definition of 'pass' — ASR makes that move visible because partial-correct shows up as 0.4 credit instead of being absorbed into 'pass'. ASR > pass-rate by design.
ASR winsASR vs cost per successful task
Cost per successful task is a denominator; ASR is a rate. Both are useful, and they pair: report ASR for the rate of success, $/successful-task for the unit economics. ASR alone misses cost; $/successful-task alone misses partial credit.
Pair themASR vs LangSmith / LangFuse eval scores
Eval scores are usually qualitative (1-5 scales, judge-LLM grades). ASR is structurally compatible: feed the eval score into the outcome-class label decision (5 → completed, 4 → partial-correct, 3-1 → partial-incorrect or worse). Use the platform for trace storage, ASR for headline.
LayeredASR vs τ-bench / SWE-bench Verified
Benchmark suites give model-level numbers; ASR gives workflow-level numbers. Use the suite numbers to decide which model to deploy; use ASR to track whether the deployment continues working in production. Different lenses, both needed.
Different lenses08 — ConclusionOne metric, five outcomes.
Adopt ASR as the headline; keep the outcome breakdown for diagnosis. The combination is what makes agent programs defensible across teams.
Pass-rate alone is too coarse. Cost-blind metrics are too forgiving of expensive failures. Qualitative reviewer scores are not comparable across teams. ASR closes those gaps without inventing a new benchmark — it specifies how to compute a comparable rate from any benchmark suite or production trace store.
Adopt the outcome classes (5), the partial weight (0.4), the cost penalty (linear above ceiling, floor at 0), and the bootstrap CI reporting. Use the 200-trace reference dataset for reviewer calibration; do not use it as a public leaderboard.
Pair ASR with cost variance as a side-panel sub-metric. Track WoW deltas; flag class-distribution shifts above 2% as diagnostic signals. The formula is open — fork the weights for your own workflow if 0.4 partial credit feels wrong; preserve the structure so teams can compare numbers across programs.