Agent quality metrics are the panel that tells a team whether the agent program shipping every week is actually improving the product — pass rate, revision rate, eval coverage, hallucination, refusal, calibration error — measured on a stable cadence so quality drift is visible before it becomes a customer incident. Velocity metrics tell you how fast; quality metrics tell you whether the speed is earning trust or burning it.
Most agent programs start with a single, intuitive metric: does the agent ship? Tasks completed per week, PRs merged, tickets closed. Those numbers move fast and look impressive. Six months in, the same teams discover that velocity without a quality panel produces a different problem — a backlog of half-trusted output that humans quietly redo, an eval suite nobody updated when the model rotated, and a refusal rate that crept up while attention was elsewhere. Quality is what makes velocity legible.
This guide covers the ten KPIs we run on production agent programs, why each one matters, how to measure it without building bespoke infrastructure, and what the weekly review cadence looks like in practice. Everything below is implementation-agnostic — the metrics apply equally to a Claude Code subagent fleet, a customer-support agent on a knowledge base, or a content agent generating drafts for human review.
- 01Pass rate is the velocity baseline.How often does the agent's output pass the eval suite without human intervention? It is the cheapest, most legible quality signal — and the one that anchors every conversation about whether the program is improving or regressing.
- 02Revision rate is the trust signal.How often do humans edit the agent's output before shipping? Revision rate trails pass rate by design — it captures the silent quality problems that pass automated checks but still need a person's touch. Trends in revision rate are leading indicators of trust.
- 03Eval coverage is the contract.What percentage of production workflows have a regression suite? Without coverage, pass rate is a story about the workflows you happen to be measuring, not about the agent program as a whole. Target above 80% for production workflows.
- 04Hallucination and refusal are the risk metrics.Hallucination rate catches confident-wrong output. Refusal rate catches the opposite failure — over-cautious agents declining work they could have done. Both belong on the same panel because they are the two ends of the same calibration problem.
- 05Calibration error is the leading-quality indicator.When the agent says it is 90% confident, is it actually right 90% of the time? Calibration error tracks the gap between stated and observed confidence — it is the first metric that moves when a model rotation degrades judgment, often days before pass rate notices.
01 — Why Quality NowVelocity without a quality panel produces a trust deficit.
Every agent program we have audited in the last twelve months went through the same arc. Quarter one is velocity-focused — tasks completed, PRs landed, drafts generated. Numbers go up, leadership is happy, the case for expansion writes itself. Quarter two surfaces a quieter pattern: the humans downstream of the agent start spending more time editing its output than they saved by having it written. The dashboard still says velocity is up. The team knows something is off but cannot point at a number that explains why.
The gap between "agent shipped" and "agent shipped something we trust" is what the quality panel measures. Pass rate without revision rate flatters the program. Eval coverage without hallucination rate hides confidence-wrong failures. Refusal rate without calibration error makes over-cautious agents look like prudent ones. The ten KPIs below sit together for a reason — each catches a failure mode the others miss.
The cost of running the quality panel is not the instrumentation, which is mostly cheap once one eval suite is in place. The cost is the discipline of looking at it every week and deciding which axis to invest in next. Teams that adopt the panel typically see revision rate drop by a third within the first quarter — not because the agent suddenly got better, but because humans stopped accepting output that would have failed an eval the team did not previously have.
The ten KPIs split naturally into four clusters. Pass rate and revision rate are the two headline metrics — they sit at the top of every quality review. Eval coverage and regression detection latency are the infrastructure metrics — they measure whether the rest of the panel is trustworthy. Hallucination, refusal, sycophantic compliance, and brand-voice deviation are the risk metrics — they catch the failure modes that do not show up as eval failures. Calibration error and quality-incident frequency are the leading and lagging indicators respectively — calibration moves first when something degrades; incident frequency confirms it once a regression slips through.
"Velocity tells you how fast the agent ships. Quality tells you whether anyone trusts what it shipped. Both belong on the same dashboard, but they belong in different columns."— Common refrain across agent-program audits
02 — Pass RateThe agent's unaided success rate.
Pass rate is the percentage of agent runs whose output passes the eval suite for that workflow without human intervention. It is the cheapest, most legible quality metric on the panel and the one that anchors every other conversation. When a team says "the agent is at 78%," they almost always mean pass rate — and the conversation that follows is almost always about which subset of the 22% can be closed by prompt edits, model rotation, or workflow redesign.
The three modes below describe how pass rate is computed in practice. The mode you pick depends on workflow shape, eval tooling, and how strict you want the bar to be. Most teams start with strict pass and migrate toward graded pass as the program matures and rubric design gets more sophisticated.
Binary pass / fail
% of runs passing every assertionThe default starting mode. Every assertion in the eval suite must pass for the run to count as a pass. Cheap to compute, easy to communicate, ruthless about edge cases. Best for high-stakes workflows where partial credit is misleading.
Binary · No partial creditRubric-scored output
weighted score ≥ thresholdEach output is scored across a small rubric (correctness, tone, completeness, brand voice). Runs above the threshold count as passes. More expressive than strict pass; requires the rubric to be calibrated against human judgment.
Weighted · Rubric-basedPass with constraints
pass + must-have assertionsTwo-tier: a small set of must-have assertions are graded strictly, while the rest contribute to a rubric score. Catches the failure mode where a graded pass would mask a critical assertion failure. The right default for production support workflows.
Two-tier · Production defaultWhatever mode you pick, the same operational rules apply. The eval suite must be stable — adding new test cases mid-week changes the denominator and makes week-on-week comparison meaningless. The pass-rate dashboard must show the denominator alongside the percentage, so a 95% pass rate on twelve runs does not look identical to a 95% pass rate on twelve hundred. And the dashboard must show pass rate per workflow, not just per agent — averaging across workflows hides which one is dragging the program down.
The most common operational mistake on this axis is allowing the pass-rate target to drift upward without re-grounding against the eval suite. A team that hits 90% on a stable suite for three weeks running starts to expect 90% as the floor, then a junior engineer adds harder test cases and the number drops to 78%. Nothing got worse — the bar got higher — but the dashboard makes it look like a regression. Eval-suite changes are the equivalent of a benchmark change; they need to be logged, dated, and visible on the chart.
03 — Revision RateThe trust metric that trails pass rate.
Revision rate measures how often humans edit the agent's output before shipping it. It is the trust signal — the number that tells you whether downstream reviewers actually believe the pass-rate number on the dashboard. A program with 90% pass rate and 50% revision rate is silently failing; the evals say the output is good, but the humans handling it disagree often enough to spend their time on edits. A program with 78% pass rate and 12% revision rate is healthier than it looks — the eval suite is being strict, and the humans agree with its judgment.
Revision rate is harder to instrument than pass rate because it lives in the human workflow, not the eval pipeline. Some common shapes: a content agent generates drafts that a content editor revises in Google Docs or a CMS; a code agent ships PRs that a senior engineer modifies before merging; a support agent drafts replies that a human reviews and sends. In each case, the diff between agent output and shipped output is the raw signal — what gets measured is the character-level or token-level edit distance, normalised by output length, averaged across runs.
≤ 5% edit distance
The reviewer accepted the agent's output substantively. Edits are formatting, light wording, occasional fact correction. This is the target band — anything in this range should not require deeper investigation.
Healthy band6–20% edit distance
Reviewer made meaningful changes — restructured a section, rewrote a paragraph, added missing context. A workflow with sustained moderate revision rate needs prompt or rubric attention; the eval suite is missing something the human is catching.
Investigate the prompt21–50% edit distance
The agent's output served as a rough draft. This is acceptable for ideation workflows where the agent is explicitly a brainstorm partner; unacceptable for production workflows where the agent is supposed to be doing the work. Drives the prioritisation conversation.
Workflow redesign> 50% edit distance
The reviewer threw out the agent's output and wrote it themselves. Sustained rewrite-rate on a workflow is a signal to either retire the agent on that workflow, switch models, or fundamentally rework the prompt. The agent is costing more time than it is saving.
Retire or rebuildThe single most useful chart on the quality panel is pass rate and revision rate plotted side by side over time. The healthy pattern is pass rate climbing while revision rate stays flat or declines — humans are intervening less because the agent is genuinely improving. The unhealthy pattern is pass rate climbing while revision rate climbs in parallel — the eval suite is becoming easier (often because someone removed test cases that were failing) but humans are doing more work, not less. That divergence is the single most reliable warning sign of a quality program drifting off course.
For teams running prompt libraries alongside agents, the revision-rate signal has a useful crossover use — it doubles as a leading indicator for prompt updates. When revision rate spikes on a specific workflow, the most-touched prompt for that workflow is usually the one that needs attention first. The pattern is identical to what we cover in our prompt library audit framework — measure first, prioritise from the data.
04 — Eval CoverageWithout coverage, pass rate is a story about a sample.
Eval coverage is the percentage of production workflows that have a regression suite wired into CI. It is the contract metric — without coverage, every other number on the panel describes only the workflows you happen to be measuring, not the agent program as a whole. A team reporting 90% pass rate at 40% coverage is reporting pass rate on the half of workflows somebody bothered to instrument; the other half could be in free-fall and the dashboard would not notice.
The target band for production agent programs is above 80% coverage. Below 50% the panel is misleading enough that leadership should not be making expansion decisions on it. Between 50% and 80% is the working zone — the program knows where its blind spots are and is closing them. Above 80%, the panel is reliable enough to drive resource allocation; above 95%, the eval infrastructure becomes a visible budget line item and the team needs to decide whether the marginal cost of covering the last few workflows is worth it.
Eval coverage stages · % of production workflows with regression suites
Source: agent-program maturity model · Digital AppliedRegression detection latency is the partner metric to coverage — it measures how quickly a regression is detected once it appears in production. The infrastructure answer is a nightly cron running the full eval suite against production prompts, with failures routed to the workflow owner. Detection latency below twenty-four hours is the production target. The cost of running the cron is small; the value is catching vendor model rotations, silent dataset drift, and prompt edits that slipped through review.
The single highest-ROI move on this axis is wiring the first scheduled eval run. Teams routinely cross from Stage 2 to Stage 3 the week they add a cron — not because coverage went up, but because the latent regressions in the existing coverage suddenly became visible. The pattern is the same one we cover in our agentic workflow resilience audit — feedback loops change behavior more than tooling does.
05 — Hallucination + RefusalThe two ends of the same calibration problem.
Hallucination rate is the percentage of agent runs producing output that is confidently wrong on a factual claim. Refusal rate is the percentage of agent runs declining to attempt work the agent could reasonably have completed. They are opposite failure modes — one is over-confidence, the other is over-caution — but they share a root cause, which is why the panel measures them together. A change that pushes hallucination down usually pushes refusal up, and vice versa. Tracking only one of them produces a distorted optimisation target.
Hallucination is instrumented in three layers. First, an automated fact-check pass against a reference corpus or a ground-truth dataset catches the obvious failures. Second, a structured citation requirement on knowledge workflows forces the agent to ground claims in retrievable sources, which makes hallucination self-detectable when citations fail to verify. Third, a small sample of human-reviewed output catches the long-tail cases the automated layers miss — the cost of the human layer is high enough that most teams limit it to 5-10% of runs, sampled randomly.
Refusal rate is simpler to measure but harder to interpret. The raw signal is the percentage of runs where the agent output an explicit refusal — "I can't help with that," "That's outside my scope," etc. The interpretation requires judgment: refusals on workflows that genuinely fall outside the agent's authority are correct behavior; refusals on workflows the agent has previously handled successfully are quality regressions. The breakdown by workflow is what makes the number actionable; the headline number alone is noise.
Confident wrong
% of runs with unsupported claimsCaught by fact-check pass + citation verification + sampled human review. Target band below 2% on knowledge workflows, below 5% on open-ended generation. Trending up usually signals a model rotation or a corpus drift.
Over-confidenceOver-cautious decline
% of runs with explicit refusalEasy to measure, hard to interpret. Always read alongside a per-workflow breakdown — refusals on out-of-scope workflows are correct, refusals on previously-handled workflows are regressions. Sudden refusal-rate spikes often follow safety-training updates from the model vendor.
Over-cautionAgreed-with-the-user wrong
% of disagreements followed by reversalThe third calibration failure — the agent gives a correct answer, the user pushes back, the agent reverses to the user's incorrect framing. Measured by stress-testing with adversarial follow-ups. Newer failure mode; track on conversational and support workflows.
Conversational riskThe fourth risk metric on the panel is brand-voice deviation — the rate at which agent output drifts from the voice guidelines that govern human-written content for the same surface. It applies primarily to content and support workflows and is measured by rubric-scored sampling. Most teams skip it in quarter one and add it once the program has earned enough budget for a content-quality pass; for marketing-heavy agent programs, it belongs on the panel from day one because brand-voice failures are the most visible quality problems customers actually notice.
"A hallucination problem and a refusal problem are the same problem viewed from opposite sides. Optimising one without measuring the other moves the failure rather than fixing it."— Common pattern in agent-program audits
06 — Calibration ErrorThe leading-quality indicator that moves first.
Calibration error measures the gap between the agent's stated confidence and its observed accuracy. When the agent says it is 90% confident in an answer, is it actually right 90% of the time? Across the runs where it claims 70% confidence, what is the real pass rate? A well-calibrated agent matches stated to observed within a few points across confidence bins; a poorly-calibrated one shows systematic drift — claiming 90% but landing at 75%, or claiming 60% but landing at 85%.
Calibration error is the panel's leading indicator because it moves before pass rate does when something degrades. A model rotation, a prompt edit, or a corpus shift will typically shift the calibration curve days or weeks before pass rate notices — the agent is still getting answers right at roughly the previous rate, but its confidence is no longer matched to its actual performance. Teams that track calibration alongside pass rate get the warning signal earlier and can investigate before the regression becomes visible to users.
Calibration curve · stated vs observed accuracy by confidence bin
Illustrative · production agent program · stable calibrationThe example above shows a well-calibrated agent — stated and observed accuracy track within a few points across every bin. The pattern to watch for is widening gaps in the high-confidence bins, which is where the cost of mis-calibration is highest. When the agent says 90% and delivers 75%, downstream automation that trusts the stated number is shipping confidently wrong output. That is the failure mode the quality panel exists to catch before it becomes a customer incident.
Instrumenting calibration error requires the agent to emit a confidence score on each output and the eval harness to bucket runs by that score before computing observed accuracy per bucket. Most modern frontier models support structured confidence output natively; for older models or fine-tuned ones that do not, a separate judge model can be prompted to rate the confidence of each output. The judge approach is noisier but workable for programs that need calibration data on models without native support.
Quality-incident frequency is the lagging counterpart to calibration error. It counts the customer-visible quality incidents per month — output that reached a user, generated a complaint, required correction, or damaged trust in a measurable way. Calibration moves first when something degrades; incident frequency confirms it after a regression slipped through. Both sit on the panel because each tells the team something the other cannot.
07 — Dashboard CadenceA weekly quality review beats a real-time dashboard nobody reads.
The dashboard cadence question is operational rather than instrumentational. Most teams over-invest in real-time visualisation and under-invest in the weekly conversation that turns the numbers into decisions. The pattern we run on production agent programs is a single weekly meeting — agent-quality standup — where the panel is reviewed top-to-bottom, the two or three metrics that moved are discussed, and an owner is named for any action item. The meeting takes thirty minutes; the dashboard takes five minutes to build properly.
The structure that works is six columns on a single page. Column one is the headline pass rate per workflow with denominator visible. Column two is the revision rate trend over four weeks. Column three is the eval coverage percentage with target band marked. Column four is the risk metrics — hallucination, refusal, sycophantic compliance — colour-coded by trend. Column five is the calibration curve with the previous week's curve faint behind it. Column six is the quality-incident log with the workflow, root cause, and remediation status for each incident in the past quarter.
The weekly standup follows the dashboard top-to-bottom. Pass-rate drops are investigated first because they are the most legible. Revision-rate increases are investigated next because they are the trust signal. Eval-coverage gaps are the standing infrastructure item — usually one workflow per week graduates into full coverage. Risk metrics are reviewed by exception — only the ones that moved get discussed. The calibration curve gets a thirty-second visual review for drift. The incident log closes the meeting; any incident open for more than a week needs an explicit unblock decision.
Live dashboard, no cadence
The most common anti-pattern. The panel exists and updates continuously, but no scheduled meeting reviews it. Numbers drift, nobody notices, the dashboard becomes a museum. High instrumentation cost, low operational value.
AvoidStandup rhythm
Useful for incident response during a known regression or rollout. Too frequent as a default — most quality signals move on weekly timescales, so daily reviews mostly look at noise. Use during incidents, not as the default cadence.
Incidents onlyAgent-quality standup
The production default. Thirty-minute meeting, single-page dashboard, top-to-bottom review, named owner for any action. Frequent enough to catch regressions in the rollback window, infrequent enough to focus the conversation on signal.
Production defaultExecutive review
The cadence for leadership reporting. Aggregates the weekly standup output into trend lines and resource-allocation recommendations. Necessary for programs above a certain size; insufficient as the only quality review the team runs.
Leadership reportingFor teams stepping up to a formal agent-quality practice for the first time, the highest-ROI sequence is this: instrument pass rate on the three highest-stakes workflows; add revision-rate tracking on the same workflows; wire a nightly eval cron; schedule the weekly standup; expand coverage from there. The program crosses from anecdotal to measurable somewhere in the third or fourth week, and the panel becomes self-reinforcing — each week's conversation generates the next week's instrumentation priority.
Programs that mature past Stage 3 of the eval-coverage ladder usually find themselves needing additional quality engineering investment — calibration tracking, brand-voice rubrics, regression dashboards visible to non-engineering stakeholders. That is the inflection point where most teams engage outside help. Our AI transformation engagements often start with exactly this panel — instrument first, review weekly, expand from data.
Quality metrics turn agent shipping from gut-feel to engineering.
Agent programs that ship at velocity without a quality panel produce a predictable trajectory — six months of celebrated output, followed by a quieter quarter in which humans downstream of the agent quietly redo more work than the agent saved. The 10-KPI panel exists to make that gap visible before it becomes a trust deficit the program cannot recover from. Pass rate anchors the conversation; revision rate keeps it honest; eval coverage makes both numbers trustworthy; the risk metrics catch the failure modes the headline numbers miss; calibration and incident frequency close the loop.
The mechanics of the panel are cheaper than teams expect. One eval suite, one nightly cron, one weekly meeting, one single-page dashboard. The organisational work is harder — getting leadership to accept that a slightly slower velocity number paired with a meaningful quality number is a healthier program than a fast velocity number with no quality counterpart. That conversation is what the panel exists to enable; the rest is instrumentation.
What to do next: pick the three highest-stakes workflows your agent program touches. Instrument pass rate and revision rate on each. Wire a nightly eval run. Schedule the weekly standup. By week four, the panel will be telling you something you did not know in week one — and the program will already be making different prioritisation decisions because of it. The first instrumented workflow is the inflection point; everything after it compounds.