AI incident metrics are the eight-KPI framework that lets a production agent team measure how fast they detect, contain, and recover from incidents — and how the operational health of the program is trending across quarters. Mean time to detect (MTTD) and mean time to recover (MTTR) anchor the panel; severity distribution, repeat-incident rate, runbook coverage, and on-call load round it out. Without these metrics, agent ops is reactive by default — the playbook fires when something goes wrong and nobody knows whether the program is improving.
The metrics matter because agent failures compound differently from classical web incidents. A misbehaving agent can burn a week's token budget in an afternoon, cascade across thousands of runs before a single dashboard turns red, and stay invisible to latency or 5xx monitoring until the damage is done. The teams that turn agent ops from firefighting into a predictable program are the ones tracking the right indicators — and reviewing them with the same discipline classical SRE applies to availability.
This guide walks through each KPI with its formula, a recommended target band, the trend line to watch, and how it interacts with the others. It pairs with our companion incident response playbook — the playbook is the operational discipline; this framework is the measurement layer that proves it's working.
- 01MTTD predicts MTTR — detection is the highest-leverage KPI.On agent workflows, time-to-detect dominates blast radius. A team that catches an incident at minute four contains it in fifteen; a team that catches it at hour four spends the next hour just figuring out what changed. Invest in detection panels before runbooks.
- 02Severity matrix maps to page priority — measure the distribution, not just the count.Counting incidents misses the signal. A program that runs ten P3s a quarter is healthy; one that runs two P0s a quarter is in trouble. Track the distribution across P0 / P1 / P2 / P3 and watch the slope — drift toward higher severities is the early warning for a program in decay.
- 03Repeat-incident rate signals root-cause discipline.If the same failure class fires twice in a quarter, the postmortem stopped at the symptom. Repeat-incident rate is the cleanest single signal for whether the team is finding system causes or settling for agent-blame. Target under 10% on a trailing 90-day window.
- 04Runbook coverage predicts time-to-action.A team with a runbook for a failure class responds in minutes; a team without one writes the runbook under pressure during the incident. Runbook coverage — percent of incident classes with a tested runbook — is the leading indicator for next quarter's MTTR.
- 05On-call load is the leading burnout indicator.Pages per on-call week, interrupted-sleep nights, and weekend incident hours are what predict whether the rotation survives the next two quarters. Track them publicly. A program that lets on-call load drift up without intervention is one quarter away from losing its senior responders.
01 — Why Incident MetricsFrom reactive firefighting to predictable ops.
The default state of an agent program without metrics is reactive firefighting. Incidents happen, the team responds, the dashboards go green, and the program rolls on. Nobody knows whether response is getting faster or slower, whether the same failure class is repeating, whether on-call is burning out, or whether the runbooks written last quarter still match the current stack. The absence of measurement is not the absence of trouble — it's the absence of visibility into the trouble that's already present.
Classical SRE solved this for availability with a small set of durable KPIs — error budget, MTBF, MTTR, change-fail rate. Agent ops needs the same operational discipline, with the metric set adjusted for the failure surface. Latency and uptime stay relevant for the transport layer, but the metrics that actually measure agent health are detection time, recovery time, severity distribution, repeat-incident rate, runbook coverage, and on-call load. None of those are exotic; most have classical analogues. What's new is treating them as the operating dashboard for an agent program rather than an afterthought.
The framework below is the panel we install with clients before their agent workflows take real traffic. Eight KPIs, each with a formula, a target band, and a trend line. Reviewed monthly with the on-call rotation, refreshed quarterly with leadership. It replaces "how did last month go?" gut-feel with measurement that actually predicts whether the program is improving.
02 — MTTDMean time to detect — the leading indicator.
Mean time to detect is the elapsed wall-clock time between the moment an incident begins and the moment the team becomes aware of it. The start clock is the first datapoint that would have tripped a well-tuned detection panel — not the moment someone opens an incident ticket. The end clock is the page itself, whether automated or filed by a human noticing the symptom.
MTTD matters because on agent workflows, every minute of detection delay is roughly an order of magnitude of compounding cost on a misbehaving workflow. A retry storm caught at minute four costs hundreds of dollars; the same storm caught at hour four costs tens of thousands and may have produced poisoned outputs the team has to recall. MTTD is the single KPI most strongly predictive of MTTR — invest in it first.
We track three MTTD modes — automated detection, peer detection, and customer-reported detection — because their distribution tells the story of dashboard maturity. A healthy program catches most incidents automatically; a struggling one finds out from customers.
Automated detection
Detection panel fires before human noticesA trained alert (cost anomaly, trace volume drop, eval regression, tool error rate) pages on-call before any human reports the symptom. Target distribution: 70%+ of incidents detected this way. The single best signal that the detection layer is doing its job.
Target: ≥ 70% of incidentsPeer detection
Internal team member notices firstAn engineer reviewing traces, a PM watching a dashboard, an on-call sweep finds the symptom before any alert. Healthy at 15-25% of incidents — the team is engaged. Drifting above 40% suggests the alert panel is under-tuned and missing real failures.
Target: 15-25%Customer-reported detection
External report is the first signalSupport ticket, sales email, social-media mention is the first surface to flag the incident. Healthy at under 10%; rising above 15% is a red flag — the detection panel has structural gaps and the team is operating partially blind.
Target: < 10%The MTTD trend line worth watching is the seven-day rolling median by detection mode. A program with healthy detection maturity shows automated detection trending steady or down (faster catches), peer detection trending flat, and customer-reported detection trending steadily down toward zero. The opposite shape — customer reports climbing while automated detection rate stays flat — is the classical signature of dashboard atrophy. New workflows shipped without new detection signals; the panel gradually stops covering the production surface.
For a deeper treatment of the underlying instrumentation that powers these metrics, see our companion piece on agent observability — trace coverage at tool-call granularity and per-workflow cost attribution are the prerequisites without which MTTD has nothing to draw from.
"MTTD is the metric that predicts MTTR. Every minute of detection delay is roughly an order of magnitude of compounding cost on a misbehaving agent workflow."— Production agent post-mortem retrospective, Q1 2026
03 — MTTRMean time to recover — broken down by phase.
Mean time to recover is the elapsed wall-clock time from detection to full restoration of normal traffic. It is the most visible KPI in the panel and the one leadership asks about first. The mistake most teams make is reporting MTTR as a single number — "our MTTR is two and a half hours." A useful MTTR report breaks the clock down into the four phases of the incident response loop, because each phase is owned by a different discipline and each phase has a different lever to move it.
The bars below show the target distribution for a P0 incident on a healthy program. Detection plus containment together should close in under twenty minutes — the team is buying time fast. Eradication takes the bulk of the clock because diagnosis and surgical reversal are where the actual engineering work happens. Recovery is measured in tranches; the verification runbook between each one is non-negotiable.
MTTR breakdown · P0 incident phase-by-phase target
Source: Digital Applied incident-response panel · P0 target distributionThe phase breakdown turns "our MTTR is too slow" from a blame conversation into an engineering decision. If detection is slow, invest in the panel. If containment is slow, the kill-switch isn't wired or on-call doesn't trust it enough to flip first. If eradication is slow, the runbook doesn't exist or the rollback primitives aren't first-class operations. If recovery is slow, the verification runbook is missing or the tranche schedule is undefined. Each phase has a fix; the panel surfaces which one is dragging.
We track MTTR per severity tier rather than a global average, because P0 and P3 have different operating contracts. P0 targets under 2 hours; P1 under 4 hours; P2 under 1 business day; P3 under 1 week. Averaging across severities hides whether the critical incidents are tightening or loosening.
04 — Severity DistributionCount is noise — distribution is signal.
Counting incidents is the metric leadership intuitively asks for and the metric that misleads worst. A team running ten P3 incidents a quarter is healthy; a team running two P0s a quarter is in trouble. The signal is the distribution across the four severity tiers, and the slope of that distribution over time.
The matrix below shows the operating contract for each tier — the page priority, the response shape, the MTTR target — and the distribution band that indicates a program in a healthy state. Drift toward higher severities is the leading indicator of a program in decay; drift toward lower severities is what a maturing program looks like as the detection panel catches issues earlier in their lifecycle.
Blast radius growing
Pages on-call immediately, all-hands available, executive ping at 30 min. MTTR target under 2 hours. Healthy programs run zero to one P0 per quarter; two or more is the signal to invest before the team burns out.
Target: ≤ 1 / quarterInternal failure or single-workflow degradation
Pages on-call within business hours, single-owner response, status update every 2 hours. MTTR target under 4 hours. Healthy distribution: 2-5 P1s per quarter. Rising P1s while P0s stay flat is the cleanest signal that detection improved.
Target: 2-5 / quarterFallback engaged
Queued for next business day, single-owner, async update on resolution. MTTR target under 1 day. Healthy distribution: 5-10 P2s per quarter. P2s caught early are the highest-leverage incidents; they're cheap to resolve and teach the panel.
Target: 5-10 / quarterNo customer impact
Logged for weekly review, root-caused but not page-worthy. Healthy distribution: 10-20 P3s per quarter. A program logging zero P3s isn't healthy — it's blind. P3s are how a maturing program learns without paying for the lesson.
Target: 10-20 / quarterTwo disciplines make severity distribution useful. The first is that severity is set by the incident commander on the response call, not by the alert that pages — alerts are wrong about severity in both directions and the human on the call has authority to upgrade or downgrade. The second is that severity downgrades are explicit and logged, not silent. A P0 that becomes a P1 mid-incident gets the change announced in the response channel with the reason captured for the postmortem record. Without those two disciplines, severity becomes meaningless and every incident defaults to P0.
The trend line worth watching is severity distribution by calendar quarter. A maturing program shows P0 count flat or falling, P1 count flat, and P2 / P3 counts rising — the panel is catching issues earlier and at lower severity. A program in decay shows the opposite shape: P0 count rising, P3 count flat or falling, distribution shifting upward.
05 — Repeat RateRepeat-incident rate — postmortem discipline measured.
Repeat-incident rate is the percentage of incidents in a trailing window that share a root cause with an earlier incident in the same window. Formula: count of incidents where the postmortem failure class plus system-level root cause match a prior incident, divided by total incidents, evaluated on a rolling 90-day window. Target under 10%; sustained above 20% is the signal that postmortem discipline has broken down.
The metric matters because it is the cleanest single signal for whether the team is finding system causes or settling for agent-blame. A postmortem that ends at "the agent hallucinated" produces no action item that prevents the recurrence. The same failure class fires again in the next quarter, the team responds to it again, and repeat-incident rate climbs. Healthy postmortems generate concrete system-level fixes — new guardrails, new evals, new detection signals — and the same class doesn't come back.
Three operational levers move repeat-incident rate over a quarter. None of them are exotic; together they account for the bulk of programs that hold rate under 10% on a sustained basis.
Postmortem template sections
Every postmortem completes five sections: timeline, failure class (no 'hallucination' allowed), system-level root cause, action items (concrete + owned + dated), detection improvement. The template forces the system view that prevents agent-blame.
Forcing functionAction item review cadence
At the start of every postmortem, the team reviews action items from the prior 60 days. Incomplete actions are the second-most-common cause of repeat incidents; the review pulls them back into focus before the next quarter starts.
Every postmortemDetection improvement per incident
Every postmortem includes one new detection signal — a panel update, a tighter threshold, an additional alert. The detection layer compounds over postmortems; the team that adds a signal per incident is the team whose MTTD halves over a year.
Per-incidentThe trend line worth watching is repeat-incident rate on a trailing 90-day window, plotted month-over-month. A healthy program shows rate falling toward zero as the action-item backlog clears and detection improvements compound. A program in decay shows rate rising as postmortems get shorter, action items go undone, and the same failure classes recycle. Repeat-incident rate is the metric most strongly predictive of whether next quarter's incident load gets lighter or heavier — invest in it.
06 — Runbook CoverageRunbook coverage — next quarter's MTTR.
Runbook coverage is the percentage of incident failure classes with a tested, current runbook. Formula: count of failure classes with a runbook last rehearsed in the prior 90 days, divided by total failure classes observed in the prior 12 months. Target 80%+; sustained below 50% is the signal that runbooks have become stale and the team is writing them under pressure during incidents.
The metric matters because runbook coverage is the leading indicator for next quarter's MTTR. A team responding to a failure class with a current runbook moves through containment and eradication in minutes; a team responding without one spends the first half-hour figuring out what the rollback even looks like. Runbook coverage today is MTTR three months from today — the lag between the two is reliable.
Five canonical runbook classes cover roughly 90% of agent incident shapes. Every production agent program should have all five wired, rehearsed quarterly, and rewritten annually.
Kill-switch activation
Boolean flag · workflow-level · < 60s effectSingle boolean per workflow stored in a configuration system that takes effect without a code deploy. On-call has authority to flip without product-owner approval. Test the switch quarterly with a deliberate drill — a flag that's been in place six months and never flipped is one you don't actually trust.
Drill: quarterlyPrompt rollback
Versioned prompt · git revert pattern · 5-15 minPrompts treated as code with version history in the deploy system. Rollback is a first-class operation: select prior canary-passed version, deploy, verify against eval suite, unpause traffic in tranches. The cheapest discipline that pays back during eradication.
Drill: quarterlyModel version pin
Floating → fixed pointer · 15-30 minMost provider SDKs support explicit version pinning. The runbook flips the deploy config from a floating pointer to a fixed one, verifies against the eval suite, and unpauses. The version that failed gets logged so the upgrade can be re-attempted safely later.
Drill: quarterlyTool quarantine
Disable from toolset · graceful degradation · 10-30 minFailing tool is disabled from the agent's available toolset; the workflow relies on its graceful degradation path. Distinct from a server-side fix because the agent stops trying to call the broken tool entirely while the team investigates the underlying failure.
Drill: quarterlyThe fifth canonical runbook is context restore — identifying the corrupted context source (RAG index, customer-data feed, system prompt template) and resetting it from a known-good snapshot. Slower than the other four classes because the team has to confirm the corruption hasn't spread before signing off on recovery; typical resolution window is 30-90 minutes.
Runbook freshness matters as much as coverage. A runbook written twelve months ago against a stack that has since changed model providers, swapped MCP servers, and migrated deploy systems is a runbook that doesn't work. The maintenance cadence is three-tiered: after every incident, the runbook is updated with anything that surprised the on-call engineer; quarterly, the team runs a tabletop drill against one rotation-chosen runbook; annually, all five canonical runbooks get a full rewrite pass against the current stack.
07 — On-Call LoadOn-call load — burnout measured early.
On-call load is the bundle of metrics that measure the human cost of running the rotation: pages per on-call week, interrupted-sleep nights per week, and weekend incident hours. Track all three. The aggregate is what predicts whether the rotation survives the next two quarters; the individual lines tell the team which lever to pull when load drifts up.
The metric matters because on-call load is the leading indicator for burnout, and burnout is the leading indicator for losing senior responders. A program that lets on-call load climb without intervention is one quarter away from a rotation collapse — the most experienced engineers leave first, the remaining team responds slower, MTTR rises, and the cycle accelerates. The cheap intervention is measurement plus a review cadence; the expensive intervention is hiring after the senior engineer has already given notice.
Pages per on-call week
Total pages — automated and human-filed — during a single seven-day rotation. Healthy target under five pages per week; sustained above ten is the signal that alert thresholds are too sensitive, runbook coverage is too low, or both. Drives most of the felt load.
Target: < 5 / weekInterrupted-sleep nights / week
Nights in the on-call week where the responder was paged between 11pm and 6am local time. Target zero to one per week; sustained above two erodes responder health quickly and is the single best predictor that the rotation is heading toward burnout.
Target: ≤ 1 / weekWeekend incident hours
Hours spent on active incident response during Saturday and Sunday of the on-call week. Target under four hours; weekend hours are the most damaging form of on-call time and the easiest to under-count in informal reporting. Make them visible.
Target: < 4h / weekThe trend line worth watching is on-call load by responder over a trailing 12-week window. A healthy program shows load distributed roughly evenly across the rotation; a struggling one shows two or three responders absorbing the bulk of pages while others coast — which is the precursor to the senior engineers quitting. Publish the per-responder breakdown internally so leadership can spot the imbalance before it compounds.
Three interventions move on-call load when it drifts up. First, tune alert thresholds against the false-positive log — a panel that pages on noise burns the rotation without delivering signal. Second, fill runbook gaps for the failure classes that paged on-call without one — every missing runbook is roughly an hour of unstructured triage per page. Third, schedule explicit recovery time for responders coming off heavy weeks; a day off after a 15-page rotation is cheaper than the senior engineer who quits after three of them.
If you're standing up the metrics panel from scratch and need help calibrating against your current stack, our AI transformation engagements include the eight-KPI dashboard as a standard line item — detection panels designed for your workflows, severity matrix calibrated to your business risk, on-call load tracking wired to your paging system.
"On-call load is the leading burnout indicator. A program that lets it climb without intervention is one quarter away from losing its senior responders — and that loss compounds for years."— Agentic engineering leadership conversation, Q2 2026
Incident metrics turn agent ops from reactive to predictive.
The eight-KPI framework above is not an exotic measurement stack. Each metric has a classical SRE analogue, a clear formula, a target band, and a trend line. What makes it useful for agent ops is that it answers the question gut-feel can't: is the program improving, holding, or in decay — and where do we invest next quarter to move the needle? Without those answers, agent ops is reactive firefighting with no path to a predictable state.
The metrics that move the program hardest are not the ones leadership intuitively asks about. MTTR shows up first because it's the most visible, but MTTD is the leading indicator that predicts where MTTR lands. Severity count is easy to read but distribution is what tells the story. Runbook coverage today is MTTR three months from today. Repeat-incident rate is the cleanest single signal for postmortem discipline. On-call load is what predicts whether the rotation survives long enough for any of the other metrics to keep mattering.
Practical next step: pick the highest-traffic agent workflow your team runs and walk the eight KPIs against it this month. Where does MTTD land today? Is the severity matrix calibrated and used? What is runbook coverage on the failure classes you've actually seen? Most teams find at least three gaps on the first pass; closing them before next quarter starts is the cheapest investment in agent ops the team will make all year.