AI incident metrics are the eight-KPI framework that lets a production agent team measure how fast they detect, contain, and recover from incidents — and how the operational health of the program is trending across quarters. Mean time to detect (MTTD) and mean time to recover (MTTR) anchor the panel; severity distribution, repeat-incident rate, runbook coverage, and on-call load round it out. Without these metrics, agent ops is reactive by default — the playbook fires when something goes wrong and nobody knows whether the program is improving.

The metrics matter because agent failures compound differently from classical web incidents. A misbehaving agent can burn a week's token budget in an afternoon, cascade across thousands of runs before a single dashboard turns red, and stay invisible to latency or 5xx monitoring until the damage is done. The teams that turn agent ops from firefighting into a predictable program are the ones tracking the right indicators — and reviewing them with the same discipline classical SRE applies to availability.

This guide walks through each KPI with its formula, a recommended target band, the trend line to watch, and how it interacts with the others. It pairs with our companion incident response playbook — the playbook is the operational discipline; this framework is the measurement layer that proves it's working.

Key takeaways

01
MTTD predicts MTTR — detection is the highest-leverage KPI.On agent workflows, time-to-detect dominates blast radius. A team that catches an incident at minute four contains it in fifteen; a team that catches it at hour four spends the next hour just figuring out what changed. Invest in detection panels before runbooks.
02
Severity matrix maps to page priority — measure the distribution, not just the count.Counting incidents misses the signal. A program that runs ten P3s a quarter is healthy; one that runs two P0s a quarter is in trouble. Track the distribution across P0 / P1 / P2 / P3 and watch the slope — drift toward higher severities is the early warning for a program in decay.
03
Repeat-incident rate signals root-cause discipline.If the same failure class fires twice in a quarter, the postmortem stopped at the symptom. Repeat-incident rate is the cleanest single signal for whether the team is finding system causes or settling for agent-blame. Target under 10% on a trailing 90-day window.
04
Runbook coverage predicts time-to-action.A team with a runbook for a failure class responds in minutes; a team without one writes the runbook under pressure during the incident. Runbook coverage — percent of incident classes with a tested runbook — is the leading indicator for next quarter's MTTR.
05
On-call load is the leading burnout indicator.Pages per on-call week, interrupted-sleep nights, and weekend incident hours are what predict whether the rotation survives the next two quarters. Track them publicly. A program that lets on-call load drift up without intervention is one quarter away from losing its senior responders.

01 — Why Incident MetricsFrom reactive firefighting to predictable ops.

The default state of an agent program without metrics is reactive firefighting. Incidents happen, the team responds, the dashboards go green, and the program rolls on. Nobody knows whether response is getting faster or slower, whether the same failure class is repeating, whether on-call is burning out, or whether the runbooks written last quarter still match the current stack. The absence of measurement is not the absence of trouble — it's the absence of visibility into the trouble that's already present.

Classical SRE solved this for availability with a small set of durable KPIs — error budget, MTBF, MTTR, change-fail rate. Agent ops needs the same operational discipline, with the metric set adjusted for the failure surface. Latency and uptime stay relevant for the transport layer, but the metrics that actually measure agent health are detection time, recovery time, severity distribution, repeat-incident rate, runbook coverage, and on-call load. None of those are exotic; most have classical analogues. What's new is treating them as the operating dashboard for an agent program rather than an afterthought.

The framework below is the panel we install with clients before their agent workflows take real traffic. Eight KPIs, each with a formula, a target band, and a trend line. Reviewed monthly with the on-call rotation, refreshed quarterly with leadership. It replaces "how did last month go?" gut-feel with measurement that actually predicts whether the program is improving.

The shift these metrics enable

A team without the panel can answer "did we have a bad month?" A team with the panel can answer "is our program getting better or worse, and where do we invest next quarter to move the needle?" That second answer is what turns agent ops from a cost centre into a predictable engineering program.

02 — MTTDMean time to detect — the leading indicator.

Mean time to detect is the elapsed wall-clock time between the moment an incident begins and the moment the team becomes aware of it. The start clock is the first datapoint that would have tripped a well-tuned detection panel — not the moment someone opens an incident ticket. The end clock is the page itself, whether automated or filed by a human noticing the symptom.

MTTD matters because on agent workflows, every minute of detection delay is roughly an order of magnitude of compounding cost on a misbehaving workflow. A retry storm caught at minute four costs hundreds of dollars; the same storm caught at hour four costs tens of thousands and may have produced poisoned outputs the team has to recall. MTTD is the single KPI most strongly predictive of MTTR — invest in it first.

We track three MTTD modes — automated detection, peer detection, and customer-reported detection — because their distribution tells the story of dashboard maturity. A healthy program catches most incidents automatically; a struggling one finds out from customers.

Mode 01

Automated detection

Detection panel fires before human notices

A trained alert (cost anomaly, trace volume drop, eval regression, tool error rate) pages on-call before any human reports the symptom. Target distribution: 70%+ of incidents detected this way. The single best signal that the detection layer is doing its job.

Target: ≥ 70% of incidents

Mode 02

Peer detection

Internal team member notices first

An engineer reviewing traces, a PM watching a dashboard, an on-call sweep finds the symptom before any alert. Healthy at 15-25% of incidents — the team is engaged. Drifting above 40% suggests the alert panel is under-tuned and missing real failures.

Target: 15-25%

Mode 03

Customer-reported detection

External report is the first signal

Support ticket, sales email, social-media mention is the first surface to flag the incident. Healthy at under 10%; rising above 15% is a red flag — the detection panel has structural gaps and the team is operating partially blind.

Target: < 10%

The MTTD trend line worth watching is the seven-day rolling median by detection mode. A program with healthy detection maturity shows automated detection trending steady or down (faster catches), peer detection trending flat, and customer-reported detection trending steadily down toward zero. The opposite shape — customer reports climbing while automated detection rate stays flat — is the classical signature of dashboard atrophy. New workflows shipped without new detection signals; the panel gradually stops covering the production surface.

For a deeper treatment of the underlying instrumentation that powers these metrics, see our companion piece on agent observability — trace coverage at tool-call granularity and per-workflow cost attribution are the prerequisites without which MTTD has nothing to draw from.

"MTTD is the metric that predicts MTTR. Every minute of detection delay is roughly an order of magnitude of compounding cost on a misbehaving agent workflow."— Production agent post-mortem retrospective, Q1 2026

03 — MTTRMean time to recover — broken down by phase.

Mean time to recover is the elapsed wall-clock time from detection to full restoration of normal traffic. It is the most visible KPI in the panel and the one leadership asks about first. The mistake most teams make is reporting MTTR as a single number — "our MTTR is two and a half hours." A useful MTTR report breaks the clock down into the four phases of the incident response loop, because each phase is owned by a different discipline and each phase has a different lever to move it.

The bars below show the target distribution for a P0 incident on a healthy program. Detection plus containment together should close in under twenty minutes — the team is buying time fast. Eradication takes the bulk of the clock because diagnosis and surgical reversal are where the actual engineering work happens. Recovery is measured in tranches; the verification runbook between each one is non-negotiable.

MTTR breakdown · P0 incident phase-by-phase target

Source: Digital Applied incident-response panel · P0 target distribution

Detection (MTTD inside MTTR clock)Panel fires → on-call pages · target < 5 min

< 5 min

ContainmentKill-switch / feature flag / shadow active · target < 15 min from detection

< 15 min

EradicationPrompt rollback / model pin / tool quarantine complete · target < 60 min from containment

< 60 min

RecoveryPartial → full traffic restore with verification runbook · target < 40 min from eradication

< 40 min

Total P0 MTTRDetection through full restore · target < 2h

< 2h

The phase breakdown turns "our MTTR is too slow" from a blame conversation into an engineering decision. If detection is slow, invest in the panel. If containment is slow, the kill-switch isn't wired or on-call doesn't trust it enough to flip first. If eradication is slow, the runbook doesn't exist or the rollback primitives aren't first-class operations. If recovery is slow, the verification runbook is missing or the tranche schedule is undefined. Each phase has a fix; the panel surfaces which one is dragging.

We track MTTR per severity tier rather than a global average, because P0 and P3 have different operating contracts. P0 targets under 2 hours; P1 under 4 hours; P2 under 1 business day; P3 under 1 week. Averaging across severities hides whether the critical incidents are tightening or loosening.

04 — Severity DistributionCount is noise — distribution is signal.

Counting incidents is the metric leadership intuitively asks for and the metric that misleads worst. A team running ten P3 incidents a quarter is healthy; a team running two P0s a quarter is in trouble. The signal is the distribution across the four severity tiers, and the slope of that distribution over time.

The matrix below shows the operating contract for each tier — the page priority, the response shape, the MTTR target — and the distribution band that indicates a program in a healthy state. Drift toward higher severities is the leading indicator of a program in decay; drift toward lower severities is what a maturing program looks like as the detection panel catches issues earlier in their lifecycle.

P0 · Active customer impact

Blast radius growing

Pages on-call immediately, all-hands available, executive ping at 30 min. MTTR target under 2 hours. Healthy programs run zero to one P0 per quarter; two or more is the signal to invest before the team burns out.

Target: ≤ 1 / quarter

P1 · Customer impact bounded

Internal failure or single-workflow degradation

Pages on-call within business hours, single-owner response, status update every 2 hours. MTTR target under 4 hours. Healthy distribution: 2-5 P1s per quarter. Rising P1s while P0s stay flat is the cleanest signal that detection improved.

Target: 2-5 / quarter

P2 · Degraded but functional

Fallback engaged

Queued for next business day, single-owner, async update on resolution. MTTR target under 1 day. Healthy distribution: 5-10 P2s per quarter. P2s caught early are the highest-leverage incidents; they're cheap to resolve and teach the panel.

Target: 5-10 / quarter

P3 · Latent issue / near-miss

No customer impact

Logged for weekly review, root-caused but not page-worthy. Healthy distribution: 10-20 P3s per quarter. A program logging zero P3s isn't healthy — it's blind. P3s are how a maturing program learns without paying for the lesson.

Target: 10-20 / quarter

Two disciplines make severity distribution useful. The first is that severity is set by the incident commander on the response call, not by the alert that pages — alerts are wrong about severity in both directions and the human on the call has authority to upgrade or downgrade. The second is that severity downgrades are explicit and logged, not silent. A P0 that becomes a P1 mid-incident gets the change announced in the response channel with the reason captured for the postmortem record. Without those two disciplines, severity becomes meaningless and every incident defaults to P0.

The trend line worth watching is severity distribution by calendar quarter. A maturing program shows P0 count flat or falling, P1 count flat, and P2 / P3 counts rising — the panel is catching issues earlier and at lower severity. A program in decay shows the opposite shape: P0 count rising, P3 count flat or falling, distribution shifting upward.

05 — Repeat RateRepeat-incident rate — postmortem discipline measured.

Repeat-incident rate is the percentage of incidents in a trailing window that share a root cause with an earlier incident in the same window. Formula: count of incidents where the postmortem failure class plus system-level root cause match a prior incident, divided by total incidents, evaluated on a rolling 90-day window. Target under 10%; sustained above 20% is the signal that postmortem discipline has broken down.

The metric matters because it is the cleanest single signal for whether the team is finding system causes or settling for agent-blame. A postmortem that ends at "the agent hallucinated" produces no action item that prevents the recurrence. The same failure class fires again in the next quarter, the team responds to it again, and repeat-incident rate climbs. Healthy postmortems generate concrete system-level fixes — new guardrails, new evals, new detection signals — and the same class doesn't come back.

Three operational levers move repeat-incident rate over a quarter. None of them are exotic; together they account for the bulk of programs that hold rate under 10% on a sustained basis.

Lever 01

5/5

Postmortem template sections

Every postmortem completes five sections: timeline, failure class (no 'hallucination' allowed), system-level root cause, action items (concrete + owned + dated), detection improvement. The template forces the system view that prevents agent-blame.

Forcing function

Lever 02

60d

Action item review cadence

At the start of every postmortem, the team reviews action items from the prior 60 days. Incomplete actions are the second-most-common cause of repeat incidents; the review pulls them back into focus before the next quarter starts.

Every postmortem

Lever 03

100%

Detection improvement per incident

Every postmortem includes one new detection signal — a panel update, a tighter threshold, an additional alert. The detection layer compounds over postmortems; the team that adds a signal per incident is the team whose MTTD halves over a year.

Per-incident

The trend line worth watching is repeat-incident rate on a trailing 90-day window, plotted month-over-month. A healthy program shows rate falling toward zero as the action-item backlog clears and detection improvements compound. A program in decay shows rate rising as postmortems get shorter, action items go undone, and the same failure classes recycle. Repeat-incident rate is the metric most strongly predictive of whether next quarter's incident load gets lighter or heavier — invest in it.

The postmortem rule that moves the metric

A postmortem that ends with "the agent hallucinated" as the root cause produces no action item that prevents recurrence. The same failure class fires again, repeat-incident rate climbs, and the program loses ground. Banning that phrase from the template — and forcing the system-level cause into one of the five named failure classes — is the single cheapest discipline that moves repeat-incident rate down.

06 — Runbook CoverageRunbook coverage — next quarter's MTTR.

Runbook coverage is the percentage of incident failure classes with a tested, current runbook. Formula: count of failure classes with a runbook last rehearsed in the prior 90 days, divided by total failure classes observed in the prior 12 months. Target 80%+; sustained below 50% is the signal that runbooks have become stale and the team is writing them under pressure during incidents.

The metric matters because runbook coverage is the leading indicator for next quarter's MTTR. A team responding to a failure class with a current runbook moves through containment and eradication in minutes; a team responding without one spends the first half-hour figuring out what the rollback even looks like. Runbook coverage today is MTTR three months from today — the lag between the two is reliable.

Five canonical runbook classes cover roughly 90% of agent incident shapes. Every production agent program should have all five wired, rehearsed quarterly, and rewritten annually.

Runbook 01

Kill-switch activation

Boolean flag · workflow-level · < 60s effect

Single boolean per workflow stored in a configuration system that takes effect without a code deploy. On-call has authority to flip without product-owner approval. Test the switch quarterly with a deliberate drill — a flag that's been in place six months and never flipped is one you don't actually trust.

Drill: quarterly

Runbook 02

Prompt rollback

Versioned prompt · git revert pattern · 5-15 min

Prompts treated as code with version history in the deploy system. Rollback is a first-class operation: select prior canary-passed version, deploy, verify against eval suite, unpause traffic in tranches. The cheapest discipline that pays back during eradication.

Drill: quarterly

Runbook 03

Model version pin

Floating → fixed pointer · 15-30 min

Most provider SDKs support explicit version pinning. The runbook flips the deploy config from a floating pointer to a fixed one, verifies against the eval suite, and unpauses. The version that failed gets logged so the upgrade can be re-attempted safely later.

Drill: quarterly

Runbook 04

Tool quarantine

Disable from toolset · graceful degradation · 10-30 min

Failing tool is disabled from the agent's available toolset; the workflow relies on its graceful degradation path. Distinct from a server-side fix because the agent stops trying to call the broken tool entirely while the team investigates the underlying failure.

Drill: quarterly

The fifth canonical runbook is context restore — identifying the corrupted context source (RAG index, customer-data feed, system prompt template) and resetting it from a known-good snapshot. Slower than the other four classes because the team has to confirm the corruption hasn't spread before signing off on recovery; typical resolution window is 30-90 minutes.

Runbook freshness matters as much as coverage. A runbook written twelve months ago against a stack that has since changed model providers, swapped MCP servers, and migrated deploy systems is a runbook that doesn't work. The maintenance cadence is three-tiered: after every incident, the runbook is updated with anything that surprised the on-call engineer; quarterly, the team runs a tabletop drill against one rotation-chosen runbook; annually, all five canonical runbooks get a full rewrite pass against the current stack.

07 — On-Call LoadOn-call load — burnout measured early.

On-call load is the bundle of metrics that measure the human cost of running the rotation: pages per on-call week, interrupted-sleep nights per week, and weekend incident hours. Track all three. The aggregate is what predicts whether the rotation survives the next two quarters; the individual lines tell the team which lever to pull when load drifts up.

The metric matters because on-call load is the leading indicator for burnout, and burnout is the leading indicator for losing senior responders. A program that lets on-call load climb without intervention is one quarter away from a rotation collapse — the most experienced engineers leave first, the remaining team responds slower, MTTR rises, and the cycle accelerates. The cheap intervention is measurement plus a review cadence; the expensive intervention is hiring after the senior engineer has already given notice.

Metric 01

Pages per on-call week

Total pages — automated and human-filed — during a single seven-day rotation. Healthy target under five pages per week; sustained above ten is the signal that alert thresholds are too sensitive, runbook coverage is too low, or both. Drives most of the felt load.

Target: < 5 / week

Metric 02

Interrupted-sleep nights / week

Nights in the on-call week where the responder was paged between 11pm and 6am local time. Target zero to one per week; sustained above two erodes responder health quickly and is the single best predictor that the rotation is heading toward burnout.

Target: ≤ 1 / week

Metric 03

Weekend incident hours

Hours spent on active incident response during Saturday and Sunday of the on-call week. Target under four hours; weekend hours are the most damaging form of on-call time and the easiest to under-count in informal reporting. Make them visible.

Target: < 4h / week

The trend line worth watching is on-call load by responder over a trailing 12-week window. A healthy program shows load distributed roughly evenly across the rotation; a struggling one shows two or three responders absorbing the bulk of pages while others coast — which is the precursor to the senior engineers quitting. Publish the per-responder breakdown internally so leadership can spot the imbalance before it compounds.

Three interventions move on-call load when it drifts up. First, tune alert thresholds against the false-positive log — a panel that pages on noise burns the rotation without delivering signal. Second, fill runbook gaps for the failure classes that paged on-call without one — every missing runbook is roughly an hour of unstructured triage per page. Third, schedule explicit recovery time for responders coming off heavy weeks; a day off after a 15-page rotation is cheaper than the senior engineer who quits after three of them.

If you're standing up the metrics panel from scratch and need help calibrating against your current stack, our AI transformation engagements include the eight-KPI dashboard as a standard line item — detection panels designed for your workflows, severity matrix calibrated to your business risk, on-call load tracking wired to your paging system.

"On-call load is the leading burnout indicator. A program that lets it climb without intervention is one quarter away from losing its senior responders — and that loss compounds for years."— Agentic engineering leadership conversation, Q2 2026

Conclusion

Incident metrics turn agent ops from reactive to predictive.

The eight-KPI framework above is not an exotic measurement stack. Each metric has a classical SRE analogue, a clear formula, a target band, and a trend line. What makes it useful for agent ops is that it answers the question gut-feel can't: is the program improving, holding, or in decay — and where do we invest next quarter to move the needle? Without those answers, agent ops is reactive firefighting with no path to a predictable state.

The metrics that move the program hardest are not the ones leadership intuitively asks about. MTTR shows up first because it's the most visible, but MTTD is the leading indicator that predicts where MTTR lands. Severity count is easy to read but distribution is what tells the story. Runbook coverage today is MTTR three months from today. Repeat-incident rate is the cleanest single signal for postmortem discipline. On-call load is what predicts whether the rotation survives long enough for any of the other metrics to keep mattering.

Practical next step: pick the highest-traffic agent workflow your team runs and walk the eight KPIs against it this month. Where does MTTD land today? Is the severity matrix calibrated and used? What is runbook coverage on the failure classes you've actually seen? Most teams find at least three gaps on the first pass; closing them before next quarter starts is the cheapest investment in agent ops the team will make all year.

AI Incident Metrics: MTTR / MTTD Framework Agentic 2026

01 — Why Incident MetricsFrom reactive firefighting to predictable ops.

02 — MTTDMean time to detect — the leading indicator.

Automated detection

Peer detection

Customer-reported detection

03 — MTTRMean time to recover — broken down by phase.

MTTR breakdown · P0 incident phase-by-phase target

04 — Severity DistributionCount is noise — distribution is signal.

Blast radius growing

Internal failure or single-workflow degradation

Fallback engaged

No customer impact

05 — Repeat RateRepeat-incident rate — postmortem discipline measured.

Postmortem template sections

Action item review cadence

Detection improvement per incident

06 — Runbook CoverageRunbook coverage — next quarter's MTTR.

Kill-switch activation

Prompt rollback

Model version pin

Tool quarantine

07 — On-Call LoadOn-call load — burnout measured early.

Pages per on-call week

Interrupted-sleep nights / week

Weekend incident hours

Incident metrics turn agent ops from reactive to predictive.

Incident metrics turn agent ops from reactive to predictive.

Incident metrics programs

The questions teams ask before the first P0.

Continue exploring agent operations.

Agentic Workflow Incident Response: Playbook + Runbooks 2026

Agentic Workflow Completion Metrics: Pipeline Health 2026

Agent Quality Metrics: Pass Rate + Revision Rate 2026