An agentic workflow incident response playbook is the codified five-phase loop — detection, containment, eradication, recovery, postmortem — that a production agent team runs when a workflow misbehaves at scale. The agent failure surface is different from classical web incidents: a bad prompt can cascade across thousands of runs before a single dashboard turns red, cost can spike 50× in an hour without a latency change, and the rollback target is rarely a clean previous build. The playbook below is the one we install with clients before their agent workflows take real traffic.
The pattern across post-incident reviews is consistent. The teams that recover fastest don't have smarter on-call engineers; they have detection panels tuned for agent-specific signals, kill-switches wired before they're needed, runbooks rehearsed quarterly, and a postmortem culture that finds the system cause rather than blaming the agent. The teams that struggle have classical web ops instincts applied to a workflow class those instincts were never built for.
This guide walks through each of the five phases, the runbook templates that operationalise them, the severity matrix that maps an incident to a paging tier and an MTTR clock, and the FAQ for the questions ops teams ask before their first agent P0. It pairs with our companion resilience audit checklist — resilience is what stops incidents from happening; this playbook is what you run when one happens anyway.
- 01Agent incidents compound — a bad prompt cascades before the first alert.Classical web incidents hit one endpoint at a time. Agent incidents hit every run that touched the broken prompt, model version, or tool — often thousands within minutes. Time-to-detect dominates blast radius; the playbook starts there.
- 02Detection signals are agent-specific — cost anomaly, trace volume drop, eval regression.Latency and 5xx rates miss most agent failures. The signals that catch a P0 are token spend per workflow, trace volume vs baseline, eval-suite regression on canary runs, and tool-error rate. Build the panels before the incident, not during.
- 03Kill-switch first, diagnosis second — containment beats triage on agent traffic.When an agent is mis-firing across thousands of runs, the first move is to stop new runs from entering the workflow — feature flag, queue pause, route bypass. Containment is reversible; production damage often is not.
- 04Postmortem without agent-blame surfaces the real root cause.“The agent hallucinated” is not a root cause; it's a description of the symptom. The real cause is almost always a missing guardrail, eval gap, or context-engineering bug. Postmortem templates have to force the system view.
- 05Severity matrix maps to page priority — P0/P1/P2/P3 each have an MTTR clock.P0 pages on-call immediately and targets resolution under two hours. P1 pages business-hours with a four-hour clock. P2 enters the next-day queue. P3 is logged for the weekly review. Without the matrix, every incident becomes a P0 and on-call burns out.
01 — Why PlaybookAgent incidents compound — playbooks prevent the cascade.
Classical web incidents have a forgiving shape. An endpoint starts returning 5xx, a dashboard turns red within a minute or two, the on-call engineer rolls back the last deploy, traffic recovers, and the postmortem writes itself. The blast radius is bounded by the request rate and the time-to-detect; neither typically catches the team off guard.
Agent incidents break that shape. A model version bump that lands cleanly in evals can produce subtly worse tool selection across an entire workflow class. A prompt edit that ships fine for one customer can hallucinate against an edge-case input nobody thought to test. A new MCP server that connected cleanly in staging can quietly time out in production, and the agent — designed to be resilient — will retry, escalate, fan out, and burn a week's token budget in an afternoon. The signals that would normally surface the failure (latency, 5xx) often look completely healthy while the incident compounds underneath.
The playbook exists because the failure surface is genuinely different. Every team running agents in production will have at least one incident class their classical web playbooks don't cover. The five-phase loop below is the operational discipline that makes those incidents survivable.
Detect
Cost · trace volume · eval · tool errorAgent-specific dashboards and alerts catch the incident class classical signals miss. Time-to-detect is the lever that dominates blast radius; every minute of detection is roughly an order of magnitude of compounding cost on a misbehaving workflow.
Target: < 5 minContain
Kill-switch · feature flag · shadowStop new runs from entering the broken path. Toggle the feature flag, pause the queue, route to the shadow agent. Containment is reversible and cheap; the cost of pausing for ten minutes is rounding error against the cost of not pausing.
Target: < 15 minEradicate
Rollback · model pin · tool quarantineIdentify the change that introduced the failure and reverse it surgically. Prompt rollback, model version pin, tool quarantine, MCP server disable. Distinct from containment — containment stops the bleeding, eradication removes the wound.
Target: < 60 minRecover
Verify · partial restore · full restoreConfirm the eradication worked, restore traffic in measured tranches, monitor the recovery panels closely. Full restoration only after a verification runbook signs off. Don't flip back to 100% the moment the dashboard goes green.
Target: < 2 h (P0)02 — DetectionEval regression, cost anomaly, trace volume drop.
Detection is the highest-leverage phase in the playbook. The difference between a P0 caught at minute four and a P0 caught at hour four is rarely a smarter on-call engineer; it's whether the team built dashboards and alerts on signals classical web monitoring misses. Latency stays healthy on most agent failures. Error rate stays healthy. The signals that fire are agent-shaped, and the team that hasn't built them runs blind.
The four signal classes below are the foundation of the detection panel. Each carries a baseline definition, an alert threshold, and a typical false-positive rate so the on-call rotation can trust what it sees.
Cost anomaly
Token spend per workflow · per tenant · per hourTrack token spend at workflow granularity with rolling baselines. Alert when spend exceeds 2× the trailing 14-day p95 for that workflow, or when tenant-level spend trips a budget. Catches retry storms, tool-loop failures, and prompt regressions that bloat completions — none of which surface on latency.
Threshold: 2× p95Trace volume drop
Workflow completion rate vs baselineTrace volume dropping is the inverse failure mode of cost anomaly — workflows are failing fast and not completing. Alert when completion rate falls below 50% of trailing baseline for any workflow with material volume. Especially common after an MCP server change or tool deprecation.
Threshold: 50% baselineEval regression
Canary eval suite per deployRun the eval suite against a small canary slice of production traffic on every deploy. Alert when any eval metric regresses more than 5% from the prior canary window. Catches model-version, prompt, and tool-binding changes that pass tests but degrade real workflows.
Threshold: 5% deltaTool error rate
Tool call failure / retry per minutePer-tool failure and retry rates surface MCP server outages, API deprecations, and rate-limit storms long before they propagate to a workflow-level signal. Alert when tool error rate doubles for more than 5 minutes. The cheapest panel in the set; usually the first one to fire.
Threshold: 2× / 5 minThe other four signals worth building once the four above are in place: agent decision drift (per-decision-class rate vs baseline), human-escalation rate (handoff frequency to operator queue), tool selection entropy (distribution of tool choices for a given workflow), and tenant-level outlier rate (workflows by tenant scoring abnormally high or low). The full eight-signal panel is the difference between detection-rich incident response and the classical-web-instincts version that misses agent failures by design.
The companion piece on agent observability covers the instrumentation layer that makes these signals possible — without trace coverage at tool-call granularity and per-workflow cost attribution, the detection panels above have no data to draw from.
"Time-to-detect dominates blast radius on agent workflows. Every minute of detection is roughly an order of magnitude of compounding cost on a misbehaving workflow."— Production agent post-mortem, Q1 2026
03 — ContainmentKill-switch, feature flag, agent shadowing.
Containment is the phase where the team buys time. The incident is still active, the cause may not yet be understood, but the bleeding has to stop. The discipline is to choose containment actions that are reversible and cheap — pausing for ten minutes is rounding error against the cost of letting a misbehaving workflow run another hour.
The three containment primitives below cover roughly 90% of agent incident classes. Every team running agents in production should have all three wired before they're needed.
Kill-switch
A single boolean per workflow that, when flipped, stops new runs from entering the agent path. Existing in-flight runs either drain or abort, depending on the workflow class. The kill-switch sits in a configuration store separate from the application code so it takes effect without a deploy; on-call has authority to flip it without paging product owners. The discipline is to flip first and diagnose second — the cost of an unnecessary kill is rounding error against the cost of a delayed one.
Feature flag
A more surgical alternative to the kill-switch: route a percentage of traffic to the broken path, the rest to a fallback. Useful when the agent is doing real work and the fallback is acceptable for a window — manual escalation, cached response, simpler model. The flag should support tenant-level overrides so a single high-value customer can be pinned to the fallback while the rest of the workflow comes back online incrementally.
Agent shadowing
Route the production traffic through both the broken agent and a known-good shadow (previous version, simpler model, or rule-based fallback). The shadow's output is the customer-facing result; the broken agent runs in parallel and its output is logged for comparison. Shadowing is more work to set up than kill-switch but preserves the customer experience while the team investigates — useful for high-volume customer-facing workflows where pausing has material business cost.
04 — EradicationPrompt rollback, model pin, tool quarantine.
Eradication is distinct from containment. Containment stops the bleeding; eradication removes the wound. The team has time bought by the kill-switch or feature flag, and now has to identify the specific change that introduced the failure and reverse it without making the problem worse.
The choice of eradication action depends on the failure class. The matrix below maps the four most common agent incident shapes to their eradication move, with a typical recovery time and the failure mode each one risks.
Symptom: eval delta + decision drift
Roll the prompt back to the previous canary-passed version. The deploy system should support prompt-versioned rollback as a first-class operation; treating prompts as code with git history is the cheapest discipline that pays back during eradication.
Prompt rollback (5-15 min)Symptom: cost anomaly + tool drift
Pin the model to the last known-good version explicitly in the deploy config. Most provider SDKs allow version pinning; the eradication step is to flip from a floating pointer to a fixed one and verify against the eval suite before unpausing traffic.
Model pin (15-30 min)Symptom: tool error rate + trace volume drop
Quarantine the failing tool — disable it from the agent's available toolset and rely on the workflow's graceful degradation path. Distinct from a server-side fix because the agent stops trying to call the broken tool entirely while the team investigates.
Tool quarantine (10-30 min)Symptom: drift across tenants + hallucination spike
Identify the corrupted context source — RAG index, customer-data feed, system prompt template — and reset it from a known-good snapshot. Slower than the other classes because the team has to confirm the corruption hasn't spread before signing off on recovery.
Context restore (30-90 min)The cross-cutting discipline is to make every eradication action reversible. Rollback the prompt, but capture the broken prompt for postmortem analysis. Pin the model, but log the failure on the new version so the upgrade can be re-attempted. Quarantine the tool, but keep its trace coverage active so the underlying failure can be diagnosed. The team that eradicates by deleting evidence loses the postmortem lesson.
05 — RecoveryVerification runbook, partial restore, full restore.
Recovery is the phase where most teams over-rotate. The dashboard went green, eradication completed, the obvious thing to do is to unpause traffic. The playbook discipline is to recover in measured tranches with a verification runbook between each step — the second-most-common cause of a repeat incident is a too-confident recovery on the first one.
Verification runbook
Before any traffic comes back, the team runs a fixed verification checklist: eval suite passes on the eradicated state, canary workflow on a known-good test input completes with expected output, tool calls succeed against the recovered MCP servers, cost panel shows the corrected workflow at baseline token spend. The runbook is owned by the on-call engineer; sign-off is required before moving to partial restore.
Partial restore
Unpause traffic to the smallest meaningful tranche — typically 5-10% of normal volume, or one low-risk tenant for multi-tenant workflows. Monitor the detection panels closely for 30-60 minutes. If anything regresses, the kill-switch goes back on and the team returns to eradication. If the panels stay green, advance to the next tranche (25%, 50%, 100%) with a verification pass between each. Most P0 recoveries take three to five tranches.
Full restore
Traffic at 100%, all panels green, in-flight runs completing normally. The recovery phase isn't over the moment the dial hits 100 — keep an elevated monitoring window for at least 24 hours, with on-call alerts tuned tighter than baseline so any recurrence pages immediately. Full restore ends when the monitoring window closes without incident.
Eval pass before partial
Every recovery starts with the eval suite on the eradicated state. If a single canary eval is below baseline, the team does not proceed to partial restore. Cheap insurance against repeat incidents.
Sign-off requiredInitial tranche target
Unpause to 5% of normal volume (or one low-risk tenant) and watch the detection panels for 30-60 minutes. The team that unpauses straight to 100% is the team most likely to need a second incident response cycle.
Hold ≥ 30 minPost-recovery window
After full restore, keep alert thresholds tighter than baseline for 24 hours. Recurrence happens most often within the first day post-recovery; the elevated window catches it before it pages back.
Tighter thresholdsThe discipline that pays the most during recovery is the verification runbook itself. A team without one will recover faster, ship more confidently, and pay the cost in repeat incidents. A team with one will recover more slowly the first time and never need to re-run the cycle on the same root cause. That trade is consistently worth taking.
06 — PostmortemRoot cause without agent-blame.
The postmortem phase is where the team learns or doesn't. The single most common failure mode is writing "the agent hallucinated" as the root cause and closing the ticket. That sentence isn't a root cause — it's a description of the symptom. The real cause is almost always a missing guardrail, an eval-coverage gap, a context-engineering bug, or a tool-binding mismatch. Postmortem templates have to actively force the system view.
The blameless template
A working postmortem template has five sections, each with forcing-function prompts that prevent agent-blame:
- Timeline. Minute-by-minute from first signal to full restore, including detection delays, response times, and every action taken. Forces the team to surface what the detection panels caught versus what they missed.
- Failure class.Categorise the incident: prompt regression, model version, tool failure, context corruption, infrastructure, or human error. No "agent hallucination" allowed — that category collapses into one of the five above when examined.
- System-level root cause.What guardrail, eval, or check should have caught this and didn't? If the answer is "none exists," that's the action item. If one exists but failed, that's the bug.
- Action items.Concrete, owned, dated. Each addresses a system-level gap surfaced by the root cause — never "train the model better" or "tell the agent not to do that."
- Detection improvement. What signal would have caught this 10 minutes earlier? Build the panel as an action item. The detection layer compounds over postmortems — every incident teaches the dashboard one new thing.
The blameless discipline extends to the people on-call. Agent incidents tend to escalate fast and resolve in confused conditions; the postmortem reviews the decisions in their context and looks for the missing guardrails, not the people. A team that blames the on-call engineer learns less than a team that blames the system that put the engineer in that position.
07 — SeverityP0, P1, P2, P3 — page priority.
The severity matrix maps an incident to a paging tier and an MTTR clock. Without it, every incident becomes a P0 and on-call burns out in two months. With it, the team has a shared vocabulary for urgency that scales the response to the actual blast radius.
The bars below show the operational shape of each tier — how urgently it pages, who responds, and the target time-to-recover. They are the defaults we install with clients; adjust the thresholds to your business risk profile, but keep the four-tier structure.
Severity matrix · page priority + MTTR clock
Severity thresholds are defaults — calibrate against your business-risk profile. The four-tier structure should hold regardless.Two disciplines make the severity matrix work in practice. The first is that severity is set by the incident commander on the response call, not by the alert that pages. An alert can be wrong about severity in either direction; the human on the call has authority to upgrade or downgrade. The second is that severity downgrades are explicit, not implicit. A P0 that becomes a P1 mid-incident gets that change announced in the response channel, with the reason logged for the postmortem.
If you're standing up the playbook from scratch, the priority ordering is: detection panels first (Phase 1), kill-switch second (Phase 2), severity matrix third (cross-cutting), then runbook templates for the other phases. The first three give the team enough to respond to the first incident competently; the runbooks bring the response from competent to fast. Our AI transformation engagements ship the full playbook as a standard line item — detection panels designed for your stack, runbook templates wired to your tooling, severity matrix calibrated to your business risk profile.
Incident response is the work — the playbook just makes it repeatable.
Every team that ships agents to production will run an incident response cycle. The question isn't whether — it's how prepared the team is when the first signal lands. A team without detection panels tuned for agent failures will catch the incident at hour four; a team without a kill-switch will spend the next hour writing one under pressure; a team without a severity matrix will treat every incident as a P0 until on-call burns out.
The five-phase playbook isn't exotic. Each phase has a classical web-ops equivalent and a small set of agent-specific differences. What it requires is the discipline to build the primitives before they're needed — the detection signals wired, the kill-switch deployed, the verification runbook written, the postmortem template forcing the system view. The same teams that skip the resilience layer skip the incident response layer for the same reason: it costs engineering time before the first demo and only pays back at production scale.
Practical next step: pick the highest-traffic agent workflow your team runs and walk it through the five phases this week. Where would detection fire? Is the kill-switch wired? What does the eradication move look like for a prompt regression? Who owns the postmortem? Most teams find at least three gaps on the first pass; closing them before the first incident is the cheapest investment the team will make all year.