An agentic AI production deploy is the moment a prototype that scored well in evals collides with real users, real traffic, and real cost. Stage 6 of the ten-stage agentic AI pipeline is the checklist, rollback plan, monitoring config, canary release, feature-flag layer, and post-deploy verification runbook that together turn a passing eval into an operational system you can actually leave running on a Friday.
Most production incidents we see in 2026 are not capability incidents. The model can do the work; the prototype passed evals; the demo was clean. What broke was the deploy — a missing kill switch, a monitoring stack that watched latency but not eval drift, a 100% rollout instead of a 1% canary, a feature flag wired to the deploy instead of the release. Stage 6 fixes those gaps before the traffic arrives.
This Stage 6 kit picks up from prototype templates (Stage 5) and hands off to team enablement (Stage 7). Skip to the FAQ for the questions teams ask before the deploy calendar appointment goes on the books.
- 01Eval-passing does not equal production-ready.A clean eval suite measures capability, not operability. The Stage 6 deploy checklist exists because the gap between a green eval and a healthy production system is consistent, predictable, and almost always under-engineered.
- 02The rollback plan precedes the deploy — written before, not after.Kill-switch design, traffic-shift mechanics, and eval-driven rollback triggers are checklist items completed before traffic moves. A rollback you design under incident pressure is not a rollback; it is improvisation.
- 03Monitoring layers stack on each other — eval, latency, cost, error.Production agents need all four axes alerting independently. Eval-only monitoring misses cost spikes; latency-only monitoring misses quality drift; cost-only monitoring misses correctness regressions. The four together is the floor.
- 04Canary releases bound the blast radius before the rollout earns trust.Four-tier canary (1% → 10% → 50% → 100%) with eval, latency, cost, and error gates at every tier is how a regression touches dozens of users rather than thousands. Cheaper than the postmortem on the day you skipped it.
- 05Feature flags decouple deploy from release.Per-user, per-tenant, per-workload flags let you ship code today, enable behavior tomorrow, and switch it off without a redeploy. The flag layer is what makes production agents safe to iterate at the cadence the workload demands.
01 — Why Stage 6Passing evals do not equal production-ready.
The most common failure mode at Stage 5 → Stage 6 is shipping a prototype that passed evals as if it were a production system. Evals measure capability under controlled conditions: known inputs, scoped failure modes, a curated dataset, a single-tenant harness. Production introduces concurrent traffic, adversarial inputs, multi-tenant blast radius, cost ceilings, drift over time, and operators with pagers. The capability bar is necessary; it is never sufficient.
Stage 6 codifies the operational layer that sits between a green eval suite and an agent that survives its first production weekend. The thirty-point checklist below covers six categories — deploy hygiene, rollback design, monitoring stack, canary mechanics, feature-flag layer, and post-deploy verification. Together they are how the same team that demoed cleanly on Tuesday ships safely on Thursday.
Eval-only confidence
Green eval suite · no canary · 100% deployThe team passes its eval gate and treats it as the go/no-go. The first incident is a regression on an input class the eval set never covered — caught by users, not by the team.
Skips canaryLatency-only monitoring
p50 / p99 dashboards · no eval driftOperations sees latency, not quality. A subtle reasoning regression ships and is invisible for a week — the dashboards are green, the eval suite isn't run on production traffic, the support tickets pile up.
Misses driftLayered deploy
Checklist · canary · 4-axis monitoring · runbookThirty checks signed off before traffic moves. Canary opens 1% → 10% → 50% → 100% with eval, latency, cost, and error gates. A 72-hour runbook closes the loop. Regressions are caught at 1%, not 100%.
Stage 6 defaultDeploy equals release
No flags · feature ships when code shipsWithout a feature-flag layer, every deploy is a release. Rolling back a misbehaving feature requires a redeploy under pressure. Stage 6 treats deploy and release as two separate events that the flag layer connects.
Conflates layers02 — ChecklistThirty production readiness checks.
The thirty-point checklist is the gate. Every item gets a yes/no sign-off from a named owner before the canary opens to 1%. The categories below are deliberate — they map to the failure modes that take down agent workflows in their first month of production traffic. The point is not the specific numbers; it is that every point is asked, answered, and signed off rather than assumed.
# stage-6 production deploy checklist · 30 points
## A. Capability gate (5)
1. Eval suite green on the deploy candidate (≥ target score per axis)
2. Eval suite covers production input distribution, not just demo inputs
3. Adversarial / red-team prompts run against candidate; report attached
4. Regression suite passes vs the model currently in production
5. Confidence calibration measured — confidence scores match accuracy
## B. Deploy hygiene (5)
6. Deploy artifact pinned (model version, prompt version, tool versions)
7. Config / secrets loaded from environment, not baked in
8. Smoke test in staging against shadow traffic (≥ 1 hour green)
9. Deploy is reproducible from a git tag — no manual steps
10. Deploy notification posted (channel + ticket + on-call paged)
## C. Rollback readiness (5)
11. Kill-switch wired and tested in staging within the last 7 days
12. Traffic-shift mechanism documented (flag, weighted route, or DNS)
13. Rollback target (previous version) is hot and ready to receive 100%
14. Rollback trigger thresholds defined (eval drop, latency, cost, error)
15. Rollback runbook contains exact commands — no improvisation
## D. Monitoring stack (5)
16. Eval drift metric streaming on production traffic (sampled per tier)
17. Latency p50 / p95 / p99 alerting per stage, not just end-to-end
18. Cost per request alerting with daily and hourly burn-rate budgets
19. Error rate alerting split by class (transient / permanent / external)
20. Dashboard linked from the deploy ticket — operators bookmark, not search
## E. Canary mechanics (5)
21. Canary tiers configured: 1% → 10% → 50% → 100%
22. Gate criteria documented per tier (must hold for N minutes / hours)
23. Automatic rollback wired to gate violation, not "operator notices"
24. Per-tenant exclusion list for canary (high-value tenants opt in last)
25. Canary tier promotion is a deliberate command, never time-based alone
## F. Post-deploy hygiene (5)
26. 72-hour verification runbook scheduled; on-call rotation knows
27. First production incident drill rehearsed against this deploy
28. Customer-facing changelog drafted (if user-visible behavior changes)
29. Cost forecast updated based on canary observation, not pre-deploy estimate
30. Stage 7 (team enablement) hand-off scheduled within 7 daysItems in section A (capability gate) close the loop with Stage 5; items in section F (post-deploy hygiene) open the loop into Stage 7. The middle sections — B through E — are the genuinely production-specific work that Stage 6 owns. Treat the checklist as a hard gate. Missing items are not minor — they are the headline of the next incident review.
"A rollback you design under incident pressure is not a rollback. It is improvisation, performed badly, in front of an audience."— Production deploy retrospective, Q4 2025
03 — RollbackKill-switch, traffic shift, eval triggers.
Rollback is the second-most-skipped piece of a Stage 6 deploy — second only to canary mechanics. The pattern that keeps showing up: the team designs forward, ships, and discovers at the moment of incident that the rollback was an idea, not an artefact. The four patterns below are the practical shapes a rollback plan takes; pick the one that matches your traffic mechanics, then write the commands down.
Kill-switch flag
Single boolean · evaluated per requestA single feature flag wraps the entire agent workflow. Flip it false and traffic falls back to the prior surface (manual flow, prior model, or hard error with a queued retry). Cheapest pattern; works for the first deploy.
Use for stage-6 v1Weighted traffic shift
Router weights · per-request shardTwo stacks run side by side — current and candidate. Router weights move traffic between them in steps (100/0 → 90/10 → 50/50 → 0/100). Rollback is a weight change, not a redeploy.
Use for stage-6 v2Per-tenant cohort
Tenant allowlist · roll forward by cohortStack ships behind a per-tenant flag. Cohort 1 tries the new behavior; on success, cohort 2 enables. Rollback rolls back the latest cohort only, not every tenant. Used for high-value or contractually distinct accounts.
Use for B2B agentsEval-triggered auto-rollback
Streaming eval · auto-flip on thresholdEval metric streams over production traffic. When it crosses a defined threshold (per-stage accuracy drop, latency spike, cost explosion), the kill-switch flips without an operator. Operator is notified, not asked.
Use for high-volume agentsThe four patterns are not mutually exclusive. The strongest Stage 6 deploys we ship at Digital Applied combine pattern 1 (a kill-switch as the floor), pattern 2 (weighted shifts as the primary canary mechanic), and pattern 4 (auto-rollback wired to the same metrics the canary gates on). Pattern 3 is the addition for contractually distinct tenants — the SaaS shape where a single tenant cannot be the test surface for the rest.
Rollback triggers must be specific. "Eval drops" is not a trigger; "accuracy on the labelled production sample falls below 0.92 for 15 minutes across 5,000 requests" is. The looser the trigger, the slower the response — and the slower the response, the more users see the regression before the rollback engages.
04 — MonitoringEval, latency, cost, error budgets.
A production agent has four monitoring axes that must alert independently: eval drift (is the answer still right), latency (does it still respond on time), cost (is it still affordable), and error rate (does it still complete). Most teams ship with one or two; the gap is consistent and the consequence is consistent too. The choice matrix below covers the picks per axis — what to measure, what to alert on, and what to do when the alert fires.
Streaming accuracy on production traffic
Sample 1-5% of production requests, score them with the same eval suite used at Stage 5, alert on accuracy drop sustained across a window. Distinct from offline eval — this watches the live system, not the curated set.
Watch accuracy + confidence calibrationp50 / p95 / p99 per stage
End-to-end latency hides which stage degraded. Alert per stage at p99 above a defined threshold for 5+ minutes. Alert separately on tail latency expansion (p99 / p50 ratio climbing) — that is upstream degradation, not load.
Per-stage thresholds + ratio alertsDaily budget + hourly burn rate
Two alerts. Daily budget warning when projected spend exceeds budget by 20%; immediate alert when hourly burn rate exceeds 2× the moving average. Cost spikes are usually either a model misroute or a runaway retry loop — both worth paging.
Budget + burn-rate dual alertSplit by class — transient, permanent, external
A single 'error rate' metric collapses three failure modes that demand different responses. Split into transient (retryable), permanent (deploy regression), external (upstream provider). Each class gets its own alert with its own runbook.
Three-class split with separate alertsIndependent axes
Eval, latency, cost, error rate. Each alerts independently with its own threshold and its own owner. Collapsing them into one 'health' score is the surest way to ship a regression invisibly.
Stage 6 floorProduction eval traffic
Eval scoring on every request is too expensive for most agents. Sample 1-5% of production requests, score them against the eval rubric, aggregate to a streaming accuracy metric. The sample is what makes the alerting feasible.
Practical defaultTime-to-alert
From the metric crossing threshold to the on-call pager firing, under 5 minutes. Anything longer means the regression reaches a meaningful slice of users before the team is told.
Engineering targetThe four axes feed the canary gates and the auto-rollback triggers. Stage 6 monitoring is not a passive dashboard layer — it is the load-bearing input to every other piece of the deploy kit. If the monitoring stack is wrong, the canary is blind, the rollback is delayed, and the runbook is reactive. Build this section first; the rest of Stage 6 inherits its quality.
05 — Canary1% → 10% → 50% → 100% with gates.
A canary release opens the production tap in stages, with each stage gated on the monitoring axes from Section 04. The four-tier pattern below is the default Digital Applied template. The specific percentages can flex for the workload, but the principle does not: every promotion is a deliberate decision after the metrics hold, not a timer that ticks regardless.
# stage-6 canary release pattern · four tiers, gated promotion
tier 1 · 1% traffic
duration ≥ 30 minutes (or ≥ 5,000 requests, whichever is later)
promote when eval drift < 1 pp · p99 latency < SLO · cost < budget · errors stable
rollback when any gate violated for ≥ 5 minutes
who watches on-call engineer + deploy owner
tier 2 · 10% traffic
duration ≥ 4 hours (or ≥ 50,000 requests)
promote when all tier-1 gates still hold across the larger sample
rollback when any gate violated for ≥ 5 minutes
who watches on-call engineer + deploy owner + product owner
tier 3 · 50% traffic
duration ≥ 24 hours (covers a daily cycle of traffic mix)
promote when all gates hold across the daily peak
rollback when any gate violated for ≥ 10 minutes
who watches on-call engineer (passive — auto-rollback wired)
tier 4 · 100% traffic
duration ongoing
watch for first 72 hours per the post-deploy runbook (§07)
rollback when any gate violated for ≥ 10 minutes
who watches on-call rotation + automatic rollback
# excluded by default at every tier — opted in last
· high-value tenants (named allowlist)
· regulated-workload tenants
· tenants with active escalations
# promotion is a deliberate command — never time-onlyFour-tier canary release · blast radius vs trust earned
Tier durations are minimums — extend per workload risk and traffic mix.The single most common canary mistake we see is timer-only promotion. A canary that promotes after 30 minutes regardless of metric health is not a canary — it is a slow rollout. The gates are what make it a canary. The pager on the on-call rotation is what makes the gates real.
06 — Feature FlagsPer-user, per-tenant, per-workload flags.
Feature flags are the layer that decouples deploy from release. Without flags, every code deploy is a release of the behavior it contains; rolling that behavior back means another deploy under pressure. With flags, the deploy ships the code dark, the release turns on the behavior, and the rollback flips the same switch. Stage 6 agents need three flag granularities; using fewer narrows the options when an incident lands.
Per- user flag
Allowlist · denylist · percentage roll-outToggle behavior for a single user or a percentage of users. Used for internal testing, beta cohorts, and emergency exclusion of a specific user reporting an issue. The most precise scope; the cheapest blast radius.
Smallest scopePer- tenant flag
Account-scoped · cohort-scopedToggle behavior across an entire tenant — usually a B2B customer or an internal team. Used for staged rollouts to named accounts, contractually distinct behavior, or emergency rollback of a single tenant without affecting others.
B2B / multi-tenantPer- workload flag
Workflow-scoped · stage-scopedToggle behavior for a specific agent workflow or a specific stage within a workflow. Used to swap a model in one stage without changing the rest, A/B test prompt variants, or disable a single high-risk step pending review.
Granular controlAll three granularities should be readable in a single audit log keyed on the request ID. When an incident asks "why did this request behave this way?", the answer comes from the flag state at request time, not from reconstructing the deploy history. The audit trail is the difference between a five-minute triage and a four-hour archaeology project.
Pick a flag platform that supports targeting rules and percentage rollouts natively — homegrown flag tables turn into a maintenance burden by month three. The vendor or open-source choice is a Stage 4 decision; the integration is a Stage 6 deliverable; the audit and clean-up cadence is Stage 8 (governance).
07 — Post-DeployVerification runbook for the first 72 hours.
The deploy is not finished when traffic reaches 100%. The first 72 hours after full rollout is when subtle regressions surface — the ones too small to trip an alert immediately but large enough to matter by week two. The post-deploy runbook below is the structured walkthrough Digital Applied runs on every Stage 6 deploy. Owners are named, intervals are explicit, and the close- out hands cleanly to Stage 7 enablement.
# stage-6 post-deploy verification runbook · 72 hours
## hour 0 — full rollout reached
· deploy owner posts confirmation in #deploys with dashboard links
· on-call rotation acknowledges hand-off (this deploy is the active surface)
· automatic rollback verified armed (force a synthetic trip in staging)
· customer-facing changelog published (if user-visible change)
## hour 1 — first quality pass
· sample 100 production requests across stages, score against eval rubric
· review error log for new error classes (codes / messages / stack patterns)
· check cost burn rate vs pre-deploy estimate — flag any > 20% deviation
· confirm flag audit log writes are flowing (per-user / tenant / workload)
## hour 4 — first traffic-mix pass
· review p50 / p95 / p99 latency per stage vs pre-deploy baseline
· review eval drift metric — is the streaming accuracy stable across cohorts
· review tenant-level cost; any tenant > 2× their pre-deploy share?
· review tail-latency ratio (p99 / p50) — climbing = upstream degradation
## hour 24 — first full daily cycle
· full eval rerun against a 1,000-request production sample
· review off-peak behavior (different input distribution than peak)
· review canary-excluded tenants for opt-in candidacy
· publish day-1 deploy report (eval / latency / cost / error vs baseline)
## hour 72 — close-out
· 72-hour deploy report published (link in deploy ticket)
· open issues triaged with owners and dates
· auto-rollback configuration reviewed; thresholds adjusted if needed
· hand-off to stage 7 (team enablement) scheduled within 7 days
· retrospective scheduled — what to repeat, what to change next deployThe 72-hour window is calibrated empirically: it is the timeframe in which the majority of post-deploy regressions we have seen on agentic systems surface — drift in tail distributions, cost spikes tied to less-common input classes, errors that only trip on the third daily peak, behavior changes that only register once a specific tenant uses a specific workflow. After 72 hours the surface area of the deploy is well characterised and the system enters steady-state ownership under Stage 7.
08 — Next StageHand-off to team enablement (Stage 7).
Stage 6 delivers a production system; Stage 7 makes it supportable. The clean hand-off has three deliverables: a deploy report (eval, latency, cost, error vs baseline across the 72- hour window), a runbook artefact that the on-call rotation can follow without the deploy owner, and a named owner pairing for every workflow the deploy touched. Without those three, Stage 7 inherits an undocumented system and the next deploy has the same problem.
For broader context, the resilience layer that sits underneath every Stage 6 deploy is covered in our agentic workflow resilience audit (70-point checklist). Stage 6 is where the resilience layer earns its keep — timeouts, retries, rollback, observability all become production primitives the moment real traffic touches them.
If you want the Stage 6 templates run for you end-to-end, our AI transformation engagements ship the deploy checklist, the rollback plan, the monitoring stack, the canary mechanic, the feature-flag layer, and the 72- hour verification runbook as a single Stage 6 package, with the Stage 7 enablement hand-off scheduled before the canary opens.
Production deploy is the checklist — release is the canary.
The Stage 6 gap is the consistent one across agentic AI programmes in 2026: the prototype is good, the eval is green, and the deploy is improvised. The same teams that spent weeks on Stage 5 evaluation often spend a single afternoon on Stage 6 deploy, and the production weekend produces the incident that could have been a 1% canary blip. The kit above exists because we have watched the same gap close the same way every time.
None of the thirty checks is exotic. None of the four canary tiers is novel. The feature-flag pattern is a decade old; the 72-hour runbook is calibrated against observed incident windows on agent systems specifically. What is new is the combination — checklist, rollback, monitoring, canary, flags, runbook — applied as a single Stage 6 gate before any agent workflow sees production traffic. That gate is what turns capability into operability.
Practical next step: pick one agent workflow that is about to deploy and run the thirty-point checklist against it today. Almost every team finds a missing item; almost every team can close the gap in a single sprint. The remaining items — the canary mechanic, the auto-rollback wiring, the 72-hour runbook — are what separate a deploy that survives a Friday from a deploy that creates one.