SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
DevelopmentPlaybook10 min readPublished May 7, 2026

Stage 6 of 10 — production deploy. The thirty-point checklist that turns a passing eval into a production system.

Agentic AI Production Deploy: Stage 6 Pipeline Kit

A passing eval is not a production system. Stage 6 of the agentic AI pipeline is where prototype quality becomes operational quality — with a deploy checklist, an explicit rollback plan, a monitoring stack that watches eval and cost together, a canary release that opens the tap gradually, and a 72-hour post-deploy runbook that catches the drift the dashboards miss.

DA
Digital Applied Team
Agentic engineering · Published May 7, 2026
PublishedMay 7, 2026
Read time10 min
SourcesProduction deploys, 2024-2026
Deploy checks
30
across six categories
Canary tiers
4
1% → 10% → 50% → 100%
Monitoring axes
4
eval · latency · cost · error
Verification window
72h
post-deploy runbook

An agentic AI production deploy is the moment a prototype that scored well in evals collides with real users, real traffic, and real cost. Stage 6 of the ten-stage agentic AI pipeline is the checklist, rollback plan, monitoring config, canary release, feature-flag layer, and post-deploy verification runbook that together turn a passing eval into an operational system you can actually leave running on a Friday.

Most production incidents we see in 2026 are not capability incidents. The model can do the work; the prototype passed evals; the demo was clean. What broke was the deploy — a missing kill switch, a monitoring stack that watched latency but not eval drift, a 100% rollout instead of a 1% canary, a feature flag wired to the deploy instead of the release. Stage 6 fixes those gaps before the traffic arrives.

This Stage 6 kit picks up from prototype templates (Stage 5) and hands off to team enablement (Stage 7). Skip to the FAQ for the questions teams ask before the deploy calendar appointment goes on the books.

Pipeline navigation · ten stages
You are reading Stage 6 — production deploy. The full pipeline runs in order: 1 readiness assessment · 2 strategy roadmap · 3 data foundation · 4 vendor selection · 5 prototype · 6 production deploy · 7 team enablement · 8 governance · 9 scale · 10 continuous improvement. Each stage links forward and back; skipping stages is how production incidents happen.
Key takeaways
  1. 01
    Eval-passing does not equal production-ready.A clean eval suite measures capability, not operability. The Stage 6 deploy checklist exists because the gap between a green eval and a healthy production system is consistent, predictable, and almost always under-engineered.
  2. 02
    The rollback plan precedes the deploy — written before, not after.Kill-switch design, traffic-shift mechanics, and eval-driven rollback triggers are checklist items completed before traffic moves. A rollback you design under incident pressure is not a rollback; it is improvisation.
  3. 03
    Monitoring layers stack on each other — eval, latency, cost, error.Production agents need all four axes alerting independently. Eval-only monitoring misses cost spikes; latency-only monitoring misses quality drift; cost-only monitoring misses correctness regressions. The four together is the floor.
  4. 04
    Canary releases bound the blast radius before the rollout earns trust.Four-tier canary (1% → 10% → 50% → 100%) with eval, latency, cost, and error gates at every tier is how a regression touches dozens of users rather than thousands. Cheaper than the postmortem on the day you skipped it.
  5. 05
    Feature flags decouple deploy from release.Per-user, per-tenant, per-workload flags let you ship code today, enable behavior tomorrow, and switch it off without a redeploy. The flag layer is what makes production agents safe to iterate at the cadence the workload demands.

01Why Stage 6Passing evals do not equal production-ready.

The most common failure mode at Stage 5 → Stage 6 is shipping a prototype that passed evals as if it were a production system. Evals measure capability under controlled conditions: known inputs, scoped failure modes, a curated dataset, a single-tenant harness. Production introduces concurrent traffic, adversarial inputs, multi-tenant blast radius, cost ceilings, drift over time, and operators with pagers. The capability bar is necessary; it is never sufficient.

Stage 6 codifies the operational layer that sits between a green eval suite and an agent that survives its first production weekend. The thirty-point checklist below covers six categories — deploy hygiene, rollback design, monitoring stack, canary mechanics, feature-flag layer, and post-deploy verification. Together they are how the same team that demoed cleanly on Tuesday ships safely on Thursday.

Trap 1
Eval-only confidence
Green eval suite · no canary · 100% deploy

The team passes its eval gate and treats it as the go/no-go. The first incident is a regression on an input class the eval set never covered — caught by users, not by the team.

Skips canary
Trap 2
Latency-only monitoring
p50 / p99 dashboards · no eval drift

Operations sees latency, not quality. A subtle reasoning regression ships and is invisible for a week — the dashboards are green, the eval suite isn't run on production traffic, the support tickets pile up.

Misses drift
Stage 6 pattern
Layered deploy
Checklist · canary · 4-axis monitoring · runbook

Thirty checks signed off before traffic moves. Canary opens 1% → 10% → 50% → 100% with eval, latency, cost, and error gates. A 72-hour runbook closes the loop. Regressions are caught at 1%, not 100%.

Stage 6 default
Trap 3
Deploy equals release
No flags · feature ships when code ships

Without a feature-flag layer, every deploy is a release. Rolling back a misbehaving feature requires a redeploy under pressure. Stage 6 treats deploy and release as two separate events that the flag layer connects.

Conflates layers
The Stage 6 thesis
A prototype that passes evals is one ingredient of a production system. The other ingredients — checklist, rollback, monitoring, canary, flags, runbook — are engineered, not assumed. Every production incident this guide is calibrated against came from a team that shipped the prototype straight to 100% traffic without the rest of the kit.

02ChecklistThirty production readiness checks.

The thirty-point checklist is the gate. Every item gets a yes/no sign-off from a named owner before the canary opens to 1%. The categories below are deliberate — they map to the failure modes that take down agent workflows in their first month of production traffic. The point is not the specific numbers; it is that every point is asked, answered, and signed off rather than assumed.

# stage-6 production deploy checklist · 30 points

## A. Capability gate (5)
1. Eval suite green on the deploy candidate (≥ target score per axis)
2. Eval suite covers production input distribution, not just demo inputs
3. Adversarial / red-team prompts run against candidate; report attached
4. Regression suite passes vs the model currently in production
5. Confidence calibration measured — confidence scores match accuracy

## B. Deploy hygiene (5)
6. Deploy artifact pinned (model version, prompt version, tool versions)
7. Config / secrets loaded from environment, not baked in
8. Smoke test in staging against shadow traffic (≥ 1 hour green)
9. Deploy is reproducible from a git tag — no manual steps
10. Deploy notification posted (channel + ticket + on-call paged)

## C. Rollback readiness (5)
11. Kill-switch wired and tested in staging within the last 7 days
12. Traffic-shift mechanism documented (flag, weighted route, or DNS)
13. Rollback target (previous version) is hot and ready to receive 100%
14. Rollback trigger thresholds defined (eval drop, latency, cost, error)
15. Rollback runbook contains exact commands — no improvisation

## D. Monitoring stack (5)
16. Eval drift metric streaming on production traffic (sampled per tier)
17. Latency p50 / p95 / p99 alerting per stage, not just end-to-end
18. Cost per request alerting with daily and hourly burn-rate budgets
19. Error rate alerting split by class (transient / permanent / external)
20. Dashboard linked from the deploy ticket — operators bookmark, not search

## E. Canary mechanics (5)
21. Canary tiers configured: 1% → 10% → 50% → 100%
22. Gate criteria documented per tier (must hold for N minutes / hours)
23. Automatic rollback wired to gate violation, not "operator notices"
24. Per-tenant exclusion list for canary (high-value tenants opt in last)
25. Canary tier promotion is a deliberate command, never time-based alone

## F. Post-deploy hygiene (5)
26. 72-hour verification runbook scheduled; on-call rotation knows
27. First production incident drill rehearsed against this deploy
28. Customer-facing changelog drafted (if user-visible behavior changes)
29. Cost forecast updated based on canary observation, not pre-deploy estimate
30. Stage 7 (team enablement) hand-off scheduled within 7 days

Items in section A (capability gate) close the loop with Stage 5; items in section F (post-deploy hygiene) open the loop into Stage 7. The middle sections — B through E — are the genuinely production-specific work that Stage 6 owns. Treat the checklist as a hard gate. Missing items are not minor — they are the headline of the next incident review.

"A rollback you design under incident pressure is not a rollback. It is improvisation, performed badly, in front of an audience."— Production deploy retrospective, Q4 2025

03RollbackKill-switch, traffic shift, eval triggers.

Rollback is the second-most-skipped piece of a Stage 6 deploy — second only to canary mechanics. The pattern that keeps showing up: the team designs forward, ships, and discovers at the moment of incident that the rollback was an idea, not an artefact. The four patterns below are the practical shapes a rollback plan takes; pick the one that matches your traffic mechanics, then write the commands down.

Pattern 1
Kill-switch flag
Single boolean · evaluated per request

A single feature flag wraps the entire agent workflow. Flip it false and traffic falls back to the prior surface (manual flow, prior model, or hard error with a queued retry). Cheapest pattern; works for the first deploy.

Use for stage-6 v1
Pattern 2
Weighted traffic shift
Router weights · per-request shard

Two stacks run side by side — current and candidate. Router weights move traffic between them in steps (100/0 → 90/10 → 50/50 → 0/100). Rollback is a weight change, not a redeploy.

Use for stage-6 v2
Pattern 3
Per-tenant cohort
Tenant allowlist · roll forward by cohort

Stack ships behind a per-tenant flag. Cohort 1 tries the new behavior; on success, cohort 2 enables. Rollback rolls back the latest cohort only, not every tenant. Used for high-value or contractually distinct accounts.

Use for B2B agents
Pattern 4
Eval-triggered auto-rollback
Streaming eval · auto-flip on threshold

Eval metric streams over production traffic. When it crosses a defined threshold (per-stage accuracy drop, latency spike, cost explosion), the kill-switch flips without an operator. Operator is notified, not asked.

Use for high-volume agents

The four patterns are not mutually exclusive. The strongest Stage 6 deploys we ship at Digital Applied combine pattern 1 (a kill-switch as the floor), pattern 2 (weighted shifts as the primary canary mechanic), and pattern 4 (auto-rollback wired to the same metrics the canary gates on). Pattern 3 is the addition for contractually distinct tenants — the SaaS shape where a single tenant cannot be the test surface for the rest.

Rollback triggers must be specific. "Eval drops" is not a trigger; "accuracy on the labelled production sample falls below 0.92 for 15 minutes across 5,000 requests" is. The looser the trigger, the slower the response — and the slower the response, the more users see the regression before the rollback engages.

04MonitoringEval, latency, cost, error budgets.

A production agent has four monitoring axes that must alert independently: eval drift (is the answer still right), latency (does it still respond on time), cost (is it still affordable), and error rate (does it still complete). Most teams ship with one or two; the gap is consistent and the consequence is consistent too. The choice matrix below covers the picks per axis — what to measure, what to alert on, and what to do when the alert fires.

Eval drift
Streaming accuracy on production traffic

Sample 1-5% of production requests, score them with the same eval suite used at Stage 5, alert on accuracy drop sustained across a window. Distinct from offline eval — this watches the live system, not the curated set.

Watch accuracy + confidence calibration
Latency
p50 / p95 / p99 per stage

End-to-end latency hides which stage degraded. Alert per stage at p99 above a defined threshold for 5+ minutes. Alert separately on tail latency expansion (p99 / p50 ratio climbing) — that is upstream degradation, not load.

Per-stage thresholds + ratio alerts
Cost
Daily budget + hourly burn rate

Two alerts. Daily budget warning when projected spend exceeds budget by 20%; immediate alert when hourly burn rate exceeds 2× the moving average. Cost spikes are usually either a model misroute or a runaway retry loop — both worth paging.

Budget + burn-rate dual alert
Error rate
Split by class — transient, permanent, external

A single 'error rate' metric collapses three failure modes that demand different responses. Split into transient (retryable), permanent (deploy regression), external (upstream provider). Each class gets its own alert with its own runbook.

Three-class split with separate alerts
Coverage
4
Independent axes

Eval, latency, cost, error rate. Each alerts independently with its own threshold and its own owner. Collapsing them into one 'health' score is the surest way to ship a regression invisibly.

Stage 6 floor
Sample rate
1-5%
Production eval traffic

Eval scoring on every request is too expensive for most agents. Sample 1-5% of production requests, score them against the eval rubric, aggregate to a streaming accuracy metric. The sample is what makes the alerting feasible.

Practical default
Latency
<5min
Time-to-alert

From the metric crossing threshold to the on-call pager firing, under 5 minutes. Anything longer means the regression reaches a meaningful slice of users before the team is told.

Engineering target

The four axes feed the canary gates and the auto-rollback triggers. Stage 6 monitoring is not a passive dashboard layer — it is the load-bearing input to every other piece of the deploy kit. If the monitoring stack is wrong, the canary is blind, the rollback is delayed, and the runbook is reactive. Build this section first; the rest of Stage 6 inherits its quality.

05Canary1% → 10% → 50% → 100% with gates.

A canary release opens the production tap in stages, with each stage gated on the monitoring axes from Section 04. The four-tier pattern below is the default Digital Applied template. The specific percentages can flex for the workload, but the principle does not: every promotion is a deliberate decision after the metrics hold, not a timer that ticks regardless.

# stage-6 canary release pattern · four tiers, gated promotion

tier 1 · 1% traffic
  duration       ≥ 30 minutes (or ≥ 5,000 requests, whichever is later)
  promote when   eval drift < 1 pp · p99 latency < SLO · cost < budget · errors stable
  rollback when  any gate violated for ≥ 5 minutes
  who watches    on-call engineer + deploy owner

tier 2 · 10% traffic
  duration       ≥ 4 hours (or ≥ 50,000 requests)
  promote when   all tier-1 gates still hold across the larger sample
  rollback when  any gate violated for ≥ 5 minutes
  who watches    on-call engineer + deploy owner + product owner

tier 3 · 50% traffic
  duration       ≥ 24 hours (covers a daily cycle of traffic mix)
  promote when   all gates hold across the daily peak
  rollback when  any gate violated for ≥ 10 minutes
  who watches    on-call engineer (passive — auto-rollback wired)

tier 4 · 100% traffic
  duration       ongoing
  watch for      first 72 hours per the post-deploy runbook (§07)
  rollback when  any gate violated for ≥ 10 minutes
  who watches    on-call rotation + automatic rollback

# excluded by default at every tier — opted in last
  · high-value tenants (named allowlist)
  · regulated-workload tenants
  · tenants with active escalations

# promotion is a deliberate command — never time-only

Four-tier canary release · blast radius vs trust earned

Tier durations are minimums — extend per workload risk and traffic mix.
Tier 1 · 1% canarySmallest blast radius · catches the worst regressions in dozens of users · 30+ minutes
1%
Tier 2 · 10% canaryDaily-cycle subset · confirms stability across input mix · 4+ hours
10%
Tier 3 · 50% canaryFull daily cycle · stress test under peak load · 24+ hours
50%
Tier 4 · 100% rolloutFull traffic · 72-hour post-deploy runbook active · auto-rollback wired
100%

The single most common canary mistake we see is timer-only promotion. A canary that promotes after 30 minutes regardless of metric health is not a canary — it is a slow rollout. The gates are what make it a canary. The pager on the on-call rotation is what makes the gates real.

06Feature FlagsPer-user, per-tenant, per-workload flags.

Feature flags are the layer that decouples deploy from release. Without flags, every code deploy is a release of the behavior it contains; rolling that behavior back means another deploy under pressure. With flags, the deploy ships the code dark, the release turns on the behavior, and the rollback flips the same switch. Stage 6 agents need three flag granularities; using fewer narrows the options when an incident lands.

Granularity 1
Per- user flag
Allowlist · denylist · percentage roll-out

Toggle behavior for a single user or a percentage of users. Used for internal testing, beta cohorts, and emergency exclusion of a specific user reporting an issue. The most precise scope; the cheapest blast radius.

Smallest scope
Granularity 2
Per- tenant flag
Account-scoped · cohort-scoped

Toggle behavior across an entire tenant — usually a B2B customer or an internal team. Used for staged rollouts to named accounts, contractually distinct behavior, or emergency rollback of a single tenant without affecting others.

B2B / multi-tenant
Granularity 3
Per- workload flag
Workflow-scoped · stage-scoped

Toggle behavior for a specific agent workflow or a specific stage within a workflow. Used to swap a model in one stage without changing the rest, A/B test prompt variants, or disable a single high-risk step pending review.

Granular control

All three granularities should be readable in a single audit log keyed on the request ID. When an incident asks "why did this request behave this way?", the answer comes from the flag state at request time, not from reconstructing the deploy history. The audit trail is the difference between a five-minute triage and a four-hour archaeology project.

Pick a flag platform that supports targeting rules and percentage rollouts natively — homegrown flag tables turn into a maintenance burden by month three. The vendor or open-source choice is a Stage 4 decision; the integration is a Stage 6 deliverable; the audit and clean-up cadence is Stage 8 (governance).

Decouple deploy from release
The unspoken rule across mature Stage 6 deploys: every behavior change ships behind a flag, defaulted off, even when the plan is to enable it the same day. The flag is the rollback. The flag is the canary mechanism. The flag is the tenant carve-out. A codebase without flags is a codebase that re-deploys to roll back.

07Post-DeployVerification runbook for the first 72 hours.

The deploy is not finished when traffic reaches 100%. The first 72 hours after full rollout is when subtle regressions surface — the ones too small to trip an alert immediately but large enough to matter by week two. The post-deploy runbook below is the structured walkthrough Digital Applied runs on every Stage 6 deploy. Owners are named, intervals are explicit, and the close- out hands cleanly to Stage 7 enablement.

# stage-6 post-deploy verification runbook · 72 hours

## hour 0 — full rollout reached
  · deploy owner posts confirmation in #deploys with dashboard links
  · on-call rotation acknowledges hand-off (this deploy is the active surface)
  · automatic rollback verified armed (force a synthetic trip in staging)
  · customer-facing changelog published (if user-visible change)

## hour 1 — first quality pass
  · sample 100 production requests across stages, score against eval rubric
  · review error log for new error classes (codes / messages / stack patterns)
  · check cost burn rate vs pre-deploy estimate — flag any > 20% deviation
  · confirm flag audit log writes are flowing (per-user / tenant / workload)

## hour 4 — first traffic-mix pass
  · review p50 / p95 / p99 latency per stage vs pre-deploy baseline
  · review eval drift metric — is the streaming accuracy stable across cohorts
  · review tenant-level cost; any tenant > 2× their pre-deploy share?
  · review tail-latency ratio (p99 / p50) — climbing = upstream degradation

## hour 24 — first full daily cycle
  · full eval rerun against a 1,000-request production sample
  · review off-peak behavior (different input distribution than peak)
  · review canary-excluded tenants for opt-in candidacy
  · publish day-1 deploy report (eval / latency / cost / error vs baseline)

## hour 72 — close-out
  · 72-hour deploy report published (link in deploy ticket)
  · open issues triaged with owners and dates
  · auto-rollback configuration reviewed; thresholds adjusted if needed
  · hand-off to stage 7 (team enablement) scheduled within 7 days
  · retrospective scheduled — what to repeat, what to change next deploy

The 72-hour window is calibrated empirically: it is the timeframe in which the majority of post-deploy regressions we have seen on agentic systems surface — drift in tail distributions, cost spikes tied to less-common input classes, errors that only trip on the third daily peak, behavior changes that only register once a specific tenant uses a specific workflow. After 72 hours the surface area of the deploy is well characterised and the system enters steady-state ownership under Stage 7.

08Next StageHand-off to team enablement (Stage 7).

Stage 6 delivers a production system; Stage 7 makes it supportable. The clean hand-off has three deliverables: a deploy report (eval, latency, cost, error vs baseline across the 72- hour window), a runbook artefact that the on-call rotation can follow without the deploy owner, and a named owner pairing for every workflow the deploy touched. Without those three, Stage 7 inherits an undocumented system and the next deploy has the same problem.

For broader context, the resilience layer that sits underneath every Stage 6 deploy is covered in our agentic workflow resilience audit (70-point checklist). Stage 6 is where the resilience layer earns its keep — timeouts, retries, rollback, observability all become production primitives the moment real traffic touches them.

If you want the Stage 6 templates run for you end-to-end, our AI transformation engagements ship the deploy checklist, the rollback plan, the monitoring stack, the canary mechanic, the feature-flag layer, and the 72- hour verification runbook as a single Stage 6 package, with the Stage 7 enablement hand-off scheduled before the canary opens.

Conclusion

Production deploy is the checklist — release is the canary.

The Stage 6 gap is the consistent one across agentic AI programmes in 2026: the prototype is good, the eval is green, and the deploy is improvised. The same teams that spent weeks on Stage 5 evaluation often spend a single afternoon on Stage 6 deploy, and the production weekend produces the incident that could have been a 1% canary blip. The kit above exists because we have watched the same gap close the same way every time.

None of the thirty checks is exotic. None of the four canary tiers is novel. The feature-flag pattern is a decade old; the 72-hour runbook is calibrated against observed incident windows on agent systems specifically. What is new is the combination — checklist, rollback, monitoring, canary, flags, runbook — applied as a single Stage 6 gate before any agent workflow sees production traffic. That gate is what turns capability into operability.

Practical next step: pick one agent workflow that is about to deploy and run the thirty-point checklist against it today. Almost every team finds a missing item; almost every team can close the gap in a single sprint. The remaining items — the canary mechanic, the auto-rollback wiring, the 72-hour runbook — are what separate a deploy that survives a Friday from a deploy that creates one.

Ship production-grade

Production deploy is the checklist — release is the canary.

Our team runs Stage 6 production deploys — checklist, rollback, monitoring, canary, feature flags, verification — and hands off to team enablement.

Free consultationExpert guidanceTailored solutions
What we deliver

Stage 6 production deploy

  • 30-point production deploy checklist
  • Rollback and kill-switch design
  • Monitoring stack design (eval / latency / cost / error)
  • Canary release rollout playbook
  • Post-deploy 72-hour verification runbook
FAQ · Stage 6 deploy

The questions teams ask before production.

When all five Stage 6 capability-gate checks have signed-off owners and all twenty-five operability checks (deploy hygiene, rollback readiness, monitoring stack, canary mechanics, post-deploy hygiene) are green. A green eval suite alone is the first gate, not the only gate. Practical signal: the prototype is ready when an unrelated engineer can read the rollback runbook, follow it cold, and return the system to the prior version without paging the deploy owner. If that is not true, Stage 6 is not done — the prototype is ready to start Stage 6, not to ship past it. Teams that conflate the two failures show up consistently in the same way: the first incident exposes the gap between capability and operability and is owned by the deploy owner personally rather than by the on-call rotation.