An agentic AI prototype that ships to production looks almost identical to one that quietly dies in a Slack channel — until you inspect the eval harness. Stage 5 of the agentic AI implementation pipeline is the line where the program either becomes measurable and inheritable or stays a demo nobody wants to maintain. This kit ships the templates that keep prototypes on the production side of that line.

The default Stage 5 failure mode is recognisable: a small team builds a working prototype in two weeks, runs a slick demo for leadership, hears applause, and then watches the project drift for three months because nobody can answer the question is it actually good enough to ship. Evals were never written, success criteria were never defined, and the demo evidence was vibes. By the time the next quarterly review lands, the prototype is forgotten — or worse, in production without measurement, accumulating regressions silently.

This guide covers the five Stage 5 artifacts in order: the prototype brief, the eval harness, the success criteria framework, the prototype-to-production gate checklist, and the demo script. Every section ships a template. The pipeline hand-off to Stage 6 (production deploy) closes the kit. Read it once; reuse the templates for every prototype your team builds.

Pipeline navigation · 10 stages

The agentic AI implementation pipeline. Stage 5 sits between vendor selection and production deploy — the eval-first prototype is what makes the rest of the pipeline survivable.01Readiness assessment 02Strategy & roadmap 03Data foundation 04Vendor selection 05Prototype 06Production deploy 07Team enablement 08Governance 09Scale 10Continuous improvement

Key takeaways

01
Eval harness before prototype.The eval harness — Promptfoo, DeepEval, or RAGAS depending on archetype — is the first artifact, not the last. Building it first forces the team to answer 'what would success look like' before any code gets written.
02
Success criteria written before build.Quantitative and qualitative criteria, both signed off by the business owner, locked before the prototype starts. Mid-build redefinition is the single most common reason prototypes never reach a clean go/no-go decision.
03
Gate checklist prevents demo-only prototypes.Twenty checks across five axes — eval coverage, success-criteria pass, deployability, observability, ownership. The checklist is the line that separates a prototype with a future from a demo-driven dead end.
04
Demo script uses eval evidence not vibes.Three-minute narrative that references the eval scores, success-criteria thresholds, and gate-checklist state — not a curated happy-path walkthrough. Executives can tell the difference; the eval evidence is what unlocks the deploy budget.
05
Prototype-to-production gate is a measurable line.Crossing into Stage 6 is a numeric event, not a meeting outcome. When the gate-checklist score crosses the agreed threshold and the success criteria are green, the program ships. When either fails, the team iterates with a known target rather than negotiating scope.

01 — Why Stage 5Prototypes ship in demos; production-bound prototypes ship with evals.

Stage 5 is the most consequential stage in the pipeline because it is the first stage where the program produces a concrete artifact that either survives or doesn't. Stages 1 through 4 deliver documents — readiness scores, roadmaps, data audits, vendor scorecards. Stage 5 delivers software. The transition from paper to code is where many programs discover that the previous four stages described a different problem than the one the prototype is now trying to solve.

The right framing for Stage 5 is not "build a prototype." It is "build an eval harness and a prototype that runs against it." That single inversion changes every downstream decision. When the eval harness comes first, the prototype builds toward a measurable target rather than a felt sense of completion. When the success criteria are written before the build, mid-build scope creep becomes visible rather than invisible. When the gate checklist is agreed in advance, the deploy decision is data-driven rather than political.

The cost of getting Stage 5 wrong is not a wasted prototype — it is a prototype that ships into production without measurement, accumulating quality debt for the next twelve months. Every audit we run finds at least one such system: a prototype that became production because the demo went well, with no eval rails, no success-criteria tracking, and an owner who has since moved teams. The Stage 5 templates exist to prevent that pattern, not to slow the prototype down.

"Stage 5 is the stage where evidence replaces enthusiasm. Every artifact in this kit exists to make that swap happen before the demo, not after."— Common refrain from agentic AI implementation engagements

The five artifacts in this kit are sequenced deliberately. The brief defines the problem and the hypothesis. The eval harness makes the hypothesis testable. The success criteria define what passing looks like. The gate checklist makes the production hand-off measurable. The demo script makes the evidence narratable. Skip any one of the five and the prototype is structurally weaker — most often, teams skip the eval harness because it feels slow, and then spend the next quarter recovering from that decision.

02 — BriefProblem, scope, hypothesis.

The prototype brief is the artifact that prevents Stage 5 from inheriting a fuzzy mandate from Stages 1 through 4. It is a single page — deliberately short — that names the problem the prototype addresses, the scope boundaries, the hypothesis the prototype is testing, and the explicit non-goals. The brief is signed off by the business owner, the technical owner, and the AI lead before any code is written.

The hypothesis is the part teams skip the most. A prototype without a hypothesis is a feature build; a prototype with a hypothesis is a test. The difference matters because tests have outcomes — they pass or fail against the eval harness — whereas feature builds only have completion states. The hypothesis should be specific enough that the eval harness can measure whether it held, and falsifiable enough that the team agrees in advance what disconfirmation looks like.

Template · prototype brief (one page)

# Prototype Brief · <feature name> ## Problem One paragraph. The specific business problem this prototype addresses, written so a non-AI stakeholder can recognise it. ## Hypothesis "We believe that <agentic capability> can <do X> for <user persona> to achieve <outcome> — measured by <leading indicator>." ## Scope (in) - Concrete capability 1 - Concrete capability 2 - Single data source · single user persona · single workflow ## Non-goals (out) - What this prototype will NOT do - What it will NOT replace - What it will NOT integrate with at this stage ## Success criteria (link) → docs/prototypes/<name>/success-criteria.md ## Eval harness (link) → docs/prototypes/<name>/eval-harness.md ## Owners - Business owner: <name> · final go/no-go - Technical owner: <name> · build accountability - AI lead: <name> · eval and model decisions ## Timeline - Brief signed off: YYYY-MM-DD - Eval harness ready: YYYY-MM-DD (before any prompt is written) - Prototype freeze: YYYY-MM-DD - Gate review: YYYY-MM-DD - Stage 6 hand-off: YYYY-MM-DD (if gate passes)

Two operational discipline points are worth flagging. First, the brief lists the eval-harness readiness date beforethe prototype-freeze date — the eval comes first, not last. Second, the brief explicitly names a Stage 6 hand-off date that is conditional on the gate passing. Naming the conditional from the start prevents the political hand-wave where the prototype "ships" without ever passing the gate.

For the briefing template at scale, our AI transformation engagements embed this artifact as the first deliverable of every Stage 5 engagement, with the eval harness and success criteria sequenced as the next two. The pattern is reproducible across archetypes — agentic SDR, document agent, customer-support copilot — with the archetype-specific differences handled inside the eval harness, not the brief.

03 — Eval HarnessPromptfoo, DeepEval, RAGAS — pick by archetype.

The eval harness is the single highest-leverage artifact in the entire Stage 5 kit. It is the contract the prototype builds against and the line that separates a measurable system from a vibes-driven one. The harness must be runnable before the first prompt is written; the test cases drive the prompt design, not the other way around.

Three open-source frameworks cover the majority of agentic AI prototypes. The choice is driven by the prototype archetype, not personal preference. Pick the wrong framework and the harness still works, but the team spends extra effort fighting the tool instead of measuring the system.

Generation tasks

Promptfoo · for prompt-heavy agents

YAML-first declarative evals. Best when the prototype is a generation, classification, or extraction system with a stable prompt surface. Quickest on-ramp from zero; pairs cleanly with any CI. Default choice for SDR, copywriting, and structured-output prototypes.

Pick Promptfoo

Multi-step agents

DeepEval · for tool-using systems

Pytest-style with built-in metrics for faithfulness, answer relevance, contextual recall, and tool-call correctness. Strongest when the prototype involves multi-step reasoning, tool use, or agent loops where individual step quality matters as much as the final output.

Pick DeepEval

RAG systems

RAGAS · for retrieval-augmented prototypes

Purpose-built metrics for retrieval-augmented generation: context precision, context recall, faithfulness, answer correctness. Strongest when the prototype is a document agent, knowledge-base assistant, or any system whose quality depends on retrieval first and generation second.

Pick RAGAS

Mixed archetype

Two frameworks side by side

Some prototypes are genuinely hybrid — a RAG step followed by a multi-step agent loop, for example. In that case run RAGAS against the retrieval layer and DeepEval against the agent loop, with a shared test-case set. Compose the two scores into a single gate metric. The cost is moderate; the visibility into where the system breaks is significant.

Compose RAGAS + DeepEval

The number of test cases per prompt at the prototype stage is modest — twenty to fifty per critical path, covering happy path, known edge cases, and at least three deliberately adversarial inputs designed to stress the hypothesis. The point is not exhaustive coverage; the point is enough coverage to detect regression. Coverage breadth is a Stage 6 concern, not a Stage 5 one.

The eval harness output feeds two artifacts directly: the success criteria (Section 04, where the numeric thresholds come from) and the demo script (Section 06, where the eval evidence becomes the demo's narrative spine). Without a working harness, the downstream artifacts cannot exist in usable form. This is why the sequencing matters — the harness is dependency for everything else.

For a deeper treatment of eval framework selection and 100-point library audits, our prompt library audit framework covers the institutional discipline that grows out of one good eval harness. Stage 5 plants the seed; the audit framework grows the forest.

04 — Success CriteriaQuantitative + qualitative.

Success criteria are what turn a prototype from a feature into a test. They are written before the build, signed off by the business owner, and locked for the duration of the prototype. Mid-build redefinition is allowed exactly once and requires a written rationale; uncontrolled drift in success criteria is the single most reliable predictor of a Stage 5 program that never reaches a clean go/no-go decision.

The criteria split into two halves. Quantitative criteria are measured directly by the eval harness — accuracy, precision, recall, latency, cost per call, faithfulness score, refusal rate. Qualitative criteria are graded by a small panel of humans on a small rubric — usability, tone, trust, escalation behavior, failure-mode acceptability. Both halves are required; neither is sufficient alone.

Template · success criteria (one page)

# Success Criteria · <feature name> ## Quantitative · measured by eval harness | Metric | Threshold | Floor | Measured by | |---------------------------|-----------|-------|---------------------| | Task accuracy | ≥ 85% | 75% | Promptfoo / RAGAS | | Faithfulness (RAG) | ≥ 0.85 | 0.75 | RAGAS | | Tool-call correctness | ≥ 95% | 90% | DeepEval | | P95 latency | ≤ 4.0 s | 6.0 s | Harness timing | | Cost per task | ≤ $0.04 | $0.08 | API spend metric | | Refusal rate (in-scope) | ≤ 3% | 8% | Promptfoo assertion | ## Qualitative · human-graded · 5-person panel · 1–5 scale | Dimension | Target avg | Floor | |---------------------------|------------|-------| | Output usability | ≥ 4.0 | 3.5 | | Tone fit | ≥ 4.0 | 3.5 | | Trust / hallucination | ≥ 4.5 | 4.0 | | Failure-mode acceptability| ≥ 4.0 | 3.5 | | Escalation behavior | ≥ 4.0 | 3.5 | ## Pass rule ALL quantitative metrics ≥ threshold OR ALL quantitative metrics ≥ floor AND qualitative avg ≥ target. ## Hard fail rule ANY metric below floor → prototype does NOT pass the gate, regardless of other scores. ## Signed off - Business owner: <name> · <date> - Technical owner: <name> · <date> - AI lead: <name> · <date>

The two-threshold model — target plus floor — is what makes the criteria operationally useful. A single threshold creates a binary outcome; a target-plus-floor creates a three-zone outcome (clearly passing, conditionally passing with remediation, failing). The floor row is what prevents a prototype from sneaking through with one excellent metric covering several mediocre ones.

The hard-fail rule deserves explicit attention. Most prototype programs allow individual metric failures to be argued away in the gate meeting ("the trust score is low but the usability is great"). The hard-fail rule short-circuits that argument — any metric below its floor blocks the gate regardless of the others. The rule exists to protect the program against its own optimism.

05 — Gate ChecklistTwenty checks that unblock deploy.

The prototype-to-production gate checklist is the artifact that makes the Stage 5 to Stage 6 hand-off measurable rather than political. Twenty checks split across five axes — four each. A prototype passes the gate when at least sixteen of twenty are green and zero are blocked. Anything below that returns to the team with a numeric target rather than a negotiation.

The five axes mirror the structure of the prompt library audit framework, deliberately. Stage 5 is where the institutional discipline of evals, observability, and ownership starts; the gate is the first inflection point where that discipline becomes visible. Skipping the gate is how teams end up at Stage 1 of the audit framework instead of Stage 3.

Eval

Eval coverage

4 checks · harness, cases, CI, baseline

Harness runs against the prototype on every PR. At least 20 test cases per critical path. CI integration verified. 7-day baseline of nightly eval scores established before the gate.

Foundation

Criteria

Success-criteria pass

4 checks · quant, qual, floor, signoff

All quantitative metrics meet threshold or floor. Qualitative panel completed by 5 humans. No metric below its hard floor. Business owner has signed off on the score sheet.

Outcome

Deploy

Deployability

4 checks · package, rollback, flag, kill

Prototype is deployable behind a feature flag in the production environment. Rollback path is tested. Kill-switch is wired. Deploy artifact is reproducible from a tagged commit.

Mechanical

Observe

Observability rails

4 checks · logs, traces, eval cron, alerts

Structured logging for inputs, outputs, and tool calls. Trace IDs propagate. Nightly eval cron configured for production prompts. Alert routing to the named owner on regression.

Visibility

Own

Ownership and runbook

4 checks · owner, runbook, on-call, escalation

One named human owner per prototype. Runbook covers the top five known failure modes. On-call rotation includes the prototype. Escalation path to the AI lead and the business owner is documented.

Human

The gate score is reported as a single number — 17 of 20, 19 of 20 — alongside the per-axis breakdown. Reporting the breakdown matters because the same total can hide very different shapes; a prototype that scores 17/20 with zero in observability is a very different artifact from one that scores 17/20 evenly. The production-deploy team in Stage 6 reads both numbers, not just the total.

For programs that want a deeper treatment of the deploy-side mechanics — feature flags, kill-switches, canary releases — our companion Stage 6 production deploy kit picks up exactly where this gate ends. The two stages are designed to compose: Stage 5 produces a gated artifact, Stage 6 ships it under measurement.

06 — Demo ScriptThree-minute narrative with eval evidence.

The demo script is the artifact that turns the eval evidence into something a non-engineering executive can act on. The hard rule: three minutes, eval-evidence-driven, no curated happy path. The soft rule: the script is rehearsed at least twice before it is performed, and rehearsed against the real eval harness, not a mocked-up version.

Most prototype demos fail one of three ways. They are too long and lose the room. They are happy-path-only and trigger the executive's "what could go wrong" reflex without answering it. Or they show enthusiasm without evidence — the presenter is excited, the screens look good, but no measurable claim is made. The three-minute structure below addresses all three failure modes by design.

Template · three-minute demo script

# Demo Script · <prototype name> # Total time: 3 minutes · structure: 30 + 60 + 45 + 30 + 15 [00:00–00:30] FRAME (30 s) - "Today I'm showing the <prototype name> against the success criteria we agreed on <date>." - One sentence: the hypothesis from the brief. - One sentence: the pass rule from the success criteria. [00:30–01:30] LIVE RUN (60 s) - Run ONE happy-path case live. Speak the input, the eval score it achieved on the harness, and the output. - Then run ONE deliberately adversarial case live. Show how it scored. - Do NOT curate. The demo is on the harness, not on hand-picked inputs. [01:30–02:15] EVAL EVIDENCE (45 s) - Show the success-criteria table with the actual numbers filled in. - Read the line: "X of 6 quantitative metrics meet threshold; Y of 5 qualitative metrics meet target; Z metrics below floor." - State the gate score: "<n> of 20 on the gate checklist." [02:15–02:45] LIMITATIONS (30 s) - Name the top two known failure modes from the harness. - State what we will and will not commit to in Stage 6. - Reference the kill-switch and rollback path. [02:45–03:00] ASK (15 s) - The decision being requested: green-light Stage 6 deploy under feature flag, or remediation cycle with named target. - The next milestone date.

The discipline of running an adversarial case live, alongside a happy-path case, is what separates the eval-driven demo from every demo the audience has seen before. Executives learn quickly that a demo that includes its own failure modes is substantially more trustworthy than one that doesn't. The paradox is that showing failure increases the probability of green-lighting the next stage, because the trust premium outweighs the embarrassment of the failed case.

"A demo that includes its own failure modes is more trustworthy than a demo that doesn't — and the trust premium is what unlocks the Stage 6 budget."— Pattern observed across agentic AI gate reviews

07 — Anti-PatternsDemo-driven prototypes that never ship.

Three anti-patterns account for most Stage 5 failures. They are recognisable in retrospect and avoidable in advance, but each one requires deliberate counter-discipline. The templates in this kit exist to make the counter-discipline default rather than heroic.

Pattern 01

70%

Demo-first, eval-later

Team builds a prototype, runs a slick demo, then promises evals as a Stage 6 task. Stage 6 inherits an unmeasured artifact, accumulates eval debt, and either ships with regressions or stalls. The fix: the eval harness is the first artifact, not the last. Brief lists harness-ready date before prototype-freeze date.

Most common failure

Pattern 02

20%

Mid-build criteria drift

Success criteria are written but unstable — every weekly check-in renegotiates the thresholds based on what the prototype currently scores. The gate becomes meaningless because the bar moves with the artifact. The fix: criteria are locked at brief signoff. Mid-build redefinition is allowed once with a written rationale, never silently.

Politically subtle

Pattern 03

10%

Owner-less prototype

The prototype was built by a contractor, an intern, or a cross-functional pod that has since disbanded. By the time the gate review lands, no named human owner exists. The prototype either ships into production unmaintained or dies in committee. The fix: ownership is a brief field and a gate check, not an afterthought.

Most expensive to fix

The demo-first anti-pattern is the most common and the most insidious because each individual decision looks reasonable at the time. Skipping the eval harness saves a week. Pushing it to Stage 6 sounds like a sensible deferral. The accumulated cost only becomes visible three to six months later, when the production system has shifted underneath the prototype and nobody can tell whether quality has degraded or whether it was always this way.

The criteria-drift anti-pattern is the most politically sensitive. Renegotiating thresholds mid-build feels like healthy pragmatism — the team is learning what is achievable and updating the targets accordingly. The problem is that updates always trend in one direction (down), and the gate review becomes a performance rather than a test. The counter-discipline is procedural: criteria changes require written rationale and re-signoff from the business owner. That small friction is enough to stop the silent drift.

The owner-less anti-pattern is the most expensive to fix because it requires retroactively recruiting an owner for something that may already be in production. The right time to address it is at brief signoff; the second-best time is at the gate review; the worst time is six months later when the production system needs an upgrade and no one knows the prototype well enough to make the call.

08 — Next StageHand-off to production deploy (Stage 6).

A prototype that passes the gate is not ready for production — it is ready for Stage 6. The distinction matters. Stage 6 converts the gated artifact into a deployed system under measurement, behind a feature flag, with the rollback and kill-switch rails wired live rather than tested in staging. What Stage 5 delivers is the right thing to deploy; Stage 6 delivers the right way to deploy it.

Stage 5 outputs · Stage 6 inputs

Source: Digital Applied Stage 5 / 6 hand-off pattern

Brief signed offProblem, hypothesis, scope, non-goals, owners locked

Stage 5

Eval harness runnablePromptfoo / DeepEval / RAGAS configured, 20+ cases per path

Stage 5

Success criteria passedQuant + qual scored, no metric below floor, signoff captured

Stage 5

Gate checklist ≥ 16/20Eval, criteria, deploy, observability, ownership all green

Stage 5

Demo delivered with evidenceThree-minute eval-driven narrative, decision logged

Stage 5

Production deploy under flagCanary release, monitoring live, kill-switch armed

Stage 6

The five Stage 5 artifacts become the five Stage 6 inputs. The brief informs the rollout communication. The eval harness becomes the nightly cron and the regression dashboard. The success criteria become the production SLOs. The gate checklist becomes the deploy-readiness rubric. The demo script becomes the rollout announcement. Every artifact compounds; nothing is thrown away at the stage boundary.

For programs ready to make the hand-off, our Stage 6 production deploy templates cover the deploy-side mechanics in the same template-driven shape. The two kits are designed to be used together — independently usable, but most powerful in sequence.

Conclusion

Prototype quality is eval quality — everything else is a demo.

Stage 5 is the most consequential stage in the agentic AI pipeline because it is the first stage that produces software rather than documents. The artifact a Stage 5 team delivers either becomes the foundation of a measurable production system or it becomes a demo nobody remembers in six months. The difference between the two outcomes is not the cleverness of the prototype — it is whether the eval harness was first or last.

The five templates in this kit are deliberately minimal. The brief is one page. The eval harness uses an open-source framework. The success criteria fit on a single sheet. The gate checklist is twenty items across five axes. The demo script runs three minutes. None of this is heavy. What makes the kit work is the sequencing — eval before prototype, criteria before build, gate before deploy, evidence before narrative — and the discipline to not skip the parts that feel slow.

What to do next: pick the most important prototype on your team's roadmap. Write the brief this week. Stand up the eval harness next week. Lock the success criteria before any prompt gets written. By the time the prototype is ready for a gate review, the artifact is already measurable — and that is the single most important thing Stage 5 can deliver. Then hand off to Stage 6 with the rails wired, not promised.

Agentic AI Prototype Templates: Stage 5 Pipeline Kit

01 — Why Stage 5Prototypes ship in demos; production-bound prototypes ship with evals.

02 — BriefProblem, scope, hypothesis.

03 — Eval HarnessPromptfoo, DeepEval, RAGAS — pick by archetype.

Promptfoo · for prompt-heavy agents

DeepEval · for tool-using systems

RAGAS · for retrieval-augmented prototypes

Two frameworks side by side

04 — Success CriteriaQuantitative + qualitative.

05 — Gate ChecklistTwenty checks that unblock deploy.

Eval coverage

Success-criteria pass

Deployability

Observability rails

Ownership and runbook

06 — Demo ScriptThree-minute narrative with eval evidence.

07 — Anti-PatternsDemo-driven prototypes that never ship.

Demo-first, eval-later

Mid-build criteria drift

Owner-less prototype

08 — Next StageHand-off to production deploy (Stage 6).

Stage 5 outputs · Stage 6 inputs

Prototype quality is eval quality — everything else is a demo.

Prototypes that ship start with the eval harness, not the demo.

Stage 5 prototype engagements

The questions teams ask before shipping the first prototype.

Continue the pipeline.

Agentic AI Glossary: 200 Essential Terms for 2026

AI Evaluation Metrics Reference Guide 2026: 80 Metrics

Agent Success Rate (ASR): The Measurement Framework