A prompt library team rollout is the 90-day program that takes a team from a scattered folder of prompts to a versioned, eval-covered, regression-monitored asset the team will still own in twelve months. The plan stages naming and versioning in the first month, eval suites and regression cron in the second, and governance gates plus team training in the third — because prompt libraries rot without discipline, and 90 days is the window where the discipline gets installed before the team drifts back to the old habits.
The failure mode the plan is designed against is not technical. It is organisational. Teams ship a catalog, write three evals, congratulate themselves, and then the regression cron quietly stops firing because nobody named an owner. Six months later, the library looks the same as before the rollout — full of untested prompts nobody dares touch. The 30/60/90 cadence exists to make the social contract as explicit as the technical artifacts, so the work survives the team that ran the rollout.
This guide covers why 90 days is the right horizon, what each 30-day stage actually contains, how to choose between Promptfoo, OpenAI Evals, and DeepEval, the templates the team should ship on day one, and the four pitfalls that derail most rollouts. The plan is technique-agnostic — the named tools are interchangeable examples, not prescriptions. The cadence is what matters.
- 01Prompt libraries rot without discipline.Six months of unmanaged edits turn a clean library into a graveyard nobody dares touch. The 30/60/90 plan installs the feedback loops that prevent the rot in the first place — not a cleanup after the fact.
- 02Naming and versioning are foundational.Without a parseable version field on every prompt, every other axis collapses. Stage one ships the convention and enforces it in PR review so the rest of the rollout has something stable to build on.
- 03Evals are the contract.A prompt without an eval is a wish. Stage two wires at least one passing eval suite into CI for every production prompt — that single artifact turns prompts from tribal knowledge into refactorable engineering.
- 04Regression cron catches drift.Daily nightly runs catch the regressions CI cannot — model-vendor updates, dataset shifts, time-based drift. The cron is the single highest-leverage line item in the entire plan.
- 05Lifecycle policy retires zombie prompts.Stage three ships a written lifecycle — deprecate, promote, retire — so prompts have a defined exit ramp instead of accumulating forever. The policy is what keeps the library lean past month twelve.
01 — Why 90 DaysPrompt libraries rot without discipline — 90 days makes it stick.
Ninety days is the right horizon for the same reason every mature operational rollout — incident response, on-call rotations, design-system adoption — converges on roughly the same length. Anything shorter than thirty days does not give the team time to feel the friction of the new conventions during real work; anything longer than ninety drifts because the original sponsor moves on to the next priority. The three-stage cadence is engineered to install one discipline per month and let the previous month's discipline harden into habit before the next one lands.
The thirty-day stage boundaries are not arbitrary. Days 1–30 target the artifacts that have to exist before evals are meaningful — catalog, naming, versioning, ownership. Days 31–60 target the feedback loop itself — eval suites, CI wiring, regression cron, model-fit testing. Days 61–90 target the social and lifecycle layer — governance gates, deprecation policy, training. Trying to compress the order — writing evals before the catalog is clean, or shipping lifecycle policy before evals exist — is the most common reason rollouts fail.
The other reason 90 days works is that it is the natural quarter-end checkpoint. Most leadership conversations about AI quality, AI spend, and AI risk reset on a quarterly cadence. Finishing the rollout at the same beat as the business review means the audit findings, the eval coverage chart, and the regression dashboard land in front of decision-makers during the exact window they care about. The 90-day plan is therefore as much a stakeholder-management instrument as a technical one.
We have run variants of this plan across teams ranging from two-person AI startups to thirty-person product organisations. The shape holds: the first month is heavier than it looks because catalog cleanup is genuinely organisational work; the second month feels productive because evals produce visible artifacts almost immediately; the third month is the most important because that is when the team picks up ownership and starts running the cadence themselves. Skip any one stage and the rollout reverts within two quarters.
For a deeper framework on scoring the library you are inheriting — coverage, versioning, evals, regressions, model-fit — pair this plan with our 100-point prompt library audit. The audit gives you the baseline score; the 90-day plan gives you the cadence to move that score up by at least one maturity stage.
"Ninety days is long enough to install the discipline, short enough that the original sponsor is still in the room when the rollout lands. Both ends matter."— Common framing from prompt rollout engagements
02 — Days 1-30Catalog audit, naming + versioning, owner assignment.
The first thirty days exist to make the library legible. Until every prompt has a canonical home, a parseable version field, and a named human owner, no downstream work — evals, cron, governance — has anything stable to attach to. The single biggest mistake teams make is racing ahead to eval implementation before the catalog is clean; the evals end up pointing at the wrong file, the wrong version, or a prompt that has already been silently rewritten somewhere else in the codebase.
The five milestones below are sequenced — each one unlocks the next. The audit gives you the universe of prompts. The convention turns the universe into a list with consistent shape. The single-source migration eliminates the duplicate copies that would otherwise corrupt every downstream measurement. CODEOWNERS makes silent edits structurally impossible. The README closes the loop so the next engineer who lands on the team can find the rules without asking.
Catalog audit and inventory
grep · spreadsheet · stakeholder interviewsWalk the codebase, every service, every notebook, every script. Build a flat spreadsheet — slug, file path, owner guess, model, last-edited date, feature it powers. Surface the duplicates and orphans. Two days of grep, three days of conversations with the engineers who wrote them.
Output: prompt inventory CSVNaming + versioning convention
kebab-case slug + parseable versionPick a single convention and write it down. Recommended: kebab-case slug plus monotonic integer or date-stamp version, e.g. customer-summary-v3.md or onboarding-2026-05-15.md. Format is less important than presence — argue for one day, decide, document.
Output: CONVENTIONS.mdMigrate to single source of truth
prompts/ directory or DB tableMove every production prompt to one canonical location. Delete duplicates. Replace inline strings in service handlers with imports from the canonical file. The grep test must pass — one prompt, one home, no surprises.
Output: prompts/ directoryOwner assignment + CODEOWNERS
.github/CODEOWNERSEvery prompt names exactly one human owner — the person who reviews changes, signs off on edits, and gets paged when the prompt regresses. Wire CODEOWNERS so the owner is a required reviewer on any PR touching their prompt. Shared ownership is forbidden.
Output: CODEOWNERS + owner fieldREADME + stage one demo
internal demo + retroWrite the prompts/README explaining the convention, the directory layout, the owner contract, and how to add a new prompt. Run a 30-minute team demo. Capture the friction points in a retro doc — they become the input for stage two scoping.
Output: README + retro notesThe deliverable at day thirty is unglamorous and disproportionately valuable: a clean catalog, a written convention, an enforced owner contract, and a team that has felt the convention during one week of real PR review. Most teams report at the day-30 retro that the convention itself was easy and the inventory was twice as long as they expected — there are always orphan prompts buried in forgotten scripts, and that discovery alone usually pays for the first stage.
One pattern worth flagging: the temptation in week one is to also start writing evals because evals feel like the "real work." Resist it. An eval pointing at a prompt that is about to be relocated and renamed will need to be rewritten in week three. Sequence matters; the discipline is to do stage one cleanly before stage two starts.
03 — Days 31-60Eval suite setup, regression cron, model-fit testing.
The second thirty days install the feedback loop. Stage one made the library legible; stage two makes it measurable. The milestones below are again sequenced — the framework choice comes before any eval is written, the first eval comes before CI wiring, CI wiring comes before the nightly cron, and model-fit testing closes the stage because it is the test that depends on every prior step being in place.
The team that finishes stage two has, at a minimum, one passing eval suite on the highest-stakes prompt in the library, that suite firing on every PR that touches the prompt, and a nightly run firing against production prompts with results routed to owners. That is the single artifact most responsible for the library actually surviving — and it is achievable in a month if stage one was done cleanly.
Pick the eval framework
Promptfoo · OpenAI Evals · DeepEvalStack-fit decision. TypeScript-heavy teams default to Promptfoo. Python-heavy AI teams default to DeepEval. OpenAI-hosted-only teams default to OpenAI Evals. Pick once, do not litigate; Section 05 is the full comparison.
Output: framework decision docFirst eval suite
10 test cases on highest-stakes promptPick the most important prompt in the library. Write ten cases — happy path, edge cases, known failure modes. Score on a 1-5 rubric with two or three criteria and written reasons. Get it passing locally before CI.
Output: first passing suiteCI integration
GitHub Actions / equivalentWire the eval suite to fire on every PR that touches any file in prompts/. Block merge on regression below the rubric threshold. Owner gets the failing-eval comment automatically.
Output: PR-gated eval suiteRegression cron + dashboard
nightly · 30-day rolling historySchedule the full eval suite to run nightly against production prompt versions. Store every score for at least 30 days. Build a small page showing each prompt's score and 30-day trend. Wire alerts to the prompt owner, not a shared channel.
Output: nightly cron + dashboardCross-model fit testing
primary + cheaper alternativeFor the top five prompts, evaluate against the cheapest model in the family that might still meet the eval bar. Document the model choice and the eval score. Most libraries discover one-third of prompts can move down a tier — meaningful cost savings.
Output: model-fit decision logThe cron is the single highest-leverage milestone in the entire 90-day plan. CI evals catch regressions introduced by the team; the cron catches regressions introduced by everything else — model vendor updates, dataset shifts, time-based drift, third-party API changes. Teams that wire evals into CI and stop there miss roughly a quarter of the regressions they would otherwise catch. The infrastructure cost is the GitHub Actions schedule trigger and a database row per nightly run; the behavioral change is what matters.
Cross-model fit testing in week eight is the milestone management cares about most because it converts to dollars. For a thirty-prompt library, the typical cost reduction from systematic model-fit testing is 30-50% — not from switching everything to the cheapest model, but from identifying which third of the library can safely move down a tier. The eval suite makes that decision data-driven instead of religious.
"The eval suite is the contract. Once it exists, the prompt becomes refactorable — you can swap models, rewrite wording, change the rubric, and the contract tells you whether the change was an improvement or a regression."— Common refrain from prompt rollout engagements
04 — Days 61-90Governance gates, lifecycle policy, team training.
The third thirty days install the social layer. Stages one and two produced artifacts; stage three produces the cadence that keeps those artifacts alive after the rollout sponsor moves on. The single biggest reason 60-day rollouts fail where 90-day rollouts succeed is that the social work has not been done — there is no written lifecycle policy, no governance forum, no training material, and the discipline erodes within two quarters.
The milestones below shift the work from individual engineering to organisational design. Governance gates make new prompts a deliberate decision. Lifecycle policy creates the exit ramp that keeps the library lean. Training material spreads the knowledge beyond the engineers who ran the rollout. The final week is a handoff — the team running the rollout deliberately steps back and the prompt-ops cadence runs without them for one full week before declaring the rollout complete.
Governance gates
new-prompt template + approvalEvery new prompt must ship with an eval suite, an owner, a model-fit decision, and a lifecycle target stage. The template is a checklist; the gate is a required reviewer on the introducing PR. No bypass for 'just a small prompt'.
Output: new-prompt templateLifecycle policy
promote · deprecate · retireWritten policy for how prompts move between beta, production, deprecated, and archived states. Deprecated prompts have a deletion date or a forwarding pointer. No prompt sits in 'I think we still use this' limbo past 90 days.
Output: LIFECYCLE.mdTeam training
1-hour workshop + reference deckLive walkthrough covering catalog conventions, how to add a prompt, how to write an eval, how to read the regression dashboard, what to do when an alert fires. The deck becomes onboarding material for any future hire.
Output: training deckGovernance cadence
30-min monthly forumMonthly meeting reviewing eval coverage, regression trends, deprecation queue, and any prompts requiring lifecycle decisions. Standing agenda, named chair, decisions written down. The forum is the venue where the discipline survives turnover.
Output: standing monthly meetingHandoff + close-out
rollout team steps backThe team that ran the rollout deliberately steps back for one full week. The cadence runs without them. At day 90, a close-out review confirms the team is operating the library independently and identifies the one or two follow-on improvements for the next quarter.
Output: close-out reportThe governance gate is the milestone that distinguishes a mature library from one that quietly accumulates technical debt. Without the gate, every new feature can introduce a prompt without an eval, without an owner, without a lifecycle decision — and the library's health metrics drift down month by month. With the gate, those decisions are forced at the moment the prompt is introduced, which is the cheapest possible time to make them.
The lifecycle policy is the milestone leadership thanks you for six months later. The library that exists at day 90 is twenty or thirty prompts. The library that exists at month twelve, without a lifecycle policy, is eighty or ninety prompts because nothing ever gets retired. The policy converts "we shipped a prompt last quarter and probably don't use it anymore" into a defined deprecation queue with deletion dates, which is the only thing that keeps the library lean past the first year.
The handoff week is the test of whether the rollout actually took. If the cadence stalls — the cron stops, the monthly forum gets cancelled, alerts go unanswered — the rollout team needs to extend by two weeks and revisit the ownership assignments. If the cadence runs cleanly, the close-out report becomes the artifact leadership uses to justify the next quarter's AI investment, because the library's health is now measurable in a way it has never been before.
05 — Eval FrameworksPromptfoo, OpenAI Evals, DeepEval.
The eval framework choice in week six is the single irreversible decision in the rollout. Switching frameworks mid-stream is expensive — the test cases have to be rewritten, the CI integration has to be redone, the team has to relearn the syntax. The right approach is to make the decision deliberately on day 31-35, document the reasoning, and not litigate it for at least the rest of the 90 days. The matrix below covers the three frameworks worth considering and the team profile each one fits best.
Lightweight YAML evals
Best starting point for teams new to evals. Declarative test cases, supports rubric scoring and model-graded judgments, integrates cleanly with CI. The YAML-first format makes eval design accessible to PMs and content engineers, not just engineers. Default recommendation for TypeScript stacks.
TypeScript-heavy teamsStructured eval registry
Stronger when the library leans heavily on OpenAI-hosted models and the team wants the eval format to match the format the provider uses internally. Better registry semantics; less convenient for multi-provider stacks. Python-first.
OpenAI-anchored stacksPytest-style prompt evals
Wraps prompt evals in a pytest-like syntax with built-in metrics for faithfulness, answer relevancy, contextual recall. Fits well when the prompt library already lives in a Python codebase with strong test culture. Lowest friction for Python AI teams.
Python AI teamsHosted eval suites
Braintrust, LangSmith, HumanLoop and similar offer hosted eval pipelines with dashboards and human-review workflows out of the box. Worth considering after the rollout if the open-source tooling becomes a bottleneck — usually month four or later, not during the initial 90 days.
Post-rollout considerationThe decision matrix is deliberately blunt because the framework is less important than the discipline of using it. We have seen teams ship excellent libraries with all three open-source frameworks and teams stall with all three. The framework is the vehicle; the cadence is the destination. Pick the one that matches the team's existing testing language and CI shape, document the decision in the framework choice doc, and move on.
One pattern worth flagging: commercial eval platforms usually look attractive during week five when the team is staring at framework choice and wishing somebody else had already built the dashboard. Resist adopting one during the 90-day rollout — the open-source frameworks are good enough, and the team needs to feel the friction of the eval workflow before they can sensibly evaluate which commercial features actually matter. Revisit commercial tooling at month four or five, with usage data in hand.
06 — TemplatesCatalog template, eval suite, deprecation policy.
The three templates below are the minimum viable artifacts for the rollout. The catalog template defines the frontmatter every prompt file carries. The eval suite template defines the shape of every Promptfoo YAML the team writes. The deprecation policy defines the lifecycle states and the transitions between them. Copy the templates into the repo on day one of each stage and adapt — they are starting points, not prescriptions.
Prompt file frontmatter
Required fields: slug, version, owner, model, status (beta | production | deprecated | archived), feature, eval-suite path, last-edited. Stored as YAML frontmatter on a markdown file or as columns in a database table. Format is up to the team; the fields are mandatory.
Used in stage 1Eval suite skeleton
Promptfoo YAML with three sections: providers (which model to run against), tests (the ten cases), and rubric (the two or three scoring criteria). Each test has inputs, optional reference output, and a pass condition. The skeleton lives in prompts/evals/EXAMPLE.yaml.
Used in stage 2Lifecycle policy
Four states (beta · production · deprecated · archived), the transitions between them, and the gate for each transition. Deprecated prompts have a deletion date. Archived prompts have a forwarding pointer to the replacement. No prompt sits in 'unknown' state past 30 days.
Used in stage 3A minimal frontmatter template the team can paste at the top of every prompt file on day one:
---
slug: customer-summary
version: 3
owner: alex.chen
model: claude-sonnet-4-6
status: production
feature: dashboard-summary
eval-suite: prompts/evals/customer-summary.yaml
last-edited: 2026-05-15
---
You are a customer-success analyst summarising
the last 30 days of activity for {{customer_name}}...The frontmatter does three things at once. It makes every prompt parseable by tooling — the dashboard, the catalog generator, the lifecycle audit script all read the same schema. It forces the new-prompt template gate at write time — you cannot ship a prompt without filling in the fields. And it gives the regression cron a stable target — the cron reads eval-suite from the frontmatter and knows which YAML to run. Three lines of YAML, three downstream benefits.
For teams scaling beyond fifty prompts, consider sharding the catalog by domain — prompts/marketing/, prompts/support/, prompts/agents/ — each with its own owner team and its own eval suite directory. The frontmatter convention stays the same; the directory structure scales. Above two hundred prompts the sharding is mandatory, not optional, because a single flat directory becomes unreadable.
07 — PitfallsFour prompt-library rollout failure modes.
Every rollout we have run has hit at least one of the four pitfalls below. Some teams hit three. Knowing the failure modes ahead of time is the cheapest possible insurance — the bars below show how often each pitfall appears across engagements, so you can pre-empt the ones most likely to land on your team.
Pitfall frequency · across rollout engagements
Source: typical rates observed across rollout engagementsThe eval-before-catalog inversion is the most expensive mistake because it wastes the most engineering hours. A team excited about evals writes ten cases against a prompt, gets it passing, then in week three the prompt gets renamed during the single-source migration and the eval suite has to be rewritten. The defence is to enforce stage sequencing — no evals before day 31, even if the team is eager.
The shared-channel alert routing failure is the most insidious because it looks like everything is working — the cron fires, the alerts post — but nothing actually gets investigated. The defence is to wire alerts to the named owner from the catalog frontmatter, not to a shared channel. Owner field and alert routing should be the same field; if a prompt has no owner, the alert has nowhere to go, and that is the right signal that ownership needs to be assigned before the prompt ships.
For teams approaching this rollout fresh, pair the plan with our broader writing on prompt engineering anti-patterns so the prompts entering the library are themselves clean — there is no point cataloguing and eval-covering prompts that have foundational quality issues. Pre-rollout cleanup of the worst offenders is usually worth a week before stage one even starts. For organisations stepping up to this level of prompt operations, our AI transformation engagements often start with exactly this 90-day plan and walk the cadence with the team.
"A 90-day plan that finishes on day 89 has failed. A 90-day plan that finishes on day 90 with the cadence running for one full week without the rollout team in the room has succeeded."— Internal post-mortem heuristic
Prompt libraries compound — 90 days makes the discipline stick.
The 30/60/90 plan is not a formula for a perfect prompt library. It is the cadence that converts a scattered folder of prompts into an asset the team will still own in twelve months. The artifacts the rollout produces — the catalog, the eval suites, the regression cron, the lifecycle policy — are the visible deliverables. The real outcome is the cadence the team has internalised by day 90, which is what keeps the artifacts alive after the original sponsor moves on.
The pattern we see across rollout engagements is that teams who finish all three stages cleanly are still operating the library at month twelve with eval coverage above seventy percent and a monthly governance forum on the calendar. Teams who compress the rollout into sixty days or skip the lifecycle policy revert to the pre-rollout state within two quarters. The 90-day horizon is engineered around that empirical observation; shorter cadences may feel faster but they fail more often.
What to do next: pick a start date, name a rollout sponsor, block the day-one catalog audit on the calendar, and commit publicly to the day-90 close-out review. The mechanics are well-understood, the tooling is open source, and the organisational change is mostly a matter of writing down conventions that already implicitly exist. The first hour of catalog audit is the most important hour of the entire ninety days — once the library has any feedback loop pointed at it, the rest of the plan follows.