This case study covers a six-month prompt library build at an anonymised platform engineering team — what they started with, what they shipped in which order, and what the library looked like six months in. The headline numbers (200 prompts, 87% eval coverage, near-100% regression catch rate before users noticed) are real but anonymised; the working patterns are exactly what we recommend other teams replicate.
The starting point was familiar. Eighty production-adjacent prompts scattered across five product teams. No naming convention, no version field, no eval suite, no shared understanding of which prompt powered which feature. The team lead could feel the weight of it — every model upgrade triggered a multi-week stabilisation period, every junior engineer's tweak quietly shifted behaviour, every customer complaint took half a day to trace back to a prompt change nobody remembered making. The library had become a graveyard. The engagement existed to turn it back into a library.
What follows is the playbook in seven sections. Section 01 is the situation at month zero. Sections 02 through 05 are the four interventions in the order they happened: naming and versioning, eval suite design, regression cron, lifecycle policy. Section 06 is what the library looked like at month six — the numbers, the culture shift, the things that surprised us. Section 07 is the replication guide for teams at smaller scale. Every concrete decision is one we would make again; every mistake we made along the way is documented in the lessons.
- 01Naming + versioning are foundational, not optional.Before any eval can run, every prompt needs a parseable identity and a version field. The convention itself matters less than the discipline — ship a format, refine the format later. The team that argues for three weeks about semver versus date stamps ships nothing.
- 02The eval suite is the contract, not the test.Each prompt's eval suite functions as a behavioural contract — what the prompt is supposed to do, expressed in test cases. Once the contract exists, prompts become refactorable, model-swappable, and upgradeable. Without it, every change is a bet.
- 03The regression cron catches drift before users do.Daily eval runs against production prompt versions caught vendor model updates, dataset shifts, and silent edits inside a 24-hour window — long before customer-facing impact compounded. The cron is the difference between a measurable library and a measured one.
- 04Lifecycle policy retires zombies, prevents bloat.Without an explicit deprecate-promote-retire path, prompts accumulate forever. The team's lifecycle policy retired a quarter of the original library in the first quarter — prompts that were powering features nobody used, with nobody noticing.
- 05Cadence beats one-time effort.A monthly library review with the prompt owners, an eval-suite quality check, and a clean-up sprint kept the library trustworthy. One-time audits decay; a rhythm sustains. The investment is small once the infrastructure exists.
01 — SituationEighty prompts, five teams, zero shared discipline.
The team owned platform-level AI infrastructure for an engineering organisation with around five product groups sitting on top of it. Each product group had been shipping AI features for roughly a year. Every group had its own prompts, its own conventions (which is to say, no conventions), and its own pain points. The platform team had been pulled in because the same conversation kept happening: a model upgrade landed, something broke, nobody knew which prompt was responsible, and the fire drill took days to resolve.
The audit on day one found eighty prompts in active production paths. Roughly thirty of them lived in markdown files inside various service repos. Another twenty were hardcoded as multi-line strings inside handler functions. A dozen sat in a shared Confluence page that engineers copy-pasted into code when they needed them. The remaining were duplicated across multiple locations — the same prompt, in slightly different wordings, in three repos, with nobody able to say which one was canonical. None of them had a version field. None of them had an eval suite. Three of them had ownership comments in them; the named owners had left the company.
The most expensive symptom was the slow trickle of silent regressions. A prompt would behave well for weeks, then start producing subtly worse output — slightly more verbose, slightly less factual, slightly off-tone. The product team would notice eventually, usually via a customer support ticket. By the time engineering traced the regression, the git blame would show three commits to the prompt file, none of them flagged as risky, none of them gated by anything more than a code review by an engineer who could not realistically judge whether the prompt was now better or worse. The half-life of trust in the library was measured in days.
The platform team had a credible mandate but limited authority. They could write the conventions; they could not force the product teams to follow them. The engagement design had to accommodate that reality. Rather than a top-down rewrite of every prompt, the plan was incremental: ship the infrastructure, document the conventions, prove the value on three high-stakes prompts, then let the product teams self-onboard as their next regression made the cost of the status quo legible. That sequencing turned out to be the single most important decision of the engagement.
The first three target prompts were chosen deliberately. One was a customer-facing prompt with frequent regressions and obvious stakes — a high-traffic feature where the product team was already in pain. The second was a developer-facing internal prompt with strong evangelists — engineers who would tell their colleagues the new workflow was good. The third was a low-stakes, low-traffic prompt the platform team itself owned, used as a test-bed for the infrastructure without risking a product surface. Three prompts, three audiences, one converging story: this is what good looks like, and here's how to get there.
"The library was not technically broken. It was just operating without any feedback loops. The first eval suite that fired and failed changed how everyone on the team thought about prompts."— Platform team lead, retrospective interview
02 — Approach: Naming + VersioningFour conventions that unlocked the rest of the build.
Before any eval could run, every prompt needed an identity. That sounds trivial; in practice it was the most contentious two weeks of the engagement. The platform team drafted a proposal, the product teams debated, the platform team published a final version with the explicit caveat that the format was less important than the discipline. The four conventions below are what shipped. None of them are theoretically optimal; all of them are good enough that the arguments about marginal improvements stopped happening.
The convention had to satisfy four constraints. It had to be human-readable, because product managers and content engineers needed to participate in prompt review. It had to be machine-parseable, because the eval suite would key off it. It had to be diff-friendly, so code review remained useful. And it had to be portable, because the library would eventually move from filesystem-backed to database-backed without a rewrite. The four conventions below shipped on day ten of the engagement, and survived essentially unchanged for the rest of the six months.
Domain dot prompt slug
support.refund-eligibility.mdEvery prompt file lives in prompts/<domain>/<slug>.md. The domain matches the product team that owns it; the slug is a short, descriptive identifier. No version in the filename — versions live in frontmatter so git history remains clean and rollback is one revert.
Filesystem source of truthStructured metadata block
YAML frontmatterEvery prompt file opens with a YAML block: id, version, owner, model, status, eval_suite, last_reviewed. Status is one of active, beta, deprecated, retired. The eval_suite field points to a file under evals/ — the linkage that the eval cron relies on.
Mandatory fieldsSemver-style major.minor.patch
v1.4.2Major for behavioural rewrites, minor for prompt body edits, patch for typo or whitespace fixes. Bumped manually by the author in the same PR as the prompt edit. The discipline was simple enough that the team adopted it without coaching after the first month.
Manual but enforced in PROne named human per prompt
owner: github-handleEvery prompt has exactly one owner — the engineer who reviews changes, sees regression alerts, and approves model migrations. CODEOWNERS maps prompts/<domain>/* to the domain's prompt-owner group, so PR review is automatic. Team ownership was disallowed.
Routed via CODEOWNERSThe four conventions landed together in the same PR — the platform team did not ship them piecemeal, because doing so would have invited piecemeal adoption. The PR included a migration script that rewrote every existing prompt file in the repo to the new convention, filling in the frontmatter with best-effort guesses for owner and model. Engineers were asked to correct the guesses, not to invent the conventions. That sequencing turned a multi-month negotiation into a two-week review. The conventions were a fait accompli; the corrections were the productive conversation.
One subtle decision worth surfacing: the version field lives in frontmatter, not in the filename. Putting the version in the filename feels intuitive — refund-eligibility.v3.md — but breaks git history (every version is technically a new file) and complicates code review (the diff shows a deletion plus an addition, not a meaningful edit). With the version in frontmatter, the file is the canonical home and git history is the change history. Rollback is one git revert; diff review is a normal markdown diff. The pattern is borrowed from how database migrations are typically structured, and it worked the same way here.
03 — Approach: Eval SuitePromptfoo as the eval contract for every prompt.
With the naming and versioning conventions in place, the second intervention was the eval suite. The team chose Promptfoo for the same reasons most teams choose Promptfoo when starting from zero: YAML-first format that product managers could read and contribute to, a CLI that wired cleanly into their existing CI, and rubric-style scoring that matched how the team already discussed prompt quality informally. The choice was not theoretically optimal — there are stronger eval frameworks for specific niches — but it was shippable in two weeks.
The pilot was three prompts and ten test cases each. The highest-stakes customer-facing prompt got fifteen. The internal developer-facing prompt got ten. The platform team's own test-bed prompt got the remaining ten. Each test case had three components: an input (a realistic user message or context), an expected behaviour (described as a rubric, not a literal output), and a pass/fail criterion (graded by either an exact-match check, a regex, or a model-as-judge call). The thirty-five test cases together took two engineering days to write, plus three hours of review with the prompt owners.
The first run failed. Not because the prompts were bad, but because two of the rubrics had been written too strictly — the eval was demanding a more specific output than the prompt actually promised. That failure was the inflection point. The team realised that writing evals was itself a clarifying exercise: it forced an explicit statement of what the prompt was supposed to do, which was the conversation the library had never had. Half the eval-design time went into rubric-tightening; the other half went into discovering edge cases nobody had ever explicitly tested.
Eval coverage growth · pilot through production-grade
Source: anonymised engagement metrics, recorded monthlyBy month two, the pattern was visible. Each eval suite cost roughly half a day to write, then paid back its time within the first regression it caught. The team stopped having arguments about whether evals were worth the investment and started having arguments about which prompts to evaluate next. The product teams that had been skeptical for the first month started writing their own suites by week eight. The scoring rubrics improved — the early ones were brittle, graded too strictly, and produced false positives that burned trust; the later ones used model-as-judge calls with tighter rubrics, calibrated against ten human-graded examples, and the false-positive rate dropped to roughly five percent.
One operational decision worth flagging: the team chose to run model-as-judge calls against a different model family than the one the prompt was tuned for. If the prompt targeted Claude Sonnet, the judge was GPT-5.5. The reason was specific — using the same model as both the producer and the judge produces correlated errors that hide regressions. Crossing families breaks that correlation and surfaces a wider class of failure modes. The pattern is a small footnote in the eval literature; in this engagement it visibly improved the catch rate during the first three months.
"Writing the eval was the conversation. The eval is just what survives the conversation."— Senior platform engineer on the team
04 — Approach: Regression CronA nightly cadence to catch drift before users do.
CI integration alone catches regressions introduced by code changes. The third intervention was a regression cron — a scheduled job that ran the full eval suite against the production-deployed prompt versions every night, regardless of whether anything had been changed. The cron caught a different and harder class of regressions: vendor model updates, dataset drift, third-party API behaviour changes, and the slow corrosion of prompts whose tuning context had shifted underneath them.
The cron was a GitHub Actions schedule trigger that fired at 02:00 UTC. It cloned the repo, read every prompt with astatus: activeflag, ran its eval suite against the production model, recorded the score in a small time-series database, and posted a Slack message to the owner of any prompt whose score dropped more than 10% from its 7-day rolling baseline. The infrastructure cost was essentially free — the GitHub Actions allowance covered it comfortably — and the alerting fit into the team's existing operational flow.
Production prompts
Every prompt with status: active in the catalog runs nightly at 02:00 UTC against its eval suite. Catches model-vendor updates within 24 hours — the window before customer impact compounds. Default cadence for the entire production library.
Daily nightly · 02:00 UTCBeta & staging prompts
Prompts with status: beta run weekly on Monday mornings. Lower stakes, lower traffic, lower priority for compute spend. Same eval suite, same alert routing — just a less aggressive schedule that respects the maturity level.
Weekly · Monday 06:00Deprecated prompts
Prompts with status: deprecated still get evaluated, just monthly. The point is not to maintain quality on a deprecated prompt; the point is to know if it has degraded so badly that the team needs to accelerate retirement before the next platform shift.
First of each monthEval suites themselves
When an eval suite is edited (new test case, tightened rubric), CI runs it immediately against the current production prompt. Closes the loop where a 'good' prompt was actually only passing a too-lenient eval — the fix was to tighten the eval, not the prompt.
CI hook on eval editsThe first regression the cron caught was a model vendor update — a minor version bump on the underlying API that shifted the temperature distribution of outputs for a specific prompt by enough to fail two of its test cases. The customer impact would have been small but real over the course of the week; the cron caught it inside twelve hours, the prompt owner was paged into Slack, and the team had rolled back to the previous model version within the business day. That single incident, more than any abstract argument, sold the cron to the product teams that had been quietly resistant. A regression caught before customers noticed was the most concrete possible demonstration of why the library mattered.
Over the six months, the cron caught nineteen regressions in total — roughly one every ten days. Three of them were model vendor updates, six were silent edits that had snuck through code review without an eval being attached, four were third-party API behaviour changes (a search index returning different rankings, a tool integration shifting its response schema), and the remaining six were slow drift that accumulated over weeks rather than appearing overnight. None of the nineteen reached customers in a way that triggered a support ticket. The catch rate, measured against the small number of regressions that customers did report independently, was effectively complete.
05 — Approach: Lifecycle PolicyDeprecate, promote, retire — an explicit path for every prompt.
The fourth intervention was the one the platform team initially considered optional and quickly stopped considering optional. Without an explicit lifecycle policy, prompts accumulated forever. New ones got added; old ones never got removed. By month three the library was a hundred and thirty prompts, and roughly a quarter of them were powering features the product teams had quietly deprecated months earlier. Nobody had thought to retire the prompts that backed them. The infrastructure was running evals against ghosts.
The lifecycle policy defined four explicit states: beta (newly introduced, in active evaluation, not yet promoted), active (in production, eval coverage required, daily regression cron), deprecated (no longer the recommended choice, scheduled for retirement, monthly regression cron), and retired(removed from active code paths, archived in git history, no longer evaluated). The status field in the prompt's frontmatter was the source of truth; the regression cron and the dashboard both keyed off it.
Beta to active
A prompt promotes from beta to active when it has run for at least seven days in shadow mode against an eval suite, passed all test cases on at least five consecutive nightly runs, and been signed off by the named owner. The signoff is a single PR comment.
Time + eval + ownerActive to deprecated
A prompt deprecates either when its replacement promotes to active, or when the feature it powers is decommissioned. Deprecation includes a forwarding pointer to the replacement (or a deletion ticket) and a target retirement date thirty days out.
Pointer + retire-byDeprecated to retired
A prompt retires ninety days after deprecation, by default. The retirement removes the prompt from active code paths but preserves it in git history under archive/. Retired prompts no longer run eval suites and no longer appear on the dashboard.
Auto-script + manual reviewThe lifecycle policy retired thirty-two prompts in the first quarter — roughly a quarter of the library at that point — simply by giving the team a reason to look at every prompt with a fresh eye. Many of those retirements were trivially obvious in retrospect: a prompt powering a feature that had been sunset, a duplicate that had been left in place when a consolidation was supposed to delete it, a beta prompt that had never been promoted and never been retired. The policy did not invent the cleanup; it forced the cleanup to happen on a schedule.
The most operationally important field in the lifecycle policy turned out to be the last_reviewed frontmatter entry. The monthly review cadence (described in Section 06) walked through every prompt whoselast_revieweddate was older than ninety days and asked one question: is this still in production, still owned, still useful? Most of the time the answer was yes and the date got bumped. Some of the time the answer was no and a deprecation got opened. The discipline was small; the cumulative effect over six months was a library that actually represented the team's current state.
"A prompt library that never retires anything is just a graveyard with better metadata. The lifecycle policy is what keeps it a library."— Engagement retrospective, month 6
06 — OutcomesSix months in — two hundred prompts, measured end to end.
By month six the library had grown from eighty prompts to two hundred. The growth was not entirely natural — roughly sixty of the new prompts existed because the audit uncovered prompts that had previously been hardcoded inside handlers and were now extracted into proper catalogued files. Another forty represented new features the product teams had shipped during the engagement, all of which had joined the library cleanly because the infrastructure was ready by month three. Twenty prompts were ones the team split from monolithic system prompts into composable units once the eval suite gave them confidence to refactor.
Eval coverage sat at eighty-seven percent. The remaining thirteen percent was the long tail — prompts that were either too low-stakes to justify the suite-design time, or so deeply experimental that the team accepted the trade-off of evaluating informally for now. The library's quality was no longer in question; the conversation had moved on to whether each remaining uncovered prompt was worth the marginal investment, which is a meaningfully healthier conversation to be having.
The regression catch rate was near-complete. Across six months, every regression that customers reported also showed up first in the cron — usually several days before. The cron caught nineteen regressions in total; customers independently reported five. Of those five, all five had also been detected by the cron and were already in the queue for remediation by the time the support ticket arrived. The library had crossed the threshold from reactive to proactive. That was the operational shift the engagement existed to produce.
Library outcomes · month 0 vs month 6
Source: anonymised engagement metrics, six-month reviewThe cultural change was harder to quantify and arguably more important. By month four the product teams had stopped asking the platform team for prompt help and started asking for prompt review. The conventions had moved from top-down documentation to bottom-up practice. New engineers joining the organisation learned the prompt workflow as part of onboarding alongside the code-review workflow. Prompt edits no longer happened opportunistically in feature PRs; they happened in dedicated prompt PRs that named the version bump and linked the eval results. The library had become a piece of the team's engineering culture rather than an artifact of it.
The monthly library review cadence — established in month two — sustained the gains. The platform team and the domain prompt owners met for ninety minutes on the first Tuesday of each month. The agenda was fixed: prompts whoselast_reviewed date had aged past ninety days, prompts whose eval scores had drifted (even within tolerance) over the prior thirty days, deprecation candidates, beta promotions, and a quick scan of the eval suite quality. The cadence cost the team a couple of engineering hours per month and bought a library that stayed trustworthy without a separate maintenance project.
07 — Lessons + ReplicationWhat we would do again, and what we would change.
Six lessons distilled from the engagement, with the honest caveats about where each one might not generalise. The engagement was at one platform team with a particular culture and a particular budget; teams in different shapes should expect to make different trade-offs around the edges. The pattern of intervention sequence — naming, evals, cron, lifecycle — is the part we would carry to any engagement of similar scope.
Pick three pilot prompts, not the whole library
Three prompts — one high-stakes customer-facing, one developer-facing with evangelists, one low-stakes platform-owned — produced a converging story that no whole-library sweep could have produced. The momentum from three working examples carried adoption better than any documentation push.
Three pilots · alwaysShip the convention as a fait accompli
Two weeks of debate produced less alignment than two days of 'here is the convention, please correct the migration script's guesses.' The conversation that mattered was about the corrections, not the conventions. Anchor first, refine later.
Anchor earlyCross-family judge model
Using GPT as the judge for Claude-targeted prompts (and vice versa) was the small footnote that visibly improved catch rate. Same-family judges produce correlated errors that hide failure modes. Cross the family boundary on every model-as-judge call by default.
Always cross familiesWire the cron in month one, not month four
The biggest mistake we made was leaving the regression cron until month three. By then, three regressions had reached customers that the cron would have caught. The cron is cheap to set up against a small library and grows with the library — start it on day one with the pilot suites.
Cron from day oneLifecycle is mandatory, not optional
Without explicit deprecate-promote-retire, the library bloats. We initially considered the lifecycle policy a 'nice to have' and ended up retrofitting it in month three to clean up thirty-two zombie prompts. Build the policy alongside the eval suite, not after.
Lifecycle on day oneReplicate at half the scale
For a team with twenty prompts instead of two hundred, the playbook collapses: pick one pilot, write five test cases, wire a daily cron, define active/deprecated/retired states. Same sequence, smaller surface. The principles do not change; the calendar does.
Same shape, smaller doseFor smaller teams considering replication, the practical starting point is the smallest possible version of each intervention. One pilot prompt. Five test cases. A Promptfoo CLI run in CI. A nightly GitHub Actions schedule firing the same suite against the production prompt. A single markdown file documenting the four lifecycle states. The infrastructure investment for a twenty-prompt library is a week of part-time work, not a quarter; the payoff compounds the same way. The mistake to avoid is treating this as a heavyweight engagement scoped only to teams with platform-engineering budgets. The playbook scales down cleanly to product teams of three engineers.
For teams considering replication, the right next steps depend on where they are starting. Teams that have already run a library audit (with the 100-point framework or any equivalent) should jump straight to the eval suite work for their three highest-stakes prompts. Teams that have not audited yet should start there — the audit, even an informal version of it, identifies the pilot prompts and establishes the baseline measurement. Either path leads to the same library, just on a slightly different timeline. If you want a structured engagement that mirrors what this case study describes, our AI transformation team runs prompt library builds at this exact scale, and we usually pair them with a thirty-sixty-ninety day rollout plan — covered separately in our prompt library rollout plan guide, and grounded in the same audit framework as our 100-point library evaluation.
Prompt libraries become institutional knowledge when discipline is wired in.
The headline numbers from this engagement — two hundred prompts, eighty-seven percent eval coverage, near-complete regression catch rate before customer impact — are the visible results. The invisible result is the change in how the team thinks about prompts. They are not artifacts; they are infrastructure. They have owners, versions, eval suites, lifecycle stages, and a regression cron the team actually trusts. The library is a first-class engineering asset on par with the codebase around it.
The replication advice is unambiguous: do not start by trying to clean up the whole library. Start with three prompts, ship the conventions, write the evals, wire the cron, define the lifecycle states, and let the practice spread by example. The teams that try to do everything at once burn out and lose momentum; the teams that ship a credible pilot in three weeks usually convert their organisations within a quarter. The cadence is what sustains the gains — a monthly review meeting, ninety-minute agenda, fixed format. The investment is small; the compounding payoff is institutional knowledge that survives platform churn and personnel change.
The broader pattern is consistent with what we see across every team that ships AI in production: the difference between a library and a graveyard is not the quality of the prompts inside it, it is the discipline wired around it. Naming, versioning, evals, regressions, lifecycle, cadence. Six disciplines, three months to install, sixty months of compounding value. The platform team in this case study made the investment once and stopped firefighting regressions. That is the entire promise of the work.