A prompt library audit scores the prompts a team relies on across five axes — catalog coverage, versioning discipline, eval suite depth, regression detection, and cross-model fit — for one hundred points total. The point is not the score; the point is that without measurement, prompt libraries quietly turn into prompt graveyards where nobody knows which version shipped, which model it was tuned for, or whether last week's edit broke production.

Most teams running LLM features in production have somewhere between ten and fifty named prompts powering customer-facing workflows. The prompts work. They are also almost never measured. When a model upgrade lands, when a junior engineer rewords a system prompt for clarity, when the team migrates from Sonnet to Haiku for cost — there is no automated signal that anything regressed. The audit framework below exists to make those failures visible before users find them.

This guide covers what separates a real library from a folder of prompts, walks through each of the five axes with the 20 points inside it, and ends with a four-stage maturity model teams can map themselves onto today. Everything below is technique-agnostic — the named tools (Promptfoo, OpenAI Evals, DeepEval) are interchangeable examples, not prescriptions.

Key takeaways

01
A prompt without an eval is a wish.Evals are the contract that turns a prompt from tribal knowledge into something a team can refactor, swap models on, or upgrade with confidence. No eval, no contract, no real ownership.
02
Versioning by semver works — prompt-001.v3.md is fine.Get a discipline; argue about the format later. What kills libraries is not the format, it is the absence of any version field at all. Pick a convention, write it in the README, enforce it in PR review.
03
Regression detection needs a cron.Daily eval runs catch drift before users do. Vendor model updates, dataset shifts, and silent edits all introduce regressions; only a scheduled job catches them in the window where you can still roll back cleanly.
04
Model fit is per-prompt, not per-library.Some prompts move from Sonnet to Haiku gracefully; others lose 30 points of accuracy and need to stay where they are. The audit forces per-prompt model-fit testing rather than a single library-wide call.
05
Promptfoo + OpenAI Evals + DeepEval cover 80% of real needs.Open-source tooling is enough for most teams. The remaining 20% — multi-modal evals, long-context regression, custom rubrics — eventually wants a commercial platform, but starting with open source is the right default.

01 — Library vs GraveyardMost prompt libraries are write-only.

A prompt library starts the way every reasonable engineering artifact starts — somebody needs a system prompt, writes a good one, and pastes it into the codebase. Six months later, the codebase has thirty prompts, none of them have version tags, half of them are duplicated across three files, and the original author has moved teams. Nobody is sure which version is in production. The library has quietly turned into a graveyard.

The transition is invisible because every individual edit looked reasonable at the time. A teammate tightened the wording. A new model came out and someone tweaked the temperature. The product team asked for a softer tone and an engineer rewrote two paragraphs. None of those changes had an eval running against them. Each one shifted behavior; the team noticed only when a customer complained about an output that used to be fine.

The distinguishing characteristic of a real prompt library — what this audit is designed to measure — is not size, not formatting, not even quality of the prompts themselves. It is the presence of feedback loops. A library has evals running on a schedule. It has a version field on every file. It has a regression dashboard somebody actually looks at. It has a model-fit decision logged for each prompt so when Sonnet 4.6 replaces Sonnet 4.5, the migration is a fifteen-minute review instead of a two-week panic.

The graveyard test

Ask three people on the team to point to the exact version of the prompt running in production for feature X. If you get three different answers — or worse, three confident guesses — the library is a graveyard. The first hour of the audit always starts here.

The 100-point framework is built around five axes because those are the five places feedback loops live (or fail to). Catalog coverage measures whether the library knows what it owns. Versioning measures whether change is legible. Evals measure whether quality is contractual. Regression detection measures whether drift is visible. Model fit measures whether the library survives the inevitable platform churn underneath it. Twenty points each, scored on observable evidence, totaled into a single number that maps onto the four-stage maturity model in Section 07.

"The shortest path from a prompt graveyard to a prompt library is a single eval run that fails. After that, every prompt becomes obviously measurable."— Common pattern across audit engagements

02 — Catalog CoverageTwenty points on what's tracked.

The first axis answers a question that sounds trivial until you try to answer it concretely: what prompts does this team own? A surprising number of audits start with the team discovering they have prompts they forgot existed — buried in old feature branches, hardcoded in scripts, duplicated across three services with slight variations nobody can explain. Catalog coverage is twenty points because most graveyards fail this axis before any of the others.

The points break down across four sub-categories: inventory completeness, metadata richness, ownership clarity, and lifecycle state. Each sub-category contributes five points, scored on evidence the audit team can verify in under fifteen minutes by looking at the repo and asking two questions of the team lead.

Inventory

05pts

Every prompt is findable

Single source of truth for the prompt catalog. Either a prompts/ directory, a database table, or a CMS — whatever the team picks, every production prompt has exactly one canonical home. No duplicates, no orphans, no surprises.

Grep test passes

Metadata

05pts

Each prompt has structured fields

Owner, model, last-edited date, eval status, feature it powers, version. Stored as frontmatter or a row of columns — the format is up to the team, but the fields are mandatory. Five points for completeness, not formatting.

Structured fields

Ownership

05pts

A human owns every prompt

Each prompt names exactly one human owner — the person who reviews changes, signs off on edits, and gets paged when the prompt regresses. Shared ownership is a synonym for no ownership; the audit penalises 'team-owned' entries.

One owner per prompt

Lifecycle

05pts

Active vs deprecated is explicit

The catalog distinguishes prompts in production, in beta, deprecated, and archived. Deprecated prompts have a deletion date or a forwarding pointer to the replacement. No prompt sits in 'I think we still use this' limbo.

State machine

Catalog coverage is the cheapest axis to remediate and the highest-leverage. Teams that score below ten on this axis often see their score jump fifteen to twenty points within two weeks of committing to a single source of truth — there is nothing algorithmically hard about cataloguing prompts, it just needs a decision and a small amount of housekeeping.

The audit deliverable for this axis is a flat CSV listing every prompt found in the codebase with the four sub-category scores attached. Findings rated P1 (orphaned prompt, unclear ownership) are scheduled for immediate remediation; P2 findings (incomplete metadata, ambiguous lifecycle state) feed into the next sprint. By the time the next axis starts, the team already has a clean inventory to work against.

03 — VersioningTwenty points on naming and history.

Versioning is the axis with the strongest opinions and the lowest actual difficulty. Teams argue for weeks about semver vs date stamps vs hashes, then implement none of the three. The audit is agnostic on format — semver, date, hash, monotonic integer — and ruthless about presence. A prompt without a version field gets zero points on this axis, regardless of how good the prompt is.

The twenty points split across four sub-categories: version identity, change history, rollback discipline, and diff reviewability. The first two are about whether you can identify the version running and trace how it got that way; the second two are about whether you can recover when something breaks. Most libraries score well on identity and poorly on rollback — easy to tag a version, harder to actually go back to one.

Versioning · 20 points across four sub-categories

Source: typical scores observed across audit engagements

Version identityEvery prompt has a parseable version field

05 pts

Change historyGit history or DB audit log preserves every edit

05 pts

Rollback disciplinePrior versions are runnable in production within minutes

05 pts

Diff reviewabilityPrompt edits go through PR with a human reviewer

05 pts

The most common finding on this axis is silent edits in production-adjacent code paths. A teammate updates a system prompt inside a service handler, the commit goes through with a benign message like "tweak wording," the eval suite — if one exists — wasn't wired to fire on that file, and the change ships. Three days later, support tickets spike. Nobody connects the two until someone runs git log on the prompt file in desperation.

The remediation pattern is straightforward and orthogonal to the versioning format debate. Move prompts to a dedicated directory. Wire CI to run the eval suite on any change inside that directory. Require PR review with the prompt's owner as a required reviewer (CODEOWNERS does this). Once those three mechanisms are in place, the versioning axis takes care of itself — every change becomes a tagged, reviewed, eval-gated artifact, regardless of whether the version string is v3, 2026.05.03, or a8f3c92.

The format does not matter

Teams that adopt any versioning convention — even one as crude as appending .v3 to a filename — score dramatically better than teams still debating which convention is theoretically optimal. Ship a discipline, refine the format later.

04 — EvalsTwenty points on eval suite coverage.

Evals are the axis where the audit gets the most pushback and generates the most value. Teams without evals will argue, sincerely and at length, that their prompts are too qualitative to measure — that the outputs are creative, the tasks are open-ended, the judgement is subjective. The audit response is simple: pick any three production prompts, write ten test cases for each, run them tomorrow, and see if anyone disagrees with the scores. Almost nobody does.

The twenty points split across coverage breadth, case quality, scoring rigor, and CI integration. The first is about how manyprompts have evals; the second is about whether those evals actually exercise the prompt's real failure modes; the third is about whether the scoring is something more than a thumbs-up; the fourth is about whether the eval suite runs on every change or only when someone remembers.

Promptfoo

Lightweight YAML evals

promptfoo.yaml + CLI

Best starting point for teams new to evals. Declarative test cases, supports rubric scoring and model-graded judgments, integrates cleanly with CI. The default recommendation when an engagement starts at zero.

Open source · TypeScript-friendly

OpenAI Evals

Structured eval registry

Python registry + harness

Stronger when the library leans heavily on OpenAI-hosted models and the team wants the eval format to match what the model provider uses internally. Better registry semantics; less convenient for non-OpenAI providers.

Open source · Python-first

DeepEval

Unit-test style for prompts

pytest + custom metrics

Wraps prompt evals in a pytest-like syntax with built-in metrics for faithfulness, answer relevancy, contextual recall. Fits well when the prompt library already lives in a Python codebase with strong test culture.

Open source · Python · pytest

The audit does not prescribe a tool. The audit prescribes coverage: every production prompt should have at least one eval suite running against it on every change. The choice between Promptfoo, OpenAI Evals, and DeepEval is a stack-fit decision — pick the one that matches the team's existing testing language and CI shape. We default to Promptfoo for TypeScript stacks because the YAML-first format makes eval design accessible to PMs and content engineers, not just engineers.

For teams writing their first eval suite, the practical starting point is a single prompt and ten test cases. Pick the highest-stakes prompt in the library. Identify ten inputs that cover happy path, edge cases, and known failure modes. Score each output on a small rubric — two or three criteria, each judged on a 1-5 scale, with a written reason. Wire the suite into CI. That single eval, running on every PR that touches the prompt, will catch more regressions in its first month than the team's previous six months of code review. The pattern generalizes; once a team has one eval suite running, the second one takes hours, not days. The blueprint we use for skill-level eval scaffolding is identical to the one in our Claude skill tutorial: a tiny markdown contract, a callable script, a CI hook.

"A prompt without an eval is a wish. With one good eval suite, the prompt becomes a contract — and contracts can be refactored, model-swapped, and upgraded with confidence."— Common refrain from prompt audit engagements

05 — RegressionsTwenty points on drift and detection.

Regression detection is the axis that separates teams running evals from teams running an actual prompt-ops practice. Evals in CI catch regressions introduced by code changes. Regression detection catches regressions introduced by everything else — model vendor updates, dataset shifts, time-based drift, third-party API changes, and the slow erosion of prompt quality as the world the prompt was tuned for moves on around it.

The twenty points split across detection scope, scheduling cadence, alert routing, and historical baselining. The detection scope question is whether the team runs evals only on changed prompts or on the full library on a schedule. Scheduling cadence asks how often. Alert routing asks who gets paged when the eval suite regresses. Historical baselining asks whether the team can answer "when did this prompt start degrading?" with a chart rather than a guess.

Detection scope

Full library on schedule

Run every prompt's eval suite every night, not just on PR. Catches model-vendor regressions, dataset drift, and slow degradation that PR-only runs miss entirely.

Full library, scheduled

Cadence

Daily for production, weekly for beta

Daily is the sweet spot for production prompts — frequent enough to catch a model update within 24 hours, infrequent enough to keep eval costs reasonable. Beta and deprecated prompts can run weekly.

Daily · weekly

Alert routing

Page the prompt owner

Regression alerts go to the named owner from the catalog, not to a shared channel. Shared channels mean nobody owns it; named owners mean somebody investigates. Catalog ownership and alert routing are the same field.

Owner-targeted

Baselining

30-day rolling score history

Store every nightly eval score for at least 30 days. When a regression alert fires, the responder can see whether the drop is a one-day spike (probably noise) or a multi-day trend (probably the vendor changed something underneath).

30-day window

The most common gap on this axis is the absence of a cron. Teams wire their eval suites into CI, hit ninety percent on the previous axis, and stop there — never noticing that a quarter of their regressions come from changes the team did not make. The fix is almost embarrassingly cheap: take the eval suite that already runs in CI, fire it on a nightly schedule against production prompt versions, and route any failure to the prompt's owner. The infrastructure cost is the GitHub Actions schedule trigger or its equivalent. The behavioral change is what matters.

Once nightly runs are in place, the natural next step is a small dashboard — a single page showing each prompt's current score and 30-day trend, with the prompt owner pictured next to it. The dashboard is not the value; the dashboard is what makes the value visible to people who do not read CI logs. When a non-engineering stakeholder asks "is our AI quality holding up," the right answer is a URL, not a paragraph.

06 — Model FitTwenty points on cross-model testing.

Model fit is the axis the audit added in late 2025 as frontier providers started shipping incremental model versions on monthly cadences. Until then, the assumption was that a prompt tuned for one model would work fine on the next version of the same family. That assumption stopped being safe somewhere around the Sonnet 4.5 to 4.6 transition; some prompts moved cleanly, others lost meaningful accuracy on tasks they had previously aced. The audit needed to make that decision visible per prompt.

The twenty points split across cross-version testing, cross-family testing, cost-fit testing, and migration documentation. The first two ask whether the prompt has been evaluated on more than its primary model. The third asks whether the team has tested whether the prompt could survive being downgraded to a cheaper model for cost savings. The fourth asks whether decisions about which model to use are written down — so when the cost-saving question comes up in three months, the team can answer it in fifteen minutes instead of re-running the analysis.

Model fit · 20 points across four sub-categories

Source: typical scores observed across audit engagements

Cross-versionEach prompt evaluated on at least two versions of its primary model

05 pts

Cross-familyEach prompt evaluated on one model from a second family (e.g. GPT and Claude)

05 pts

Cost-fitPrompt tested against the cheapest model in the family that still meets the eval bar

05 pts

Migration docsModel choice rationale written down with eval scores attached

05 pts

The headline finding from this axis is almost always asymmetry. Across a library of thirty prompts, ten of them move from Sonnet to Haiku without losing more than a couple of eval points — meaningful cost savings, no production risk. Ten more lose noticeable accuracy on Haiku but recover cleanly with light prompt edits. The last ten cannot leave Sonnet at all; they rely on capabilities the cheaper model does not have. The audit forces the team to discover that asymmetry rather than assume the library is homogeneous.

The remediation for this axis is structural rather than point-by-point. Once cross-model evals are running for a handful of prompts, the team almost always adopts a standard: every new prompt ships with at least two model evaluations (its primary plus one cheaper alternative), and the migration documentation gets written at the moment the decision is made, not retrospectively. The audit catches the library at the moment that discipline starts, which is usually the difference between a library that survives the next platform shift and one that does not.

The cost-fit question

For a thirty-prompt library, the typical cost reduction from systematic model-fit testing is in the 30-50% range — not from switching everything to the cheapest model, but from identifying which third of the library can safely move down a tier. The eval suite makes that decision data-driven instead of religious.

07 — MaturityFour-stage model — ad-hoc to production-grade.

The four-stage maturity model maps the 100-point score onto a short label teams can use in conversation. The labels are deliberately blunt: the point is to give a team an honest statement of where they are, not to flatter them. The audit output always includes the numeric score, the stage label, and a remediation roadmap prioritised by which axis will move the score fastest.

Stage 1 · 0–25

Ad-hoc

No catalog, no versioning, no evals. Prompts live wherever they were written. Changes ship without review. Regressions are discovered by customers. Most teams running their first LLM features in production live here for at least six months.

Graveyard risk · immediate

Stage 2 · 26–50

Disciplined catalog

Prompts are catalogued and versioned, ownership is clear, edits go through PR. Evals exist for a handful of high-stakes prompts but most of the library is still uncovered. Regression detection is reactive — fired on PR, not scheduled.

Foundation laid

Stage 3 · 51–75

Eval coverage

Every production prompt has an eval suite. CI runs them on change. A nightly cron runs the full library and routes failures to owners. Cross-version testing happens for primary models. The library is measurable end-to-end, but cross-family and cost-fit testing are gaps.

Production-ready

Stage 4 · 76–100

Production-grade

Full eval coverage across families. Cost-fit testing routine for new prompts. Regression dashboard visible to non-engineering stakeholders. Migration docs current. The library is a first-class engineering artifact with the same maturity as the codebase around it.

Institutional knowledge

Most teams entering an audit score somewhere in the 20-40 range — solidly Stage 1 or early Stage 2. The remediation roadmap is almost always the same: catalog first, then versioning, then evals on the top five prompts, then scheduling. By the time the team has implemented those four steps — typically six to eight weeks of part-time work — the score is usually in the 50-65 range, which is enough to survive the kinds of platform changes that broke them before.

Reaching Stage 4 is the harder lift. It takes a deliberate investment in eval infrastructure, a culture that treats prompts as production artifacts, and a budget for the regular cost of running cross-model evals on a schedule. The teams we see at Stage 4 are usually those running customer-facing AI features as their primary product surface — where prompt quality directly correlates with revenue, and the eval infrastructure pays for itself in avoided incidents. For agencies and engineering teams stepping up to that level, our AI transformation engagements often start with exactly this audit and walk the remediation roadmap with the team — the goal being institutional knowledge that survives platform churn, not a one-time cleanup.

The compound win from maturing the library is the same one we see in agentic patterns more broadly — when a team makes prompt evaluation routine, the threshold for shipping new AI features drops sharply. We covered the same dynamic for the sibling case of agentic skill development in our Claude Code subagent tutorial — building eval rails for prompts and for subagents follows the same shape, and teams that adopt one usually find themselves wanting the other within a quarter.

Conclusion

Prompt libraries become institutional knowledge — but only if you measure them.

The 100-point framework is not a formula for a good prompt library; it is a flashlight for a team to point at their own library and see which corners are dark. The score itself matters less than the conversation the score forces. When a team learns they have eight prompts in production with no named owner, the remediation is obvious. When the team discovers their nightly cost could drop by a third with two weeks of cross-model testing, the budget conversation moves from theoretical to actionable in the same meeting.

The pattern we see across audit engagements is that teams move from Stage 1 to Stage 3 faster than they expect — the mechanics are not hard, the tooling is open source, and the organisational change is mostly a matter of writing down conventions that already implicitly exist. The slower lift is from Stage 3 to Stage 4, which requires the team to treat prompts as first-class engineering artifacts on par with the services around them. That is a culture shift, not a tooling shift, and it pays back in the form of survivable platform transitions.

What to do next: run the audit on your own library. Pick a single afternoon. Score yourself honestly across the five axes. Identify the two axes where you scored lowest. Write down a single concrete remediation for each — the catalog update, the eval suite, the nightly cron, whichever pair will move your score fastest. The first hour of work is the most important; once the library has any feedback loop running against it, the rest follows.

Prompt Library Audit: 100-Point Evaluation Framework 2026

01 — Library vs GraveyardMost prompt libraries are write-only.

02 — Catalog CoverageTwenty points on what's tracked.

Every prompt is findable

Each prompt has structured fields

A human owns every prompt

Active vs deprecated is explicit

03 — VersioningTwenty points on naming and history.

Versioning · 20 points across four sub-categories

04 — EvalsTwenty points on eval suite coverage.

Lightweight YAML evals

Structured eval registry

Unit-test style for prompts

05 — RegressionsTwenty points on drift and detection.

Full library on schedule

Daily for production, weekly for beta

Page the prompt owner

30-day rolling score history

06 — Model FitTwenty points on cross-model testing.

Model fit · 20 points across four sub-categories

07 — MaturityFour-stage model — ad-hoc to production-grade.

Ad-hoc

Disciplined catalog

Eval coverage

Production-grade

Prompt libraries become institutional knowledge — but only if you measure them.

A prompt library without eval coverage is a graveyard waiting to compound regressions.

Prompt library audit engagements

The questions teams ask before auditing their prompt library.

Continue exploring prompt engineering.

Prompt Library Team Rollout: 30/60/90-Day Plan 2026

Prompt Library Metrics: Coverage, Regression Framework 2026

Case Study: Prompt Library at a Platform Engineering Team