Production RAG is quality engineering — and quality engineering needs a phased plan. The 30/60/90-day playbook below moves a retrieval-augmented generation system from raw corpus to production-grounded answers across three deliberate phases: ingestion and embedding choice in month one, re-ranking and faithfulness evaluation in month two, observability and refresh cadence in month three.

What's at stake: most teams ship a RAG prototype in a week and then spend six months discovering that prototype-grade retrieval, no eval suite, no citation UI, and no refresh story do not survive contact with real users. The work below is the difference between a demo and a production system — and it compounds, because each phase's artefacts (chunker, eval suite, observability) become the substrate for the next.

This guide covers seven sections: why 90 days is the right horizon, the milestones in each of the three monthly phases, how to pick between Promptfoo / DeepEval / RAGAS for the eval harness, copy-pasteable templates for the ingestion pipeline and refresh playbook, and four production failure modes worth designing against before they happen.

Key takeaways

01
Production RAG is quality engineering.Treat the system the way you'd treat a search product or a recommender — with a labelled test set, a measurable quality metric, and a CI pipeline that catches regressions before they ship to users.
02
Chunking dominates retrieval quality.Embedding model and ANN index get most of the marketing attention; chunking decides whether the right passage even enters the candidate set. Get this right in week one or pay for it forever.
03
Faithfulness eval is non-negotiable."Did the answer use the retrieved context?" is the single metric that separates production RAG from a confidently-hallucinating chatbot. RAGAS, DeepEval, and Ragas-style judges all measure this — pick one and instrument it.
04
Citation UX is half the trust.Source attribution is worth roughly 80% of the perceived-quality lift in RAG products. A day on the citation UI beats a week of prompt tuning. Build it in month two, not after launch.
05
Refresh cadence prevents drift.Corpora go stale, embedding models get deprecated, and the answer quality you measured on day 60 is not the answer quality your users see on day 180. Schedule the refresh in month three, before the drift bites.

01 — Why 90 DaysProduction RAG is quality engineering — plan it that way.

Ninety days is the horizon at which a competent product team can ship a RAG system that earns user trust without cutting corners anywhere meaningful. Shorter and at least one of ingestion, evaluation, or observability gets dropped — and the dropped one almost always evaluation, which is the one you cannot recover after launch. Longer and the work fragments across roadmap cycles and ships unevenly. The 30/60/90 split is not arbitrary; it maps cleanly to the three layers of the stack that decide whether the system is production-grade.

The mistake teams make is treating RAG as an LLM problem. The generation step is the most visible part of the answer surface but the smallest source of quality variance. Retrieval quality is the ceiling on answer quality — no model can cite a passage it never saw — and retrieval quality is decided by ingestion, chunking, embedding choice, and re-ranking, in roughly that order. The plan below front-loads exactly that work.

The forward projection that matters: teams that ship a phased RAG system on this horizon tend to compound. The eval harness from month two becomes the CI guardrail for month four's new embedding model. The observability work from month three becomes the substrate for incident response. The refresh cadence becomes the operational pattern that keeps quality from drifting. None of that compounding happens if the quality discipline is skipped.

Days 1-30

Ingestion & embedding

Stand up the corpus pipeline, pick the chunking strategy, lock the embedding model, validate retrieval on a hand-labelled sample. The deliverable is a system that recalls the right chunks for a representative query set — even without a polished UI.

Foundation

Days 31-60

Re-ranking, eval, citation UX

Add hybrid retrieval and a re-ranker. Build the faithfulness eval suite. Ship the citation UI so every answer surfaces its sources. By day 60 the system has measurable, repeatable answer quality and a UI users can trust.

Quality

Days 61-90

Production, observability, refresh

Promote to production, instrument latency / cost / faithfulness dashboards, schedule the refresh cadence. The system runs unattended, regressions are caught in CI, and the refresh playbook keeps quality from drifting silently.

Operate

After 90 days

Continuous improvement

Once the foundation, quality layer, and operational layer are all in place, every subsequent change — new embedding model, larger context window, agentic retrieval — runs through the existing eval and observability rails. Compounding starts.

Compound

For teams choosing between this phased plan and a faster ship-it-and-iterate approach: the faster approach works for internal prototypes and demos but does not produce a system that survives the third week of real user traffic. The phased approach takes ninety days; the unstructured approach usually takes six months and produces a system the team cannot confidently iterate on. Pick the discipline up front.

"Retrieval quality is the ceiling on answer quality. No model can cite a passage it never saw."— The single retrieval principle that justifies the 90-day phasing

02 — Days 1-30Ingestion, chunking, embedding pick.

Month one builds the foundation. The deliverable is a working corpus → embeddings → retrieval pipeline that recalls the right passages for a representative query set — even without a polished answer UI on top. The five milestones below are the critical path; skipping any of them produces problems that cost weeks to recover from later.

Week 1

Corpus inventory

Source registry + checksums

Catalogue every source that will feed the system. Domain, owner, freshness expectation, sensitivity. Store source URLs and content checksums so re-ingest stays idempotent and stale documents are detectable.

Deliverable: registry doc

Week 2

Chunking strategy

Sliding window · semantic · paragraph

Pick the chunker that fits your corpus. Sliding window (500-800 tokens, 50-token overlap) is the workable default for prose. Code wants 200-400 tokens at function boundaries. Dense reference material can carry 800-1200.

Deliverable: chunker.ts

Week 3

Embedding model lock

OpenAI · Cohere · Voyage

Pick one embedding model and stick with it for the quarter. text-embedding-3-large is the safe default. Pin dimension at the column type so accidental cross-model inserts fail loud, not silent.

Deliverable: model decision doc

Week 4a

Index + retrieval query

IVFFlat vs HNSW

Build the ANN index, write the cosine-distance query, join back to documents in the same round-trip. Start with IVFFlat; switch to HNSW only if measured recall falls below your bar.

Deliverable: retrieval API

Week 4b

Recall baseline

50-100 labelled queries

Hand-label a representative query set with the chunks that should be retrieved. Run the pipeline; measure recall@10. This number is the baseline every later change is measured against.

Deliverable: eval set v1

The single most under-invested step in month one is the recall baseline. Teams treat it as optional because it requires human labelling — fifty queries, an afternoon of work — and the labelled set then sits as the foundation every subsequent change is judged against. Without it the team has no way to tell whether week-five's clever chunker tweak actually helped or just felt like it should have.

For the underlying stack — Postgres + pgvector, the canonical three-table schema, ingestion and retrieval SQL — our self-hosted RAG tutorial walks through the implementation in detail. This playbook stays at the planning layer; the tutorial is the paired step-by-step.

Phase 1 exit gate

By day 30 the system can answer "does the right chunk reach the candidate set" for at least 95% of your hand-labelled queries. If recall@10 is below that, fix retrieval before adding generation polish. Polish on a weak retrieval base is wasted effort.

03 — Days 31-60Re-ranking, faithfulness eval, citation UX.

Month two converts a retrieval pipeline into a system that produces grounded answers users can trust. Three concerns dominate: lifting recall on hard queries with hybrid + re- ranking, instrumenting faithfulness so hallucination is measurable, and shipping the citation UI that turns "the model said" into "the model said because this passage says".

Week 5

Hybrid retrieval

Vector + BM25, fused by RRF

Add a tsvector column and a parallel BM25 query. Fuse the two rankings with reciprocal rank fusion (k=60 is the canonical constant). Lifts recall on proper-noun and identifier queries that pure vector misses.

Recall lift on entities

Week 6

Re-ranker layer

Cohere · Voyage · cross-encoder

Retrieve top 20-40 hybrid candidates, pass through a re-ranker, keep top 6-10 for the model. Adds 80-200ms latency, lifts recall@5 meaningfully on long-tail queries. Skip if latency budget is tight.

Recall lift on long tail

Week 7

Faithfulness eval

RAGAS or DeepEval

Instrument three metrics: faithfulness (answer grounded in context), answer relevance (answer addresses question), context precision (retrieved chunks are actually relevant). Run nightly against your labelled set.

Deliverable: eval suite in CI

Week 8a

Citation UI

Inline [N] → footnote card

Parse bracketed citations from the streamed answer, render footnote cards keyed by chunk index. Hover or tap reveals source URL, title, and the exact passage the model grounded on. Day of work, half the trust.

Deliverable: citation component

Week 8b

Hallucination guard

Distance threshold + refusal

If the top retrieved chunk's cosine distance exceeds a calibrated threshold (often 0.4), refuse the question rather than answering. Eliminates the 'confidently wrong on out-of-corpus' failure mode.

Refusal copy + threshold doc

The most consequential of the five milestones is week seven's faithfulness eval. Without it the team has no way to know whether a prompt-template tweak or a new chunker improved or degraded answer quality on real distribution. With it, every subsequent change is measurable — and the CI pipeline can fail a deploy that regresses faithfulness below a threshold. That guardrail is what makes the system iterable for months and years rather than just weeks.

Phase 2 exit gate

By day 60 the system reports faithfulness, answer relevance, and context precision nightly against the labelled set, and every answer in the UI carries inline citations that link back to the source. Anything short of both is unfinished phase-2 work; do not promote to production.

04 — Days 61-90Production deploy, observability, refresh cadence.

Month three is the operational phase. The system promotes to production behind staged rollout, every query path is instrumented, and the refresh cadence that prevents quality drift is scheduled and rehearsed at least once before the quarter ends. The five milestones below close out the playbook.

Week 9

Staged rollout

5% → 25% → 100%

Promote behind a feature flag. Start at 5% of traffic, watch faithfulness and latency dashboards for at least 48 hours, then step up to 25% and 100% on the same gating. Roll back the flag, not the code, if a regression appears.

Deliverable: rollout plan

Week 10

Observability dashboards

Latency · cost · faithfulness

P50/P95 retrieval latency, P50/P95 first-token latency, cost per query, faithfulness on a daily sample, refusal rate, citation coverage. One dashboard, one alert channel, one on-call owner.

Deliverable: dashboard URL

Week 11

Refresh cadence

Incremental + checksum-gated

Schedule the corpus refresh. Daily incremental re-ingest of changed sources (checksum compare gates re-embedding), weekly full re-run of the eval set, monthly review of any drift. Document who owns each step.

Deliverable: refresh playbook

Week 12a

On-call runbook

Incident → diagnosis → mitigation

Document the top failure modes: stale corpus, embedding-model deprecation, recall regression, hallucination spike. For each, the diagnostic query, the immediate mitigation (often the feature flag), and the escalation path.

Deliverable: runbook.md

Week 12b

Quarterly audit

Faithfulness · drift · spend

Schedule the day-90 review with the audit scorecard: recall, faithfulness, latency, cost, refusal rate, citation coverage, refresh adherence. Anything below threshold gets a remediation ticket before the next quarter starts.

Deliverable: audit doc

The hidden milestone in month three is week eleven's refresh cadence. Most teams ship without one and then watch answer quality degrade silently over the next four to six months as the corpus drifts away from what the embeddings were trained on. The fix is operational, not technical: schedule the incremental refresh, gate it on checksum changes so the embedding bill stays bounded, and rehearse the full re-embed at least once before you actually need it.

Phase 3 exit gate

By day 90 the system serves production traffic behind a feature flag, dashboards report latency, cost, faithfulness on rolling windows, and the refresh playbook has run end-to-end at least once. The quarter closes with a written audit and a plan for the next quarter's improvements.

05 — Eval FrameworksPromptfoo, DeepEval, RAGAS — pick by archetype.

Three eval frameworks dominate the production RAG toolbox in 2026 — Promptfoo, DeepEval, and RAGAS. They are not interchangeable; each is shaped for a different team archetype. Pick the one whose archetype matches yours; do not run all three. The maintenance overhead of multiple eval harnesses defeats the discipline they are meant to enforce.

Promptfoo

Test-driven prompt iteration

YAML-driven, CLI-first, snapshot-test feel. Strongest fit if your team treats prompts and retrieval configs as code and wants a CI step that flags regressions on a labelled query set. Lightweight, low overhead, scales linearly with test count.

Pick Promptfoo

DeepEval

Python-first eval suite

Pytest-compatible, broad metric library (G-Eval, hallucination, contextual relevance), local cross-encoder support. Strongest fit if your team is Python-heavy and wants eval as a first-class test category alongside unit and integration tests.

Pick DeepEval

RAGAS

RAG-specific metric suite

Purpose-built for RAG — faithfulness, answer relevance, context precision, context recall as primary citizens. Strongest fit if RAG is the product (not a feature inside a larger LLM stack) and you want the metric vocabulary the rest of the industry uses.

Pick RAGAS

Combination

Promptfoo + RAGAS as judges

Promptfoo runs the test harness, RAGAS-style judges score the outputs. Common pattern for teams that want CLI test ergonomics with the RAG-specific metric vocabulary. Adds a layer of indirection — only justified if both authors find the indirection clarifying.

Hybrid path

The decision rule that holds up: if your team is mostly TS/JS and your retrieval lives in the same repo as your application, Promptfoo is the lowest-friction pick. If your team is mostly Python and you already write pytest-based suites for the rest of your ML stack, DeepEval slots into the existing test discipline. If RAG is the product and you need the standard metric vocabulary for stakeholder reporting, RAGAS is the right pick. The wrong move is picking based on stars or recency — pick based on archetype.

Faithfulness

0.85

Production threshold

Fraction of answer claims supported by the retrieved context. Below 0.85 the system is unreliable enough that users notice; below 0.75 it is actively producing trust-eroding hallucinations. Track nightly, alert on drop.

Primary metric

Answer relevance

0.90

Question-alignment bar

How well the answer addresses the asked question (not the retrieved context). Catches the case where the model produces a faithful answer to a related but different question. Slightly higher bar than faithfulness — users notice mis-targeted answers fast.

Secondary metric

Context precision

0.80

Retrieval purity

Fraction of retrieved chunks that are actually relevant to the question. Below 0.80 the model is filtering noise rather than reasoning over signal, which lifts latency and degrades faithfulness simultaneously. Tune retrieval if this drops.

Diagnostic metric

Eval set size

100

Labelled queries minimum

100 hand-labelled queries with relevance judgments is the smallest set that gives stable measurements. 50 is too noisy for week-on-week comparison; 250 is the right target by end of month two. Refresh quarterly as the corpus evolves.

Eval set governance

06 — TemplatesIngestion pipeline, eval suite, refresh playbook.

Three reusable templates anchor the playbook. The ingestion CLI is the entry point for adding or updating sources; the eval CI step is what guards faithfulness regressions; the refresh cron is what keeps the corpus in sync without a human in the loop. Copy them, rename them, ship them.

Ingestion CLI

// scripts/ingest.ts — drop a source URL or local path, get checksum-gated re-embed
import { readSource } from '@/lib/rag/source';
import { chunkBySlidingWindow } from '@/lib/rag/chunk';
import { embedBatch } from '@/lib/rag/embed';
import { upsertDocument } from '@/lib/rag/upsert';

async function main() {
  const sourceUrl = process.argv[2];
  if (!sourceUrl) throw new Error('Usage: pnpm ingest <source-url>');

  const { title, mimeType, text } = await readSource(sourceUrl);
  const chunks = chunkBySlidingWindow(text, { size: 600, overlap: 50 });
  const embeddings = await embedBatch(chunks.map((c) => c.content));

  await upsertDocument({
    sourceUrl,
    title,
    mimeType,
    content: chunks.map((c) => ({
      ord: c.ord,
      text: c.content,
      tokens: c.tokenCount,
    })),
    embeddings,
  });

  console.log(`Ingested ${sourceUrl} · ${chunks.length} chunks`);
}

main().catch((e) => {
  console.error(e);
  process.exit(1);
});

Eval CI step

# .github/workflows/rag-eval.yml — nightly faithfulness regression gate
name: RAG eval

on:
  schedule:
    - cron: '0 3 * * *'  # nightly at 03:00 UTC
  workflow_dispatch:

jobs:
  faithfulness:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - run: pnpm install --frozen-lockfile

      # Run the labelled eval set through the RAG pipeline and score
      - run: pnpm eval:rag --set eval/v2.jsonl --out reports/eval.json

      # Fail the job if faithfulness drops below the threshold
      - run: |
          node -e "
            const r = require('./reports/eval.json');
            if (r.faithfulness < 0.85) {
              console.error('faithfulness below threshold:', r.faithfulness);
              process.exit(1);
            }
          "

      - uses: actions/upload-artifact@v4
        with:
          name: rag-eval-report
          path: reports/eval.json

Refresh playbook

// scripts/refresh.ts — incremental, checksum-gated, idempotent
import { listSources } from '@/lib/rag/registry';
import { fetchSource } from '@/lib/rag/source';
import { ingestIfChanged } from '@/lib/rag/upsert';

const MAX_PARALLEL = 8;

async function main() {
  const sources = await listSources();
  let touched = 0;
  let skipped = 0;

  for (let i = 0; i < sources.length; i += MAX_PARALLEL) {
    const batch = sources.slice(i, i + MAX_PARALLEL);
    const results = await Promise.allSettled(
      batch.map(async (s) => {
        const fresh = await fetchSource(s.url);
        const changed = await ingestIfChanged(s, fresh);
        return changed ? 'touched' : 'skipped';
      }),
    );
    for (const r of results) {
      if (r.status === 'fulfilled' && r.value === 'touched') touched++;
      else if (r.status === 'fulfilled') skipped++;
    }
  }

  console.log(`Refresh complete · touched=${touched} skipped=${skipped}`);
}

main().catch((e) => {
  console.error(e);
  process.exit(1);
});

Template discipline

All three templates assume the schema in our pgvector tutorial — documents / chunks / embeddings, content checksum on documents, cascade delete on chunks. If your schema differs, adapt the upsert layer; the surrounding shape of the pipeline is the part worth copying.

Two operational notes. First, the eval CI job should run on a schedule, not on every PR — the labelled set takes minutes to score and the relevant signal is daily drift, not per-commit regression. Second, the refresh script should write its touched/skipped counts to your observability surface; an unexpected zero on the touched count usually means a source registry pointing at a dead URL.

07 — PitfallsFour production RAG failure modes.

Four failure modes recur across production RAG deployments. None of them are exotic; all of them are avoidable with a single deliberate design choice up front. The bars below quantify how often each appears in systems built without the phased plan above — order-of-magnitude indicative, not a controlled study.

Production RAG failure modes · how often each shows up

Indicative frequency across teams Digital Applied has audited

Silent corpus driftSource content changes, re-ingest never runs

common

Hallucination on out-of-corpus questionsNo distance threshold, no refusal copy

common

Recall regression on new embedding modelNo eval suite to catch the drop in CI

frequent

Citation UI bolted on at launchTreated as polish, not as the trust surface

frequent

1. Silent corpus drift

Source documents change, the re-ingest job either does not exist or runs without a checksum gate, and answer quality degrades silently over months. Mitigation is exactly the week-eleven refresh cadence — incremental, checksum-gated, scheduled, observable.

2. Hallucination on out-of-corpus questions

User asks a question the corpus doesn't cover, the model answers anyway from training data, the answer is confidently wrong with a citation that's only tangentially related. Mitigation is the week-eight hallucination guard — distance threshold on the top retrieved chunk, explicit refusal copy when the threshold is breached.

3. Recall regression on a model upgrade

New embedding model ships, team upgrades without re-running the eval set, recall@10 drops three points, faithfulness tanks two weeks later when users start noticing. Mitigation is the week-seven eval suite plus the CI gate from the template above — no model change ships without a passing eval.

4. Citation UI as launch polish

Citations get pushed to "v2" or "after launch", users have no way to verify the model's claims, trust collapses inside the first month. Mitigation is the week-eight citation UI — built into the answer surface from day one of public access, not bolted on after.

"Every production RAG failure mode we've seen maps to a skipped phase. The plan is not academic — it is the failure-mode prevention list."— A reading of four years of post-mortems

If you are looking for an outside-in version of this same discipline as a one-page audit, the companion 80-point RAG quality scorecard scores an existing deployment against the same dimensions this plan builds toward. The two pair cleanly — the plan builds the system; the scorecard audits it once it's live.

Conclusion

Production RAG is engineering — 90 days is the right horizon to do it right.

The 30/60/90-day phasing in this playbook is not arbitrary structure imposed on a free-form problem. It maps to the three layers of the RAG stack that decide whether the system is production-grade — foundation, quality, operations — and it front-loads exactly the work that compounds. The ingestion pipeline from month one becomes the substrate for month two's eval suite, which becomes the guardrail for month three's production deploy, which becomes the ground truth for every subsequent improvement.

The candid framing is the right one: shorter horizons skip evaluation and ship hallucinating chatbots; longer horizons fragment the work across roadmap cycles and ship unevenly. Ninety days is the horizon at which a competent product team can ship a phased system without cutting corners anywhere meaningful. The teams that hold the discipline tend to be the ones with compounding answer quality quarter on quarter; the teams that don't tend to be the ones rewriting the system from scratch six months in.

The broader signal is clearer: RAG has matured into a quality-engineering discipline with a stable shape — corpus inventory, chunking, embeddings, hybrid retrieval, re- ranking, faithfulness eval, citation UX, observability, refresh. The shape will keep evolving (agentic retrieval, longer context windows, new embedding models), but the phasing endures. Pick the discipline up front and the evolution becomes an upgrade path rather than a rewrite.

RAG System Production: 30/60/90-Day Plan 2026