A RAG system audit is the single highest-leverage hour you can spend on a production retrieval pipeline that has been live for more than a quarter. Quality decays silently — embeddings stale, chunks drift, retrieval thresholds quietly accept noisier candidates, and citations look fine to the eye but fail faithfulness eval. The 80-point scorecard below is the same method we apply on engagements, structured so a senior engineer can grade a stack in roughly five hours.

What's at stake: most teams who built RAG in 2024 or 2025 shipped a v1 with reasonable defaults and have not re-audited since. Models have moved, embedding models have improved by 10-20 points on retrieval benchmarks, hybrid retrieval went from optional to expected, and faithfulness eval frameworks (RAGAS, DeepEval) became production-ready. A v1 RAG that scored 90 on launch can easily score 60 a year later without a single line of code changing — the world moved, the system did not.

This guide covers seven sections: why RAG decays silently, then one section per pipeline layer (ingestion, chunking, embeddings, retrieval, grounding), and finally a scale-baselines section with measured numbers at 10k, 100k, and 1M chunks. Each layer has a specific check count totalling eighty, with severity tiers and remediation defaults.

Key takeaways

01
Ingestion is where most RAG audits surface critical findings.Garbage in, garbage answers. Source hygiene, deduplication, refresh cadence, and silent-failure handling collectively account for the largest fraction of high-severity findings on the engagements we run.
02
Chunking strategy dominates retrieval quality.500-800 token chunks with 50-token overlap is the working default for prose; code drops to 200-400 with semantic boundaries. Wrong chunk size is the single largest lever and the most common silent quality loss.
03
Hybrid retrieval beats pure vector for proper nouns.BM25 plus vector fused with reciprocal rank fusion (RRF) recovers identifier, name, and rare-entity hits that pure vector consistently misses. Production default for knowledge bases with any named-entity surface.
04
Faithfulness eval is non-negotiable for production RAG.Use FaithfulnessEval (DeepEval) or RAGAS faithfulness to score whether answers are actually grounded in retrieved chunks. A 70% faithfulness baseline is the floor; 90% is the production target.
05
Re-audit cadence: monthly for the first quarter, quarterly thereafter.Embedding models, retrieval defaults, and eval frameworks drift faster than most ML-ops cycles. A retrieval pipeline that audited clean on launch can quietly degrade in three months without a single deploy.

01 — Why Audit RAGRAG quality decays silently — audits surface what users feel.

The defining failure mode of production RAG is silence. Unlike a broken deploy or a 500 from the API, a degraded retrieval pipeline keeps serving answers — they just become subtly less grounded, less accurate, less specific. Users feel the drop before any dashboard does, and the lag between user perception and team awareness is where trust evaporates.

Three forces drive that decay. Content drift: the underlying corpus grows, source documents change, stale chunks linger past their useful life. Model drift: new embedding models surpass the one you indexed on by 10-20 points on retrieval benchmarks; the generation model evolves but the prompts and grounding logic do not. Tooling drift: faithfulness eval frameworks (RAGAS, DeepEval, Promptfoo) reach production-readiness, hybrid retrieval and re-rankers become table stakes, and a stack that was state of the art on launch falls one or two generations behind without a single code change.

Pre-launch

Full 80-point audit

Before any production traffic. Catches schema mistakes, missing observability, and chunking defaults that would cost months to fix once users depend on the answers. The cheapest audit you will ever run.

Always run

First quarter live

Monthly mini-audit

30-40 points per month, rotating through layers. Catches embedding-model staleness, chunk-drift in the long tail, and silent ingestion failures. The window where user trust is established or lost.

Monthly cadence

Steady state

Quarterly full audit

Re-grade the full 80 points every three months. Catches industry-level shifts — new embedding models, eval framework upgrades, retrieval pattern changes. Schedule it like a security audit.

Quarterly cadence

Triggered

Incident-driven audit

After any user-reported quality regression, faithfulness-eval drop of 5+ points, or material change to the corpus. Targeted to the layers most likely to have moved. Two to three hours, not five.

Per incident

The audit method below is layer-by-layer because RAG failures are layer-localized. A faithfulness drop almost always traces to either retrieval (the right chunk never reached the model) or grounding (the right chunk reached the model but was ignored). Distinguishing those two requires the layered scorecard — a single end-to-end metric can flag the regression but cannot point at the cause.

02 — IngestionSource hygiene, deduplication, refresh cadence — fifteen checks.

Ingestion is where most audits surface their highest-severity findings. The reason is structural: every downstream layer assumes the input pipeline produced clean, deduplicated, current content. When ingestion is silently broken — a feed stopped updating, a PDF parser dropped tables, a sync job started quietly re-embedding unchanged documents nightly — everything downstream amplifies the problem. The fifteen checks below cover source coverage, parsing correctness, deduplication, refresh cadence, and silent-failure handling.

The fifteen ingestion checks

Source coverage. Every source the product promises to answer from is wired into ingestion. No silent gaps where users ask about a section the system never indexed.
Parser correctness. PDFs preserve tables and ordered lists; HTML strips navigation and boilerplate; Markdown preserves heading hierarchy. Spot-check ten random documents per source type.
Mime-type handling. Each supported mime type has an explicit parser and a fallback. Unknown types log a warning rather than silently producing empty chunks.
Checksum-based change detection. A SHA-256 hash on canonical content prevents nightly re-embedding of unchanged documents. Single largest cost-leak we find.
Refresh cadence. Each source has a documented refresh interval matching its real change rate — daily, weekly, monthly — and the schedule actually runs.
Last-success heartbeat. Every ingestion job writes a last-success timestamp. A simple dashboard or alert fires when a feed has not refreshed for 2x its expected interval.
Deduplication. Identical or near-identical chunks across documents are detected (cosine 0.97+) and either merged or attributed to a canonical source.
Cascade-delete on document removal. When a source document disappears, its chunks and embeddings are cleaned up — no orphan rows clogging the retrieval pool.
Versioning. Major document changes produce a new version row; old versions are tombstoned rather than hard-deleted so audit trails survive.
Content normalization. Whitespace, unicode normalization, smart-quotes, and encoding artefacts are handled consistently before chunking.
PII handling. Sources containing PII are flagged and either redacted, segregated, or excluded per policy. Audit logs prove the policy ran.
Rate-limit handling. External fetchers (websites, SaaS APIs) respect rate limits and back off cleanly rather than failing partial-document.
Partial-failure recovery. If a batch of 1,000 documents fails on document 500, the remaining 500 still process and the failed one is logged for retry.
Metadata extraction. Title, author, published date, section heading — captured as structured metadata for filtering and citation, not buried in the chunk text.
Ingestion observability. Counts, durations, error rates per source visible in a dashboard. Sudden volume changes (a feed dropping 50%) trigger alerts.

The most common ingestion finding

The single most frequent critical finding we see is no checksum-based change detection — teams paying tens of dollars a night to re-embed unchanged documents, and (worse) losing the ability to detect when a source has actually changed. Five lines of SQL and a hash function fix it; the cost-and-correctness lift is usually the largest single win in the entire audit.

03 — ChunkingStrategy, size, overlap, semantic vs structural — fifteen checks.

Chunking is the single largest lever on retrieval quality and the most common silent quality loss. Wrong chunk size doesn't crash anything — it just means the right semantic unit never ends up in the retrieval candidate set, and no amount of downstream tuning recovers what was discarded at the chunk boundary.

Three patterns dominate. Sliding window with fixed token counts and overlap — simplest, most predictable, the right default. Paragraph chunking that splits on blank lines and merges short paragraphs to a target size — preserves prose boundaries. Semantic chunking that uses sentence embeddings to detect topic shifts — higher quality at higher ingestion cost. Code, tables, and structured documents need different strategies again.

Prose

600tok

Default chunk size

500-800 tokens with 50-token overlap is the working default. Smaller loses surrounding context; larger buries the relevant sentence under noise. Measured against recall@10 on hand-labeled queries.

Overlap: 50 tok

Code

300tok

Code chunks

200-400 tokens with semantic boundaries — function or class. Splitting mid-function destroys the unit a developer is actually asking about. Language-aware splitters (tree-sitter) outperform token windows.

Boundary: function

Reference

1000tok

Reference docs

800-1200 tokens for dense technical material — API docs, regulations, legal text. The reference frame matters more than fine-grained retrieval; longer chunks preserve the surrounding definitions.

Recall over precision

Tables

1row

Table rows

Each table row is its own chunk with the header repeated as context. Splitting tables by token count destroys their structure; row-as-chunk preserves the data-shape the model needs to reason about.

Header injection

The fifteen chunking checks

Strategy documented.The chunking choice (sliding window, paragraph, semantic) is documented with the reasoning. New engineers can answer "why this strategy" in one sentence.
Size measured, not guessed. Chunk size was tuned against recall@10 on a hand-labeled query set, not defaulted from a tutorial.
Overlap configured. 10-15% overlap (50 tokens on a 600-token chunk) prevents boundary effects where a relevant sentence sits between two chunks.
Semantic vs structural decision. Prose uses prose chunking; code uses code-aware splitters; tables preserve row structure. Mixed corpora apply different strategies per mime type.
Token-counting matches embedding model. Tokenizer used in chunking matches the embedding model's tokenizer (or close enough), not a generic GPT tokenizer when embedding with a different model.
Header injection. Each chunk carries enough context (document title, section heading) prepended or in metadata that retrieval finds it even when the chunk text itself omits the keyword.
Chunk-order preserved. Chunks have an ord field within their parent document; adjacent chunks can be retrieved alongside the primary hit.
Minimum-size threshold. Very short chunks (under 50 tokens) are merged with neighbours or discarded. Pure header rows or empty list items should not occupy retrieval slots.
Maximum-size threshold. Chunks above the embedding-model context window are split rather than truncated silently.
Whitespace handling. Excess whitespace is collapsed, but paragraph breaks are preserved as semantic signal.
List handling. Ordered and unordered lists stay intact within a chunk where possible; splitting a numbered list mid-item destroys readability.
Code-block handling. Code blocks are not split mid-block. A code chunk that ends with if (x) { with no closing brace is unintelligible to the model.
Recall measurement. Recall@10 on a hand- labeled query set is the metric, not vibes. Re-measure after any chunking change.
Chunk-text deduplication. Identical chunks from different sources are merged or canonicalized to avoid flooding the candidate set with the same content.
Re-chunk on strategy change. Changing chunk size or strategy triggers a full re-chunk + re-embed of the corpus. Partial re-chunking creates retrieval inconsistencies that are hard to debug.

"Chunk size is the single largest lever on retrieval quality. Half the audits we run, the highest-impact fix is moving from 1500-token chunks to 600 with 50 overlap."— Working observation from the engagements we ship

04 — EmbeddingsModel fit, dimensionality, refresh discipline — ten checks.

Embedding-model choice has stopped being a one-off launch decision and become an ongoing discipline. Model generations move every six to twelve months, and the gap between "state of the art at launch" and "current state of the art" translates directly into retrieval-quality points. The ten checks below cover model fit, dimensionality, versioning, and the discipline of knowing when to re-embed.

For most teams in 2026, the practical embedding-model shortlist is OpenAI text-embedding-3-large (the safe default, $0.13 per million tokens, widely tested with pgvector), Cohere embed-v3 multilingual for cross-language retrieval, and Voyage voyage-3 for the highest-quality English-language retrieval especially on code and technical content. The honest answer remains: pick OpenAI to ship, swap only after a measured eval.

The ten embedding checks

Model choice justified. The embedding model was chosen based on a measured eval on your content type (English prose, multilingual, code, mixed), not defaulted from a tutorial.
Model version recorded. The exact model name and version is stored per row in the embeddings table — (chunk_id, model) as composite key — so a partial re-embed across model versions is possible.
Dimensionality pinned at column type. vector(1536) not generic vector. An accidental insert from a different-dimension model fails loud.
Normalization consistent. If embeddings are L2-normalized at insert time, queries normalize the same way. Cosine distance assumes both sides are normalized; mismatch silently degrades quality.
Re-embed plan documented. The team has a written plan for how to re-embed the corpus on a model upgrade — duration, cost, dual-running strategy during cutover.
Cost monitoring. Monthly embedding spend is visible. Sudden spikes (nightly re-embedding of unchanged docs, runaway ingestion) trigger alerts.
Batch sizing. Embedding API calls batch 100-200 strings per request to amortize HTTP overhead. Not one-string-per-call.
429 / rate-limit handling. Exponential backoff with jitter on rate-limit responses, with a cap on retries before failing the job.
Query / document parity. Query embeddings use the same model and same preprocessing as document embeddings. Asymmetric retrieval (different models for indexing and querying) is intentional, not accidental.
Eval against current SOTA.The current model has been benchmarked against one alternative on a labeled test set within the last twelve months. If you cannot answer "is this still the best choice", you do not know.

The re-embed trigger

Re-embed when (a) a new model shows 5+ points lift on your specific eval set, (b) the corpus has materially shifted (new domain, language, or content style), or (c) you are unifying two previously separate systems on a single model. The cost is high but the lift compounds across every downstream metric.

05 — RetrievalPure vector, hybrid, re-ranking — fifteen checks.

Retrieval is one SQL query — and then two more on top, often, for hybrid search and re-ranking. The audit question at this layer is whether you have the right retrieval mode for your query distribution, whether the ANN index parameters match your recall budget, and whether observability lets you spot a recall regression before users do. For the full retrieval-SQL recipe, see our self-hosted RAG with Postgres + pgvector tutorial — the audit assumes the system is already built.

Baseline

Pure vector

ORDER BY <=> LIMIT k

One SQL query, one round trip, ~12ms P50 at 100k chunks. The right default for semantic intent. Audit check: are you measuring recall@10 against exact-search ground truth?

Lowest latency

Default

Hybrid RRF

vector + BM25 fused

Two CTEs, reciprocal rank fusion. Recovers proper-noun and identifier hits pure vector misses. Audit check: do queries with named entities measurably benefit from BM25?

Production default

Max recall

Hybrid + re-rank

RRF top-40 → reranker → top-8

Cross-encoder scores (query, chunk) pairs jointly. Highest recall at top-k. Audit check: is the rerank lift large enough to justify the 80-200ms latency cost?

When accuracy matters

The fifteen retrieval checks

Index type chosen by recall budget. HNSW for recall, IVFFlat for cost. The choice was made on measurement, not preference.
Index parameters tuned. HNSW m and ef_search, IVFFlat lists and probes are set based on measured recall, not defaults.
Distance metric matches embedding model. Cosine for OpenAI and most modern embedders, dot-product when the model is trained for it. vector_cosine_ops in the index definition matches.
k chosen deliberately. The number of chunks retrieved is tuned. Too few — right answer missed; too many — context window blown, attention diffused.
Over-retrieval at SQL layer. Retrieve 20-40 candidates and truncate / re-rank to the 6-10 the model actually sees. Cheap insurance against the right chunk being just outside the cutoff.
Hybrid retrieval available. BM25 (Postgres tsvector) sits alongside vector for queries with named entities. Either always-on or query-classifier-routed.
RRF correctly implemented. Reciprocal rank fusion with k=60 (canonical constant). Both ranks contribute even when only one of vector or BM25 has a hit.
Re-ranker decision documented. The team has measured rerank lift on a test set and decided yes or no with numbers. No re-ranker is a valid answer.
Metadata filtering. Filters (tenant, source, date) are applied at the SQL layer before ANN, not in the application after retrieval.
Recall@10 dashboarded. Recall against a hand-labeled test set is a tracked metric, not a one-time measurement.
Tail-latency monitored. P50, P95, P99 of retrieval latency. ANN tail-latency is the early warning for index pressure.
Distance-threshold cutoff.Optional but recommended: results with cosine distance above ~0.4 are filtered out as "no good match" rather than fed to the model and hallucinated against.
Connection pool sized for ANN. ANN scans hold a connection for the full search duration; pool size accounts for concurrent retrieval traffic.
Query-side embedding cached for repeats. Common queries (FAQs, exact-match repeats) have their query embedding cached briefly to avoid re-embedding the same text.
Index rebuild plan documented. When and how the ANN index is rebuilt (IVFFlat after large ingest, HNSW never, both on dimension change) is written down.

06 — GroundingCitations, faithfulness, hallucination guards — fifteen checks.

Grounding is the layer where retrieval becomes a trustworthy answer. Three concerns: how the prompt forces the model to actually use retrieved context, how citations surface in the UI so users can verify, and how hallucinations are detected and blocked before they reach production. Faithfulness eval — using RAGAS, DeepEval, or Promptfoo — is what turns "feels grounded" into a measured metric.

The fifteen grounding checks

Grounding-only prompt instruction. System prompt explicitly tells the model to answer only from the numbered context and to refuse otherwise.
Refusal behavior tested. Out-of-corpus questions reliably trigger refusal rather than confident hallucination. Test set includes negative examples.
Chunks numbered for citation. Retrieved chunks are numbered [1] through [k] in the prompt so the model can cite by index.
Citations enforced. Prompt requires inline citations for factual claims. Generated answers are validated post-stream to confirm citations are present.
Citation parsing. [N] tokens are parsed out of the streamed text and resolved to chunk metadata for UI rendering.
Source attribution in UI. Users can hover or tap a citation to see source URL, title, and exact chunk text. The trust UX worth roughly 80% of perceived quality.
Faithfulness eval running. RAGAS or DeepEval faithfulness score on a sample of production traffic. 90% is the production target; 70% is the floor.
Answer-relevance eval running. Separate from faithfulness — does the answer address the question? Faithful-but-irrelevant is still a failure mode.
Context-precision eval. Are the retrieved chunks actually used? Low precision means the prompt is wasting context window on irrelevant chunks.
Hallucination guard threshold.If the top chunk's distance exceeds ~0.4, the model refuses rather than answers. Distance-based out-of-corpus detection.
Temperature tuned. 0.0-0.3 for grounded RAG. Higher temperatures encourage paraphrase and hallucination-by-rephrasing.
Context-window budget tracked.Sum of system + retrieved chunks + user message + completion budget fits the model's window with headroom.
Eval results dashboarded. Faithfulness and relevance scores tracked over time, with alerts on regression.
Human-feedback loop. Users can flag wrong answers. Flagged examples flow into the eval set for regression testing.
Citation accuracy spot-checked. 20 random answers per month are manually checked: does the cited chunk actually support the claim? Catches the most insidious failure — confident citations to chunks that say something different.

The faithfulness floor

Below 70% faithfulness, the system is unsafe for production — answers are frequently confident, well-cited, and wrong. Between 70% and 90%, the system is usable with explicit "verify the cited source" UX. Above 90%, citation-trust UX becomes a quality multiplier rather than a damage-control mechanism.

07 — Scale BaselinesLatency, cost, recall at 10k / 100k / 1M chunks.

The bars below come from a single Hetzner CX31 (4 vCPU / 8GB / NVMe) running Postgres 16 + pgvector 0.7 with HNSW (m=16, ef_search=100), 1536-dim embeddings from text-embedding-3-large, and Claude Sonnet 4.7 for generation. The point of publishing these is not that your numbers should match — they should match if you are on similar hardware, but the variance from corpus shape, query distribution, and shared-buffers cache is real. The point is that the audit needs ground truth.

Latency baselines · audit reference

Hetzner CX31 · pgvector 0.7 · HNSW (m=16, ef_search=100) · text-embedding-3-large · Claude Sonnet 4.7

Retrieval P50 · 10k chunksHNSW ANN query, k=8

6 ms

Retrieval P50 · 100k chunksSame config, 10x corpus

12 ms

Retrieval P50 · 1M chunksSame config, 100x corpus

38 ms

Retrieval P95 · 1M chunksTail latency at scale

78 ms

End-to-end P50 (first token)Embed query + retrieve + first model token

320 ms

End-to-end P95 (first token)Same path, tail

540 ms

Cost baseline. A single grounded RAG query at launch pricing (mid-2026) costs roughly two cents end-to-end with text-embedding-3-large on the embedding side and Claude Sonnet 4.7 on generation — ~50 tokens to embed the query, ~5,000 tokens of retrieved context, ~300 output tokens. That scales linearly until you add a re-ranker (Cohere Rerank adds ~$0.0002 per query) or move to a larger generation model. Audit check: is your per-query cost in that range, and do you know which line item dominates?

Recall baseline. On a 100k-chunk corpus with hand-labeled relevance judgments, HNSW (ef_search=100) typically recovers recall@10 of approximately 0.98 against exact-search ground truth; IVFFlat (probes=10) approximately 0.94. Adding hybrid retrieval with BM25 RRF tends to lift recall@10 to roughly 0.99 on queries containing proper nouns or identifiers. Audit check: have you measured recall against ground truth on your own queries, or are you reporting numbers from a tutorial?

Projecting forward.Embedding models will continue shifting every six to twelve months; faithfulness eval tooling (RAGAS, DeepEval, Promptfoo) will continue maturing into standard CI; re-ranker quality will continue rising relative to pure-vector retrieval. The audit method scales — the 80 points stay the same, but the bar for "passing" each one moves upward. That's exactly why monthly mini-audits during the first production quarter and quarterly full audits thereafter are not optional in 2026. If your team is standing this up from scratch and wants the broader prompt- engineering hygiene that pairs with grounding-prompt design, our prompt library audit (100-point evaluation) is the complementary scorecard. If you want this audit run on your stack, our AI transformation engagements start with exactly this method.

Conclusion

RAG quality decays silently — quarterly audits keep the system honest.

The eighty checks above are not a one-time launch checklist — they are a quarterly discipline. Production RAG systems are built once and degraded continuously by the world around them: new content, new models, new eval methods, new retrieval patterns. The teams that get the long-run quality story right are the ones who treat the audit like a security review — scheduled, owned, severity-ranked, with documented remediation.

The leverage point is the cadence. A monthly mini-audit during the first production quarter (30-40 points, rotating through layers) catches the embedding-model staleness and silent ingestion failures that destroy user trust in the window where trust is still being established. A quarterly full audit thereafter catches the industry-level shifts — new embedding models, eval framework upgrades, retrieval pattern changes — that would otherwise leave a v1 stack one or two generations behind without a single line of code changing.

The single most important habit underneath all eighty checks: measure, do not guess. Recall@10 against hand-labeled queries. Faithfulness eval scores tracked over time. Per-query cost visible on a dashboard. The audits that produce real lift are the ones grounded in measurement; the audits that produce performative deliverables are the ones grounded in vibes. Production RAG is too consequential, and decays too quietly, for vibes.

RAG System Audit: 80-Point Quality Scorecard 2026