RAG system metrics turn a quality story from "feels accurate" into a contract — recall, precision, faithfulness, citation accuracy, freshness, latency, and cost, tracked weekly against a labeled eval set and dashboarded so regressions surface before users feel them. The ten KPIs below are the production minimum we instrument on every RAG engagement, mapped to RAGAS, DeepEval, and Promptfoo so the metric definitions are reproducible across teams and stacks.
What's at stake: most production RAG ships with a single end-to-end "is the answer good" eval and no layered instrumentation. When quality drops — and it always does, silently — the team has no way to tell whether the failure is at retrieval, generation, or grounding. The fix is a multi-metric panel that distinguishes a recall regression (the right chunk never reached the model) from a faithfulness regression (the right chunk reached the model but was ignored) from a freshness regression (the chunk reached the model but was stale).
This guide covers seven sections: why RAG metrics matter, then one section per metric family — recall and precision, faithfulness, citation accuracy, freshness and refresh lag, latency and cost — and finally a section mapping each KPI to the open-source eval framework that measures it best. Eval-framework selection comes last on purpose: the metric definitions should drive the framework choice, not the other way around.
- 01Recall at k dominates retrieval quality.If the right chunk never lands in the candidate set, no amount of generation tuning recovers it. Recall@10 on a hand-labeled query set is the single highest-leverage retrieval metric and the one most often missing from production dashboards.
- 02Faithfulness is the trust metric.Citations look fine to the eye but fail measurement. RAGAS or DeepEval faithfulness scores answer the question that matters: does each claim in the response trace back to a retrieved chunk. 90% is the production target; below 70% the system is unsafe to ship.
- 03Citation accuracy is the UX.Users do not trust the model — they trust the cited source they can click. Citation accuracy (does the cited chunk actually support the claim) is the metric that converts a faithful answer into a trusted one, and it is the single biggest UX lever above faithfulness itself.
- 04Freshness predicts user-perceived staleness.A retrieval pipeline can be accurate, faithful, well-cited — and still feel broken because users ask about content the system indexed six weeks ago and never refreshed. Refresh-lag per source plus a last-success heartbeat is the cheapest reliability win in the panel.
- 05Cost per query is the production constraint.Quality is necessary but not sufficient. A grounded RAG query in 2026 costs roughly two cents end-to-end on mid-range models; the line items that dominate (long-context input, re-ranker calls, generation tokens) are the levers. Cost per query on the dashboard, alongside quality, is how production stays sustainable.
01 — Why RAG MetricsQuality from hope to contract — what a metric panel buys you.
The defining property of production RAG quality is that it is multi-causal. A single "the answer was bad" signal collapses retrieval, generation, grounding, and freshness into one opaque outcome, and the team has no way to localize the failure. A metric panel turns the diagnosis from a hunch into a decomposition: recall regressions point at retrieval, faithfulness regressions point at generation or prompt design, citation accuracy regressions point at the grounding instrumentation, freshness regressions point at ingestion. Each metric points at a different team or fix.
The contract framing matters. When the panel is in place, the team can say "the system maintains ≥0.92 recall@10, ≥0.88 faithfulness, ≥0.85 citation accuracy, with a 24-hour P95 freshness lag" and mean it. Those are numbers that survive a vendor review, a stakeholder skepticism, a regression in a deploy, or a new team-member onboarding. Without the panel, quality is whatever the loudest user said last week — which is a hope, not a contract.
The complement to a metric panel is the audit method that surfaces structural failures before they show up in the metrics. For the full 80-point checklist that scores every layer of the pipeline, see our RAG system audit (80-point quality scorecard) — the audit catches issues the metrics will only surface weeks later.
Vibe-based quality
Quality is whatever the most recent user complaint says it is. Regressions surface through customer escalations, not dashboards. Team cannot answer 'is the system better or worse than last month' with numbers.
Production-unsafeEnd-to-end only
One score for 'is the answer good'. Catches existence of regressions but never their cause. Team knows quality dropped but cannot tell whether retrieval, generation, or freshness moved.
Diagnostic-blindTen decomposed metrics
Recall, precision, faithfulness, citation accuracy, freshness, latency, cost — each pointing at a specific subsystem. Regressions are localized within hours, not weeks. The production minimum for a system users depend on.
Production targetWeighted scorecard
Layered panel plus a single composite score that rolls up to a leadership dashboard. The composite communicates direction to stakeholders; the layers communicate causation to engineers. Both audiences served.
Mature operationThe ten metrics below are split deliberately across retrieval (recall, precision, recall@k, MRR), generation (faithfulness, answer relevance, citation accuracy), and operations (freshness, latency, cost per query). Every metric has a measurement recipe, a production target, and a recommended eval framework. The framework-mapping section at the end ties everything together.
02 — Recall + PrecisionThe retrieval foundation — recall@k, MRR, precision.
Retrieval metrics answer a single question: did the right chunk land in the candidate set the generator sees. The three working metrics are recall at k, mean reciprocal rank (MRR), and precision at k. Each captures a different aspect of retrieval quality, and production panels track all three because each has a different failure mode that the others mask.
Recall@k is the highest-leverage metric. It asks: across the top-k chunks returned, did at least one relevant chunk appear. If recall@10 is below 0.85, no downstream tuning matters — the generator never saw the right context to work with. MRR adds position-awareness: a relevant chunk at rank 1 is more valuable than the same chunk at rank 8 because the generator weights early chunks more heavily. Precision at k answers the inverse: of the chunks we returned, how many were actually useful — a low precision means the context window is full of noise that diffuses the model's attention.
Recall@k
≥0.90 @ k=10Did at least one relevant chunk appear in the top-k. The single most important retrieval metric. Measure against a hand-labeled query set of 50-100 representative questions, each with hand-marked relevant chunks. Re-measure after every retrieval change.
Highest leverageMean Reciprocal Rank
≥0.75Position-weighted retrieval quality. Reciprocal of the rank of the first relevant chunk, averaged across queries. Captures whether the right answer is near the top of the list or buried at the bottom — generators weight early chunks more heavily.
Rank-awarePrecision@k
≥0.45 @ k=10Of the k chunks returned, how many were actually relevant. Low precision means context window is bloated with irrelevant chunks that diffuse generator attention. Often acceptable in 0.4-0.6 range because over-retrieval is cheap insurance for recall.
Context efficiencyRecall@k measurement recipe
Bootstrap a labeled set from your own corpus: 50-100 representative queries, each with the hand-marked chunks that should be retrieved (one or more per query). For each query, run retrieval, capture the top-k chunk IDs, and check whether any chunk ID intersects the labeled set. Recall@k is the fraction of queries for which at least one labeled chunk landed in the top-k. Track recall@1, recall@5, and recall@10 — the curve shape tells you whether the problem is hard-wins-at-top-1 or wide-spread.
The reason 50-100 queries are enough: at this sample size, a single-point change in recall is detectable above noise. Beyond ~200 queries, the cost of labeling scales linearly while statistical power saturates. The right way to grow the labeled set is to add queries that failed in production — every user-flagged wrong answer becomes a regression-test case.
The recall regression playbook
When recall@10 drops by 5 points or more, the cause is almost always one of four things: chunking strategy changed (often unintentionally via a code path), embedding model drifted (silent re-embed against a new model version), the ANN index parameters degraded (HNSW ef_search, IVFFlat probes lowered for cost), or the corpus shifted faster than the labeled query set (the queries no longer represent real traffic). The metric tells you to look; the audit method tells you where.
03 — FaithfulnessThe trust metric — does each claim trace back to a chunk.
Faithfulness is the metric that separates a hallucination-prone system from a trustworthy one. The definition is precise: of the factual claims made in the answer, what fraction can be supported by the retrieved context. A faithfulness score of 1.0 means every claim traces back to a chunk; 0.5 means half the claims are ungrounded; 0.0 means the answer is entirely fabricated relative to the context.
The implementation in both RAGAS and DeepEval follows the same recipe: decompose the answer into atomic claims using an LLM judge, then for each claim, ask whether the retrieved context supports, contradicts, or is silent about it. Supported claims count toward faithfulness; unsupported claims (silent or contradicted) count against. The score is the supported-claim fraction. The judge model matters — use a strong general model (Claude Sonnet 4.7, GPT-5.4) rather than the generator itself to avoid self-grading bias.
Below 70%
Frequent confident, well-cited, wrong answers. Production-unsafe. Stop and audit grounding-prompt design, refusal behavior, and whether the top chunks are actually relevant to the query. Distance-threshold cutoff at retrieval should refuse rather than ground against weak context.
Block ship70-90%
Usable with explicit 'verify the cited source' UX. Citations are the user's safety net; the system itself is imperfectly grounded. Acceptable for low-stakes assistive use; not acceptable for autonomous decision support or regulated domains.
Ship with UXAbove 90%
The citation-trust UX becomes a quality multiplier rather than damage control. Users learn the system is reliably grounded and lean into the cited sources. The faithfulness floor where production RAG turns from a liability into a trust asset.
Production target95%+ with verification
Sample 20 answers per month and manually verify the cited chunks actually support the claims. Catches the worst failure mode — well-cited answers where the cited chunk says something different from the claim. Required for regulated industries.
Regulated barHow to ship faithfulness eval in CI
The leverage point is not measuring faithfulness in a notebook — it is wiring the metric into continuous integration so a prompt change, retrieval change, or generation-model swap triggers a faithfulness regression test before the change ships. DeepEval integrates with pytest natively (eval cases look like test cases); RAGAS integrates with most data-pipeline frameworks via its dataset abstractions. The CI path: a labeled set of 30-50 queries with expected-faithfulness floors, run on every PR that touches retrieval or generation, block the merge if any case drops below its floor.
The production-traffic complement is sampling: every 1,000 live queries, sample one and run faithfulness eval offline. Build a time-series of faithfulness scores; alert on a 3-day rolling drop of 5+ points. This is the cheapest way to catch silent faithfulness regressions caused by drift in the corpus, the model, or the user-query distribution.
"Faithfulness is the metric that separates a hallucination-prone system from a trustworthy one. Every other metric in the panel exists to keep faithfulness on contract."— Working observation from production RAG engagements
04 — Citation AccuracyThe UX metric — does the cited chunk support the claim.
Citation accuracy is faithfulness with a higher bar. Faithfulness asks whether each claim is supported somewhere in the retrieved context; citation accuracy asks whether the specific chunk cited beside the claim actually supports it. The distinction matters because a citation that does not support its claim is more damaging than no citation at all — it gives the user a false sense of verification, and when they click through and the source says something different, the trust collapse is sharp.
The measurement is straightforward: parse citations out of the generated answer (the [N] tokens or whatever schema the system uses), pair each citation with its preceding claim, and for each pair, score whether the cited chunk supports the claim. Both RAGAS and DeepEval have citation-accuracy metrics; the DeepEval implementation is somewhat tighter because it scores at the claim level rather than the answer level.
Citation accuracy
Below 0.80 the system is teaching users that citations are unreliable — a worse outcome than no citations. Audit the prompt enforcement, the citation parsing, and whether the model is over-citing convenient chunks rather than supporting ones.
Unsafe belowProduction target
Citation accuracy above 0.92 makes hover-to-verify a genuine quality lift. Users develop trust in the cited sources, click through more, and the system's perceived quality jumps above its raw faithfulness score.
Production goalManual verification
Twenty random production answers per month, manually verified by a human reviewer. Catches the insidious failure where the metric reports high accuracy but the citations cluster on a few easy chunks while harder claims go uncited.
Human eyesPerceived-quality share
Working estimate from production engagements: roughly 80% of the perceived-quality jump from a faithful RAG comes from the citation UX. Hover-to-verify, click-through, source-attribution panel. Citations are the trust contract made visible.
DisproportionateThe citation-accuracy regression playbook
When citation accuracy drops without faithfulness moving, the cause is almost always at the prompt or parsing layer: the model is citing the wrong chunkfor the claim, even though a supporting chunk is in context. Check the citation instruction in the system prompt — "cite the chunk that directly supports each factual claim, not the chunk you paraphrased" is more effective than a generic citation instruction. Also check the parser: if [2] tokens are being mis-mapped to chunk IDs, the metric reports a citation-accuracy regression that is really a parsing bug.
When citation accuracy drops in lockstep with faithfulness, the cause is upstream — usually a retrieval regression where the generator is doing its best with chunks that genuinely do not support the user's query, and the citations reflect the retrieval quality, not a grounding bug. The metric panel decomposes the two cases; the audit method localizes the fix.
05 — FreshnessThe staleness metric — refresh lag per source.
Freshness is the metric most often missing from RAG panels and most often responsible for "the system feels broken" user complaints that the other metrics cannot explain. A retrieval pipeline can be accurate, faithful, well-cited — and still feel broken because users ask about content the system indexed six weeks ago and never refreshed. The user perceives staleness as a quality problem; the engineering team sees clean metrics and cannot reconcile the gap.
The right way to measure freshness is at the source level, not the corpus level. Each ingested source has a real-world change rate — daily for news, weekly for product documentation, monthly for legal text, ad-hoc for blog posts. Refresh lag is the difference between when the source last changed externally and when the system re-indexed it. Track P50, P95, and P99 refresh lag per source; alert on P95 exceeding 2x the source's expected refresh interval.
The five freshness signals
- Refresh lag P95. 95th-percentile time between source change and re-index. The leading indicator: when this creeps up, users will feel staleness soon. Per-source granularity matters because the right alert threshold is source-dependent.
- Last-success heartbeat. Per-source timestamp of the last successful ingestion run. A simple dashboard fires when any source has not refreshed for 2x its expected interval. The cheapest reliability check in the panel; usually catches silent ingestion failures within hours.
- Source coverage drift. Number of chunks per source over time. A sudden drop (a feed shedding 30% of its content) usually indicates a parser regression or a source-side schema change. A sudden spike indicates a duplication or re-ingestion bug.
- Checksum miss-rate. Fraction of documents re-embedded because their content actually changed, versus fraction re-embedded by mistake. A high mistake rate (no checksum-based change detection) is a cost leak and a freshness false-positive.
- User-perceived staleness.Flagged answers tagged as "out of date." This is the lagging indicator the panel ultimately serves — feed flagged answers back into the freshness alert thresholds so the system learns which sources actually need tighter refresh cadences.
The interaction between freshness and faithfulness is subtle and worth naming. A stale-but-faithful answer is still wrong — citations point at a real chunk, the chunk really did support the claim, and the claim really did used to be true. Faithfulness eval scores the answer as correct because the metric does not know the chunk is six months old. The fix is to inject document age into the citation UX ("source published 6 months ago") and into the prompt ("prefer chunks from the last 30 days when the question is time-sensitive"). Freshness as a panel metric is what makes that fix discoverable.
06 — Latency + CostThe operational metrics — P95 latency, cost per query.
Latency and cost are the production constraints. A RAG pipeline that is faithful, well-cited, and fresh but takes seven seconds to return a first token is not a production system — it is a prototype. Similarly, a quality pipeline that costs ten cents per query when the business model supports two cents is unsustainable. Both metrics belong on the same dashboard as quality because the trade-offs between them are constant and explicit.
The bars below come from a representative production stack: Postgres 16 with pgvector HNSW (m=16, ef_search=100), 1536-dim embeddings from text-embedding-3-large, Claude Sonnet 4.7 for generation, no re-ranker. End-to-end is measured from query arrival to first generated token; full-answer timing depends on output length and is tracked separately.
Latency budget · production target P95 ≤ 800 ms
Representative pgvector + Claude Sonnet 4.7 stack · 100k chunks · no re-ranker on this pathThe cost-per-query decomposition
A grounded RAG query in 2026 on a mid-range stack costs roughly two cents end-to-end. The decomposition: ~50 tokens to embed the query (~$0.0001 at OpenAI rates), ~5,000 tokens of retrieved context fed to the generator (~$0.015 at Sonnet 4.7 input pricing), ~300 generated output tokens (~$0.0045 at Sonnet 4.7 output pricing), and a Cohere Rerank call if used (~$0.0002). Generation input dominates — long-context input is the biggest single cost line, and the lever with the most leverage for reduction.
The three cost levers in priority order: reduce retrieved context (better retrieval means fewer chunks needed, k=6 instead of k=12 if recall holds), cheaper generation model (Haiku for routing, Sonnet for synthesis — most queries do not need Opus), and caching (common queries cached at the answer level; query embeddings cached briefly for FAQ-style repetition). The first lever is the biggest and the one most often missed; teams over-retrieve as insurance and pay the cost on every query for the rare query where it matters.
07 — Eval Framework MappingRAGAS, DeepEval, Promptfoo — which measures what.
The three production-ready RAG eval frameworks in 2026 are RAGAS, DeepEval, and Promptfoo. Each one covers most of the ten metrics above, but the ergonomics, integration patterns, and metric implementations differ enough that the framework choice has real consequences. The right way to decide is not by metric coverage — all three cover the essentials — but by the team's integration model: notebook-and-data-pipeline (RAGAS), pytest and CI (DeepEval), or prompt-level A/B and red-teaming (Promptfoo).
RAGAS — the established baseline
Python · LangChain / LlamaIndex nativeThe longest production track record and the broadest metric library: faithfulness, answer relevance, context precision, context recall, context relevance. Right choice for typical LangChain-stack RAG teams that want canonical metrics with dataset-pipeline ergonomics.
Pick to shipDeepEval — pytest ergonomics
Python · pytest-first · faster-movingEval cases look like test cases. Cleanest CI integration of the three. Metric library is fast-moving — citation accuracy, hallucination, contextual precision/recall, and a stronger faithfulness implementation in 2026 builds. Right choice when eval is first-class in CI.
Pick for CIPromptfoo — A/B and red-team
Node/TS · YAML test specs · cross-providerStrongest for prompt-level A/B testing, regression suites across providers, and red-teaming. Less of a RAG-eval framework, more of a prompt-eval framework that includes RAG. Right choice when prompt iteration velocity is the bottleneck.
Pick for promptsMetric-to-framework cheat sheet
- Recall@k, precision@k, MRR. RAGAS context recall and context precision both work; DeepEval has equivalent metrics. Promptfoo can run them via custom assertions. All three rely on a labeled set you supply.
- Faithfulness.RAGAS faithfulness and DeepEval FaithfulnessEval are both production-grade. DeepEval's implementation has slightly better claim decomposition in 2026 builds; RAGAS has more battle-testing. Either is correct; pick by integration model.
- Citation accuracy. DeepEval has a dedicated citation-accuracy metric; RAGAS measures this indirectly through context-precision and answer-correctness. DeepEval is the cleaner choice for citation-heavy production systems.
- Freshness and refresh lag. None of the three frameworks measure freshness — these are operational metrics, not eval-framework metrics. Build them into the ingestion-observability layer and surface them on the same dashboard as the eval-framework metrics.
- Latency and cost. Same: operational metrics tracked at the application layer (OpenTelemetry, custom-instrumented), not in the eval framework. The eval framework runs offline against the labeled set; latency and cost are measured on production traffic.
The honest recommendation: pick RAGAS to ship because the metric library is broadest and the production track-record is longest. Evaluate DeepEval in quarter two if eval-in-CI becomes a bottleneck. Layer Promptfoo on top for prompt-level A/B and regression testing where it complements rather than replaces the primary framework. Do not let the framework choice block the instrumentation — any of the three is dramatically better than no panel at all.
For teams standing this panel up from scratch on a self-hosted stack, the foundational tutorial is our build a self-hosted RAG with Postgres and pgvector walk-through, which covers the retrieval-SQL layer the metric panel sits on top of. If you want the panel and the audit method run on your stack end-to-end, our AI transformation engagements start with exactly this instrumentation work.
RAG metrics turn quality from a hope into a contract.
The ten KPIs above are the production minimum for a RAG system that users depend on. Recall and precision keep the retrieval honest; faithfulness and citation accuracy keep the generation honest; freshness keeps the corpus honest; latency and cost keep the operation honest. Each metric points at a different subsystem, which means each regression is localized to the team and the fix that owns it. That decomposition — from a single opaque quality signal to ten layered diagnostic metrics — is the entire reason the panel exists.
The leverage point is the cadence. A weekly metric review on a labeled set of 50-100 queries, combined with sampled production-traffic faithfulness scoring, surfaces regressions within days instead of weeks. The teams that get production RAG quality right are the ones who treat the panel like a financial close — scheduled, owned, with named thresholds and named escalation paths when a metric drops below its floor. Without that discipline, the panel is decoration; with it, the panel is the contract.
The single most important habit underneath all ten metrics: measure against your own labeled set, on your own queries, against your own corpus. Public benchmarks and framework defaults are a starting point — they tell you the metric is measurable. Your hand-labeled set is what tells you the metric actually moves on the failures your users care about. The metric panel that produces real lift is the one rooted in your queries, not the one borrowed from a tutorial.