RAG system metrics turn a quality story from "feels accurate" into a contract — recall, precision, faithfulness, citation accuracy, freshness, latency, and cost, tracked weekly against a labeled eval set and dashboarded so regressions surface before users feel them. The ten KPIs below are the production minimum we instrument on every RAG engagement, mapped to RAGAS, DeepEval, and Promptfoo so the metric definitions are reproducible across teams and stacks.

What's at stake: most production RAG ships with a single end-to-end "is the answer good" eval and no layered instrumentation. When quality drops — and it always does, silently — the team has no way to tell whether the failure is at retrieval, generation, or grounding. The fix is a multi-metric panel that distinguishes a recall regression (the right chunk never reached the model) from a faithfulness regression (the right chunk reached the model but was ignored) from a freshness regression (the chunk reached the model but was stale).

This guide covers seven sections: why RAG metrics matter, then one section per metric family — recall and precision, faithfulness, citation accuracy, freshness and refresh lag, latency and cost — and finally a section mapping each KPI to the open-source eval framework that measures it best. Eval-framework selection comes last on purpose: the metric definitions should drive the framework choice, not the other way around.

Key takeaways

01
Recall at k dominates retrieval quality.If the right chunk never lands in the candidate set, no amount of generation tuning recovers it. Recall@10 on a hand-labeled query set is the single highest-leverage retrieval metric and the one most often missing from production dashboards.
02
Faithfulness is the trust metric.Citations look fine to the eye but fail measurement. RAGAS or DeepEval faithfulness scores answer the question that matters: does each claim in the response trace back to a retrieved chunk. 90% is the production target; below 70% the system is unsafe to ship.
03
Citation accuracy is the UX.Users do not trust the model — they trust the cited source they can click. Citation accuracy (does the cited chunk actually support the claim) is the metric that converts a faithful answer into a trusted one, and it is the single biggest UX lever above faithfulness itself.
04
Freshness predicts user-perceived staleness.A retrieval pipeline can be accurate, faithful, well-cited — and still feel broken because users ask about content the system indexed six weeks ago and never refreshed. Refresh-lag per source plus a last-success heartbeat is the cheapest reliability win in the panel.
05
Cost per query is the production constraint.Quality is necessary but not sufficient. A grounded RAG query in 2026 costs roughly two cents end-to-end on mid-range models; the line items that dominate (long-context input, re-ranker calls, generation tokens) are the levers. Cost per query on the dashboard, alongside quality, is how production stays sustainable.

01 — Why RAG MetricsQuality from hope to contract — what a metric panel buys you.

The defining property of production RAG quality is that it is multi-causal. A single "the answer was bad" signal collapses retrieval, generation, grounding, and freshness into one opaque outcome, and the team has no way to localize the failure. A metric panel turns the diagnosis from a hunch into a decomposition: recall regressions point at retrieval, faithfulness regressions point at generation or prompt design, citation accuracy regressions point at the grounding instrumentation, freshness regressions point at ingestion. Each metric points at a different team or fix.

The contract framing matters. When the panel is in place, the team can say "the system maintains ≥0.92 recall@10, ≥0.88 faithfulness, ≥0.85 citation accuracy, with a 24-hour P95 freshness lag" and mean it. Those are numbers that survive a vendor review, a stakeholder skepticism, a regression in a deploy, or a new team-member onboarding. Without the panel, quality is whatever the loudest user said last week — which is a hope, not a contract.

The complement to a metric panel is the audit method that surfaces structural failures before they show up in the metrics. For the full 80-point checklist that scores every layer of the pipeline, see our RAG system audit (80-point quality scorecard) — the audit catches issues the metrics will only surface weeks later.

Without metrics

Vibe-based quality

Quality is whatever the most recent user complaint says it is. Regressions surface through customer escalations, not dashboards. Team cannot answer 'is the system better or worse than last month' with numbers.

Production-unsafe

Single metric

End-to-end only

One score for 'is the answer good'. Catches existence of regressions but never their cause. Team knows quality dropped but cannot tell whether retrieval, generation, or freshness moved.

Diagnostic-blind

Layered panel

Ten decomposed metrics

Recall, precision, faithfulness, citation accuracy, freshness, latency, cost — each pointing at a specific subsystem. Regressions are localized within hours, not weeks. The production minimum for a system users depend on.

Production target

Composite quality

Weighted scorecard

Layered panel plus a single composite score that rolls up to a leadership dashboard. The composite communicates direction to stakeholders; the layers communicate causation to engineers. Both audiences served.

Mature operation

The ten metrics below are split deliberately across retrieval (recall, precision, recall@k, MRR), generation (faithfulness, answer relevance, citation accuracy), and operations (freshness, latency, cost per query). Every metric has a measurement recipe, a production target, and a recommended eval framework. The framework-mapping section at the end ties everything together.

02 — Recall + PrecisionThe retrieval foundation — recall@k, MRR, precision.

Retrieval metrics answer a single question: did the right chunk land in the candidate set the generator sees. The three working metrics are recall at k, mean reciprocal rank (MRR), and precision at k. Each captures a different aspect of retrieval quality, and production panels track all three because each has a different failure mode that the others mask.

Recall@k is the highest-leverage metric. It asks: across the top-k chunks returned, did at least one relevant chunk appear. If recall@10 is below 0.85, no downstream tuning matters — the generator never saw the right context to work with. MRR adds position-awareness: a relevant chunk at rank 1 is more valuable than the same chunk at rank 8 because the generator weights early chunks more heavily. Precision at k answers the inverse: of the chunks we returned, how many were actually useful — a low precision means the context window is full of noise that diffuses the model's attention.

Coverage

Recall@k

≥0.90 @ k=10

Did at least one relevant chunk appear in the top-k. The single most important retrieval metric. Measure against a hand-labeled query set of 50-100 representative questions, each with hand-marked relevant chunks. Re-measure after every retrieval change.

Highest leverage

Position

Mean Reciprocal Rank

≥0.75

Position-weighted retrieval quality. Reciprocal of the rank of the first relevant chunk, averaged across queries. Captures whether the right answer is near the top of the list or buried at the bottom — generators weight early chunks more heavily.

Rank-aware

Noise

Precision@k

≥0.45 @ k=10

Of the k chunks returned, how many were actually relevant. Low precision means context window is bloated with irrelevant chunks that diffuse generator attention. Often acceptable in 0.4-0.6 range because over-retrieval is cheap insurance for recall.

Context efficiency

Recall@k measurement recipe

Bootstrap a labeled set from your own corpus: 50-100 representative queries, each with the hand-marked chunks that should be retrieved (one or more per query). For each query, run retrieval, capture the top-k chunk IDs, and check whether any chunk ID intersects the labeled set. Recall@k is the fraction of queries for which at least one labeled chunk landed in the top-k. Track recall@1, recall@5, and recall@10 — the curve shape tells you whether the problem is hard-wins-at-top-1 or wide-spread.

The reason 50-100 queries are enough: at this sample size, a single-point change in recall is detectable above noise. Beyond ~200 queries, the cost of labeling scales linearly while statistical power saturates. The right way to grow the labeled set is to add queries that failed in production — every user-flagged wrong answer becomes a regression-test case.

The recall regression playbook

When recall@10 drops by 5 points or more, the cause is almost always one of four things: chunking strategy changed (often unintentionally via a code path), embedding model drifted (silent re-embed against a new model version), the ANN index parameters degraded (HNSW ef_search, IVFFlat probes lowered for cost), or the corpus shifted faster than the labeled query set (the queries no longer represent real traffic). The metric tells you to look; the audit method tells you where.

The recall-precision trade-off

Over-retrieve at the SQL layer (k=40) and truncate at the generator (top-8 after re-rank). The recall lift from wider retrieval is almost free; the precision cost is absorbed by the re-ranker or context-window discipline. Production default: k=20-40 candidates retrieved, k=6-10 chunks shown to the generator.

03 — FaithfulnessThe trust metric — does each claim trace back to a chunk.

Faithfulness is the metric that separates a hallucination-prone system from a trustworthy one. The definition is precise: of the factual claims made in the answer, what fraction can be supported by the retrieved context. A faithfulness score of 1.0 means every claim traces back to a chunk; 0.5 means half the claims are ungrounded; 0.0 means the answer is entirely fabricated relative to the context.

The implementation in both RAGAS and DeepEval follows the same recipe: decompose the answer into atomic claims using an LLM judge, then for each claim, ask whether the retrieved context supports, contradicts, or is silent about it. Supported claims count toward faithfulness; unsupported claims (silent or contradicted) count against. The score is the supported-claim fraction. The judge model matters — use a strong general model (Claude Sonnet 4.7, GPT-5.4) rather than the generator itself to avoid self-grading bias.

Unsafe

Below 70%

Frequent confident, well-cited, wrong answers. Production-unsafe. Stop and audit grounding-prompt design, refusal behavior, and whether the top chunks are actually relevant to the query. Distance-threshold cutoff at retrieval should refuse rather than ground against weak context.

Block ship

Usable

70-90%

Usable with explicit 'verify the cited source' UX. Citations are the user's safety net; the system itself is imperfectly grounded. Acceptable for low-stakes assistive use; not acceptable for autonomous decision support or regulated domains.

Ship with UX

Production

Above 90%

The citation-trust UX becomes a quality multiplier rather than damage control. Users learn the system is reliably grounded and lean into the cited sources. The faithfulness floor where production RAG turns from a liability into a trust asset.

Production target

Audit

95%+ with verification

Sample 20 answers per month and manually verify the cited chunks actually support the claims. Catches the worst failure mode — well-cited answers where the cited chunk says something different from the claim. Required for regulated industries.

Regulated bar

How to ship faithfulness eval in CI

The leverage point is not measuring faithfulness in a notebook — it is wiring the metric into continuous integration so a prompt change, retrieval change, or generation-model swap triggers a faithfulness regression test before the change ships. DeepEval integrates with pytest natively (eval cases look like test cases); RAGAS integrates with most data-pipeline frameworks via its dataset abstractions. The CI path: a labeled set of 30-50 queries with expected-faithfulness floors, run on every PR that touches retrieval or generation, block the merge if any case drops below its floor.

The production-traffic complement is sampling: every 1,000 live queries, sample one and run faithfulness eval offline. Build a time-series of faithfulness scores; alert on a 3-day rolling drop of 5+ points. This is the cheapest way to catch silent faithfulness regressions caused by drift in the corpus, the model, or the user-query distribution.

"Faithfulness is the metric that separates a hallucination-prone system from a trustworthy one. Every other metric in the panel exists to keep faithfulness on contract."— Working observation from production RAG engagements

04 — Citation AccuracyThe UX metric — does the cited chunk support the claim.

Citation accuracy is faithfulness with a higher bar. Faithfulness asks whether each claim is supported somewhere in the retrieved context; citation accuracy asks whether the specific chunk cited beside the claim actually supports it. The distinction matters because a citation that does not support its claim is more damaging than no citation at all — it gives the user a false sense of verification, and when they click through and the source says something different, the trust collapse is sharp.

The measurement is straightforward: parse citations out of the generated answer (the [N] tokens or whatever schema the system uses), pair each citation with its preceding claim, and for each pair, score whether the cited chunk supports the claim. Both RAGAS and DeepEval have citation-accuracy metrics; the DeepEval implementation is somewhat tighter because it scores at the claim level rather than the answer level.

Floor

0.80

Citation accuracy

Below 0.80 the system is teaching users that citations are unreliable — a worse outcome than no citations. Audit the prompt enforcement, the citation parsing, and whether the model is over-citing convenient chunks rather than supporting ones.

Unsafe below

Target

0.92

Production target

Citation accuracy above 0.92 makes hover-to-verify a genuine quality lift. Users develop trust in the cited sources, click through more, and the system's perceived quality jumps above its raw faithfulness score.

Production goal

Spot-check

20/mo

Manual verification

Twenty random production answers per month, manually verified by a human reviewer. Catches the insidious failure where the metric reports high accuracy but the citations cluster on a few easy chunks while harder claims go uncited.

Human eyes

UX lift

80%

Perceived-quality share

Working estimate from production engagements: roughly 80% of the perceived-quality jump from a faithful RAG comes from the citation UX. Hover-to-verify, click-through, source-attribution panel. Citations are the trust contract made visible.

Disproportionate

The citation-accuracy regression playbook

When citation accuracy drops without faithfulness moving, the cause is almost always at the prompt or parsing layer: the model is citing the wrong chunkfor the claim, even though a supporting chunk is in context. Check the citation instruction in the system prompt — "cite the chunk that directly supports each factual claim, not the chunk you paraphrased" is more effective than a generic citation instruction. Also check the parser: if [2] tokens are being mis-mapped to chunk IDs, the metric reports a citation-accuracy regression that is really a parsing bug.

When citation accuracy drops in lockstep with faithfulness, the cause is upstream — usually a retrieval regression where the generator is doing its best with chunks that genuinely do not support the user's query, and the citations reflect the retrieval quality, not a grounding bug. The metric panel decomposes the two cases; the audit method localizes the fix.

The hover-to-verify pattern

The single highest-leverage UX pattern for RAG is inline citations that expand on hover or tap to show the source URL, document title, and exact chunk text. Users trust the model less than they trust a source they can verify in two clicks. Citation accuracy above 0.92 is what makes that pattern a quality multiplier rather than a damage-control mechanism.

05 — FreshnessThe staleness metric — refresh lag per source.

Freshness is the metric most often missing from RAG panels and most often responsible for "the system feels broken" user complaints that the other metrics cannot explain. A retrieval pipeline can be accurate, faithful, well-cited — and still feel broken because users ask about content the system indexed six weeks ago and never refreshed. The user perceives staleness as a quality problem; the engineering team sees clean metrics and cannot reconcile the gap.

The right way to measure freshness is at the source level, not the corpus level. Each ingested source has a real-world change rate — daily for news, weekly for product documentation, monthly for legal text, ad-hoc for blog posts. Refresh lag is the difference between when the source last changed externally and when the system re-indexed it. Track P50, P95, and P99 refresh lag per source; alert on P95 exceeding 2x the source's expected refresh interval.

The five freshness signals

Refresh lag P95. 95th-percentile time between source change and re-index. The leading indicator: when this creeps up, users will feel staleness soon. Per-source granularity matters because the right alert threshold is source-dependent.
Last-success heartbeat. Per-source timestamp of the last successful ingestion run. A simple dashboard fires when any source has not refreshed for 2x its expected interval. The cheapest reliability check in the panel; usually catches silent ingestion failures within hours.
Source coverage drift. Number of chunks per source over time. A sudden drop (a feed shedding 30% of its content) usually indicates a parser regression or a source-side schema change. A sudden spike indicates a duplication or re-ingestion bug.
Checksum miss-rate. Fraction of documents re-embedded because their content actually changed, versus fraction re-embedded by mistake. A high mistake rate (no checksum-based change detection) is a cost leak and a freshness false-positive.
User-perceived staleness.Flagged answers tagged as "out of date." This is the lagging indicator the panel ultimately serves — feed flagged answers back into the freshness alert thresholds so the system learns which sources actually need tighter refresh cadences.

The refresh-lag alert

The most useful single alert in a freshness panel is any source whose P95 refresh lag exceeds 2x its expected refresh interval. A weekly source has not refreshed in 14 days. A daily source has not refreshed in 48 hours. This single alert catches the majority of silent ingestion failures before users feel them.

The interaction between freshness and faithfulness is subtle and worth naming. A stale-but-faithful answer is still wrong — citations point at a real chunk, the chunk really did support the claim, and the claim really did used to be true. Faithfulness eval scores the answer as correct because the metric does not know the chunk is six months old. The fix is to inject document age into the citation UX ("source published 6 months ago") and into the prompt ("prefer chunks from the last 30 days when the question is time-sensitive"). Freshness as a panel metric is what makes that fix discoverable.

06 — Latency + CostThe operational metrics — P95 latency, cost per query.

Latency and cost are the production constraints. A RAG pipeline that is faithful, well-cited, and fresh but takes seven seconds to return a first token is not a production system — it is a prototype. Similarly, a quality pipeline that costs ten cents per query when the business model supports two cents is unsustainable. Both metrics belong on the same dashboard as quality because the trade-offs between them are constant and explicit.

The bars below come from a representative production stack: Postgres 16 with pgvector HNSW (m=16, ef_search=100), 1536-dim embeddings from text-embedding-3-large, Claude Sonnet 4.7 for generation, no re-ranker. End-to-end is measured from query arrival to first generated token; full-answer timing depends on output length and is tracked separately.

Latency budget · production target P95 ≤ 800 ms

Representative pgvector + Claude Sonnet 4.7 stack · 100k chunks · no re-ranker on this path

Query embedding · P50text-embedding-3-large API

45 ms

Retrieval P50 · 100k chunkspgvector HNSW · k=10

12 ms

Re-rank P50 (optional)Cohere Rerank · top-40 → top-8

120 ms

Generation first-token P50Claude Sonnet 4.7 · 5k input tokens

280 ms

End-to-end first-token P50Embed + retrieve + generate

340 ms

End-to-end first-token P95Same path, tail latency

620 ms

The cost-per-query decomposition

A grounded RAG query in 2026 on a mid-range stack costs roughly two cents end-to-end. The decomposition: ~50 tokens to embed the query (~$0.0001 at OpenAI rates), ~5,000 tokens of retrieved context fed to the generator (~$0.015 at Sonnet 4.7 input pricing), ~300 generated output tokens (~$0.0045 at Sonnet 4.7 output pricing), and a Cohere Rerank call if used (~$0.0002). Generation input dominates — long-context input is the biggest single cost line, and the lever with the most leverage for reduction.

The three cost levers in priority order: reduce retrieved context (better retrieval means fewer chunks needed, k=6 instead of k=12 if recall holds), cheaper generation model (Haiku for routing, Sonnet for synthesis — most queries do not need Opus), and caching (common queries cached at the answer level; query embeddings cached briefly for FAQ-style repetition). The first lever is the biggest and the one most often missed; teams over-retrieve as insurance and pay the cost on every query for the rare query where it matters.

The cost-quality contract

The right framing is cost per quality-passing query — not raw cost per query. A two-cent query that passes faithfulness eval is worth more than a half-cent query that fails. The metric panel makes the trade-off explicit: cost and quality on the same dashboard, with the cost-per-passing-query as the composite the production team optimizes.

07 — Eval Framework MappingRAGAS, DeepEval, Promptfoo — which measures what.

The three production-ready RAG eval frameworks in 2026 are RAGAS, DeepEval, and Promptfoo. Each one covers most of the ten metrics above, but the ergonomics, integration patterns, and metric implementations differ enough that the framework choice has real consequences. The right way to decide is not by metric coverage — all three cover the essentials — but by the team's integration model: notebook-and-data-pipeline (RAGAS), pytest and CI (DeepEval), or prompt-level A/B and red-teaming (Promptfoo).

Canonical

RAGAS — the established baseline

Python · LangChain / LlamaIndex native

The longest production track record and the broadest metric library: faithfulness, answer relevance, context precision, context recall, context relevance. Right choice for typical LangChain-stack RAG teams that want canonical metrics with dataset-pipeline ergonomics.

Pick to ship

CI-native

DeepEval — pytest ergonomics

Python · pytest-first · faster-moving

Eval cases look like test cases. Cleanest CI integration of the three. Metric library is fast-moving — citation accuracy, hallucination, contextual precision/recall, and a stronger faithfulness implementation in 2026 builds. Right choice when eval is first-class in CI.

Pick for CI

Prompt-level

Promptfoo — A/B and red-team

Node/TS · YAML test specs · cross-provider

Strongest for prompt-level A/B testing, regression suites across providers, and red-teaming. Less of a RAG-eval framework, more of a prompt-eval framework that includes RAG. Right choice when prompt iteration velocity is the bottleneck.

Pick for prompts

Metric-to-framework cheat sheet

Recall@k, precision@k, MRR. RAGAS context recall and context precision both work; DeepEval has equivalent metrics. Promptfoo can run them via custom assertions. All three rely on a labeled set you supply.
Faithfulness.RAGAS faithfulness and DeepEval FaithfulnessEval are both production-grade. DeepEval's implementation has slightly better claim decomposition in 2026 builds; RAGAS has more battle-testing. Either is correct; pick by integration model.
Citation accuracy. DeepEval has a dedicated citation-accuracy metric; RAGAS measures this indirectly through context-precision and answer-correctness. DeepEval is the cleaner choice for citation-heavy production systems.
Freshness and refresh lag. None of the three frameworks measure freshness — these are operational metrics, not eval-framework metrics. Build them into the ingestion-observability layer and surface them on the same dashboard as the eval-framework metrics.
Latency and cost. Same: operational metrics tracked at the application layer (OpenTelemetry, custom-instrumented), not in the eval framework. The eval framework runs offline against the labeled set; latency and cost are measured on production traffic.

The honest recommendation: pick RAGAS to ship because the metric library is broadest and the production track-record is longest. Evaluate DeepEval in quarter two if eval-in-CI becomes a bottleneck. Layer Promptfoo on top for prompt-level A/B and regression testing where it complements rather than replaces the primary framework. Do not let the framework choice block the instrumentation — any of the three is dramatically better than no panel at all.

For teams standing this panel up from scratch on a self-hosted stack, the foundational tutorial is our build a self-hosted RAG with Postgres and pgvector walk-through, which covers the retrieval-SQL layer the metric panel sits on top of. If you want the panel and the audit method run on your stack end-to-end, our AI transformation engagements start with exactly this instrumentation work.

Conclusion

RAG metrics turn quality from a hope into a contract.

The ten KPIs above are the production minimum for a RAG system that users depend on. Recall and precision keep the retrieval honest; faithfulness and citation accuracy keep the generation honest; freshness keeps the corpus honest; latency and cost keep the operation honest. Each metric points at a different subsystem, which means each regression is localized to the team and the fix that owns it. That decomposition — from a single opaque quality signal to ten layered diagnostic metrics — is the entire reason the panel exists.

The leverage point is the cadence. A weekly metric review on a labeled set of 50-100 queries, combined with sampled production-traffic faithfulness scoring, surfaces regressions within days instead of weeks. The teams that get production RAG quality right are the ones who treat the panel like a financial close — scheduled, owned, with named thresholds and named escalation paths when a metric drops below its floor. Without that discipline, the panel is decoration; with it, the panel is the contract.

The single most important habit underneath all ten metrics: measure against your own labeled set, on your own queries, against your own corpus. Public benchmarks and framework defaults are a starting point — they tell you the metric is measurable. Your hand-labeled set is what tells you the metric actually moves on the failures your users care about. The metric panel that produces real lift is the one rooted in your queries, not the one borrowed from a tutorial.

RAG System Metrics: Recall, Precision, Faithfulness 2026

01 — Why RAG MetricsQuality from hope to contract — what a metric panel buys you.

Vibe-based quality

End-to-end only

Ten decomposed metrics

Weighted scorecard

02 — Recall + PrecisionThe retrieval foundation — recall@k, MRR, precision.

Recall@k

Mean Reciprocal Rank

Precision@k

Recall@k measurement recipe

The recall regression playbook

03 — FaithfulnessThe trust metric — does each claim trace back to a chunk.

Below 70%

70-90%

Above 90%

95%+ with verification

How to ship faithfulness eval in CI

04 — Citation AccuracyThe UX metric — does the cited chunk support the claim.

Citation accuracy

Production target

Manual verification

Perceived-quality share

The citation-accuracy regression playbook

05 — FreshnessThe staleness metric — refresh lag per source.

The five freshness signals

06 — Latency + CostThe operational metrics — P95 latency, cost per query.

Latency budget · production target P95 ≤ 800 ms

The cost-per-query decomposition

07 — Eval Framework MappingRAGAS, DeepEval, Promptfoo — which measures what.

RAGAS — the established baseline

DeepEval — pytest ergonomics

Promptfoo — A/B and red-team

Metric-to-framework cheat sheet

RAG metrics turn quality from a hope into a contract.

RAG metrics turn quality from hope into contract.

RAG quality panel engagements

The questions teams ask before shipping production RAG.

Continue exploring RAG quality.

RAG System Audit: An 80-Point Quality Scorecard 2026

AI Evaluation Metrics Reference Guide 2026: 80 Metrics

AI Hallucination Rate Benchmarks 2026: 5-Model Study