SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentReference8 min readPublished Apr 30, 2026

6 families · 80 metrics · with formulas + citations

AI Evaluation Metrics Reference Guide.

AI evaluation metrics fall into six families: text quality, embedding similarity, RAG-specific, agentic, safety and fairness, and benchmark suites. This reference covers 80 metrics with formulas, when-to-use guidance, and citations to HELM, OpenAI Evals, and HuggingFace evaluate.

DA
Digital Applied Team
Senior strategists · Published Apr 30, 2026
PublishedApr 30, 2026
Read time8 min
SourcesHELM · OpenAI Evals · HuggingFace evaluate
Metrics defined
80
across 6 families
Formulas
40+
with worked examples
Benchmark suites
15+
tracked
Library citations
10+
Ragas, TruLens, Helm

AI evaluation vocabulary fragments badly. Different teams use "success rate" to mean four different things. Different vendors report "faithfulness" using different libraries. Different benchmarks claim quality wins on different measures. Locking the metric definition in your eval contract is the single highest-leverage move on any agent or RAG project.

This reference holds 80 metrics across six families: text quality (BLEU, ROUGE, METEOR, chrF, BERTScore), embedding-similarity (cosine, BGE-score, MRL), RAG-specific (faithfulness, context relevance, answer relevance, context recall), agentic (success rate, tool-call accuracy, plan validity, latency, cost), safety (toxicity, bias, jailbreak rate), and the canonical benchmark suites (HELM, MMLU-Pro, GPQA, SWE-Bench, LiveCodeBench).

Each entry has a definition, formula or scoring guide, when to use, when not to, and citations to canonical papers and the HELM, OpenAI Evals, HuggingFace evaluate, Ragas, TruLens, and Arize Phoenix libraries.

Key takeaways
  1. 01
    Lock the metric definition in your eval contract. Most quality disputes trace to metric mismatch.Specify which 'success rate' you mean (step, plan, end-to-end, rubric-graded), which library you score with (Ragas vs TruLens), and which benchmark suite you compare against. Vague metric names produce vague disagreements.
  2. 02
    Three metrics drive most agent reporting: end-to-end success rate, tool-call accuracy, cost-per-successful-task.These three together capture goal completion, agent behavior, and economics. Adding more metrics adds noise; these three plus a few safety metrics cover ~80% of executive reporting.
  3. 03
    Faithfulness, context relevance, answer relevance — these three are the canonical RAG triad.Implemented in Ragas and TruLens with stable definitions. Most production RAG dashboards lead with these three; add context recall when you need to track retrieval-side gaps.
  4. 04
    LLM-as-judge has documented biases (length, recency, self-preference). Calibrate against humans on a sample.LLM judges are scalable but not unbiased. Always run a sample of human grades to calibrate; expect 0.6-0.8 agreement and report the calibration as part of your eval methodology.
  5. 05
    Benchmark suite scores age fast. Cite the suite version and date.MMLU-Pro v1.1 ≠ MMLU-Pro v1.0. SWE-Bench Verified ≠ SWE-Bench Lite. Always pair scores with suite version and date — comparisons across versions or dates are misleading.

01Family 01Text quality metrics.

Lexical and semantic comparison of generated text to reference text. Most predate LLMs; some still useful as cheap baselines.

BLEU. Bilingual Evaluation Understudy. Papineni et al. (2002). Measures n-gram overlap between generated and reference text. Standard for machine translation; weak signal for general generation.

ROUGE-1, ROUGE-2, ROUGE-L. Lin (2004). ROUGE-1 measures unigram overlap; ROUGE-2 bigrams; ROUGE-L longest common subsequence. Standard for summarization evaluation.

METEOR.Banerjee & Lavie (2005). ROUGE/BLEU successor that handles synonyms and stemming. Better correlation with human judgment.

chrF. Popović (2015). Character-level F-score; robust to morphology. Used in machine translation (WMT) standard.

BERTScore. Zhang et al. (2019). Embedding- based similarity using BERT representations. Higher correlation with human judgment than n-gram metrics.

BLEURT. Sellam et al. (2020). Learned metric trained to predict human judgments. Stronger correlation than BLEU/ROUGE; requires periodic re-training.

COMET. Rei et al. (2020). Reference-based quality estimation for translation. Considered SOTA on WMT.

Perplexity. Inverse of average likelihood assigned by model to text. Lower is better. Useful for internal model comparison; weak external metric.

Exact match. Binary metric: does generated text exactly match reference. Only useful for structured outputs (math, code-token sequences).

F1 / token F1. Standard precision-recall harmonic mean. Token F1 measures token-level overlap; used in span-extraction tasks.

Edit distance / Levenshtein. Number of single-character edits. Used for OCR and code-correction evaluation.

02Family 02Embedding similarity metrics.

Semantic comparison via vector representations. Used in retrieval, clustering, and quality scoring.

Cosine similarity. Angle between two vectors; range [-1, 1]. Standard for embedding-based similarity.

Dot product. Sum of element-wise products. Equivalent to cosine when vectors are normalized.

Euclidean / L2 distance. Geometric distance. Some embedding models use L2; most use cosine.

BGE-score. Embedding-based scoring using BGE embedding models. Used as cheap LLM-judge alternative.

SemScore. Semantic similarity score from sentence-transformers; baseline for STS tasks.

MRR (Mean Reciprocal Rank). Mean of 1/(rank of first relevant document). Used in retrieval evaluation; rewards finding the first relevant doc early.

Recall@k. Fraction of true relevant documents in top-k. Standard retrieval-quality metric.

Precision@k. Fraction of top-k documents that are relevant.

NDCG. Normalized Discounted Cumulative Gain. Rank-aware metric; rewards relevant docs at higher positions.

MAP (Mean Average Precision). Mean of average precision across queries. Used in retrieval benchmarks.

03Family 03RAG-specific metrics.

Metrics designed for retrieval-augmented generation quality. The RAG triad — faithfulness, context relevance, answer relevance — is the canonical production set.

Faithfulness. Whether the answer is supported by retrieved context. Implemented as the fraction of answer claims supported by context. Ragas and TruLens canonical implementations.

Context relevance. Whether retrieved context is actually relevant to the user query. Catches "right answer, wrong source" failures.

Answer relevance. Whether the answer addresses the user query. Catches "drifted answer" failures even when context and faithfulness are correct.

Context recall. Fraction of relevant context in the corpus that was retrieved. Retrieval-side metric; complements context relevance.

Context precision. Fraction of retrieved context that was relevant. Captures retrieval noise.

Answer correctness. Whether the answer is factually correct against ground truth. Distinct from faithfulness — an answer can be faithful to context but still incorrect.

Context utilization. Whether the answer uses the context that was retrieved. Catches "ignored the context" failures.

Citation accuracy. Whether claims are correctly attributed to specific source spans. Critical in regulated and audit-heavy domains.

Hallucination rate. Fraction of answers containing unsupported claims. Inverse correlate of faithfulness; reported separately for clarity.

RAGAS score. Combined score from the Ragas library; weighted blend of faithfulness, answer relevance, context relevance, and context recall.

TruLens triad. TruLens library implementation of the RAG triad (faithfulness, context relevance, answer relevance) with provider-specific adapters.

RAG triad
Faithfulness
claims supported by context

Answer claims grounded in retrieved context. Catches hallucination.

Required
RAG triad
Context relevance
context relevant to query

Catches 'right answer wrong source'. Retrieval-side complement to faithfulness.

Required
RAG triad
Answer relevance
answer addresses query

Catches drifted answers. Output-side metric.

Required
Optional
Context recall
retrieved vs available

Adds retrieval-quality lens. Track when retrieval is the suspected bottleneck.

Diagnostic
"Faithfulness, context relevance, answer relevance — implement these three first. Most other RAG metrics correlate with one of them."— Internal RAG eval retro, March 2026

04Family 04Agentic metrics.

Metrics for evaluating agent behavior end-to-end and at sub-step granularity. The vocabulary diverges most across teams; locking definitions matters.

End-to-end success rate. Fraction of agent runs that complete the user goal. The headline metric for executive reporting.

Step success rate. Fraction of individual agent steps that succeed. Diagnostic metric — tells you where in the loop failures concentrate.

Plan success rate. Fraction of generated plans that complete every step. Distinct from end-to-end — a plan can succeed but fail to meet the user goal.

Tool-call accuracy. Fraction of tool calls that produce a valid, on-policy result. Catches malformed arguments, missing tools, and policy violations.

Tool selection accuracy. Fraction of tool calls where the model picked the right tool for the situation. Distinguishes "called the wrong tool right" from "called the right tool wrong."

Plan validity. Fraction of generated plans that are syntactically and semantically executable — every step references an available tool with valid args.

Latency P50/P95/P99. Wall-clock latency percentiles. Production SLOs typically expressed at P95 or P99.

Cost per successful task. Total spend divided by end-to-end successes. The economic metric that captures token efficiency, success rate, and retry cost in one number.

Cost per attempt. Total spend divided by total runs. Diagnostic complement to cost-per-success.

Retry rate. Fraction of tool calls or steps requiring retry. High retry rate signals brittle tool design or unstable upstream services.

Step budget exhaustion. Fraction of runs terminated by step-budget rather than goal completion. High rate signals task scope mismatch with budget.

Confidence calibration. Whether the model's stated confidence aligns with actual success rate. Critical for confidence-band routing.

Refusal accuracy. Two sub-metrics: false refusals (refusing acceptable requests) and false compliances (executing harmful requests). Tracked separately.

Approval-gate latency. Time spent waiting for human confirmation in HITL workflows. Important for estimating end-to-end UX.

The three-metric agent dashboard
For executive reporting on agent performance, lead with three: end-to-end success rate, tool-call accuracy, and cost per successful task. Add safety metrics. Most other metrics are diagnostic — useful for engineering, noisy for executives.

05Family 05Safety & fairness metrics.

How AI safety is actually measured. The vocabulary maps directly to NIST RMF Measure function and EU AI Act Article 15 (accuracy, robustness, cybersecurity).

Toxicity. Fraction of outputs flagged for harmful, hateful, or offensive content. Standard toolkits: Perspective API, OpenAI Moderation, Llama Guard.

Bias. Differential model behavior across demographic groups. Measured via paired-prompt tests (e.g., changing only a protected attribute and measuring output divergence).

Demographic parity. Whether model outputs are uniform across protected attributes. Strict measure; often impractical at high resolution.

Equalized odds. Fairness measure (Hardt et al., 2016) — equal true positive and false positive rates across groups.

Calibration parity. Whether confidence calibration is consistent across groups. Captures differential reliability.

Jailbreak rate. Fraction of adversarial prompts that successfully bypass model safety. Tracked against a known prompt-injection test set.

Prompt-injection resistance. Fraction of tool-result-based attacks that fail to manipulate the agent. Distinct from jailbreak (user-side) — this is data-side adversarial robustness.

PII leak rate. Fraction of outputs containing personally-identifying information that should have been redacted. Critical for GDPR and HIPAA workloads.

Hallucination severity. Distinct from hallucination rate. Severity grades the impact of each hallucination (cosmetic, misleading, dangerous).

Harmful action rate. Fraction of agent runs that take a harmful action (sending wrong email, modifying wrong record, exposing data). Specific to tool-using agents.

Safety success rate. Fraction of unsafe requests correctly refused. Pairs with refusal accuracy metrics.

Adversarial robustness. Performance under adversarial input perturbation. Measured via standardized test suites (HarmBench, WildBench).

Mandatory
3metrics
Toxicity · jailbreak · PII

These three are the minimum safety dashboard. Track per release; bound regression.

Production minimum
Regulated
+3metrics
Bias · calibration · refusal

Add for healthcare, finance, employment, education, government use cases.

Sector-specific
Agent
+2metrics
Harmful-action · prompt-injection

Add for tool-using agents. Captures action-side risks beyond pure-text risks.

Agent-specific

06Family 06Benchmark suites.

Standardized test suites used to compare models across tasks. Cite suite version and date; scores age fast.

HELM. Holistic Evaluation of Language Models. Stanford CRFM. Comprehensive multi-metric, multi- scenario benchmark suite.

MMLU. Massive Multitask Language Understanding. 57 subjects across STEM, humanities, social sciences. Frequently saturated by frontier models.

MMLU-Pro. Wang et al. (2024). Harder successor to MMLU. Reasoning-heavy; current frontier comparison standard.

GPQA. Graduate-level question answering. Rein et al. (2023). PhD-level science questions.

SWE-Bench Verified. Software engineering benchmark from real GitHub issues. Standard for coding agent evaluation.

SWE-Bench Pro. Harder SWE-Bench variant (2026). Frontier coding-agent comparison.

LiveCodeBench. Contamination-resistant coding benchmark using recently-published problems.

HumanEval. OpenAI coding benchmark. Saturating; mostly used as a sanity check now.

MBPP. Mostly Basic Python Programming. Coding benchmark; saturating.

GSM8K. Grade School Math 8K problems. Saturated by frontier models; baseline reference.

MATH. Hendrycks et al. (2021). Advanced math benchmark; still useful for reasoning evaluation.

MRCR (Multi-Round Co-reference Resolution). Long-context retrieval benchmark; tracks 1M-context performance.

RULER. Long-context evaluation suite. Multi-task; 4K to 128K+ context tested.

NIAH (Needle in a Haystack). Long-context retrieval test; planted facts in long contexts.

MCP-Atlas. Multi-task agent benchmark using MCP-exposed tools. Frontier agentic comparison.

AgentBench. Liu et al. (2023). Multi- domain agent capability benchmark.

HarmBench. Mazeika et al. (2024). Standardized harm-evaluation benchmark.

WildBench. Real-world conversation benchmark; harder than alignment-curated benchmarks.

Arena (LMSYS). Human preference comparison; long-running pairwise model evaluation.

"Always cite the benchmark version and date. SWE-Bench Verified Q1 2026 ≠ SWE-Bench Verified Q4 2025 — the eval set updates and so do the scores."— Internal benchmark methodology retro, May 2026

07ConclusionPick the metric; lock the contract.

The shape of AI evaluation vocabulary · April 2026

80 metrics is large; the right working set is much smaller.

Most production AI systems can be evaluated with 8-12 metrics: three RAG triad metrics, three agent dashboard metrics (success rate, tool-call accuracy, cost-per-success), three safety metrics (toxicity, jailbreak, PII), and a handful of task-specific text-quality or benchmark scores. The 80 in this glossary cover the broader space; pick the subset that matches your workload.

The discipline that matters more than metric selection is metric definition. Lock which 'success rate' you mean, which library you score with, which benchmark version you cite. Vague metric names produce vague disagreements; precise definitions turn quality into a measurable contract.

LLM-as-judge is scalable but biased. Calibrate against humans on a sample. Report the calibration as part of your methodology. Re-calibrate when you change models or rubrics. The methodology section of every eval report should be longer than executives expect; that's where the credibility lives.

Production-grade AI evaluation

Stop measuring the wrong thing.

We help engineering and product teams design AI evaluation programs — metric selection, eval-set curation, LLM-as-judge calibration, and ongoing dashboard build for agent and RAG systems.

Free consultationExpert guidanceTailored solutions
What we work on

AI evaluation engagements

  • Metric selection by workload class (agent, RAG, generation)
  • Eval set curation with human-graded calibration sample
  • LLM-as-judge setup with bias-calibration discipline
  • Production dashboard build (success, accuracy, cost, safety)
  • Pre-release regression evaluation against benchmark suites
FAQ · AI evaluation metrics

The evaluation questions we get every week.

End-to-end success rate, almost always. End-to-end captures whether the user goal was met regardless of internal mechanics. Step-level success and tool-call accuracy are diagnostic — useful for engineering, noisy for executive reporting. Plan success rate sometimes matters but typically tracks closely with end-to-end. The right executive dashboard pattern: lead with end-to-end success rate, drill into step-level only when end-to-end shifts unexpectedly.