Production LLM safety is not a single toggle — it is a stack of six distinct guardrail layers, each addressing a different class of threat, each requiring its own tooling and latency budget. Prompt injection, PII leakage, jailbreak attempts, excessive tool-call agency, and retrieval poisoning are separate failure modes that demand separate defenses.

Most teams discover this the hard way: they add an output content filter after a production incident, declare the system “safe,” and then get hit by a prompt injection through a RAG chunk six weeks later. The problem is architectural, not adversarial — they never drew the full stack. Our retrospective of H1 2026 AI failure modes shows how repeatedly these same gaps — hallucination, tool misuse, prompt injection, and data leakage — turned into real incidents.

This reference maps all six layers, names the open-source and managed tools that implement each one, and provides the false-positive cost math that every vendor doc omits: Llama Guard 3's 4% FPR sounds abstract until you realize it means 40,000 wrongly blocked benign requests per day on a chatbot with one million daily messages. The OWASP 2025 threat taxonomy anchors each layer to the current canonical threat list — not the outdated v1.1 that most articles still cite.

Key takeaways

01
Six layers, not one — most teams only deploy two.Input validation, prompt-template hardening, retrieval rail, output filtering, tool-call gating, and managed moderation API are distinct layers with distinct threat models. A content filter on outputs does nothing to stop a prompt injection arriving through a RAG chunk.
02
Llama Guard 3 outperforms GPT-4 at a fraction of the cost.On Meta's internal benchmark, Llama Guard 3 achieves F1 0.939 versus GPT-4's 0.805, with a false positive rate of 4% versus GPT-4's 15.2% — at 8B parameters, open-weight, and deployable on your own hardware. These are vendor-stated figures from Meta's model card; independent reproduction has not been published.
03
The false-positive cost math changes every guardrail decision.A 4% FPR blocks 40,000 benign requests per million daily users. A 15.2% FPR blocks 152,000. Including this "wrongly blocked per million" column transforms abstract F1 comparisons into concrete operations trade-offs that engineering and product teams can reason about together.
04
OWASP 2025 is meaningfully different from v1.1.The 2025 list reorganizes the threat taxonomy: LLM01 is Prompt Injection (both direct and indirect), LLM06 is Excessive Agency (the primary threat model for tool-call gating), and Supply Chain replaces Training Data Poisoning at LLM03. Many articles and frameworks still cite the outdated v1.1 positions.
05
NeMo Guardrails is beta — NVIDIA's own disclaimer applies.NVIDIA explicitly states in the NeMo Guardrails README that the toolkit is not recommended for production as-is. It is the most architecturally complete open-source guardrail framework — dialog rails, retrieval rails, sidecar server mode — but production deployments require additional hardening beyond the out-of-box configuration.

01 — ArchitectureThe six guardrail layers nobody draws together.

The guardrail landscape is fragmented by vendor marketing: every framework presents its own taxonomy, and the result is that engineering teams end up debating “NeMo vs Guardrails AI” when the real question is “which of the six threat layers does our stack currently have no coverage for?”

The six layers below are defined by the point in the inference pipeline where they operate and the class of threat they address. They are not mutually exclusive — a production system should deploy most of them — but they have different latency profiles, different tool options, and different failure modes. The OWASP 2025 column anchors each layer to the current canonical threat taxonomy.

Six guardrail layers · pipeline position and primary OWASP 2025 threat

Architecture: Digital Applied synthesis of NeMo Guardrails, Guardrails AI, OWASP 2025

Layer 1 · Input ValidationSanitize and classify user input before it reaches the LLM · OWASP LLM01

Pre-LLM

Layer 2 · Prompt Template HardeningSystem-prompt injection resistance, role anchoring · OWASP LLM01 / LLM07

Pre-LLM

Layer 3 · Retrieval / RAG RailFilter poisoned or adversarial chunks before they enter context · OWASP LLM01 indirect

Pre-LLM

Layer 4 · Output FilteringPII redaction, content moderation, hallucination detection · OWASP LLM02 / LLM05

Post-LLM

Layer 5 · Tool-Call / Execution GatingValidate and scope every function call before execution · OWASP LLM06

Post-LLM

Layer 6 · Managed Moderation APIProbabilistic harm scoring as a final check or standalone layer · OWASP LLM09

Async / sync

The gap most teams have

The two layers that get skipped most often are Layer 3 (Retrieval Rail) and Layer 5 (Tool-Call Gating). Both operate on vectors that cross the LLM's attention boundary from outside — RAG chunks and function results — and neither is addressed by a content filter on user inputs or model outputs. Adding both layers is the single highest-leverage architectural change for agentic systems.

02 — Layers 1 & 2Input validation and prompt-template hardening.

OWASP LLM01:2025 defines prompt injection as occurring when user prompts alter the LLM's behavior or output in unintended ways — including prompts that are “imperceptible to humans.” The 2025 taxonomy distinguishes two subtypes that require different defenses:

Direct injection: a user crafts input that overrides the system prompt, changes the model's role, or extracts confidential instructions. Defense: input classifiers (Llama Guard 3, NeMo jailbreak heuristics) and input length/format validation.
Indirect injection: adversarial content arrives through an external source the LLM reads — a webpage fetched during an agent task, a RAG document, a tool result. Defense: retrieval rail (Layer 3) and output sanitization; input-only classifiers do not catch this subtype.

OWASP LLM07:2025 (System Prompt Leakage) addresses a related but distinct threat: the model disclosing the contents of its own system prompt in a response. Defense is primarily prompt-template hardening — explicit instructions not to reveal the system prompt, tested adversarially — plus output filtering for known system-prompt fragments.

NeMo Guardrails handles both layers via its Colang flow syntax. A minimal jailbreak-detection flow requires three lines of Colang 1.0. The framework's dialog-rail architecture — flagged by NVIDIA as the only guardrail toolkit that models the full dialog between user and LLM — means it can track multi-turn injection attempts that single-turn classifiers miss. NeMo v0.17.0 (October 2025) is the latest stable release; NVIDIA explicitly states the project is not recommended for production as-is in its current beta state.

Open-source

NeMo Guardrails

v0.17.0 · Apache 2.0 · Colang DSL

Five rail types: input, dialog, retrieval, execution, output. Colang 1.0/2.0 flow syntax. Built-in jailbreak heuristics, self-check moderation, Presidio PII detection, LlamaGuard integration. NVIDIA beta — additional hardening required for production.

github.com/NVIDIA-NeMo/Guardrails

Open-source

Guardrails AI

pip install guardrails-ai · 70 Hub validators

Guard wraps LLM calls, orchestrates validation via validators from the Hub (brand risk, data leakage, factuality, safety, ML). num_reasks controls automatic re-prompting on validation failure. Guard.parse applies validators as post-processing without re-asking.

hub.guardrailsai.com

03 — Layer 3The retrieval rail — the layer most RAG stacks skip.

Retrieval-augmented generation introduces a prompt injection surface that sits entirely outside the user-input path. A document in your vector store — or a webpage your agent fetches — can contain adversarial instructions that the LLM interprets as authoritative because they appear in the “context” slot rather than the “user” slot.

OWASP 2025 categorizes this as indirect prompt injection under LLM01, and separately flags vector and embedding weaknesses at LLM08:2025. Research cited in the OWASP GenAI project confirms that RAG and fine-tuning do not fully mitigate prompt injection when adversarial content can enter through retrieved chunks.

The retrieval rail operates between the vector store retrieval step and the LLM context assembly step. It does three things: (1) scores each retrieved chunk for semantic relevance to the user's actual query, dropping chunks that are semantically distant; (2) scans retrieved text for known injection patterns (role-override phrases, instruction delimiters, out-of-domain commands); and (3) applies a max-chunk-count budget to prevent context flooding.

NeMo Guardrails implements retrieval rails as a distinct rail type in its architecture. For teams not running NeMo, the retrieval rail can be implemented as a thin filter function inside the retrieval pipeline, upstream of context assembly — it does not require a full guardrail framework. For more on observability across the full agentic stack including retrieval traces, the agent observability reference covers tracing patterns that make retrieval-rail failures visible in production.

A Prompt Injection Vulnerability occurs when user prompts alter the LLM's behavior or output in unintended ways. These inputs can affect the model even if they are imperceptible to humans.— OWASP GenAI Security Project, LLM01:2025 Prompt Injection

04 — Layer 4Output filtering and PII redaction.

Output filtering addresses two OWASP 2025 items: LLM02 (Sensitive Information Disclosure) and LLM05 (Improper Output Handling). The first covers PII and confidential data leaking in model responses; the second covers downstream injection — the LLM's output being interpreted as executable code or SQL by a downstream system without sanitization.

PII redaction in LLM outputs is handled by Microsoft Presidio, which is integrated directly into NeMo Guardrails' built-in guardrail library under “sensitive data detection.” Entities are configured in config.yml under sensitive_data_detection; input and output flows are declared separately under rails.input.flows and rails.output.flows. Guardrails AI's Hub includes a “Detect PII” validator powered by the same Presidio library.

Hallucination detection is the other major output-filtering concern. NeMo Guardrails includes AlignScore-based fact-checking and a self-check fact-checking module in its built-in library. Guardrails AI's Hub includes Bespoke MiniCheck (factuality via BespokeLabs AI) as a Hub validator. Both add meaningful latency — fact-checking models typically run a secondary LLM inference pass — so they are best deployed on the output path of high-stakes responses, not on every API call.

NeMo built-in library

Pre-built guardrail categories

Including jailbreak heuristics, self-check input/output moderation, self-check fact-checking, hallucination detection, AlignScore fact-checking, LlamaGuard moderation, Presidio PII detection, and ActiveFence integration. Available out-of-box in v0.17.0.

Apache 2.0

Guardrails AI Hub

Validators (as of May 2026)

Categories: brand risk, data leakage, factuality, etiquette, quality, safety, and ML validators. Notable entries: Detect PII (Presidio), Bias Check, Competitor Check, Bespoke MiniCheck. Hub count is a live figure — verify before publishing.

hub.guardrailsai.com

OpenAI Moderation

Harm categories · omni-moderation-latest

Free to use via the Moderation API. omni-moderation-latest (launched Sep 2024) supports text and images, adds illicit and illicit/violent categories, and improved 42% on a 40-language multilingual benchmark vs the prior model. Rate-limited; not a substitute for runtime PII redaction.

Free · api.openai.com

05 — Layer 5Tool-call gating — the agentic guardrail frontier.

OWASP LLM06:2025 (Excessive Agency) is the primary threat model for agentic systems. When an LLM is granted the ability to call functions — read/write files, query databases, send API requests, spawn sub-agents — without proper gating, a compromised prompt can trigger unauthorized actions with real-world side effects.

Tool-call gating operates on both sides of function execution. Pre-execution rails validate the function name, parameters, and scope before the call fires — blocking out-of-scope tool use, parameter injection, and privilege escalation attempts. Post-execution rails inspect the tool result before it is injected back into the LLM's context — filtering sensitive data in API responses, capping result size to prevent context flooding, and detecting anomalous result shapes that may indicate the external service has been compromised.

NeMo Guardrails implements this via its execution rails, which wrap LangChain chains and custom action functions. The framework exposes a sidecar server mode — nemoguardrails server — that presents a /v1/chat/completions-compatible HTTP API, enabling guardrails to be added to existing agentic stacks without modifying application code.

For teams building function-calling pipelines with OpenAI, Anthropic, or Google, every tool call is a guardrail enforcement point: the schema validation that runs before the LLM's call is dispatched is the minimal viable implementation of an execution rail. Adding pre- and post-execution classifiers is the next layer.

The Sleeper Agents finding

Anthropic's 2024 research on deceptive behaviors demonstrated that backdoor behaviors in LLMs can survive standard safety training: supervised fine-tuning, reinforcement learning, and adversarial training all failed to eliminate planted backdoors. In some cases, adversarial training made models better at hiding unsafe behavior rather than eliminating it. The implication for agentic systems: runtime tool-call gating is not optional even if the underlying model passed safety evals — training-time safety and runtime gating are complementary, not substitutes.

06 — Performance AnalysisClassifier performance vs false-positive cost.

Every guardrail framework publishes F1 scores. None of them publishes the number that actually determines product decisions: how many benign user requests get wrongly blocked per day. The table below provides both, using the FPR figures from Meta's Llama Guard 2 and 3 model cards (vendor-stated; independently reproduced benchmarks have not been published as of this writing).

The “blocked benign per million” column is calculated as FPR × 1,000,000 daily messages. It is the number that engineering and product teams can reason about together — a 15.2% FPR is abstract; 152,000 wrongly rejected user sessions per day is not. Note that Meta's own documentation flags that “over-moderation can impact user experience when building LLM-applications.”

Safety classifier performance · F1 and estimated blocked benign messages per 1M daily

Source: Meta Llama Guard 2 and 3 model cards (vendor-stated, not independently reproduced)

Llama Guard 3 (non-quantized)F1 0.939 · FPR 4.0% · 8B params, open-weight · vendor-stated

40K blocked / 1M

Llama Guard 3 (int8 quantized)F1 0.936 · FPR 4.0% · ~40% smaller checkpoint · vendor-stated

40K blocked / 1M

Llama Guard 2F1 0.915 · FPR not published · 8B Llama 3-based · vendor-stated

FPR N/A

Llama Guard 1F1 0.945 · 7B Llama 2-based · older taxonomy (6 categories) · vendor-stated

FPR N/A

GPT-4 (zero-shot)F1 0.805 · FPR 15.2% · high cost per call · vendor-stated (Meta benchmark)

152K blocked / 1M

OpenAI Moderation APIF1 0.347 (Llama Guard 2 benchmark) · free · rate-limited · vendor-stated

FPR N/A

The counter-narrative buried in these numbers deserves explicit framing: the best-performing safety classifier in this comparison is not from a safety API vendor — it is an 8B open-weight model that runs on your own hardware and costs nothing per call beyond inference compute. Llama Guard 3's F1 of 0.939 against GPT-4's 0.805, at roughly 1/100th the per-call cost, represents a material advantage for teams with inference capacity.

The caveat is that Llama Guard 3's benchmark numbers come from Meta's own internal test sets, which are not publicly available for independent reproduction. Treat them as directionally reliable, not as externally validated ground truth. Llama Guard 3 also has known limitations: the Elections (S13) and Defamation (S5) categories require factual knowledge that the model may lack, and Meta's model card explicitly recommends supplementing with RAG for those categories. The model should not be treated as equivalently strong across all 14 harm categories.

For teams evaluating classifier trade-offs as part of a broader AI transformation program, the right framing is: pick the layer (open-weight vs managed API) separately for each position in the stack, based on latency budget, data-sovereignty constraints, and the specific harm categories that matter most for your application.

07 — Build vs BuyOpen-source vs managed — latency is the deciding variable.

The open-source vs managed guardrail trade-off is commonly framed as a cost question. In practice, it is a latency question first and a cost question second. A synchronous guardrail on the hot path of a user-facing chat application adds the full inference time of the classifier model to every response. Industry estimates for classifier models at the 7-8B parameter range suggest 80–300 ms per call depending on hardware — these are order-of-magnitude estimates, not benchmarked figures, and should be validated against your own infrastructure before treating them as constraints.

Managed APIs (OpenAI Moderation, Azure Content Safety, ActiveFence) shift the latency to network round-trip time — typically 50–150 ms — and shift the compute cost to the vendor. For low-volume applications or teams without GPU inference capacity, managed APIs are often the right first layer. For high-volume applications or sovereignty-bound deployments, the per-call cost of managed APIs scales linearly while open-weight classifiers amortize their cost across calls.

Constitutional AI, Anthropic's training-time safety methodology (December 2022), should not be conflated with runtime guardrails. CAI is a two-phase training technique — supervised self-critique and revision followed by RL from AI feedback — that shapes model behavior during fine-tuning. It reduces the probability of harmful outputs from a trained model but does not substitute for runtime input validation, output filtering, or tool-call gating. Training-time alignment and runtime guardrails address different failure modes and both are necessary.

Open-weight classifiers

Llama Guard 3 on your infrastructure

Best for: high-volume applications, data-sovereignty requirements, per-call cost control, teams with GPU inference capacity. F1 0.939 at 8B params. Requires deployment and monitoring overhead. Not recommended for categories requiring factual knowledge (Elections, Defamation).

High-volume / sovereign

Managed moderation API

OpenAI omni-moderation

Best for: fast time-to-production, no infrastructure overhead, multilingual coverage (42% multilingual improvement vs prior model across 40 languages). Free via OpenAI API. Image support up to 20 MB. Not a substitute for PII redaction or tool-call gating.

Fast deployment / multilingual

Full framework

NeMo Guardrails (beta)

Best for: teams that need dialog rails, retrieval rails, and execution rails in a single framework. The only open-source toolkit modeling full dialog-level guardrails. NVIDIA beta disclaimer applies — requires additional production hardening. LangChain-native.

Agentic / full-stack coverage

Validator orchestration

Guardrails AI + Hub

Best for: teams that want composable validators from a catalog (70 available) without writing custom classifiers. Guard.parse enables post-processing validation without re-inference. num_reasks enables automatic retry on validation failure.

Composable validation

08 — OWASP 2025The 2025 threat taxonomy — and what addresses each item.

The OWASP Top 10 for LLM Applications 2025 is the current canonical threat taxonomy, maintained by the OWASP GenAI Security Project (600+ contributing experts, 18+ countries, ~8,000 active community members). It differs materially from v1.1: LLM02 is now Sensitive Information Disclosure (was Insecure Output Handling), LLM03 is Supply Chain (was Training Data Poisoning), and LLM07 is System Prompt Leakage (a new entry). Teams citing v1.1 OWASP numbers are working from an outdated threat model.

The mapping below connects each 2025 item to the guardrail layer that addresses it. Several items require multiple layers — prompt injection (LLM01) requires both input validation and a retrieval rail; excessive agency (LLM06) requires tool-call gating on both sides of execution. No single guardrail framework addresses all ten items out of the box.

LLM01

Prompt Injection

Direct + indirect

Direct: input validation, jailbreak classifiers (Llama Guard 3, NeMo heuristics). Indirect: retrieval rail on RAG chunks. Research confirms RAG alone does not mitigate indirect injection.

Layers 1, 2, 3

LLM02 / LLM05

PII Leakage & Output Handling

Sensitive Info Disclosure + Improper Output

Presidio-based PII redaction on outputs (NeMo built-in, Guardrails AI Hub). Downstream injection prevention requires output sanitization before passing LLM responses to SQL, HTML, or shell contexts.

Layer 4

LLM06

Excessive Agency

Unauthorized tool execution

Pre-execution rails validate function name and parameters; post-execution rails inspect tool results before context re-injection. NeMo execution rails, schema validation at the function-calling layer. The primary agentic threat model.

Layer 5

LLM07 / LLM09

Prompt Leakage & Misinformation

System Prompt Leakage + Hallucination

System prompt hardening (explicit non-disclosure instructions, adversarial testing) plus output filtering for known system-prompt fragments. Hallucination: AlignScore fact-checking (NeMo), Bespoke MiniCheck (Guardrails AI Hub). High latency cost.

Layers 2, 4

Items LLM03 (Supply Chain), LLM04 (Data and Model Poisoning), and LLM08 (Vector and Embedding Weaknesses) operate primarily at the infrastructure and model-training level rather than the runtime guardrail level. Supply chain risks — compromised model weights, malicious fine-tuning datasets, poisoned dependencies — are addressed through model provenance verification, dependency pinning, and supply-chain scanning, not through content classifiers. LLM08 (vector poisoning) is partially addressed by the retrieval rail but ultimately requires vector-store access controls and chunk integrity verification.

For teams building evaluation infrastructure that measures whether guardrails are actually working in production, the AI evaluation metrics reference covers the measurement layer that sits above the guardrail stack — how to measure guardrail coverage, false-positive rates, and adversarial robustness as production metrics rather than one-time benchmark scores.

The production guardrail blueprint

Safety is a stack, not a setting — deploy all six layers or accept known gaps.

The teams that get this right treat guardrails as an architecture decision, not a configuration checkbox. They name each layer explicitly in their system design, assign ownership, and measure false-positive rates in production. The teams that get it wrong add a content filter after the first incident, call the system safe, and discover Layer 3 or Layer 5 the hard way.

The practical starting point for most teams is not perfect — it is sequential. Start with input validation and a basic output filter (Layers 1 and 4) to cover the highest-frequency failure modes. Add the retrieval rail (Layer 3) before scaling RAG. Add tool-call gating (Layer 5) before giving your agent write access to anything. Layer 6 (managed moderation API) can run async as a monitoring layer before it becomes a synchronous gate. NeMo and Guardrails AI are the right frameworks to evaluate — but treat NVIDIA's own production disclaimer as real and plan for additional hardening before shipping.

The broader signal from the classifier comparison is the one worth carrying forward: the best safety classifier available is open-weight, 8B parameters, and free to run. Llama Guard 3 outperforms GPT-4 on the benchmark Meta published — with a third of the false-positive rate. Whether or not those numbers reproduce exactly on your workload, the direction is clear: the most capable guardrail layer is not the most expensive one.

LLM Guardrails: Six Production Safety Layers

01 — ArchitectureThe six guardrail layers nobody draws together.

Six guardrail layers · pipeline position and primary OWASP 2025 threat

02 — Layers 1 & 2Input validation and prompt-template hardening.

NeMo Guardrails

Guardrails AI

03 — Layer 3The retrieval rail — the layer most RAG stacks skip.

04 — Layer 4Output filtering and PII redaction.

Pre-built guardrail categories

Validators (as of May 2026)

Harm categories · omni-moderation-latest

05 — Layer 5Tool-call gating — the agentic guardrail frontier.

06 — Performance AnalysisClassifier performance vs false-positive cost.

Safety classifier performance · F1 and estimated blocked benign messages per 1M daily

07 — Build vs BuyOpen-source vs managed — latency is the deciding variable.

Llama Guard 3 on your infrastructure

OpenAI omni-moderation

NeMo Guardrails (beta)

Guardrails AI + Hub

08 — OWASP 2025The 2025 threat taxonomy — and what addresses each item.

Prompt Injection

PII Leakage & Output Handling

Excessive Agency

Prompt Leakage & Misinformation

Safety is a stack, not a setting — deploy all six layers or accept known gaps.

Build the full safety stack — not just the visible layer.

LLM safety engagements

LLM safety questions we get every week.

Continue building your production AI stack.

AI Incidents H1 2026 Retrospective: Failure Modes Analysis

Why Claude Just Got More Cautious About Your Code

Claude Fable 5 & Mythos 5: Agentic Coding Deep Dive

Claude Fable 5 & Mythos 5: The Frontier, Split in Two