Prompt injection defense is the area where every team learns the same lesson the same way: one clever payload reaches production, the post-mortem identifies the single control that should have caught it, the control gets added, and three months later a different payload bypasses that exact control while every other layer remains untouched. The pattern repeats because single-layer defense always fails — the only durable answer is defense in depth.
What follows is a twelve-layer framework distilled from production agent audits across the last eighteen months. Each layer is named with what it defends against and, more importantly, the residual risk it does not address — because the value of a layer is largely in what the next layer above it has to catch. The framework is ordered by adoption sequence, paired with a four-tier maturity model so a security team can sequence the rollout against finite engineering hours.
This guide assumes you already understand the basic shape of a prompt-injection attack — an attacker plants instructions in attacker-controllable content (a document, a webhook, a scraped page) that the agent reads and may act on. If MCP server controls are part of the picture, the companion read is the MCP server security audit checklist — this framework operates at a higher altitude and covers controls that sit outside any individual server.
- 01Single-layer defense always fails.Every prompt-injection control is bypassable in isolation. The framework's premise is layering — five to seven independent controls in the path of any privileged action, each catching what the others miss.
- 02Structured outputs reduce attack surface most.The single highest-leverage layer is constraining model output to a typed schema. A model that can only emit a validated object cannot emit free-form exfiltration text — entire classes of attack disappear from the threat model.
- 03Tool gating must be least-privilege by default.Argument allow-lists, parameter validation, and per-tool authorisation at dispatch are non-negotiable. Omnibus tools — one handler accepting free-form input — are the single most common critical finding across audited agents.
- 04Adversarial fixtures catch what general evals miss.General-purpose model evals do not surface injection regressions. Build a curated adversarial corpus, rotate it quarterly, and run it on every model upgrade and prompt change — treat it as a release gate, not an annual exercise.
- 05Replay turns incident response from guesswork to forensic.Deterministic replay with a tamper-evident audit trail converts a vague after-action narrative into a reproducible incident record. Without it, you are reconstructing events from memory; with it, you are running a query.
01 — Why 12 LayersSingle-layer defense always fails.
Every prompt-injection control documented in 2025 and 2026 has been bypassed in isolation. Untrusted-content fencing is bypassed by payloads that imitate the fence syntax. Instruction-hierarchy prompts are bypassed by content that pretends to be a higher- priority instruction. Output filters are bypassed by content that structures the exfiltration to look like legitimate output. Adversarial classifiers are bypassed by adversarial inputs designed against the classifier. None of these failures invalidate the individual control — they invalidate the idea that any single control is sufficient.
The framework's organising principle is that any privileged action — anything that mutates state, calls an external API, sends a message, or reads sensitive data — sits behind five to seven independent controls. Each control is bypassable; the conjunction is not. A payload that defeats fencing, then defeats schema validation, then defeats the tool gate, then defeats the confirmation gate, then defeats the audit alert, is no longer a commodity attack — it is a research project, and the cost has risen above the value of the target.
The twelve layers below are grouped into four columns by function: input controls, model-time controls, tool and action controls, and post-action controls. The columns matter because they map onto the places where an attacker's payload has to survive — if it cannot enter, the input layers caught it; if it enters but cannot shape the model's output, the model-time layers caught it; if it shapes the output but cannot reach a privileged tool, the action layers caught it; if it reaches the tool but is detected, the post-action layers caught it.
Input controls
Sanitization · fencing · provenanceThe layers that operate before any model invocation. Sanitize attacker-controllable strings, fence them as untrusted content with structural markers, propagate provenance metadata so downstream layers know which inputs are trusted vs unverified.
Layers 01 → 03Model-time controls
Structured outputs · instruction hierarchyThe layers that shape what the model can emit. Constrain output to a typed schema, declare instruction priority so attacker-supplied instructions are dispreferred, run content through filters before the output reaches downstream consumers.
Layers 04 → 06Action controls
Tool gating · allow-lists · confirmationThe layers between the model's output and any side effect. Per-tool authorisation, argument allow-listing, parameter validation, human-in-the-loop confirmation for destructive actions, exfiltration-path blocking at the egress layer.
Layers 07 → 10Post-action controls
Audit · replay · forensicsThe layers that catch what the others missed. Tamper-evident audit trail, deterministic replay of any incident, content-security policies on rendered output, alerting on anomalous tool-call sequences. The forensic safety net.
Layers 11 → 12The next six sections walk through the framework: input and output controls together (Section 02), tool gating (Section 03), adversarial evals (Section 04), forensic replay (Section 05), content-security rendering (Section 06), and the four-tier maturity model that sequences adoption (Section 07). Each section names the residual risk the layer does not cover, because that is the framing that makes layering work — every layer is incomplete, and the design admits that up front.
02 — Input + OutputSanitization, structured schemas, output filtering.
Input and output controls bracket the model. Input controls decide what content reaches the model and how it is labeled when it does; output controls decide what the model is allowed to emit and what happens to the emission before it reaches a downstream consumer. The four layers in this group are the highest-leverage controls in the framework — adoption order matters less than adoption depth.
The single most impactful layer in the framework is structured outputs. A model whose output is constrained to a validated typed object cannot emit free-form exfiltration text — the schema either admits a field that would carry exfiltration (in which case the schema is wrong) or it does not (in which case the attack path is structurally closed). Entire classes of attack disappear from the threat model the moment output is shaped by a schema the downstream code actually parses.
Input sanitization
Strip · normalize · length-capAttacker-controllable content is sanitized before model invocation. Strip control characters and zero-width unicode, normalize encodings, length-cap to a defensible ceiling. Residual risk: a normalized payload can still contain crafted natural-language instructions.
Defends against trivial bypassUntrusted fencing
Structural delimiters + provenanceContent the agent did not author is wrapped in structural markers (XML-style tags or structured objects) labeled with provenance. Downstream layers consume the provenance to gate behavior. Residual risk: a sufficiently clever payload imitates the fence — fencing alone never holds.
Defends against direct instruction injectionStructured outputs
Typed schema · Zod · function callingModel output constrained to a validated typed object. Downstream code reads only schema-defined fields; free-form text channels are closed. Residual risk: schema admits a free-form field, or the downstream code falls back to parsing prose on schema failure.
Highest-leverage layer · adopt firstOutput filtering
Egress validation · content rulesOutput is validated against egress rules before it reaches the consumer — secret-pattern matchers, URL allow-lists, PII detectors. Residual risk: filters are signature-based and bypassable by content reshaped to avoid the signature. A safety net, not a primary control.
Defends against late-stage exfiltrationThree operational notes on this group. First, sanitization should happen as close to the input source as possible — the further from the source, the more chances the payload has to be passed through untouched. Sanitize when the document is fetched, not when it reaches the model. Second, untrusted fencing only works if the downstream layers actually consume the provenance metadata — fencing without enforcement is theatre. Third, structured outputs do not eliminate the need for validation; they shift it from parsing free-form prose to validating typed fields, which is tractable but not zero.
The residual risk after these four layers, taken together, is still significant: the model can be persuaded to emit a structured object whose values constitute an attack — the schema admits a field for "next action" and the attacker has shaped the content so the model fills that field with the malicious action. That is the gap the tool-gating layers in Section 03 are specifically designed to close.
"Constraining output to a typed schema closes more attack surface per engineering hour than any other control in the framework — and most teams have not adopted it."— Digital Applied agent security, on the highest-ROI defensive control
03 — Tool GatingLeast privilege, argument allowlists, parameter validation.
Tool gating is the layer where structural defenses meet the real-world systems an agent can touch. The principle is simple even when the implementation is not: every tool invocation passes through a gate that enforces three properties — the caller is authorised to invoke this specific tool, the arguments are inside the allowed shape and value range, and the action does not exceed the privilege budget for this turn.
The single most common critical finding across audited agents is the omnibus tool — one handler that takes a free-form string and decides at run time which underlying operation to perform. The pattern is convenient in source but reads as a privilege- escalation primitive at audit time. Decompose into narrow tools, each with a tight schema, each with a per-tool authorisation check at dispatch. Tool-catalog size is not the problem; tool- schema specificity is the leverage.
Layer 05 — Per-tool authorisation.The caller's identity and scope are verified before the handler body runs. Same identity may invoke read tools but not write tools; same identity may invoke billing-reporter but not billing-mutator. The verification is at dispatch, not after, and it short-circuits on failure without touching the underlying system. Residual risk: the caller's scope has been crafted to include a tool that should not have been in scope — handled by the next layer.
Layer 06 — Argument allow-lists. String arguments are validated against allow-lists where possible. File paths against an allow-listed directory tree. URLs against an allow- listed host set. SQL fragments against a parameterised template, not a free-form query. Enum arguments against the closed enum. Residual risk: an allow-listed value is itself malicious — for example, an allow-listed URL that an attacker controls.
Layer 07 — Parameter validation. Numeric and structural parameters are validated against bounds tighter than the underlying system enforces. Bounded integer ranges, length caps on string fields, nested-object depth limits, total argument size caps. Residual risk: the validation is total but the schema itself is too permissive — caught by the schema review in the maturity model.
Layer 08 — Action-budget enforcement. Each turn has a budget of privileged tool calls; exceeding the budget requires explicit user re-authorisation. State-mutating tools count more than read tools. The budget is small by default and visible in the audit trail. Residual risk: the budget is set too generously, or a single high-value action stays under the budget but is itself catastrophic — handled by the confirmation gate.
execute, run, query, or action that takes a single free- form string argument collapses the entire tool catalog to one tool with maximal blast radius. Per-tool authorisation is impossible. Audit logs are uninterpretable. Split into narrow tools, even at the cost of catalog size — the framework cannot defend an omnibus.One pattern worth naming explicitly: the confirmation gate. State- mutating tools should require explicit user confirmation when invoked after an untrusted-content tool in the same turn. This is the layer that catches the document-injects-instruction → agent-acts-on-instruction chain that is the dominant prompt- injection threat in production agents today. The implementation is partly host cooperation (the IDE or chat surface must surface confirmation differently for destructive tools) and partly server policy (the tool itself refuses to act until confirmation is received). Audit asks: do you have a rule here, and is it enforced.
04 — EvalsAdversarial fixtures and red-team rotations.
Adversarial evaluation is the layer that keeps the other layers honest. General-purpose model evals do not surface injection regressions — they were not designed to. A curated adversarial corpus that is run on every model upgrade, every prompt change, and every tool-schema change is the only way to know that a framework that passed audit in March still passes audit in May.
The corpus has three ingredients. First, a base set of historically successful payloads against agent systems generally — the public injection-attack corpora are a starting point. Second, a set of payloads specific to your tool catalog — the worst plausible misuse, the credential class that could enable it, the detection signal that should surface it. Third, a rotating fresh set produced by a quarterly red-team exercise — an internal team (or an external partner) attempts to craft new payloads against the current framework, and the successful ones land in the corpus.
Historically known payloads
A baseline corpus of payloads that defeated other agent systems publicly. Cheap to assemble, stable across runs, useful as a regression-gate. Residual risk: every payload in the static corpus is, by definition, already in the training distribution of the next model release — coverage decays.
Run on every releaseCrafted against your catalog
Payloads designed for the specific tools your agent exposes — the high-value, state-mutating, externally billable tools. Authored by the team that built the agent, refined by the team that audits it. Higher leverage than the static corpus because the threat surface is yours.
Author per toolQuarterly fresh payloads
A rotating red-team exercise — internal team or external partner — that attempts new payloads against the current framework every quarter. Successful payloads land in the corpus permanently; unsuccessful ones are documented for next round's escalation. The freshness is what keeps the corpus from staling.
Rotate quarterlyBlock-on-fail before deploy
The corpus runs as a release gate, not a recurring report. A model upgrade, a system-prompt change, or a tool-schema change that regresses any fixture blocks the deploy until investigated. The gate is what converts the corpus from documentation into defense.
Wire into CILayer 09 — Adversarial eval corpus. The corpus is the layer; the run cadence is the operational practice. Fixtures are versioned, payloads are tagged by tool target and attack class, results are diffed across runs. Regressions block the release. Residual risk: a sufficiently novel payload is, by definition, not in the corpus — handled by the forensic and CSP layers below.
Two operational notes. First, the corpus is itself sensitive — it contains crafted attack payloads that work, or worked, against real agent systems. Treat it the way you treat the secrets you are defending: access-controlled, audited, never published. The value of the corpus is partly that the payloads are not in any training set or public dataset; publishing it actively reduces its leverage. Second, when a payload from the corpus does land in an incident, treat that as a corpus-coverage failure first, not a framework failure — the framework caught nothing because the fixture should have caught it earlier. Iterate the corpus, then the framework.
Baseline fixtures
A starter corpus of around 120 historically successful payloads against agent systems — direct injection, jailbreak attempts, multi-turn manipulation, ASCII smuggling, encoding tricks. Stable. Cheap. Regression-gate quality.
Recommended floorPer-tool authored
Five to ten payloads per high-value tool, authored by the team that built it. The worst plausible misuse, the credential class that could enable it, the detection signal that should fire. Higher value than the static corpus.
Per-tool budgetQuarterly fresh
A rotating red-team exercise that produces fresh payloads each quarter. Successful payloads land in the corpus permanently. Unsuccessful attempts are documented for next round's escalation. Freshness is what keeps coverage from staling.
Rotate, do not repeat05 — ForensicsDeterministic replay, audit trail tamper-evidence.
Forensic capability is the layer that converts incident response from narrative reconstruction into structured query. Without a tamper-evident audit trail and deterministic replay, every security event becomes a story the team has to assemble from memory and partial logs; with them, the same event becomes a reproducible record the team can investigate, share with auditors, and use to extend the adversarial corpus.
The two layers in this group — Layer 10 (tamper-evident audit trail) and Layer 11 (deterministic replay) — work together. The audit trail captures every input, every model invocation, every tool call, every response, every layer-gate decision. Replay consumes the trail to reproduce the agent's decision path against the same inputs, with the same model version, the same prompts, the same fixtures. The combination is what turns a vague "we think the agent did X because of Y" into "here is the exact trace, here is the layer that should have caught it, here is the fixture that will catch it next time."
The audit trail has three integrity properties that matter. First, append-only — once written, an entry cannot be deleted or modified by the system that produced it. Second, tamper-evident — any modification is detectable by a separate verification process, typically a hash chain or signed sequence numbers. Third, complete — every layer-gate decision is logged, including the "no, this is fine" decisions that did not block anything but matter when reconstructing what the layer was doing at the moment of the incident.
Replay matters because the alternative — best-effort reconstruction — fails at the worst possible time. The incident that matters most is the one the team has not seen before, and the team that has not built replay capability before the incident does not build it during one. Replay is a capital cost; pay it once, amortise across every future incident.
Forensic capability · maturity-tier alignment
Source: Digital Applied production agent audits, Q1-Q2 202606 — CSP + RenderingContent security policies for rendered outputs.
The final structural layer is content-security policy on rendered output. Agents increasingly emit content that is rendered in a client surface — markdown that becomes HTML, image references that become network requests, link targets that become user clicks. Each of those rendering paths is a re-introduction of attack surface that the earlier layers cannot reach because they operate against text, not against a rendered DOM.
Layer 12 — Content-security rendering. Rendered output is gated by a CSP equivalent: image sources are allow- listed (or stripped entirely from agent-generated markdown), link targets are allow-listed and rel-attribute-hardened, embedded HTML is sanitized through a known-good parser, and out-of-band channels (markdown image embeds, iframe targets, form actions) are explicitly enumerated and either blocked or allow-listed. Residual risk: a rendering path nobody enumerated — handled by audit-trail review and the maturity-model gap analysis.
Markdown image sources
Allow-list · strip · proxyAgent-generated markdown can embed images via attacker-controllable URLs. A rendered image fires a network request to the attacker's server, leaking the rendering context (cookies, referrer, document path). Allow-list the image-source hosts, strip the embeds, or proxy through a trusted middle.
Exfiltration vectorAnchor rel hardening
Allow-list · noreferrer · noopenerAgent-generated links default to a hardened rel attribute (noreferrer noopener) and target attribute. Off-allow-list targets are rewritten through an interstitial that requires explicit user confirmation before navigation. Stops the click-through phishing path.
Navigation vectorSanitized parser
Known-good HTML sanitizerIf the surface renders raw HTML at all, run it through a strict sanitizer with a tight allow-list of tags and attributes. Strip script, iframe, object, embed, and any attribute that admits a JavaScript URL or expression. No exceptions for trusted content — provenance is established earlier, not here.
XSS-equivalent vectorTwo notes on the rendering layer. First, CSP is the right framing even outside browsers — a CLI agent that emits markdown that gets piped into a renderer has the same attack surface as a web agent that emits HTML that gets injected into a DOM. The mental model is unchanged; the implementation differs. Second, the rendering layer is where the framework most often meets existing application-security practice — the controls are well- documented in OWASP and existing CSP guidance. Use the existing literature; do not re-invent.
"The rendering layer is the one where agentic security overlaps most cleanly with existing application security. Borrow the playbook, do not re-derive it."— Digital Applied agentic security, on the rendering layer
07 — Maturity ModelLayer-by-layer adoption tier.
The four-tier maturity model exists because no team adopts twelve layers simultaneously. The model orders adoption by leverage and by dependency — each tier is a coherent stopping point with a defensible posture, and each tier's controls unlock the controls in the next tier. Pick the tier that matches your current operational reality, ship to it cleanly, then plan the next tier on a calendar.
Tier 1 is the baseline every production agent must reach. Tier 2 is where most teams should target — defensible against commodity attacks, audit-ready against SOC2 questions. Tier 3 is where security-sensitive deployments live — finance, healthcare, agents with mutating access to customer data. Tier 4 is the frontier — agents operating on adversarial inputs at scale, where the residual risk after twelve layers still matters.
Maturity-tier coverage · layer-by-layer adoption
Source: Digital Applied 12-layer framework, 2026Two pragmatic notes on the maturity model. First, do not skip tiers. Adopting Layer 11 (deterministic replay) before Layer 03 (structured outputs) is a misallocation of engineering effort — replay is more valuable when the inputs to replay are themselves constrained and parseable. The tier ordering exists because controls depend on each other; respect the dependency. Second, the tier is a snapshot, not a destination — agents are iteratively rebuilt, and a tier-2 posture in Q1 is a tier-1 posture in Q3 if the framework does not iterate alongside the agent. Re-audit per quarter, re-baseline per model upgrade.
For teams considering whether to run this internally or bring in a partner: the framework above is enough to run a credible internal adoption. The reason teams engage us is rarely capability; it is calibration — a reviewer who has seen the same layer fail across a hundred different agent codebases names the failure mode faster and writes the remediation language in the shape that maps onto SOC2 control evidence. If that calibration matters, our AI transformation engagements include twelve-layer audits as a discrete deliverable; if it does not, the layers above are yours to run.
One last cross-reference: this framework operates at a higher altitude than any single MCP server. If MCP is in your stack, the MCP server security audit checklist sits inside Layers 05-08 of this framework and is the right read for the server-internal controls. The audit-trail and replay layers (10 and 11) are detailed in the agent audit-trail design guide — both reads are complementary, both fit inside this twelve-layer structure.
Prompt injection is a defense-in-depth problem — never single-layer.
Prompt-injection defense is the area where the temptation to settle for a single strong control is most pronounced and the cost of doing so is most predictable. Every individual control in this framework is bypassable in isolation; the conjunction is what holds. A payload that defeats fencing, schema validation, tool gating, confirmation, and the audit-trail alert is no longer a commodity attack — it is a research project, and the cost has risen above the value of the target.
The twelve layers above are not a checklist; they are a frame for sequencing the work. The four-tier maturity model is the calendar — pick the tier that matches your current operational reality, ship to it cleanly, plan the next tier. Re-baseline per quarter. Re-audit per model upgrade. Treat the adversarial corpus as a release gate, not an annual exercise.
The single most consequential mental shift is the one in Section 01: stop looking for the one control that holds, start designing the conjunction of five to seven controls that hold together. Once that framing is in place, the twelve layers follow naturally, the maturity tiers sequence the adoption, and the framework becomes a thing your team can actually ship — not an aspiration on a security roadmap.