AI output PII redaction is the practice of detecting and masking personally identifiable information across every surface an AI system produces — prompts, completions, logs, traces, audit trails, and downstream sinks — before that data lands anywhere it should not. The work is unglamorous, and the wrong place to learn it is the day a regulator, customer, or auditor asks where their data went. This guide is the implementation playbook we run for production agentic systems in 2026.
The framing matters because most teams treat redaction as a feature. They add a regex sweep on the way out, point at it in the architecture diagram, and assume the problem is solved. Six months later a routine audit finds email addresses in trace spans, customer names in feedback dumps, account numbers in evaluation datasets, and full conversation histories archived in storage that nobody mapped to a retention policy. The redaction existed; the pipeline did not.
What follows is the working architecture — why redaction matters on every AI surface, the three detection techniques and where each is appropriate, the structured-logging discipline that redacts at the field level rather than post-hoc, the real-time versus batch trade-off, how to measure false positives so over-redaction does not destroy utility, the four-tier policy that matches aggressiveness to data classification, and a reference implementation that ties it together. Read it as a punch list for production AI compliance work.
- 01Every AI surface leaks PII without discipline.Prompts, completions, logs, traces, evals, and audit trails all leak by default. Treat redaction as a pipeline that spans every surface — not a single regex on the response edge.
- 02Structured logging redacts at the field level.Field-level redaction at the structured-logging layer beats post-hoc sweeps. Mask the slot before serialisation; bolting on backend redaction is incomplete and never fully reversible.
- 03LLM-based redaction has highest coverage, highest latency.Regex is fast and brittle. NER is contextual and medium-cost. LLM-based redaction has the broadest coverage on free-form text but adds hundreds of milliseconds and dollars per million tokens — pick by archetype.
- 04Four-tier policy matches data classification.Public, internal, sensitive, regulated — each tier dictates the redaction approach, retention rules, and audit posture. One uniform policy either over-redacts cheap data or under-redacts the costly kind.
- 05False positives must be measured.Over-redaction silently destroys agent utility — a customer name redacted out of a thank-you note, a product SKU stripped from a support reply. Sample weekly, score precision and recall, tune.
01 — Why RedactionEvery AI surface is a PII leak waiting to happen.
Production AI systems are unusually leaky compared to traditional web applications. The reason is structural. A conventional CRUD app receives a typed request, writes a known row, and returns a typed response — the data shape is predictable, the storage sites are enumerated, and a single redaction policy at the API boundary covers most of the surface. An AI system, by contrast, receives free-form text, hands it to a model that produces more free-form text, persists both sides for evaluation, replays them during postmortems, and ships portions of them to a half-dozen third-party services for observability, evaluation, and quality assurance. The surfaces multiply; the shape is unknown.
The result is a fundamentally different leak profile. Customer data flows into the prompt by way of conversation history, retrieval results, or tool outputs. The completion echoes back verbatim slices of that context. The observability SDK captures both sides as span bodies and ships them to a SaaS backend with its own retention. The evaluation pipeline samples a subset nightly and stores the worst-performing rows in a regression dataset. The audit log preserves an immutable record for compliance. None of those sinks were designed as a regulated data store; all of them become one the moment unredacted PII arrives.
The diagnostic question is the one auditors actually ask. If a customer files a deletion request under GDPR, CCPA, or POPIA, can you enumerate every storage location their data currently lives in, retrieve the specific records, and prove they have been deleted? In agent deployments without a redaction pipeline the honest answer is no. The data is in the prompt history, in the trace spans, in the eval rows, in the audit log, in the cold storage tier, and in whatever third-party SaaS the team integrated for observability — and most of those backends do not support selective deletion at the field level. The composition of those defaults is a compliance event waiting to happen.
The cost asymmetry is what makes this worth fixing properly. Building the redaction pipeline correctly the first time takes three to six weeks of focused engineering. Retrofitting it after a regulatory incident takes six to nine months of cross-team remediation, an external audit, breach notification, and the sort of customer-trust damage that does not appear on a finance spreadsheet but materially changes the next twelve months of renewals. The teams that ship redaction as a pipeline before the first incident treat compliance as engineering; the teams that ship it as a feature treat it as theatre.
02 — DetectionRegex, NER, LLM-based — pick by archetype.
Three detection techniques cover almost every production redaction need. They are not mutually exclusive; the right architecture composes all three with each one applied to the archetype it handles best. The wrong architecture picks one and hopes it scales across every data shape, which is how teams end up with either brittle regex sweeps that miss half the PII or LLM-based redaction that costs more than the underlying model call.
The first axis is data shape. Structured fields with known patterns (email, phone, credit card, IBAN, government IDs) are regex territory — fast, deterministic, cheap, and accurate when the pattern is well-defined. Contextual entities (names, addresses, organisations, locations) are named-entity-recognition territory — a small fine-tuned model classifies tokens by entity type with reasonable precision and recall, at single-digit milliseconds per request. Free-form text with implicit PII (descriptions of medical conditions, account references in natural language, indirect identifiers) is LLM-based redaction territory — only a general-purpose model has the world knowledge to flag "the customer with the cardiology appointment last Tuesday" as an indirect identifier worth masking.
The second axis is cost. Regex runs at sub-millisecond per request and effectively zero marginal cost. NER runs at 5 to 50 milliseconds per request on commodity hardware with a modest fixed cost for the model server. LLM-based redaction runs at 200 to 800 milliseconds per request and adds a per-token cost that scales with the underlying model. A reasonable production architecture runs regex on every request, NER on every request for surfaces that touch free-form text, and LLM-based redaction only on flagged samples or high-risk pathways where the additional coverage is worth the latency.
Regex · structured patterns
Sub-millisecond latency, zero marginal cost, deterministic. Best for emails, phone numbers, credit cards, IBANs, government IDs, internal account references with a known shape. Brittle on edge cases — international phone formats, embedded credit cards in narrative text — and misses anything contextual. Run on every surface, every request.
Always onNER · contextual entities
5 to 50 ms per request on commodity hardware. Best for names, addresses, locations, organisations, dates that act as identifiers, and other contextual entities a regex cannot capture. Requires a fine-tuned model (Presidio, spaCy custom, or a small transformer). Precision and recall depend on training data — sample weekly to verify.
Default for free-formLLM-based · world knowledge
200 to 800 ms per request, dollar-per-million-token cost. Best for indirect identifiers, free-form text where context determines whether something is PII, and high-risk pathways where coverage outweighs latency. The model reads the passage and returns a structured list of spans to redact. Highest coverage; highest cost. Reserve for sampling and high-risk surfaces.
High-risk pathwaysThe architectural rule is to compose, not to choose. Regex catches the structured patterns at zero cost. NER catches the contextual entities at modest cost. LLM-based redaction catches the indirect identifiers regex and NER both miss, but only on the surfaces and samples where the latency and dollar cost are justified. The teams that pick one technique end up either under-redacting (regex only) or over-spending (LLM only); the teams that compose all three end up with coverage close to ninety-nine percent at a cost that finance signs off on without argument.
"Regex is fast and brittle. NER is contextual and medium-cost. LLM-based redaction has the broadest coverage on free-form text but adds hundreds of milliseconds — pick by archetype, compose by surface."— Production redaction engagements · 2026
03 — Structured LoggingRedact at the field level, not post-hoc.
The single highest-leverage decision in a redaction pipeline is where the masking actually happens. Two patterns dominate practice. The first is post-hoc redaction — a scrubber that runs against persisted log records, trace spans, or audit-trail rows after the data has already been written. The second is structured-logging redaction — a policy that runs inside the logging or tracing SDK on the way out, before the payload is ever serialised to the wire. The first pattern is the common one; the second is the correct one.
Post-hoc redaction has three fatal weaknesses. First, it is incomplete by construction — the raw payload exists in the backend for some period before the scrubber catches up, and any consumer reading during that window sees the unmasked data. Second, it scales badly — backends do not support efficient field-level rewrites at the document level, so the scrubber ends up rewriting whole records, which is slow and expensive. Third, it does not survive replication — once a log record has been forwarded to a SaaS observability backend, a downstream data warehouse, and a SIEM, post-hoc deletion in one location does not propagate to the others.
Structured-logging redaction inverts the model. The redaction policy lives in code (versioned, reviewed, testable), runs in the SDK or proxy emitting the structured log or trace span, and transforms the payload before serialisation. The backend never sees the raw value. The replication problem disappears because every downstream sink receives the already-masked record. The scale problem disappears because the masking runs once per payload at emit time, not once per record at every storage site. The completeness problem disappears because there is no window where the raw payload exists in a queryable form.
Structured-logging layer
The single correct place to redact is the SDK or proxy emitting the structured log or trace span — before the backend ever sees the raw value. Bolted-on post-ingest redaction is incomplete by construction and never propagates cleanly across downstream sinks.
Architecture ruleRedacted-fields list as attribute
Every redaction emits a structured list of the field names or pattern names that were masked, attached to the same record. Reviewers reading the log or trace know exactly what was redacted and why — preventing the "was this field empty or redacted?" ambiguity that destroys postmortem fidelity.
AuditabilityVersioned policy in code
Redaction policies live in source control alongside the application code, not in a separate vendor console. Version, review, and test like any other code path. The diff is auditable; the rollback path is git revert.
Compliance postureThe implementation pattern that works in practice is a two-layer logging interface. The application code calls a high-level logger with structured field-value pairs (for example, logger.info("turn_complete", { user_email, prompt, retrieval_ids })). The logger consults the redaction policy — typically a per-field specification keyed by field name — to determine which fields require which masking strategy. Email fields get the regex mask; prompt fields run through NER plus selective LLM-based redaction depending on the surface; retrieval_ids pass through unmasked because they are stable identifiers, not PII. The serialised record carries the masked values plus a redacted_fields attribute listing what was touched.
Two practical refinements matter. First, the policy must be allow-listed rather than block-listed: any unrecognised field name routes to the most conservative default (mask entirely until classified). Without that posture, every new field added to the codebase becomes a potential leak until someone remembers to update the policy. Second, the redacted_fields attribute should encode the policy version that performed the redaction. When policies evolve — a new pattern is added, an old one is retired — the audit trail tells reviewers which policy version was in force at the time each record was written.
04 — Real-time vs BatchLatency vs coverage trade-off.
Once the redaction site is fixed at the structured-logging layer, the next architectural decision is how aggressively to run the expensive detection techniques. Two patterns dominate production deployments. The first is real-time redaction — every record passes through the full detection stack synchronously before serialisation. The second is batch redaction — fast techniques (regex) run inline, and slow techniques (LLM-based) run asynchronously against a sample, with the results either fed back into the record or used to tune the fast layer.
Real-time redaction is the correct default for any surface where the payload may be read by a human or third-party SaaS before the batch layer would have caught up. Trace spans ship to an observability backend within seconds of emission; eval datasets feed into nightly LLM-judge runs; audit logs are sometimes queried minutes after a customer-facing event. For all of those surfaces, anything less than synchronous coverage on every record creates a window of exposure. The cost is latency — adding 50 to 800 milliseconds of detection on every log emit — which matters for user-facing surfaces and barely matters for asynchronous ones.
Batch redaction is the right pattern for surfaces where the payload remains in a private staging area long enough for the asynchronous pass to complete before any consumer touches it. The architecture is two-stage: regex runs inline to mask the obvious patterns immediately, then a worker dequeues the record, runs NER and LLM-based detection, and writes the refined mask back into the record before promoting it to the production index. The latency window is configurable; the cost-per-record is dramatically lower because the expensive techniques run only once per record (and only on a sample if cost demands it) rather than synchronously on every emit.
Real-time synchronous
regex + NER + LLM · inlineEvery record runs through the full detection stack synchronously before serialisation. Latency cost: 50 to 800 ms per emit. Coverage: highest possible. Correct default for surfaces where the payload may be read or shipped before a batch layer would catch up — trace spans, audit logs, anything touching third-party SaaS.
Default for live surfacesBatch asynchronous
regex inline · NER + LLM asyncRegex runs inline for immediate baseline coverage; a worker dequeues the record and runs NER plus LLM-based detection before promoting it to the production index. Latency window: configurable, typically 30 seconds to 5 minutes. Cost: 5 to 20× lower than synchronous. Correct for staged sinks where the payload remains private until the batch layer finishes.
Cost-optimisedThe composition pattern is what production systems actually ship. Real-time runs on every surface where exposure could happen before a batch layer would catch up — trace spans, audit logs, anything shipped to a third-party SaaS. Batch runs on internal sinks where the payload is gated by a staging queue — eval datasets, regression archives, long-term cold storage. The two layers share the same policy code so coverage stays consistent; the difference is purely operational. Teams that treat real-time and batch as alternative architectures end up choosing wrong; teams that treat them as complementary layers cover both axes without overspending.
Detection stack composition · coverage vs latency
Coverage estimated from internal redaction engagements · production AI deployments 202605 — False PositivesWhen over-redaction destroys utility.
Coverage is half the metric; precision is the other half. Redaction pipelines that optimise only for recall — "catch everything that might be PII" — produce false positives at a rate that quietly destroys agent utility. The customer name redacted out of a thank-you note. The product SKU stripped from a support reply because it matched a credit-card check digit by coincidence. The address fragment masked inside an internal shipping confirmation. Each individual case is recoverable; the aggregate is an agent that returns sanitised, robotic responses that no longer feel like a competent human assistant.
False positives are also harder to detect than false negatives. Missed PII shows up in audits, in customer complaints, and in the occasional dramatic incident. Over-redaction shows up in slightly worse satisfaction scores, slightly higher escalation rates, and slightly more "the bot is useless" feedback — all of which are easy to dismiss as the normal noise of agent operations. The teams that measure precision properly catch it; the teams that measure only coverage never see it.
The diagnostic workflow is straightforward and worth running weekly. Sample 100 redaction decisions from production traffic — both positive (redacted) and negative (passed through) — and have a reviewer score each one for precision and recall. Precision is the fraction of redactions that removed genuine PII; recall is the fraction of genuine PII that was redacted. Aggregate weekly, plot the trend, and tune the policy whenever precision drops below 95 percent or recall below the team's coverage target. The cost is roughly an hour per week of senior engineering attention; the alternative is a redaction pipeline silently becoming a utility tax on the agent.
Precision > 95% · recall > 99%
Both axes measured weekly with a sampled review. The redaction policy catches almost all genuine PII and rarely masks legitimate content. Agent utility scores are stable; compliance posture is defensible. The target state for any production redaction pipeline.
Target statePrecision low · recall high
The policy is tuned for maximum catch rate and masks too aggressively. Customer names disappear from thank-you notes; product SKUs collide with regex patterns. Agent satisfaction quietly degrades. Common in teams who treat "catch everything" as the only goal — and never measure precision.
Utility taxPrecision high · recall low
The policy passes most content through unmasked because the detection stack misses contextual or indirect PII. Easy to spot during a compliance audit, harder to spot week-to-week. Common when only regex is in place and free-form text is the dominant surface.
Compliance debtNeither axis tracked
The pipeline ships with confidence and no quantitative signal of whether it works. Engineering meetings reference "the redaction layer" as if it were a fixed property of the system. The first time anyone asks for numbers, the team scrambles. Most common state in agent teams who installed redaction but never built the measurement loop.
Common gapTwo patterns help keep precision high without sacrificing recall. First, scope detection by field type. An email-address regex applied inside a field tagged as user_email has near-perfect precision; the same regex applied to free-form prompt text picks up email-looking patterns inside URLs, template literals, and code samples that should not be masked. The field-level structured-logging architecture from Section 03 makes this scoping cheap. Second, prefer reversible masking where the surface allows. Replacing an email with [EMAIL:hash] preserves the information that an email existed and lets downstream consumers join by hash if needed, while still masking the underlying value. Reversible masks make over-redaction less painful because the original value is recoverable under the right access controls; that recovery path turns false positives from incidents into tickets.
"Coverage is half the metric; precision is the other half. Redaction pipelines that optimise only for recall silently turn the agent into a utility tax."— Production redaction engagements · 2026
06 — Four-Tier PolicyMatch redaction aggressiveness to data classification.
One uniform redaction policy applied across every surface and every data shape is the wrong design. It either over-redacts the cheap surfaces (public marketing content, anonymised eval datasets) — slowing them down and adding cost — or under-redacts the costly ones (regulated health, payment, or government-ID data) — leaving compliance exposure on the table. The right design is a tiered policy that matches redaction aggressiveness to data classification. Four tiers cover the spectrum that production AI systems actually encounter.
Tier one is public data — content that may be shared openly with no compliance constraint. Marketing copy, documentation, sample prompts, anonymised eval datasets. Redaction at this tier is minimal: regex for the obvious patterns (in case a real email accidentally landed there), no NER, no LLM-based detection. Tier two is internal data — content scoped to employees and operational use. Internal logs, debug traces, non-customer-facing reports. Redaction at this tier runs regex plus NER, masks the standard categories (email, phone, address, name), and preserves stable identifiers.
Tier three is sensitive data — content tied to identified customers or their accounts. Conversation histories, support interactions, agent traces, customer-facing audit trails. Redaction at this tier runs the full detection stack in real-time, masks all categories aggressively, uses reversible masking where utility demands it, and feeds into a 30-day retention default. Tier four is regulated data — content governed by sector-specific compliance (HIPAA, PCI-DSS, GDPR-Article-9, POPIA-special-personal-information). Redaction at this tier runs the full detection stack with conservative thresholds, defaults to irreversible masking, segregates storage into a compliance-bound backend with its own access controls, and reduces retention to the regulatory minimum.
Redaction aggressiveness by data classification tier
Tiering calibrated against GDPR, CCPA, POPIA, HIPAA, and PCI-DSS frameworks · production engagements 2026The tier assignment is the architectural decision that matters most. Every surface and every field must be classified to a tier at design time, and the classification must be allow-listed — any unclassified surface or field defaults to the most conservative tier until someone explicitly downgrades it. The policy code consults the classification when the redaction runs; the audit log records both the tier in force and the policy version, so reviewers can answer the question "why was this field treated this way?" with reference to the specific decision rather than to the policy as a whole.
07 — ReferenceCode, vendor matrix, anti-patterns.
A reference implementation ties the prior sections together into a deployable shape. The architecture has four moving parts. A classification table — keyed by field name, owned by the data-governance function, stored alongside the application code — that assigns each field to one of the four tiers. A policy module that maps tier plus field-type to a redaction strategy (regex, NER, LLM, reversible mask, irreversible mask). A logger interface that consults the policy on every structured emit and rewrites the payload before serialisation. And a measurement loop that samples production redactions weekly, scores precision and recall, and feeds tuning signals back into the policy module.
The vendor landscape in 2026 has three serious players for the detection layer and a longer tail for surrounding tooling. The choice depends on stack, sovereignty constraints, and the volume of free-form text the pipeline processes. Microsoft Presidio is the open-source default — Python-native, embeds cleanly into a logger interceptor, ships with regex recognisers plus an NER layer based on spaCy or a custom transformer, and integrates with anonymisation operators for reversible and irreversible masking. AWS Comprehend is the cloud-hosted equivalent for teams already on AWS — managed NER endpoints, custom-entity support, and per-request pricing that scales with traffic. AWS Macie targets the bucket-scan archetype rather than the inline-redaction one — useful for auditing existing storage, less useful for the real-time structured-logging path.
Integration with audit trails is the second-order architecture decision worth getting right early. Redaction events themselves are audit-worthy: every masked field generates a record including the field name, the policy version, the tier, the detection technique that fired, and (for reversible masks) the recovery key reference. Those records belong in a separate audit-log backend with its own access controls — distinct from the application log and the trace backend — so that querying "what PII has this system seen for customer X?" returns a complete answer without exposing the raw values.
Microsoft Presidio
Open source · Python · self-hostedOpen-source default. Ships with regex recognisers plus NER on spaCy or a custom transformer. Embeds cleanly into a logger interceptor. Anonymisation operators for reversible and irreversible masks. The starting point for most teams without a strong reason to pick something else.
Self-hostedAWS Comprehend
Managed · per-request pricingCloud-hosted NER and custom-entity detection on AWS. Per-request pricing scales with traffic. Correct choice for teams already on AWS who want a managed endpoint rather than a self-hosted Presidio install. Coverage is solid; integration is straightforward via the SDK.
ManagedAWS Macie
Bucket scan · audit postureTargets the bucket-scan archetype — auditing existing S3 storage for PII rather than redacting inline. Useful for the "what already leaked?" question; not useful for the real-time structured-logging path. Pair with Presidio or Comprehend on the inline side rather than picking it alone.
Audit onlyFour anti-patterns recur across redaction engagements and deserve naming so teams can avoid them by default. First, post-hoc redaction — already covered in Section 03; the wrong architecture, full stop, regardless of which vendor implements it. Second, the "regex is fine" trap — a redaction policy that ships with only regex coverage on the assumption that NER and LLM-based detection can be added later. They can be added later, technically; in practice the policy ossifies and the coverage gap persists until the first incident forces a migration. Third, unmeasured precision — discussed in Section 05; the pipeline that ships without the weekly review loop quietly becomes a utility tax. Fourth, single-tier uniform policies — discussed in Section 06; one rule applied across every surface either over-redacts or under-redacts and wastes engineering time arguing about the middle ground.
For teams building this from scratch in 2026, the recommended sequence is: classify every surface and field into the four tiers; deploy a structured-logging interceptor that consults the classification; start with Presidio plus regex for inline detection; layer NER on tier-two-and-above surfaces; layer LLM-based detection on tier-three-and-above samples; ship the weekly precision-and-recall measurement loop in the same sprint as the inline detection; integrate the audit-log backend in the following sprint. The total budget is four to eight weeks of senior engineering for the first ship, with ongoing tuning thereafter. If your team is scoping a similar engagement, our AI digital transformation practice covers the design, vendor selection, and the measurement loop end-to-end; the broader observability picture sits alongside in our observability anti-patterns essay and the SOC 2 controls mapping lives in our agentic SOC 2 framework piece.
PII redaction is a pipeline, not a feature.
The recurring pattern in this guide is that every decision — detection technique, redaction site, real-time versus batch, tiering, measurement — points toward the same conclusion. PII redaction is not a single regex on the response edge, not a vendor checkbox, not a feature in the architecture diagram. It is a continuous pipeline that spans every surface AI output touches, governed by a tiered policy that lives in code, measured weekly for both coverage and precision, and integrated with audit trails that answer the deletion-request question with confidence.
The teams that ship the pipeline before the first incident treat compliance as engineering. The teams that ship a feature and call it done treat it as theatre — and the theatre holds up until the first audit, deletion request, or breach notification reveals the gap. The cost asymmetry is severe enough that the case for shipping the pipeline correctly is mostly an exercise in showing the comparison: four to eight weeks of focused engineering up front, or six to nine months of cross-team remediation after the event. The earlier ship wins on every axis that matters.
One closing observation. The convergence we expect through 2026 is on a smaller set of correct defaults — open-source redaction libraries shipping field-level interceptors out of the box, observability SDKs exposing structured-redaction primitives as first-class options, regulatory guidance stabilising around the "before persistence" principle. None of that convergence removes the need for the tiered policy, the measurement loop, or the audit-log integration. Defaults move faster than legacy production code; the discipline of running the pipeline correctly is the moat that compounds across every regulatory window.