Agent observability anti-patterns are the gap between trace data you collect and trace data you can actually use during an incident. Most teams trace something — the few that trace usefully avoid eight specific failure modes that quietly destroy the signal long before anyone notices. This essay names each anti-pattern, gives the diagnostic signal that surfaces it, ranks its severity, and ships the corrective pattern.
The framing matters because the marketing language around agent observability is uniformly optimistic. Every vendor sells the same promise of full visibility, real-time replay, and one-pane dashboards. The lived reality is that production traces are often unsearchable (cardinality blew up the index), unsendable to the compliance team (PII in the spans), unreplayable (the body was truncated), and unhelpful at 03:14 (sampling threw away the interesting turn). The instrumentation looks busy; the postmortem ends in a shrug.
What follows is the failure-mode catalogue we apply when auditing agent observability for clients. Eight anti-patterns ordered roughly by severity, with the severity stack reconciled at the end. Read it as a punch list — every anti-pattern your stack exhibits is a trace-quality debt accruing daily interest, and the interest comes due the next time an agent misbehaves in front of a real customer.
- 01PII in traces is a compliance failure.When customer data lands in the observability backend unredacted, the trace store becomes a regulated data store overnight. Redact at the structured-logging layer, before persistence — bolting it on later is painful and never complete.
- 02Cardinality must be bounded.Free-form span attributes (user IDs as labels, raw URLs, timestamps) explode the time-series index and turn the observability bill into a line-item finance asks about. Cap unique label values; push high-cardinality data into the trace body, not the index.
- 03Truncation must be detectable.Silent body truncation destroys postmortem signal — the one span you needed to read is the one the SDK chopped at the byte limit. Emit an explicit truncated flag and a length attribute so reviewers know when they are looking at partial data.
- 04Replay turns incidents from guesswork into walkthroughs.Without deterministic replay from a trace, every fix is a hypothesis. With replay, the postmortem becomes a recorded session that future on-calls can step through — the single highest-leverage capability in agent ops.
- 05Eval signals belong in the trace.Quality and reliability views diverge when they live apart. Inline eval scores as span attributes keep root-cause analysis honest and stop the "quality is fine, reliability is broken" fiction that hides the actual regressions.
01 — Why Traces FailMost teams trace something — few teams trace usefully.
The hardest thing about agent observability is that the failure mode is invisible until you need the data. A team installs a vendor SDK, sees spans flowing in, watches the dashboard light up, and ships. Three months later a customer reports a hallucination from a specific turn at 14:32 on a Tuesday — and the on-call engineer opens the trace viewer to discover that the body was truncated, the user ID is missing, the retrieval span was sampled out, and the eval score lives on a different dashboard nobody joined to this trace. The instrumentation works; the data is useless.
Trace quality is the dimension that distinguishes the two. It is the property that a trace, opened cold, lets a competent engineer answer post-hoc questions about what the agent did and why. Trace quality is not the same as trace coverage (how many turns produce a span) or trace volume (how much data lands in the backend); it is the joint property of completeness, fidelity, searchability, replayability, and compliance handling. Each of the anti-patterns below attacks one of those properties — usually silently, often for legitimate-sounding engineering reasons (storage cost, SDK defaults, library auto-instrumentation).
Spans flow · dashboards green
Vendor SDK installed, traces in the backend, dashboards show coverage. Nobody has tried to use a trace to answer a hard question. The anti-patterns are present but unobserved — the bill is small, the index is fine, the customers haven't complained yet.
Pre-incident stateTrace reads cleanly · replay works
Bodies stored with PII redaction, cardinality bounded, truncation flags explicit, eval scores inline, retrieval payloads preserved, parent-child links intact. An on-call engineer can answer the customer's question in under five minutes and validate the fix by replay.
Target stateSpans flow · postmortem is fiction
Same dashboards as state one, but a real incident has happened. The trace exists; it can't be read. The postmortem narrative is reconstructed from logs and memory. The fix may have addressed the cause; nobody can prove it. The team writes a confident-sounding doc and ships.
Common failure modeTrace not captured
Either the path isn't instrumented or the sampling threw it out. This is in some ways the honest state — at least the team knows it can't answer the question. Worse than the target state, better than the "data looks fine, doesn't actually work" failure mode.
Honest gapOne framing note before the catalogue. The anti-patterns are not ranked by frequency — almost every team exhibits several of them — they are ranked roughly by severity. PII in spans is a compliance event; cardinality explosion is a finance event; trace truncation is a postmortem-fidelity event; missing replay is a root-cause event. The severity matters because remediation is expensive enough that you cannot fix all eight at once. Section seven reconciles the priority list so the team knows where to spend the first quarter's budget.
02 — PII in SpansTrace store becomes a compliance liability.
The default behaviour of every major agent SDK is to capture prompt and response bodies verbatim. The default behaviour of every major observability backend is to store those bodies in a searchable index for 30 to 90 days. The default behaviour of every production agent is to receive customer data — names, email addresses, account numbers, sometimes payment data, occasionally medical or financial records depending on the domain. The composition of those three defaults is a trace store that has quietly become a regulated data store, with all the access controls, retention obligations, and breach-notification exposure that implies.
The diagnostic signal is uncomfortable. Take a random sample of fifty production traces and grep them for the patterns that matter in your jurisdiction — email-address regex, phone-number regex, credit-card check-digit patterns, your domain-specific identifiers. In agent deployments without an explicit redaction policy, the hit rate is typically 30 to 40% on at least one pattern. That is the size of the compliance debt the team has accrued without realising it. Worse, the trace backend usually doesn't support selective deletion at the field level — you can drop a trace, but you cannot scrub a field within a trace — so when a customer requests deletion under GDPR, CCPA, or POPIA the operational answer is to delete the whole record, which destroys the postmortem evidence the engineering team needs.
The corrective pattern is to redact at the structured-logging layer, before the trace ever reaches the backend. The redaction policy lives in code (versioned, reviewed, testable), runs in the SDK or proxy on the way out, and emits a redacted-fields list as a span attribute so downstream consumers know what was masked. Field-level approaches beat regex sweeps where they are available: if the prompt template includes {{customer_email}}as a slot, redact the slot specifically. The trade-off is that replay against redacted bodies is partial; you can reconstruct the agent's decision tree but not the verbatim customer input. For compliance-bound workloads this is the correct trade-off.
Trace samples containing PII
Across agent deployments we audited in 2026, between 30 and 40% of randomly sampled traces contained at least one PII pattern when no explicit redaction policy was in place. Email addresses lead, phone numbers and customer-account identifiers follow.
Audit baselineStructured-logging layer
The single correct place to redact is the SDK or proxy emitting the span — before the backend ever sees the data. Bolted-on post-ingest redaction is incomplete and expensive; the only reliable approach is redacting on the way out.
Architecture ruleRedacted-fields list as span attribute
Every redaction emits a structured list of the field names (or pattern names) that were masked. Reviewers reading the trace know exactly what is missing and why — preventing the "was this field empty or redacted?" ambiguity that destroys postmortem fidelity.
AuditabilitySeverity ranking for this anti-pattern is critical. The cost of a regulatory event — fine, breach notification, customer trust damage — exceeds every other anti-pattern in the catalogue combined. If the audit shows even one occurrence of unredacted PII in the trace backend, this is the first project to ship; it outranks every other improvement on the list.
"The first time the compliance team asks for a deletion under GDPR, you discover whether your observability stack is a tool or a liability. Most teams discover it's the latter."— Production audit notes · agent observability engagements
03 — Cardinality ExplosionTime-series DB becomes a bill.
Cardinality is the silent killer of observability budgets. Every unique value of a span attribute that is indexed for filtering adds a row to the time-series database's label index. User IDs are the canonical mistake — a million unique users becomes a million-row label index, which the database must traverse on every query. Add raw URLs (with query parameters), full prompt hashes, timestamps as strings, or the model-response itself as a label, and the index explodes by orders of magnitude. The backend either starts charging accordingly or starts dropping data; either outcome is bad.
The diagnostic signal is to look at the per-label cardinality report your observability vendor exposes. Most vendors (LangFuse, Phoenix, Datadog, Honeycomb) provide a top-N report of labels ranked by unique-value count. Any label with more than a thousand unique values is a candidate for trouble; any label with more than a hundred thousand is an active problem. The corrective pattern follows from the diagnostic: bound unique label values by design. User IDs do not belong as a label — they belong in the trace body, where they are searchable through full-text indexes rather than the time-series index. URLs get normalised before being labelled (path template instead of full URL with query string). Free-form text never becomes a label.
The architectural principle is simple: labels are for aggregation, bodies are for forensics. If the question is "how many turns per route per hour?" the route belongs as a label. If the question is "what did this specific user see last Tuesday at 14:32?" the user ID belongs as a span attribute in the body — which is queryable but not indexed by the label-cardinality dimension. Most cardinality explosions are caused by treating every interesting attribute as a label because the SDK API made it easy.
Label cardinality vs observability backend health
Severity calibrated against Honeycomb, Datadog, and Phoenix billing models · production engagementsA worked example. In an engagement during the first quarter of 2026, a client's observability bill tripled in a single month with no traffic change. The root cause was a well-meaning developer who added the full conversation ID — a UUID minted per session — as a label rather than a body attribute, because the SDK's set_tag method was easier to call than set_attribute. Three weeks later the label index held two million unique values; the time-series database charged accordingly. The fix was a one-line code change and a backend cardinality reset; the bill stopped growing the same day but the previous month's charges were already irreversible. Severity ranking: high — financial impact within weeks, no customer-facing consequence directly.
04 — Trace TruncationSilently destroys the postmortem signal.
Almost every observability SDK truncates span attribute values beyond a default byte limit — typically 4 KB, sometimes 8 KB, occasionally configurable upward to 64 KB or 256 KB. The truncation is well-intentioned (protects the backend from pathological payloads) but it is almost always silent: the SDK chops the value at the byte limit and emits the truncated form with no indication that truncation occurred. For agent workloads this is catastrophic, because agent prompts routinely exceed 4 KB and frequently exceed 64 KB. The exact bytes you need to read during a postmortem are the bytes the SDK threw away.
The diagnostic signal is to inspect a handful of long-context traces and compare the captured prompt-body length against the actual prompt length the model received (which you can derive from token counts and a tokenizer). Any systematic difference indicates silent truncation. The corrective pattern has two components. First, raise the byte limit explicitly — most SDKs allow this through configuration — and verify the new limit against the longest prompts your agent produces in practice. Second, emit a structured truncation indicator: if a value was truncated, the span carries a truncated: true attribute and an original-length attribute, so reviewers know they are looking at partial data.
A subtler variant is response truncation under streaming. Streamed responses are concatenated by the SDK on stream end — but if the stream is interrupted (network error, client disconnect, timeout), the partial response is what lands in the span. Without an explicit indicator that the stream was incomplete, the postmortem reader sees a half-response and attributes it to the model rather than the network. The same pattern applies: emit a structured stream-completion attribute so the trace tells the truth about its own completeness.
Severity ranking: high. Trace truncation does not produce a compliance event or a finance event, but it does produce the specific failure mode where the engineering team writes a confident postmortem narrative that turns out to be wrong because the underlying evidence was incomplete. That class of error is harder to detect than an outright missing trace — because the reader sees what looks like a complete record — and therefore costs more in eventual rework.
05 — Missing ReplayIncidents become guesswork.
Replay is the ability to take a specific captured trace, reconstruct its inputs deterministically, and re-run the agent in a sandboxed environment to validate hypotheses. Replay is the single highest-leverage capability in agent operations because it is the only way to convert a postmortem from a narrative ("we think this fixes it") into a test ("we re-ran the failing trace against the proposed fix and it passed"). The marketing copy of every observability vendor implies replay is supported; the operational reality is that replay requires storing the agent's actual inputs verbatim, which is in tension with truncation defaults, PII redaction, and storage budgets.
The diagnostic signal is the "replay test": pick a real production trace, hand its ID to an engineer, and ask them to reproduce the agent run on a developer laptop in under thirty minutes. If they can, replay is real. If they cannot, you have the most common agent-observability gap in the industry — and every postmortem your team has written this quarter is partly fiction. The corrective pattern requires three things. First, store the rendered prompt verbatim (after PII redaction) including system prompt, conversation history, and tool-result text. Second, store the retrieval payload — the top-k documents the agent received, identified by stable IDs so the chunks can be re-fetched. Third, store the tool-result payloads, because the agent's behaviour depends on what the tools returned, not just on what it asked for.
Replay is the place where the storage-cost trade-off becomes most explicit. Storing full bodies for every trace at long retention costs roughly two to five times what storing only metadata costs. The economically defensible policy is tiered: full bodies for short retention (7 to 30 days, covering the incident-response window), metadata-plus-hash for longer retention (compliance lookups), and a fast-restore path from cold storage when an older trace becomes interesting. The anti-pattern is the all-or-nothing pendulum: either store everything forever (cost explosion) or store nothing meaningful (replay impossible). Tier deliberately.
Bodies, retrieval, tool results · deterministic
Rendered prompts stored verbatim (post-redaction), retrieval payloads with stable IDs, tool-result payloads preserved. Replay reconstructs the agent run on a developer laptop in under thirty minutes. Every postmortem fix is validated against the failing trace before ship.
Operations modePrompts stored · tool results missing
The agent's reasoning can be inspected; what the tools actually returned cannot. Replay is partial — the engineer can re-run the model but must mock the tool layer, which may not reproduce the failure. Better than nothing; not a substitute for full capture.
TransitionalTimings and IDs · no bodies
The trace tells you a turn happened, who it was for, and how long it took. It does not tell you what the agent said or what the tools returned. Postmortem becomes a narrative reconstruction from logs and memory. Common — and an active liability for incident response.
Liability modeTraces captured · replay never built
Bodies are stored, sometimes; nobody has wired up the sandbox path to re-run an agent against a captured trace. The data exists; the operational muscle doesn't. The most common state in agent teams who installed observability but never tested it under fire.
Common gapSeverity ranking for missing replay: critical. The cost is not financial or regulatory; it is the slow accumulation of unvalidated fixes shipped on the assumption they addressed the cause, with no evidence either way. Over six to twelve months this produces a pattern of recurring incidents whose root causes are never truly identified — the engineering team mistakes symptom suppression for fixes, and the customer-facing quality metric drifts downward. Replay is the cheapest infrastructure investment in agent operations relative to long-term operational stability.
06 — Four MoreNaïve sampling, span-name sprawl, missing parent-child, ignored evals.
Four more anti-patterns round out the catalogue. None of them individually rises to the severity of PII, cardinality, truncation, or missing replay — but they compound. A trace store with naïve sampling, span-name sprawl, missing parent-child linkage, and ignored eval signals is functionally unusable even if it avoids the top four mistakes. The four below appear in roughly descending order of frequency in production audits.
Naïve sampling
head-based sample · same rate for success and failureSampling at the SDK's default rate (often 10%) without separating success from failure means the failures — the traces you need — get thrown out 90% of the time. Corrective pattern: always trace 100% of root spans; sample inside the trace only on expensive operations; never sample errors.
Severity: highSpan-name sprawl
every code path emits a unique span nameFree-form span naming ("process_user_query_v2_fast", "handle_query_legacy", "run_query_async") destroys aggregation. Backends can't roll up by operation type. Corrective pattern: a small fixed taxonomy of span names — agent.turn, agent.tool_call, agent.retrieval, agent.model_call — with details in attributes.
Severity: mediumMissing parent-child linkage
spans emitted · context not propagatedSub-agent and async tool calls emit spans, but the parent trace ID isn't propagated through. The result is orphan traces — fragments of an interaction with no way to reassemble the whole. Corrective pattern: trace-context propagation as part of every cross-boundary call, including queue producers/consumers.
Severity: highIgnored eval signals
evals exist · live on a separate dashboardEval scores run nightly or per-turn but land in their own UI, never joined to the trace. Quality and reliability look fine independently; the regression sits in the join nobody computed. Corrective pattern: eval scores as span attributes on the same trace, so post-mortem inspection includes the quality dimension.
Severity: medium-highThe compounding cost matters more than any individual cost. A team that fixes only the top four but ignores the four above still ends up with traces that can't be aggregated (sprawl), can't be assembled (linkage), can't be quality-scored (evals), and can't be analysed under realistic load (naïve sampling). Trace quality is a portfolio property; partial fixes yield partial returns. The eval-signal anti-pattern in particular is the bridge between agent observability and the broader sixty-point observability audit, which goes deeper on the eval-integration axis specifically.
"Trace quality is a portfolio property — partial fixes yield partial returns. The anti-patterns compound until the trace store is technically populated and operationally useless."— Agent observability engagements · 2026
07 — SeverityCritical, high, medium — fix order.
Eight anti-patterns, three severity tiers, one prioritised fix order. The bars below rank each anti-pattern by the operational cost of leaving it unfixed — combining regulatory exposure, financial impact, and postmortem-fidelity damage. Severity is a blunt instrument; treat it as a starting point for a quarter-by-quarter remediation plan, not as gospel. The honest way to use this ranking is to run the diagnostic for each anti-pattern against your own traces, count the hits, and let empirical exposure shift the order where it should. PII almost always tops out at critical regardless of the team; replay almost always falls just below.
Anti-pattern severity ranking · prioritised fix order
Severity calibrated against regulatory, financial, and operational dimensions · audit engagements 2026A practical reading of this ranking. Quarter one ships PII redaction at the structured-logging layer and the first cut of replay-from-trace. Quarter two ships cardinality bounding, the truncation-aware SDK configuration, and sampling that treats errors as first-class. Quarter three ships parent-child linkage across all cross-boundary calls and integrates eval signals into the trace surface. Quarter four cleans up span-name sprawl — by this point the taxonomy is obvious, because the prior fixes have forced the team to look at every span name during testing. A realistic year of trace-quality remediation looks like that sequence; teams that try to compress it into a quarter end up with partial fixes everywhere and a fragile platform overall.
One closing observation. The most expensive observability work is not the engineering; it is the discipline. Every anti-pattern in this catalogue has a cheap technical fix and a hard organisational fix — getting the prompt template change reviewed for PII implications, getting the SDK configuration audited against actual prompt lengths, getting the eval pipeline to write to the same backend the reliability data lives in. Discipline is the moat. The teams that build it run agents at scale; the teams that don't accumulate trace-quality debt until an incident forces a remediation under pressure. We help clients build the discipline through our AI transformation engagements, and the observability stack TCO calculator quantifies the financial trade-offs we just walked through.
Observability is the difference between an agent in production and an agent in hope — anti-patterns are the gap.
The catalogue is eight anti-patterns long because that is the length we have seen empirically, not because there is something special about the number. New ones will surface as agent architectures evolve — multi-modal traces, agent-to-agent handoff observability, OpenTelemetry semantic conventions for GenAI stabilising — and the list will be revised. What won't change is the underlying principle: trace quality is the property that determines whether observability is leverage or theatre. Every other metric is downstream of that one.
The trajectory we expect through 2026 is convergence on a smaller set of correct defaults. OpenTelemetry's semantic conventions for GenAI are stabilising; vendor SDKs are starting to ship redaction primitives out of the box; sampling APIs are beginning to expose error-first policies. None of that convergence eliminates the need for the audit — defaults move faster than legacy production code — but it does mean teams starting today can adopt better defaults more cheaply than teams who shipped two years ago. The compounding interest on trace-quality debt is one of the strongest arguments for fixing it now rather than postponing.
One closing thought. Observability work always feels like a tax until the first time it saves an incident — at which point it permanently changes how the team operates. The cheapest way to make the case internally isn't the audit document; it is the first live replay of a real production trace performed in front of the team. When the room watches the agent's reasoning step by step, the argument for fixing the rest of the anti-patterns becomes self-evident. The eight failure modes are the gap; this essay is the punch list; the discipline to ship the fixes is the moat.