An agent observability audit is the difference between understanding why a production agent failed at 03:14 and guessing from a stack trace at 09:00 the next morning. This sixty-point checklist covers the six axes — trace coverage, span depth, eval signals, drift detection, cost tracking, and incident-response readiness — that separate agents you can operate from agents you merely deploy.

What's at stake is small until it isn't. Agentic systems fan tool calls, accumulate state across turns, and hit cost cliffs that no monolithic API call ever did. The teams that get burned in 2026 will not be the ones whose models are weakest — they will be the ones whose traces are thinnest and whose drift alarms never fire. Observability is the cheapest insurance the stack offers, and it is almost always under-purchased.

This guide is vendor-neutral and applied. Each axis lists ten checks the way a senior on-call would phrase them, ends with a scoring rubric, and the closing section maps the same checklist across LangSmith, LangFuse, Helicone, and Phoenix so you can run the audit against whatever you already have installed. Total audit time on a single-team agent is roughly three hours.

Key takeaways

01
Traces are non-negotiable for production agents.Blind production is unsupportable. If you cannot replay yesterday's 03:14 incident from a trace, you do not have an agent in production — you have an agent in hope.
02
Eval signals belong in the trace, not next to it.Quality and reliability views diverge when they live apart. Inline eval scores on the same trace surface keep root-cause analysis honest and stop the "quality is fine, reliability is broken" fiction.
03
Cost tracking per-user beats per-month.Hot-spot users — runaway agents, malformed prompts, abusive callers — surface earlier when cost is attributed per-user and per-tenant. Per-month spend dashboards hide the heavy tails until the invoice arrives.
04
Vendor-neutral spans now pay off later.OpenTelemetry semantic conventions for GenAI are stabilising. Emitting OTel-shaped spans today keeps your portability options open when the vendor landscape shifts — and it always shifts.
05
Incident runbooks need agent-trace replay.Without the ability to replay a specific trace into a sandboxed agent run, root-cause analysis is guesswork. Replay is the difference between "we fixed it" and "we think we fixed it."

01 — Blind vs ObservedProduction agents without traces are a liability.

The fastest way to tell a blind agent from an observed one is to ask the on-call engineer a single question: "Walk me through what happened in this trace from yesterday at 03:14." In a blind operation, the answer is a shrug and a tail of grep commands against unstructured logs. In an observed operation, the answer is a URL to a trace viewer showing every tool call, every model invocation, every retry, with timing and cost attached.

The distinction matters because agents fail differently than traditional services. A monolithic API either returns a 500 or it doesn't. An agent silently chooses the wrong tool, loops on a malformed sub-query, retries seven times because the eval score is fractional, and finally returns a confident-sounding answer that is wrong. None of those failure modes register as a 5xx anywhere. All of them are visible in a trace.

Blind

Unstructured logs · no replay

Stdout to a log aggregator, request IDs that don't propagate across tool calls, prompt and response bodies stripped at the proxy. Root-cause is folklore — "we think it was the retrieval step." Drift is invisible until customers complain.

Liability mode

Observed

Trace per turn · replayable

Every turn is a parent span with child spans per tool call, per model invocation, per retrieval. Prompt and response bodies are stored (with PII redaction). Eval scores are inline on the trace. Cost is attributed at the leaf. Replay reconstructs the run.

Operations mode

Hybrid

Partial coverage · sampling-only

Sampled traces (1-10%), logs for everything else, evals running on a different surface, cost in finance's spreadsheet. Common transitional state — useful, but every incident eventually hits the unsampled portion. Move forward; don't settle.

Transitional

Audit-ready

Trace + eval + drift + cost on one surface

All six axes integrated. The on-call engineer answers any post-mortem question from a single tool, with no spreadsheet correlation. The audit goal — and the level at which observability stops being a tax and starts being leverage.

Target state

The honest test

Run a fire-drill. Pick a real trace from last week, hand the URL (or the timestamp, if you cannot) to a teammate, and ask them to tell you which tool returned a bad result and why in under five minutes. If they can, you're observed. If they can't, the rest of this checklist is your action list.

02 — Trace CoverageWhat gets traced — ten checks.

Trace coverage is the breadth axis: how much of the agent's behavior produces a span you can later inspect. The ten checks below are the surface-area questions a senior reviewer asks first. Each is binary — yes, the span exists and is captured, or no, it is missing. A clean audit score on this axis is the prerequisite for every other axis; you cannot audit span depth, eval signals, or cost tracking on traces that were never captured.

Every user turn produces a root span. No turn is silently dropped. Sampling rates are explicit and configured per environment, not implicit per library default.
Every model invocation produces a child span. Including streamed responses (the span closes on stream end, not on first token).
Every tool call produces a child span. Including failed tool calls, retries, and timeouts — those are the most diagnostic.
Every retrieval step is captured. Query embedding, vector lookup, reranking, document selection — each as a separate span with the inputs and outputs preserved.
Sub-agent invocations propagate the parent ID. When the orchestrator delegates to a sub-agent, the trace context follows. No orphan traces.
External-service calls are captured.HTTP clients are auto-instrumented or manually wrapped; the trace does not stop at the agent's edge.
Background jobs are captured. Async tasks, queues, and scheduled re-evaluations all emit spans linked to the originating user turn where applicable.
Failed turns produce a trace. 5xx, exceptions, and circuit-breaker trips emit a final span with the error attached — not a missing trace.
PII redaction runs before persistence. Customer data is filtered or hashed before the trace lands in the observability backend.
Trace IDs propagate to product logs. The application log line for the user-facing response includes the trace ID, so a customer report can be cross-referenced in seconds.

Coverage rate

≥ 99%

User turns producing a root span

Anything below 99% means traces are being dropped silently. The remaining 1% is the source of most uninvestigatable incidents. Sampling can be aggressive on body content; it should never be aggressive on the root span itself.

Audit floor

Tool-call coverage

100%

Every tool call instrumented

Tool calls are the most diagnostic spans in any agent trace — they are where the agent commits to an action with side effects. Skipping any tool call is unacceptable. Manual wrappers are fine when auto-instrumentation is unavailable.

Non-negotiable

Trace-ID propagation

100%

Product logs reference the trace ID

When a customer reports "the agent gave me a weird answer at 14:32," the support engineer should pull the trace in one query, not three. Application log lines for user-facing responses must include the trace ID — every time.

Support reality check

A practical anti-pattern worth naming: capturing traces for the happy path while letting the error path emit nothing. Most teams instrument the success branch first and intend to come back to errors — and then they don't. Failed turns are exactly the ones you need a trace for. If your audit shows greater than 99% coverage on success and less than 50% on failure, you have a blind operation that happens to look observed in the dashboards.

03 — Span DepthHow deep traces go — ten checks.

Span depth is the resolution axis: once a span exists, how much of the relevant context lives inside it. A root span with no attributes tells you a turn happened. A root span with prompt, response, model name, token counts, latency, eval scores, and tool-decision rationale tells you what happened and why. The ten checks below are the attribute-and-payload completeness questions.

Prompt body stored (or hashed reference). The exact prompt the model received, with all template substitutions applied. Reconstructible.
Response body stored. Including streamed completions concatenated; including tool-call structured outputs verbatim.
Model identifier captured. Provider, model name, version, and any temperature / top-p / max-tokens settings. Cross-vendor agents demand this most.
Token counts captured. Input and output tokens per model invocation, plus cached vs uncached when the provider distinguishes (Anthropic prompt cache, OpenAI cached-input).
Latency captured at every layer. Time-to-first -token, total streaming time, tool-call latency, retrieval latency — each as its own attribute.
Tool inputs and outputs preserved.Structured arguments and structured returns, not just "tool called." The reason the model picked that tool sits in the parent prompt/response pair.
Retrieval results preserved. Top-k document IDs, scores, and either the chunk text or a stable reference to fetch it later. Without this, RAG debugging is guesswork.
User and session identifiers as attributes. For multi-tenant agents, tenant ID is non-negotiable. Even for single-tenant, the session ID lets you reconstruct a conversation.
Eval scores attached inline. When an inline eval runs against the response, its score is an attribute of the same span — not a separate record requiring correlation.
Error context captured.Stack trace, error type, retry count, and the upstream cause when the error is downstream of another span's failure.

Span-depth audit thresholds · pass targets

Audit thresholds derived from production engagements · single-team agent scope

Bodies stored (prompt + response)Required for replay; the foundation of any incident response

100%

Token counts on every model spanRequired for cost attribution and prompt-bloat detection

100%

Eval scores inline on tracesQuality and reliability on the same surface — the audit goal

≥ 95%

Retrieval payloads preservedTop-k doc IDs and scores at minimum; chunk text where storage allows

≥ 90%

Multi-tenant attributionTenant ID + user ID on every root span

100%

"A trace without bodies is a receipt. A trace with bodies is a forensic record. The difference shows up the first time you have to explain a regression."— Production lesson · agent observability engagements

04 — Eval SignalsInline eval integration — ten checks.

Eval signals are the quality axis: the model judgements, heuristic checks, and human spot-grades that score how well the agent did. The audit failure mode here is universal — evals running on a separate surface from traces, with no inline join. When that happens, quality dashboards show 92% and reliability dashboards show 98%, both teams shrug, and nobody notices the 8% intersection that is silently broken.

Inline evals run on every (or sampled) turn. Either a fast heuristic on every turn or an LLM-judge on sampled turns, both writing scores back to the same trace.
Eval scores are attributes, not separate records. The trace UI shows the score on the span; no spreadsheet correlation step is required.
Multiple eval dimensions captured.Faithfulness, relevance, harm, format-compliance, tool-correctness — not a single conflated "quality" score.
Golden dataset evals run on a schedule. A fixed test set re-runs nightly (or per-deploy) and the scores are time-series tracked alongside production scores.
Eval failures alert.Not just "this turn scored 0.3" — the rolling average crossing a threshold produces a paged alert or a ticket.
Human grades feed back into traces. When a human grader reviews a turn, the grade lands on the trace as an attribute and is queryable alongside model judgements.
Eval cost is itself tracked.LLM-judge calls consume tokens — those tokens are attributed and budgeted, not hidden in the platform's margin.
Eval drift is monitored. When you change the judge prompt or the judge model, the score distribution shifts — those shifts are visible and acknowledged.
Per-tool evals exist. Tool-correctness is its own scored dimension, separate from output quality. A correct answer assembled from wrong tool calls is still a bug.
Eval datasets version-controlled. The exact examples and labels behind every score are in git, not in a vendor UI nobody backs up.

Per-turn

Fast heuristic

format · length · profanity · grounding

Runs on every turn. Sub-50ms latency. Catches the obvious failures (malformed output, banned tokens, ungrounded claims). The first line of defence — cheap, fast, always-on.

100% coverage

Sampled

LLM-judge · multi-dimensional

faithfulness · relevance · harm · tool-correctness

Runs on a sample (5-20% in production, 100% pre-deploy). Slower and more expensive, but catches the subtle failures heuristics miss. Score lands on the trace as an attribute, not a separate record.

Sampled coverage

Scheduled

Golden dataset replay

fixed test set · nightly or per-deploy

Versioned test cases re-run against the live agent on a schedule. Distribution shift on the score time-series is the canary for prompt drift, model upgrade regressions, and retrieval degradation. Tied directly to the deployment pipeline.

Time-series

Human-in-loop

Spot-grading

trace UI · grade button · rubric

Reviewers grade a small daily sample directly in the trace UI. Grades land as attributes on the same trace. Calibrates the LLM-judge over time and provides the gold-standard signal when automated evals disagree.

Daily sample

The audit anti-pattern here is the "evals dashboard" living on a separate URL from the trace viewer. It looks like observability — it isn't. When the on-call engineer is triaging at 03:14, the round-trip between a quality dashboard and a reliability trace viewer is exactly when the wrong conclusion gets drawn. Pull the eval signals onto the same surface as the traces, or accept that you have two systems with two on-call rotations and two sets of guesses.

05 — Drift DetectionOutput drift, latency drift, cost drift — ten checks.

Drift is the silent failure mode of agentic systems. Nothing breaks; the agent gradually starts doing more retries, hitting longer prompts, returning subtly worse answers, costing more per turn — and the dashboards still read green because every metric is within its individual threshold. Drift detection looks at the rate of change rather than the absolute value, and it is the axis most often missed in production deployments.

Per-route latency time-series tracked. p50, p95, p99 over rolling windows. Step-changes alert.
Per-route token consumption time-series tracked. Input tokens and output tokens separately. Prompt-bloat is the most common cause of cost drift.
Per-route eval-score time-series tracked. The golden-dataset score is the canary; production sampled score is the secondary signal.
Per-route cost-per-turn tracked. Derived metric — token count times unit price — and trended over time. Sudden steps usually correlate with prompt-template changes.
Tool-selection distribution tracked.If the agent used tool X 60% of the time last week and 35% this week, that's a signal — for better or worse.
Retry-rate tracked. Retries are the cheapest early warning. A rising retry rate predicts both cost drift and latency drift before either crosses its individual threshold.
Cache-hit rate tracked. When prompt-cache hit rate falls, both cost and latency go up. Most teams forget to monitor cache health until the bill arrives.
Drift alerts route to humans. A drift signal with no on-call routing is a dashboard, not a detector. Page someone (or open a ticket) when a rolling window shifts more than a configured threshold.
Model-version changes annotate the time-series. When you swap from Sonnet 4.7 to Sonnet 4.8, that change is a vertical line on every drift chart — making before/after comparison instant.
Prompt-template changes annotate the time-series. Same idea, applied to the prompts you control. A drift spike right after a deploy is rarely a coincidence.

The drift triangle

Output drift, latency drift, and cost drift co-vary more often than any single one fires alone. A 15% step-up in retry rate generally drags both cost and latency with it and shows up on eval scores within a week. Watch the three axes together — dashboards built around one of them in isolation will mislead.

A worked detection: in February 2026 we ran a drift audit on a client retrieval agent. The single-metric dashboards were all green. The drift view showed a rising retry rate (from 4% to 9% over six weeks), a flat output quality score, a 22% rise in per-turn cost, and a model-upgrade annotation right at the inflection. Root cause: the new model version had a stricter tool-schema and was rejecting half the tool calls until the agent retried with corrected arguments. No single metric had crossed a threshold; the drift triangle showed it inside ten minutes.

06 — Cost TrackingPer-trace, per-user, per-tenant — ten checks.

Cost tracking is the discipline axis. Most teams know what they spent last month; few can answer "which ten users drove 38% of last week's LLM bill, and why?" The difference between the two is whether cost lives in the trace as a first-class attribute or in finance's spreadsheet as a month-end summary. Per-user attribution surfaces hot-spot users earlier; per-tenant attribution makes the chargeback model defensible.

Token counts on every model span. Input and output separately. Cached vs uncached when the provider distinguishes.
Unit-cost mapping maintained. Provider pricing tables versioned in code, dated, and updated when providers change pricing. Not a one-time spreadsheet.
Cost computed per span and rolled up per trace. A turn's total cost is the sum of its leaf-span costs, visible on the root span as an attribute.
Per-user cost attribution. User ID is a span attribute. Top-N user reports run on demand. Outliers surface in alerts.
Per-tenant cost attribution. For B2B agents and multi-tenant SaaS — tenant ID on every span, with chargeback or unit-economics reports built on top.
Per-route cost rollups. Which feature is expensive per call, which is cheap, and how does that match your monetization?
Budget alerts wired up. Spending crosses 80% of monthly budget → ticket. 95% → page. Hard ceiling → circuit-breaker on non-essential routes.
Cache health monitored. Hit rate, cache size, and TTL effectiveness — because cache savings are the difference between margin and loss on agentic workloads at scale.
Eval cost separated from production cost. LLM-judge calls are their own line item, not buried in the production agent's spend.
Cost trends correlated with eval scores. Rising cost with flat or falling quality is the most diagnostic signal in agent ops. Make it a standing report.

Cost-attribution granularity vs hot-spot detection speed

Detection-speed multipliers are illustrative — actual gain depends on traffic distribution.

Per-month spend dashboardBaseline · arrives with the invoice · hides the heavy tails

1×

Per-route attributionFeature-level cost · feeds unit economics

2-3×

Per-user attributionHot-spot users surface in days, not at month-end

5-10×

Per-tenant + per-user + per-traceAudit-ready · chargeback-defensible · drift-correlated

10-20×

The strongest argument for per-user attribution isn't unit-economics — it's incident response. The single highest-cost incident pattern in agentic systems is a runaway user (or a runaway integration acting as a user) that loops on a malformed prompt, consuming tokens at hundreds of times the normal rate. Per-month dashboards catch this when the invoice arrives. Per-user dashboards catch it inside a day. Per-trace attribution feeding a per-user rollup with an alert on the outlier catches it within minutes — which is the difference between a small refund and a board-level conversation. For engineering teams operationalizing this pattern from scratch, our walkthrough on building Claude Code custom subagents shows where to anchor the trace and cost context in an agent definition; the MCP server tutorial covers the same idea applied to the tool layer.

07 — Incident ResponseReplay, root-cause, runbooks — ten checks.

Incident response is the synthesis axis: how well the rest of the stack composes when something goes wrong. The defining primitive is replay — the ability to take a specific trace, reconstruct its inputs, and re-run the agent in a sandboxed environment to validate a fix before it ships. Without replay, every incident post-mortem ends with "we think this fixes it" and a deploy that may or may not have addressed the cause.

Replay-from-trace capability exists. Given a trace URL, you can re-run the agent against the captured inputs in a sandbox — same model, same prompts, same retrieval context.
Runbooks reference trace patterns.Not "if errors increase, restart the service" — but "if traces show tool-call rejection rate above 10%, check the schema diff and refer to runbook 4.2."
On-call rotation knows the trace viewer. Not just senior engineers — anyone on rotation can navigate from an alert to a relevant trace inside a minute.
Sampled traces are linked from alerts. Alert fires → page includes 3-5 example trace URLs. No grep stage.
Post-mortems include trace evidence. Not prose-only narratives — actual trace screenshots or links to the relevant spans, with timestamps.
Hotfixes can be validated against historical traces. Replay a sample of yesterday's failed traces against the proposed fix; require pass-rate before ship.
Customer reports map to traces in under five minutes. The support engineer enters the timestamp and tenant ID, gets the trace URL. No engineering escalation for the lookup itself.
Eval-score regressions are an incident class. A drop in golden-dataset score is treated like a production incident — paged, triaged, written up — not a backlog item.
Drift triggers gradual-rollback playbooks. A drift signal that survives investigation has a documented rollback path: previous prompt template, previous model version, previous tool schema.
Chaos / red-team exercises run regularly. Quarterly fire-drills validate the entire incident-response chain end-to-end. Without exercise, the chain rots silently.

The replay test

The single hardest question on the audit: can you reproduce yesterday's 03:14 incident on a developer laptop in under thirty minutes? If yes, your replay capability is real. If no, every post-mortem you write is partly fiction and every fix you ship is partly guesswork. Replay is the most under-invested capability in agentic operations.

Two failure modes to watch. First, the "dashboard-only incident response" — alerts page, dashboards open, nobody ever opens a trace because the team isn't fluent in the viewer. Fix by making trace navigation part of on-call onboarding, not a tribal skill. Second, the "no-replay post-mortem" — narratives without trace evidence, fixes without validation runs. Fix by requiring at least one trace link and one replay result on every post-mortem document before it's accepted.

08 — Vendor ComparisonLangSmith, LangFuse, Helicone, Phoenix — same axes.

The same sixty-point checklist runs against any of the four mainstream vendors. The mapping below is how each platform covers the six audit axes as of mid-2026 — what's first-class, what's adequate, what's a gap to fill with custom instrumentation. Treat this as a starting point; verify against current docs before committing, because the vendor landscape moves quarterly.

LangSmith

LangChain's integrated observability

Strong on trace coverage and span depth when paired with LangChain / LangGraph; weaker for non-LangChain stacks. Inline evals first-class. Cost tracking via token counts. Drift detection improving. Best fit when LangChain is already the orchestration framework.

LangChain-native shops

LangFuse

Open-source · self-hostable

Vendor-neutral SDK, self-host or cloud. Strong on trace coverage, span depth, and cost tracking. Eval framework built in; drift detection via the time-series UI. Best fit for sovereignty-bound deployments or teams who want one observability surface across multiple LLM frameworks.

Multi-framework teams

Helicone

Proxy-based capture · low-touch install

Sits between your app and the LLM provider as a proxy — instant trace coverage with no SDK changes. Strong on cost tracking and rate limiting; lighter on agentic span-tree depth and inline evals (improving). Best fit for getting started fast or for non-agentic LLM apps where the proxy model is sufficient.

Fast on-ramp

Phoenix (Arize)

OpenTelemetry-native · ML-ops heritage

Emits OTel-shaped spans by default — strongest portability story. Eval framework solid; drift detection inherits the Arize ML-monitoring DNA (more mature than agent-native competitors). Best fit when OTel semantic conventions are a hard requirement or when an Arize footprint already exists.

OTel-first stacks

The single most consequential cross-vendor decision is whether you want OpenTelemetry-shaped spans (Phoenix, LangFuse with the OTel exporter) or vendor-specific spans (LangSmith). OTel pays off when the vendor landscape shifts — and it always shifts. Vendor-specific spans are usually faster to set up and richer in the short term, at the cost of portability. The audit question to ask: if you needed to migrate the entire observability backend in a quarter, how much instrumentation would have to be rewritten? Under 10% means OTel discipline is paying off. Over 50% means the vendor lock-in is a future liability.

"The right observability vendor is the one that lets you answer post-mortem questions in seconds. Every other axis is in service of that test."— Agentic engineering · 2026 observability engagements

For teams running the audit for the first time, start by instrumenting the six axes against whatever vendor is already installed — even if the coverage is partial. The audit is more valuable than the vendor choice, and the act of running it usually surfaces the gaps that drive the next vendor decision. When the gaps are clear, our AI transformation engagements ship the implementation against any of the four platforms above — including the OpenTelemetry instrumentation plan that keeps future-vendor migration cheap.

Conclusion

Observability is the difference between an agent in production and an agent in hope.

Sixty checks across six axes. The interesting thing isn't any single check — most of them are obvious once stated. The interesting thing is that almost no production agent passes all sixty, and the gaps cluster predictably: drift detection is usually the weakest axis, replay-from-trace the most under-built capability, per-user cost attribution the largest immediate ROI. Run the audit honestly and the priority list writes itself.

The trajectory we expect through 2026 is twofold. First, OpenTelemetry semantic conventions for GenAI continue to stabilise, and vendor-neutral instrumentation becomes the default rather than the conscientious-objector position. Second, eval signals migrate from separate dashboards onto the same trace surfaces as reliability data — because the on-call engineer at 03:14 will not tolerate two URLs. Teams that invest in both shifts now will run agents at scale without the organisational pain that catches up to teams who don't.

One closing thought. Observability work always feels like a tax until the first time it saves an incident — at which point it permanently changes how the team operates. The fastest way to make the case internally isn't the audit document; it is the first replay session of a real production trace, performed live in a team meeting. When the rest of the room sees the agent's actual reasoning step by step, the argument for investing in the rest of the checklist becomes self-evident.

Agent Observability Audit: 60-Point Checklist 2026

01 — Blind vs ObservedProduction agents without traces are a liability.

Unstructured logs · no replay

Trace per turn · replayable

Partial coverage · sampling-only

Trace + eval + drift + cost on one surface

02 — Trace CoverageWhat gets traced — ten checks.

User turns producing a root span

Every tool call instrumented

Product logs reference the trace ID

03 — Span DepthHow deep traces go — ten checks.

Span-depth audit thresholds · pass targets

04 — Eval SignalsInline eval integration — ten checks.

Fast heuristic

LLM-judge · multi-dimensional

Golden dataset replay

Spot-grading

05 — Drift DetectionOutput drift, latency drift, cost drift — ten checks.

06 — Cost TrackingPer-trace, per-user, per-tenant — ten checks.

Cost-attribution granularity vs hot-spot detection speed

07 — Incident ResponseReplay, root-cause, runbooks — ten checks.

08 — Vendor ComparisonLangSmith, LangFuse, Helicone, Phoenix — same axes.

LangChain's integrated observability

Open-source · self-hostable

Proxy-based capture · low-touch install

OpenTelemetry-native · ML-ops heritage

Observability is the difference between an agent in production and an agent in hope.

Agents in production without observability are an outage waiting to happen.

Observability audit engagements

The questions teams ask before auditing agent observability.

Continue exploring agent operations.

Agent Observability Rollout: 30/60/90-Day Plan 2026

Case Study: Agent Observability with LangFuse Rollout 2026

MCP Server Security Audit: 75-Point Checklist 2026