An agent observability rollout is the gap between "we shipped an agent" and "we operate an agent." This ninety-day plan is the smallest phased build we've found that takes a team from no traces, no evals, and no replay to a stack that lets the on-call engineer walk a real production incident from page to root cause inside an afternoon — without burning a year of platform-team capacity to get there.
What's at stake is operating leverage. A team without observability ships agents on hope and debugs by folklore. A team with observability ships agents the way it ships any mature service — instrumented, alerted, and replayable — and spends its engineering capacity on the product rather than on speculative bug-hunts. The work to get from the first state to the second is unglamorous but bounded; ninety days is the right horizon to fit it inside a normal quarter without freezing feature work.
This playbook is vendor-neutral. The phases hold whether your team picks LangSmith, LangFuse, Helicone, Phoenix, or some combination — only the implementation details shift. Each phase ends with concrete deliverables, owners, and a four-item list of the failure modes we see most often. The companion piece — our sixty-point observability audit — is the assessment that tells you what to fix; this is the ninety days that fixes it.
- 01Observability before production, not after.Ship traces with the first production cutover, not the second incident. The cost of retrofitting visibility into a live agent is two to five times the cost of building it in from day one — every retrospective we've run confirms it.
- 02OpenTelemetry semantic conventions matter early.Pick OTel-shaped span names and attributes in week one. Vendor-specific shorthand feels faster initially and becomes the migration tax later. The GenAI semantic conventions are stabilising fast — skating to that puck is cheap insurance.
- 03Eval signals belong in the same trace as reliability data.Two URLs at 03:14 is one URL too many. When quality dashboards live separately from reliability traces, root-cause analysis becomes guesswork in the worst possible moment. Inline eval scores on the same span as latency and tokens.
- 04Cost attribution per-user beats per-month, every time.Runaway users, malformed integrations, abusive callers — they surface inside a day with per-user attribution and at the invoice with per-month rollups. Build the user-ID and tenant-ID propagation into spans before the dashboards, not after.
- 05Replay turns guesswork into walkthroughs.The highest-leverage capability you build in days 61-90 is deterministic trace replay. It is the difference between post-mortems that read "we think this fixes it" and post-mortems that demonstrate "here is the fix replayed against yesterday's failed traces."
01 — Why 90 DaysObservability before production — phased to keep cost manageable.
Ninety days is not a magic number. It is the smallest horizon that fits the three real phases without compressing any of them into theatre. Phase one — vendor and trace substrate — needs a month because vendor evaluation, security review, and the first wave of instrumentation always take three weeks even when the team thinks they will take one. Phase two — evals, drift, cost attribution — needs its own month because each of those streams is a small product in its own right, with its own data model and its own stakeholder. Phase three — replay and incident response — needs the last month because runbooks only earn their keep after the first real incident drills against them.
The alternative — "we'll instrument the agent after launch" — is the most expensive shortcut in agentic engineering. Every team that takes it pays the same tax: retrofitting trace propagation through an already-shipped orchestration layer, backfilling cost attribution onto traces that never carried tenant IDs, building runbooks from scratch during the incident that motivated them. The retrofit multiplier is typically two to five — meaning the same observability stack costs twice to five times as much to build after the agent ships as it does to build alongside it.
The phasing also matches how budgets get approved. A ninety-day program with a clear phase-one deliverable (traces on the top three workloads) is fundable. A vague "observability initiative" with no phase gates is not. The plan below is structured to produce a presentable artifact at day 30, day 60, and day 90 — so the program can be re-justified at each gate rather than swallowed whole at kickoff.
No instrumentation · folklore mode
Stdout to a log aggregator, request IDs that don't propagate across tool calls, no eval surface, no drift monitoring, cost arrives with the invoice. Every incident post-mortem ends with "we think it was the retrieval step." The starting line.
Pre-rollout stateTop-3 workloads · traced
Vendor chosen, security review passed, OTel-shaped spans on the highest-traffic three agents, trace IDs propagating into product logs. Coverage is partial — and that's fine. The substrate is real and the team is fluent in the trace viewer.
Phase-one gateEvals + drift + cost · on the trace
Inline eval scores land on the same spans as latency and tokens. Drift cron computes rolling windows on retry rate, cost-per-turn, and golden-dataset score. User and tenant IDs propagate to every span; per-user cost rollups exist. The right to alert is earned.
Phase-two gateReplay · runbooks · alert routing
Trace-replay sandbox restores the exact inputs from any captured trace. Incident-response runbooks cite trace patterns by name. Alerts route by severity with example trace URLs attached. The on-call engineer walks a real incident from page to root cause in an afternoon.
Phase-three gateOne more framing point. Observability is not a single team's project. It crosses application engineering (the spans), platform engineering (the backend), data science (the evals), finance (the cost model), and security (the PII redaction and access controls). The ninety-day plan only works if each phase has a named owner per stream, with the platform-engineering lead acting as the program manager. Trying to ship this through one team in silo is the most common reason rollouts stall at day 45.
02 — Days 1-30Vendor pick, OpenTelemetry conventions, top-3 trace coverage.
The first thirty days buy the substrate. The deliverable at day 30 is a working trace pipeline on the three highest-leverage agent workloads — typically the customer-facing agent, the highest-cost internal agent, and one canary workload chosen for its diagnostic value. Everything later in the plan composes on top of this substrate; if phase one slips, phases two and three slip with it.
The trap to avoid is breadth. Teams that try to instrument every agent in the first thirty days finish nothing and reach day 30 with patchy traces on twelve surfaces instead of clean traces on three. The audit principle — pick the workloads where failure has the highest blast radius and instrument those first — applies in full. Coverage breadth is a phase-four problem; this phase is about depth on a small surface.
Vendor selection
scorecard · security review · contractScore LangSmith, LangFuse, Helicone, and Phoenix against the six audit axes. Run security review in parallel — data residency, PII handling, retention, sub-processor list. Sign by end of week one; every day past day seven compresses later phases.
Owner: platform leadOpenTelemetry conventions
span names · attribute schema · trace IDsAdopt the GenAI semantic conventions for span names (gen_ai.completion, gen_ai.tool_call, gen_ai.retrieval) and attributes (model name, token counts, latencies). Codify in a small internal spec; every later instrumentation references it.
Owner: app eng leadTop-3 workload instrumentation
root spans · tool spans · model spansWrap the three target workloads. Root span per user turn. Child spans for every model invocation, every tool call, every retrieval step. Trace IDs flow into product logs so support engineers can cross-reference customer reports in seconds.
Owner: feature teamsCoverage audit + body storage
≥99% root coverage · prompt + response storedVerify root-span coverage at 99%+ on the three workloads. Confirm prompt and response bodies persist with PII redaction in place. Run a fire-drill — pick a real trace from the prior day and verify a teammate can walk it inside five minutes.
Owner: platform leadDay 30 readout
live demo · drift-detection plan signedDemo the live trace viewer on a real incident from the prior week. Present the drift-detection design for phase two for sign-off. Without the demo, the next phase's budget is at risk; without the design signed, week five drifts.
Owner: program leadTwo anti-patterns worth naming. First, the "custom wrapper" — a homegrown tracing library written to abstract the vendor SDK before any traces have been captured. This burns week one on plumbing and produces nothing demonstrable by day 30. Use the vendor SDK directly; abstract later if vendor switch becomes a credible scenario. Second, the "everything in one PR" — adopting OTel conventions, wrapping three workloads, and setting up the backend in a single change. Split it. Conventions first, then one workload, then the next two. Reviewable changes move faster.
03 — Days 31-60Eval-signal integration, drift detection, cost attribution.
Phase two earns the right to alert. The substrate from phase one is necessary but insufficient — a trace viewer with no quality signal, no drift detector, and no cost rollup is a forensic tool, not an operating one. The deliverable at day 60 is a stack where the on-call engineer can ask three questions of any production trace and get answers from the same surface: what was the response quality, how does this trace compare to recent baselines, and what did it cost?
The unifying principle across all three streams is "land on the trace." Eval scores are span attributes, not rows in a separate eval database. Drift metrics derive from trace rollups, not parallel telemetry. Cost is computed from token attributes on model spans, then rolled up to the root span and attributed to user and tenant IDs that already flow as span tags. Anything that lives next to the trace surface rather than on it becomes the second URL nobody opens during an incident.
Inline heuristic evals
format · length · grounding · refusalSub-50ms heuristic evals run on every turn — format compliance, length bounds, grounding flags, refusal detection. Scores land as span attributes on the root span. First line of defence; cheap, fast, always-on.
100% coverageSampled LLM-judge evals
faithfulness · relevance · harm · tool-correctness5-20% sampled LLM-judge runs against captured trace bodies. Multi-dimensional scoring — not a single conflated "quality" number. Eval cost is itself tracked as a separate line item, not buried in the production agent's spend.
Sampled coverageDrift detection cron
retry rate · cost/turn · golden scoreHourly rolling windows on retry rate, per-route latency, token consumption, cost-per-turn, golden-dataset eval score, and tool-selection distribution. Step-changes route to alerts; model and prompt deploys annotate the time-series.
Owner: platform + DSCost attribution per-user
user-ID · tenant-ID · per-route rollupsUser and tenant IDs propagate to every span. Per-route cost rollups; per-user top-N reports; outlier alerts. Versioned unit-price tables in code, dated per provider update. Cache hit-rate monitored alongside.
Owner: app eng + financeDay 60 readout
evals · drift · cost — on one surfaceWalk a real production trace, on the day-60 surface, with eval scores inline, drift context visible, and cost attributed at the root. Present the phase-three replay design. The demo is the proof — staged traces are not allowed.
Owner: program leadThe hardest implementation detail in phase two is identity propagation. User ID and tenant ID need to flow from the request handler through every model invocation, every tool call, every sub-agent delegation, and every background job retried later. Most teams catch identity at the entry point and lose it at the first tool boundary; the fix is to propagate through trace context (baggage on the OTel side, or the SDK's equivalent) rather than through ad-hoc argument passing. Getting this right in week eight saves weeks of rework in phase three.
Phase-two coverage targets · evals, drift, cost
Coverage targets for the phase-two gate · day 60"Eval signals next to traces is two URLs. Eval signals on traces is one. At 03:14 the difference becomes obvious."— On-call lesson · agent observability engagements
04 — Days 61-90Replay infra, incident runbooks, alert routing.
Phase three is where observability stops feeling like a tax and starts feeling like leverage. The substrate is built; the signals land on the right surface; the identity model is in place. The remaining work is composing all of it into an incident-response capability that turns alerts into walkthroughs and post-mortems into demonstrations rather than narratives.
The defining capability is trace replay. Given a trace URL — any trace, including one from last week — the on-call engineer must be able to reconstruct the exact inputs the agent saw (rendered prompt, tool outputs, retrieval payload), re-run the agent in a sandbox, and validate a candidate fix against the historical trace before the fix ships. Without replay, post-mortems read "we think this fixes it" and ship at low confidence. With replay, they read "here is the fix replayed against yesterday's failed traces; pass-rate 18 of 20."
Replay sandbox
trace URL → sandboxed agent runReplay primitive: given a trace URL, reconstruct the rendered prompt, tool outputs, retrieval payload from preserved bodies; re-run the agent in a sandbox; compare outputs. Foundation of every later capability.
Owner: platform leadIncident-response runbooks
trace-pattern triggers · documented actionsRunbooks reference trace patterns by name — not "if errors increase, restart," but "if traces show tool-call rejection rate above 10%, check schema diff and refer to runbook 4.2." Living documents tied to the trace viewer.
Owner: SRE + DSAlert routing · severity-graded
page · ticket · digest — with trace URLsSeverity 1 pages on-call with 3-5 example trace URLs attached. Severity 2 opens a ticket with the same. Severity 3 lands in a daily digest. No alert ever fires without trace evidence; alerts without evidence get muted in week 12.
Owner: SREFire-drill + rollback playbooks
chaos exercise · rollback paths · post-mortem templateEnd-to-end chaos exercise validates the entire chain — alert fires, on-call pages, trace pulled, replay validates root cause, hotfix replayed against historical traces, deploy ships. Documented rollback paths for prompt, model, tool schema.
Owner: program + on-callDay 90 readout
live incident walkthrough · runbook executionRun a real (or rehearsed) incident end-to-end on the day-90 stack — from alert to root cause to validated fix — in front of leadership. Demonstrate replay against historical traces. The walkthrough is the artifact.
Owner: program leadOne subtlety in week nine. Trace replay only works if the captured trace contains enough payload to reconstruct the agent's inputs verbatim. That requirement is the reason phase one mandates prompt and response body storage from day one — without those bodies, replay is impossible regardless of how much infrastructure you build around it. Teams that tried to save storage cost in phase one by hashing prompts instead of storing them universally discover the trade-off in week nine, and either re-do phase one or accept that replay is partial. Storage cost is the trade-off; the answer is selective rather than absolute — full bodies on 100% of traces with a short retention (7 to 30 days), metadata only for older traces.
05 — Vendor PicksLangSmith, LangFuse, Helicone, Phoenix.
The vendor scorecard below is how the four mainstream platforms map onto the ninety-day plan as of mid-2026. Each covers the six audit axes — trace coverage, span depth, eval signals, drift detection, cost tracking, incident response — with different strengths, different weaknesses, and different implications for week-one work. Treat the picks as starting points; verify against current docs before committing, because the vendor landscape moves quarterly.
LangChain's integrated observability
Strong on trace coverage and span depth when paired with LangChain or LangGraph; weaker for non-LangChain stacks. Inline evals are first-class. Cost tracking via token counts works out of the box. Drift detection improving but lighter than the ML-ops-native competitors.
LangChain-native shopsOpen-source · self-hostable
Vendor-neutral SDK with self-host or cloud options. Strong on trace coverage, span depth, and cost tracking. Eval framework built in; drift via time-series UI. Best fit for sovereignty-bound deployments or teams who want one observability surface across multiple frameworks.
Multi-framework teamsProxy-based capture · low-touch install
Sits between application and LLM provider as a proxy — instant trace coverage with no SDK changes. Strong on cost tracking and rate limiting; lighter on agentic span-tree depth and inline evals, though improving fast. Phase-one substrate ships in days, not weeks.
Fast on-rampOpenTelemetry-native · ML-ops heritage
Emits OTel-shaped spans by default — strongest portability story among the four. Eval framework is solid; drift detection inherits Arize's mature ML-monitoring DNA. Best fit when OTel semantic conventions are a hard requirement or an Arize footprint already exists.
OTel-first stacksThe single most consequential vendor decision is whether your spans emit in OpenTelemetry shape (Phoenix natively, LangFuse with the OTel exporter) or in vendor-specific shape (LangSmith). OTel pays off when the vendor landscape shifts — and it always shifts. Vendor-specific spans are typically faster to set up and richer in the short term, at the cost of portability. The week-one question to ask: if the program had to migrate the observability backend in a single quarter, what percentage of the instrumentation would have to be rewritten? Under 10% means OTel discipline is paying off. Over 50% means vendor lock-in is a future liability worth pricing in now.
For teams running the ninety-day plan for the first time, a practical default is to start with whatever vendor the team can ship traces against in week one — even if it isn't the long-term choice. The act of running the program surfaces the gaps that drive the next vendor decision, and the audit's value is independent of the vendor underneath. Our companion TCO comparison covers the cost side of the decision; this section covers the capability side.
06 — TemplatesOpenTelemetry config, alert routing, incident runbook.
Three artifacts to copy on day one. The first is the OpenTelemetry semantic-convention spec — the internal document that defines span names, attributes, and resource tags so every team instruments the same way. The second is the alert-routing matrix — the table that maps every signal to a severity, a destination, and a trace-URL attachment requirement. The third is the incident-runbook skeleton — the template every runbook author copies, so that runbooks written by different teams compose into a single playbook by day 90.
The OpenTelemetry conventions below match the stabilising GenAI semantic conventions as of mid-2026. They are deliberately minimal — every team will extend this for its own domain, but the named attributes are the ones that travel cleanly across vendors and across future SDK versions.
# observability-spec.yaml — internal OTel convention reference
# Resource attributes — applied to every span
service.name: agent-<name> # e.g. agent-support, agent-research
service.namespace: <product> # product or business unit
deployment.environment: production # or staging, dev
# Root span — one per user turn
span.name: gen_ai.user_turn
attributes:
gen_ai.system: <provider> # openai, anthropic, deepseek, etc.
gen_ai.request.model: <model-id>
gen_ai.user.id: <user-id> # propagates through child spans
gen_ai.tenant.id: <tenant-id> # propagates through child spans
gen_ai.session.id: <session-id>
gen_ai.eval.heuristic.score: <0-1> # populated by inline eval
gen_ai.cost.usd: <decimal> # rollup of leaf-span costs
# Model invocation — child span per call
span.name: gen_ai.completion
attributes:
gen_ai.request.model: <model-id>
gen_ai.request.temperature: <decimal>
gen_ai.response.model: <model-id>
gen_ai.usage.input_tokens: <int>
gen_ai.usage.output_tokens: <int>
gen_ai.usage.cached_input_tokens: <int>
gen_ai.prompt: <stored body or hash reference>
gen_ai.completion: <stored body>
gen_ai.latency.ttft_ms: <int>
gen_ai.latency.total_ms: <int>
# Tool call — child span per invocation
span.name: gen_ai.tool_call
attributes:
gen_ai.tool.name: <tool-name>
gen_ai.tool.arguments: <structured>
gen_ai.tool.result: <structured>
gen_ai.tool.retry_count: <int>
gen_ai.tool.error: <string or null>
# Retrieval — child span per step
span.name: gen_ai.retrieval
attributes:
gen_ai.retrieval.query: <text>
gen_ai.retrieval.top_k: <int>
gen_ai.retrieval.doc_ids: <list>
gen_ai.retrieval.scores: <list>
# Sampling — explicit per environment
sampling:
root_spans: 1.0 # 100% always
body_storage: 1.0 # 100% in phase 1-2; selective in phase 3
llm_judge_evals: 0.1 # 5-20% sampled
error_traces: 1.0 # never sample down errorsThe alert-routing matrix is the smallest table that prevents alert fatigue and prevents the worst opposite failure — real incidents lost in a sea of low-severity noise. Every alert maps to a severity, a destination, and a requirement that example trace URLs are attached to the page or ticket before the alert fires. The matrix below is what most teams converge on after the first real on-call rotation; copy it verbatim and tune from incident learnings.
# alert-routing.yaml — severity-graded routing with trace attachment
severities:
sev1_page:
destination: pagerduty-oncall
trace_urls_required: 3 # min 3 example traces on page
response_sla_minutes: 15
triggers:
- golden_dataset_score_drop > 10pct_24h
- retry_rate_step_change > 50pct_1h
- per_user_cost_outlier > 50x_median
- error_rate > 5pct_15min
sev2_ticket:
destination: jira-agentops
trace_urls_required: 5 # paste examples in ticket body
response_sla_hours: 24
triggers:
- drift_window_shift > 20pct_24h
- cost_per_route_trend > 30pct_7d
- eval_score_distribution_shift
- cache_hit_rate_drop > 15pct_24h
sev3_digest:
destination: slack-agentops-daily
trace_urls_required: 10 # daily digest with sampled traces
response_sla_hours: 168 # weekly review
triggers:
- tool_selection_distribution_shift
- latency_p95_trend > 15pct_7d
- per_tenant_growth_signal
attachments:
every_alert_must_include:
- trace_url: <link to trace viewer>
- dashboard_url: <link to drift dashboard>
- runbook_url: <link to matching runbook>The incident-runbook skeleton below is the shape every runbook should take. The key discipline is that every runbook references concrete trace patterns by name — not "errors are up," but "tool-call rejection rate above 10% on the customer agent." Runbooks written in prose are runbooks nobody reads at 03:14.
# runbook-4.2-tool-call-rejection.md
## Trigger pattern
- Trace signal: gen_ai.tool.error rate > 10% over 15-min window
- Cross-signal: retry_count distribution shifts right
- Drift correlation: cost-per-turn rising; eval score flat or falling
## First 5 minutes
1. Open trace viewer; filter to gen_ai.tool.error != null in last hour
2. Inspect 3-5 traces — collect tool name, error message, schema diff
3. Confirm pattern: same tool? same error class? same caller cohort?
4. Page secondary if pattern unclear after 5 minutes
## Common root causes (ranked by frequency)
1. Tool schema changed; model not updated to new arg shape (60%)
2. Tool downstream dependency degraded (20%)
3. Prompt template change altered tool-selection signal (15%)
4. New model version stricter on tool-call format (5%)
## Validation steps
- Replay one failed trace against current agent; confirm reproduce
- Replay against a candidate fix in sandbox; require pass-rate > 80%
- Cross-check golden-dataset score; verify regression bounded
## Rollback paths
- Tool schema: revert to previous version in tool-registry
- Prompt template: revert prompt-v<n> → prompt-v<n-1>
- Model version: route this agent to previous model-id
## Post-incident
- Attach 3 trace URLs to post-mortem
- Add detection threshold tuning notes if signal was late
- File data-quality ticket if root cause is upstreamAll three templates are starting points. The conventions will extend as the agent surface grows. The alert matrix will tune as the team learns which signals matter. The runbook skeleton will spawn fifty runbooks by the end of year one. The point is to start with skeletons everyone shares rather than letting each team invent its own shape — which is the most common reason ninety-day rollouts produce inconsistent operating practices in year two.
07 — PitfallsFour observability rollout failure modes.
Across the ninety-day rollouts we've scoped, four failure modes recur often enough to deserve their own section. None are technical surprises; all are organisational dynamics that quietly stall the program weeks before anyone notices. Reading them now is cheaper than learning them in flight.
Treating observability as a procurement decision
Vendor selection becomes the program; instrumentation, evals, and replay get treated as "the vendor's job." The vendor ships a backend; you ship the instrumentation. Procurement is a week-one task, not the program itself.
Org · Week 1-4One team owning the entire rollout
Platform engineering inherits the whole program in silo. Evals stall because data science isn't scoped in. Cost attribution stalls because finance was never consulted. Named owners per stream, with platform as program manager, is the fix.
Org · Week 5-8Saving storage cost by skipping prompt body storage
Teams economise on storage in phase one by hashing prompts. Phase three reveals that replay is impossible without verbatim bodies. The retrofit costs more than the storage saved. Store bodies on 100% of traces with short retention; downgrade older traces.
Tech · Week 3-4Rehearsing the day-90 walkthrough on a staged incident
The temptation at gate review is to script a clean walkthrough on a chosen trace. Resist. Pick a real incident from the prior fortnight, ideally one the team hasn't solved yet. The walkthrough is the test — staged demos do not survive production contact.
Process · Week 12One closing observation across the four. The technical problems in agent observability are well-understood and have mature solutions; the organisational problems are not, and they are where ninety-day rollouts mostly fail. A team that treats this as a cross-functional program with named owners, phase gates, and unvarnished gate demos is the team that ships at day 90. A team that treats it as a platform-engineering initiative running in parallel to feature work is the team that reaches day 60 with patchy coverage and a vendor invoice. The difference is governance, not engineering.
For organisations starting from scratch and wanting outside help to run the program, our AI transformation engagements ship exactly this phased plan against whichever vendor fits best — including the OpenTelemetry instrumentation, drift cron, cost-attribution model, and replay sandbox. The audit companion piece, the sixty-point observability audit, is the right first read for teams wanting to score the current state before kicking off; the vendor TCO comparison is the right second read for teams entering vendor selection in week one.
Observability is infrastructure — 90 days is the right horizon to make it stick.
Three phases, ninety days, six axes. The plan is unglamorous on purpose. Phase one buys the substrate — vendor, OTel conventions, top-three workload coverage — so every later signal has somewhere to land. Phase two earns the right to alert — eval signals, drift detection, cost attribution, all on the same trace surface. Phase three turns alerts into walkthroughs — replay, runbooks, severity-graded routing. The deliverable at day 90 is the on-call engineer walking a real incident from page to root cause in an afternoon, in front of leadership.
The trajectory we expect through the rest of 2026 is two shifts. First, OpenTelemetry semantic conventions for GenAI continue to stabilise, and vendor-neutral instrumentation becomes the default rather than the conscientious-objector position — making OTel discipline in week one a free decision rather than a deliberate one. Second, eval signals migrate from separate dashboards onto the same trace surfaces as reliability data — because the on-call engineer at 03:14 will not tolerate two URLs. Teams that build to those shifts now will run agents at scale without the organisational pain that catches up to teams who don't.
One closing thought. The day-90 readout, done honestly, is the most persuasive internal case you can make for continued observability investment. The audit document is paper; a live replay session of yesterday's real production trace is undeniable. Schedule it for a leadership audience; the rest of the program funds itself from there.