An agent observability rollout is the gap between "we shipped an agent" and "we operate an agent." This ninety-day plan is the smallest phased build we've found that takes a team from no traces, no evals, and no replay to a stack that lets the on-call engineer walk a real production incident from page to root cause inside an afternoon — without burning a year of platform-team capacity to get there.

What's at stake is operating leverage. A team without observability ships agents on hope and debugs by folklore. A team with observability ships agents the way it ships any mature service — instrumented, alerted, and replayable — and spends its engineering capacity on the product rather than on speculative bug-hunts. The work to get from the first state to the second is unglamorous but bounded; ninety days is the right horizon to fit it inside a normal quarter without freezing feature work.

This playbook is vendor-neutral. The phases hold whether your team picks LangSmith, LangFuse, Helicone, Phoenix, or some combination — only the implementation details shift. Each phase ends with concrete deliverables, owners, and a four-item list of the failure modes we see most often. The companion piece — our sixty-point observability audit — is the assessment that tells you what to fix; this is the ninety days that fixes it.

Key takeaways

01
Observability before production, not after.Ship traces with the first production cutover, not the second incident. The cost of retrofitting visibility into a live agent is two to five times the cost of building it in from day one — every retrospective we've run confirms it.
02
OpenTelemetry semantic conventions matter early.Pick OTel-shaped span names and attributes in week one. Vendor-specific shorthand feels faster initially and becomes the migration tax later. The GenAI semantic conventions are stabilising fast — skating to that puck is cheap insurance.
03
Eval signals belong in the same trace as reliability data.Two URLs at 03:14 is one URL too many. When quality dashboards live separately from reliability traces, root-cause analysis becomes guesswork in the worst possible moment. Inline eval scores on the same span as latency and tokens.
04
Cost attribution per-user beats per-month, every time.Runaway users, malformed integrations, abusive callers — they surface inside a day with per-user attribution and at the invoice with per-month rollups. Build the user-ID and tenant-ID propagation into spans before the dashboards, not after.
05
Replay turns guesswork into walkthroughs.The highest-leverage capability you build in days 61-90 is deterministic trace replay. It is the difference between post-mortems that read "we think this fixes it" and post-mortems that demonstrate "here is the fix replayed against yesterday's failed traces."

01 — Why 90 DaysObservability before production — phased to keep cost manageable.

Ninety days is not a magic number. It is the smallest horizon that fits the three real phases without compressing any of them into theatre. Phase one — vendor and trace substrate — needs a month because vendor evaluation, security review, and the first wave of instrumentation always take three weeks even when the team thinks they will take one. Phase two — evals, drift, cost attribution — needs its own month because each of those streams is a small product in its own right, with its own data model and its own stakeholder. Phase three — replay and incident response — needs the last month because runbooks only earn their keep after the first real incident drills against them.

The alternative — "we'll instrument the agent after launch" — is the most expensive shortcut in agentic engineering. Every team that takes it pays the same tax: retrofitting trace propagation through an already-shipped orchestration layer, backfilling cost attribution onto traces that never carried tenant IDs, building runbooks from scratch during the incident that motivated them. The retrofit multiplier is typically two to five — meaning the same observability stack costs twice to five times as much to build after the agent ships as it does to build alongside it.

The phasing also matches how budgets get approved. A ninety-day program with a clear phase-one deliverable (traces on the top three workloads) is fundable. A vague "observability initiative" with no phase gates is not. The plan below is structured to produce a presentable artifact at day 30, day 60, and day 90 — so the program can be re-justified at each gate rather than swallowed whole at kickoff.

Day 0

No instrumentation · folklore mode

Stdout to a log aggregator, request IDs that don't propagate across tool calls, no eval surface, no drift monitoring, cost arrives with the invoice. Every incident post-mortem ends with "we think it was the retrieval step." The starting line.

Pre-rollout state

Day 30

Top-3 workloads · traced

Vendor chosen, security review passed, OTel-shaped spans on the highest-traffic three agents, trace IDs propagating into product logs. Coverage is partial — and that's fine. The substrate is real and the team is fluent in the trace viewer.

Phase-one gate

Day 60

Evals + drift + cost · on the trace

Inline eval scores land on the same spans as latency and tokens. Drift cron computes rolling windows on retry rate, cost-per-turn, and golden-dataset score. User and tenant IDs propagate to every span; per-user cost rollups exist. The right to alert is earned.

Phase-two gate

Day 90

Replay · runbooks · alert routing

Trace-replay sandbox restores the exact inputs from any captured trace. Incident-response runbooks cite trace patterns by name. Alerts route by severity with example trace URLs attached. The on-call engineer walks a real incident from page to root cause in an afternoon.

Phase-three gate

The retrofit tax

Every observability retrofit we've scoped on a live agent comes in at two to five times the cost of building the same stack in parallel with the agent itself. The expensive parts — identity propagation through tool calls, prompt-body preservation with PII redaction, replay sandbox plumbing — are exactly the parts that cannot be added in a weekend after the fact. Ninety days up front beats one quarter of rework later.

One more framing point. Observability is not a single team's project. It crosses application engineering (the spans), platform engineering (the backend), data science (the evals), finance (the cost model), and security (the PII redaction and access controls). The ninety-day plan only works if each phase has a named owner per stream, with the platform-engineering lead acting as the program manager. Trying to ship this through one team in silo is the most common reason rollouts stall at day 45.

02 — Days 1-30Vendor pick, OpenTelemetry conventions, top-3 trace coverage.

The first thirty days buy the substrate. The deliverable at day 30 is a working trace pipeline on the three highest-leverage agent workloads — typically the customer-facing agent, the highest-cost internal agent, and one canary workload chosen for its diagnostic value. Everything later in the plan composes on top of this substrate; if phase one slips, phases two and three slip with it.

The trap to avoid is breadth. Teams that try to instrument every agent in the first thirty days finish nothing and reach day 30 with patchy traces on twelve surfaces instead of clean traces on three. The audit principle — pick the workloads where failure has the highest blast radius and instrument those first — applies in full. Coverage breadth is a phase-four problem; this phase is about depth on a small surface.

Week 1

Vendor selection

scorecard · security review · contract

Score LangSmith, LangFuse, Helicone, and Phoenix against the six audit axes. Run security review in parallel — data residency, PII handling, retention, sub-processor list. Sign by end of week one; every day past day seven compresses later phases.

Owner: platform lead

Week 2

OpenTelemetry conventions

span names · attribute schema · trace IDs

Adopt the GenAI semantic conventions for span names (gen_ai.completion, gen_ai.tool_call, gen_ai.retrieval) and attributes (model name, token counts, latencies). Codify in a small internal spec; every later instrumentation references it.

Owner: app eng lead

Week 3

Top-3 workload instrumentation

root spans · tool spans · model spans

Wrap the three target workloads. Root span per user turn. Child spans for every model invocation, every tool call, every retrieval step. Trace IDs flow into product logs so support engineers can cross-reference customer reports in seconds.

Owner: feature teams

Week 4

Coverage audit + body storage

≥99% root coverage · prompt + response stored

Verify root-span coverage at 99%+ on the three workloads. Confirm prompt and response bodies persist with PII redaction in place. Run a fire-drill — pick a real trace from the prior day and verify a teammate can walk it inside five minutes.

Owner: platform lead

Gate

Day 30 readout

live demo · drift-detection plan signed

Demo the live trace viewer on a real incident from the prior week. Present the drift-detection design for phase two for sign-off. Without the demo, the next phase's budget is at risk; without the design signed, week five drifts.

Owner: program lead

Two anti-patterns worth naming. First, the "custom wrapper" — a homegrown tracing library written to abstract the vendor SDK before any traces have been captured. This burns week one on plumbing and produces nothing demonstrable by day 30. Use the vendor SDK directly; abstract later if vendor switch becomes a credible scenario. Second, the "everything in one PR" — adopting OTel conventions, wrapping three workloads, and setting up the backend in a single change. Split it. Conventions first, then one workload, then the next two. Reviewable changes move faster.

Phase-one failure modes

Four ways days 1-30 typically derail. First: vendor selection drags past week one because every stakeholder wants a bake-off. Time-box to five business days. Second: security review starts in week three. Start it in parallel with vendor scoring, not after. Third: OTel conventions get skipped "to ship faster." The cost shows up in phase three as a migration of every span. Fourth: the day-30 demo runs on a staged trace, not a real one. Demand a production trace; the demo is the test.

03 — Days 31-60Eval-signal integration, drift detection, cost attribution.

Phase two earns the right to alert. The substrate from phase one is necessary but insufficient — a trace viewer with no quality signal, no drift detector, and no cost rollup is a forensic tool, not an operating one. The deliverable at day 60 is a stack where the on-call engineer can ask three questions of any production trace and get answers from the same surface: what was the response quality, how does this trace compare to recent baselines, and what did it cost?

The unifying principle across all three streams is "land on the trace." Eval scores are span attributes, not rows in a separate eval database. Drift metrics derive from trace rollups, not parallel telemetry. Cost is computed from token attributes on model spans, then rolled up to the root span and attributed to user and tenant IDs that already flow as span tags. Anything that lives next to the trace surface rather than on it becomes the second URL nobody opens during an incident.

Week 5

Inline heuristic evals

format · length · grounding · refusal

Sub-50ms heuristic evals run on every turn — format compliance, length bounds, grounding flags, refusal detection. Scores land as span attributes on the root span. First line of defence; cheap, fast, always-on.

100% coverage

Week 6

Sampled LLM-judge evals

faithfulness · relevance · harm · tool-correctness

5-20% sampled LLM-judge runs against captured trace bodies. Multi-dimensional scoring — not a single conflated "quality" number. Eval cost is itself tracked as a separate line item, not buried in the production agent's spend.

Sampled coverage

Week 7

Drift detection cron

retry rate · cost/turn · golden score

Hourly rolling windows on retry rate, per-route latency, token consumption, cost-per-turn, golden-dataset eval score, and tool-selection distribution. Step-changes route to alerts; model and prompt deploys annotate the time-series.

Owner: platform + DS

Week 8

Cost attribution per-user

user-ID · tenant-ID · per-route rollups

User and tenant IDs propagate to every span. Per-route cost rollups; per-user top-N reports; outlier alerts. Versioned unit-price tables in code, dated per provider update. Cache hit-rate monitored alongside.

Owner: app eng + finance

Gate

Day 60 readout

evals · drift · cost — on one surface

Walk a real production trace, on the day-60 surface, with eval scores inline, drift context visible, and cost attributed at the root. Present the phase-three replay design. The demo is the proof — staged traces are not allowed.

Owner: program lead

The hardest implementation detail in phase two is identity propagation. User ID and tenant ID need to flow from the request handler through every model invocation, every tool call, every sub-agent delegation, and every background job retried later. Most teams catch identity at the entry point and lose it at the first tool boundary; the fix is to propagate through trace context (baggage on the OTel side, or the SDK's equivalent) rather than through ad-hoc argument passing. Getting this right in week eight saves weeks of rework in phase three.

Phase-two coverage targets · evals, drift, cost

Coverage targets for the phase-two gate · day 60

Inline heuristic evals on every turnFormat, length, grounding, refusal — sub-50ms

100%

Sampled LLM-judge evalsMulti-dimensional scoring on 5-20% of production traffic

5-20%

Drift signals on rolling windowsRetry, cost, latency, eval score, tool distribution

≥ 95%

User and tenant ID on every spanRequired for per-user attribution and chargeback

100%

Eval cost tracked as separate line itemLLM-judge token consumption budgeted explicitly

100%

"Eval signals next to traces is two URLs. Eval signals on traces is one. At 03:14 the difference becomes obvious."— On-call lesson · agent observability engagements

Phase-two failure modes

Four ways days 31-60 typically derail. First: evals shipped to a separate dashboard URL because the vendor's eval surface is "richer." Land them on the trace; richness lives in the same span attributes. Second: drift detection built as static thresholds rather than rolling windows. Drift is a rate-of-change problem. Third: cost rolled up per-team instead of per-user, because per-user feels invasive. The invoice will be more invasive. Fourth: identity propagation broken at the first tool boundary. Build through trace context, not argument passing.

04 — Days 61-90Replay infra, incident runbooks, alert routing.

Phase three is where observability stops feeling like a tax and starts feeling like leverage. The substrate is built; the signals land on the right surface; the identity model is in place. The remaining work is composing all of it into an incident-response capability that turns alerts into walkthroughs and post-mortems into demonstrations rather than narratives.

The defining capability is trace replay. Given a trace URL — any trace, including one from last week — the on-call engineer must be able to reconstruct the exact inputs the agent saw (rendered prompt, tool outputs, retrieval payload), re-run the agent in a sandbox, and validate a candidate fix against the historical trace before the fix ships. Without replay, post-mortems read "we think this fixes it" and ship at low confidence. With replay, they read "here is the fix replayed against yesterday's failed traces; pass-rate 18 of 20."

Week 9

Replay sandbox

trace URL → sandboxed agent run

Replay primitive: given a trace URL, reconstruct the rendered prompt, tool outputs, retrieval payload from preserved bodies; re-run the agent in a sandbox; compare outputs. Foundation of every later capability.

Owner: platform lead

Week 10

Incident-response runbooks

trace-pattern triggers · documented actions

Runbooks reference trace patterns by name — not "if errors increase, restart," but "if traces show tool-call rejection rate above 10%, check schema diff and refer to runbook 4.2." Living documents tied to the trace viewer.

Owner: SRE + DS

Week 11

Alert routing · severity-graded

page · ticket · digest — with trace URLs

Severity 1 pages on-call with 3-5 example trace URLs attached. Severity 2 opens a ticket with the same. Severity 3 lands in a daily digest. No alert ever fires without trace evidence; alerts without evidence get muted in week 12.

Owner: SRE

Week 12

Fire-drill + rollback playbooks

chaos exercise · rollback paths · post-mortem template

End-to-end chaos exercise validates the entire chain — alert fires, on-call pages, trace pulled, replay validates root cause, hotfix replayed against historical traces, deploy ships. Documented rollback paths for prompt, model, tool schema.

Owner: program + on-call

Gate

Day 90 readout

live incident walkthrough · runbook execution

Run a real (or rehearsed) incident end-to-end on the day-90 stack — from alert to root cause to validated fix — in front of leadership. Demonstrate replay against historical traces. The walkthrough is the artifact.

Owner: program lead

One subtlety in week nine. Trace replay only works if the captured trace contains enough payload to reconstruct the agent's inputs verbatim. That requirement is the reason phase one mandates prompt and response body storage from day one — without those bodies, replay is impossible regardless of how much infrastructure you build around it. Teams that tried to save storage cost in phase one by hashing prompts instead of storing them universally discover the trade-off in week nine, and either re-do phase one or accept that replay is partial. Storage cost is the trade-off; the answer is selective rather than absolute — full bodies on 100% of traces with a short retention (7 to 30 days), metadata only for older traces.

The replay test

The single hardest question on the day-90 readout: can you reproduce yesterday's 03:14 incident on a developer laptop in under thirty minutes? If yes, the program shipped what it promised. If no, the replay infrastructure isn't complete and the gate hasn't actually been passed. Resist the temptation to soften the question; the on-call team will be answering it for real soon enough.

Phase-three failure modes

Four ways days 61-90 typically derail. First: replay built on synthetic inputs rather than captured trace bodies. The sandbox runs; the fidelity is fake. Second: runbooks written as prose narratives instead of trace-pattern triggers. At 03:14 nobody reads prose. Third: alerts shipped without example trace URLs attached. The on-call grep stage is exactly what observability was meant to replace. Fourth: the day-90 walkthrough rehearsed in advance with a staged incident. Real or recent — staged walkthroughs do not survive first production contact.

05 — Vendor PicksLangSmith, LangFuse, Helicone, Phoenix.

The vendor scorecard below is how the four mainstream platforms map onto the ninety-day plan as of mid-2026. Each covers the six audit axes — trace coverage, span depth, eval signals, drift detection, cost tracking, incident response — with different strengths, different weaknesses, and different implications for week-one work. Treat the picks as starting points; verify against current docs before committing, because the vendor landscape moves quarterly.

LangSmith

LangChain's integrated observability

Strong on trace coverage and span depth when paired with LangChain or LangGraph; weaker for non-LangChain stacks. Inline evals are first-class. Cost tracking via token counts works out of the box. Drift detection improving but lighter than the ML-ops-native competitors.

LangChain-native shops

LangFuse

Open-source · self-hostable

Vendor-neutral SDK with self-host or cloud options. Strong on trace coverage, span depth, and cost tracking. Eval framework built in; drift via time-series UI. Best fit for sovereignty-bound deployments or teams who want one observability surface across multiple frameworks.

Multi-framework teams

Helicone

Proxy-based capture · low-touch install

Sits between application and LLM provider as a proxy — instant trace coverage with no SDK changes. Strong on cost tracking and rate limiting; lighter on agentic span-tree depth and inline evals, though improving fast. Phase-one substrate ships in days, not weeks.

Fast on-ramp

Phoenix (Arize)

OpenTelemetry-native · ML-ops heritage

Emits OTel-shaped spans by default — strongest portability story among the four. Eval framework is solid; drift detection inherits Arize's mature ML-monitoring DNA. Best fit when OTel semantic conventions are a hard requirement or an Arize footprint already exists.

OTel-first stacks

The single most consequential vendor decision is whether your spans emit in OpenTelemetry shape (Phoenix natively, LangFuse with the OTel exporter) or in vendor-specific shape (LangSmith). OTel pays off when the vendor landscape shifts — and it always shifts. Vendor-specific spans are typically faster to set up and richer in the short term, at the cost of portability. The week-one question to ask: if the program had to migrate the observability backend in a single quarter, what percentage of the instrumentation would have to be rewritten? Under 10% means OTel discipline is paying off. Over 50% means vendor lock-in is a future liability worth pricing in now.

For teams running the ninety-day plan for the first time, a practical default is to start with whatever vendor the team can ship traces against in week one — even if it isn't the long-term choice. The act of running the program surfaces the gaps that drive the next vendor decision, and the audit's value is independent of the vendor underneath. Our companion TCO comparison covers the cost side of the decision; this section covers the capability side.

06 — TemplatesOpenTelemetry config, alert routing, incident runbook.

Three artifacts to copy on day one. The first is the OpenTelemetry semantic-convention spec — the internal document that defines span names, attributes, and resource tags so every team instruments the same way. The second is the alert-routing matrix — the table that maps every signal to a severity, a destination, and a trace-URL attachment requirement. The third is the incident-runbook skeleton — the template every runbook author copies, so that runbooks written by different teams compose into a single playbook by day 90.

The OpenTelemetry conventions below match the stabilising GenAI semantic conventions as of mid-2026. They are deliberately minimal — every team will extend this for its own domain, but the named attributes are the ones that travel cleanly across vendors and across future SDK versions.

# observability-spec.yaml — internal OTel convention reference

# Resource attributes — applied to every span
service.name: agent-<name>            # e.g. agent-support, agent-research
service.namespace: <product>          # product or business unit
deployment.environment: production    # or staging, dev

# Root span — one per user turn
span.name: gen_ai.user_turn
attributes:
  gen_ai.system: <provider>           # openai, anthropic, deepseek, etc.
  gen_ai.request.model: <model-id>
  gen_ai.user.id: <user-id>           # propagates through child spans
  gen_ai.tenant.id: <tenant-id>       # propagates through child spans
  gen_ai.session.id: <session-id>
  gen_ai.eval.heuristic.score: <0-1>  # populated by inline eval
  gen_ai.cost.usd: <decimal>          # rollup of leaf-span costs

# Model invocation — child span per call
span.name: gen_ai.completion
attributes:
  gen_ai.request.model: <model-id>
  gen_ai.request.temperature: <decimal>
  gen_ai.response.model: <model-id>
  gen_ai.usage.input_tokens: <int>
  gen_ai.usage.output_tokens: <int>
  gen_ai.usage.cached_input_tokens: <int>
  gen_ai.prompt: <stored body or hash reference>
  gen_ai.completion: <stored body>
  gen_ai.latency.ttft_ms: <int>
  gen_ai.latency.total_ms: <int>

# Tool call — child span per invocation
span.name: gen_ai.tool_call
attributes:
  gen_ai.tool.name: <tool-name>
  gen_ai.tool.arguments: <structured>
  gen_ai.tool.result: <structured>
  gen_ai.tool.retry_count: <int>
  gen_ai.tool.error: <string or null>

# Retrieval — child span per step
span.name: gen_ai.retrieval
attributes:
  gen_ai.retrieval.query: <text>
  gen_ai.retrieval.top_k: <int>
  gen_ai.retrieval.doc_ids: <list>
  gen_ai.retrieval.scores: <list>

# Sampling — explicit per environment
sampling:
  root_spans: 1.0                     # 100% always
  body_storage: 1.0                   # 100% in phase 1-2; selective in phase 3
  llm_judge_evals: 0.1                # 5-20% sampled
  error_traces: 1.0                   # never sample down errors

The alert-routing matrix is the smallest table that prevents alert fatigue and prevents the worst opposite failure — real incidents lost in a sea of low-severity noise. Every alert maps to a severity, a destination, and a requirement that example trace URLs are attached to the page or ticket before the alert fires. The matrix below is what most teams converge on after the first real on-call rotation; copy it verbatim and tune from incident learnings.

# alert-routing.yaml — severity-graded routing with trace attachment

severities:

  sev1_page:
    destination: pagerduty-oncall
    trace_urls_required: 3            # min 3 example traces on page
    response_sla_minutes: 15
    triggers:
      - golden_dataset_score_drop > 10pct_24h
      - retry_rate_step_change > 50pct_1h
      - per_user_cost_outlier > 50x_median
      - error_rate > 5pct_15min

  sev2_ticket:
    destination: jira-agentops
    trace_urls_required: 5            # paste examples in ticket body
    response_sla_hours: 24
    triggers:
      - drift_window_shift > 20pct_24h
      - cost_per_route_trend > 30pct_7d
      - eval_score_distribution_shift
      - cache_hit_rate_drop > 15pct_24h

  sev3_digest:
    destination: slack-agentops-daily
    trace_urls_required: 10           # daily digest with sampled traces
    response_sla_hours: 168           # weekly review
    triggers:
      - tool_selection_distribution_shift
      - latency_p95_trend > 15pct_7d
      - per_tenant_growth_signal

attachments:
  every_alert_must_include:
    - trace_url: <link to trace viewer>
    - dashboard_url: <link to drift dashboard>
    - runbook_url: <link to matching runbook>

The incident-runbook skeleton below is the shape every runbook should take. The key discipline is that every runbook references concrete trace patterns by name — not "errors are up," but "tool-call rejection rate above 10% on the customer agent." Runbooks written in prose are runbooks nobody reads at 03:14.

# runbook-4.2-tool-call-rejection.md

## Trigger pattern
- Trace signal: gen_ai.tool.error rate > 10% over 15-min window
- Cross-signal: retry_count distribution shifts right
- Drift correlation: cost-per-turn rising; eval score flat or falling

## First 5 minutes
1. Open trace viewer; filter to gen_ai.tool.error != null in last hour
2. Inspect 3-5 traces — collect tool name, error message, schema diff
3. Confirm pattern: same tool? same error class? same caller cohort?
4. Page secondary if pattern unclear after 5 minutes

## Common root causes (ranked by frequency)
1. Tool schema changed; model not updated to new arg shape (60%)
2. Tool downstream dependency degraded (20%)
3. Prompt template change altered tool-selection signal (15%)
4. New model version stricter on tool-call format (5%)

## Validation steps
- Replay one failed trace against current agent; confirm reproduce
- Replay against a candidate fix in sandbox; require pass-rate > 80%
- Cross-check golden-dataset score; verify regression bounded

## Rollback paths
- Tool schema: revert to previous version in tool-registry
- Prompt template: revert prompt-v<n> → prompt-v<n-1>
- Model version: route this agent to previous model-id

## Post-incident
- Attach 3 trace URLs to post-mortem
- Add detection threshold tuning notes if signal was late
- File data-quality ticket if root cause is upstream

All three templates are starting points. The conventions will extend as the agent surface grows. The alert matrix will tune as the team learns which signals matter. The runbook skeleton will spawn fifty runbooks by the end of year one. The point is to start with skeletons everyone shares rather than letting each team invent its own shape — which is the most common reason ninety-day rollouts produce inconsistent operating practices in year two.

07 — PitfallsFour observability rollout failure modes.

Across the ninety-day rollouts we've scoped, four failure modes recur often enough to deserve their own section. None are technical surprises; all are organisational dynamics that quietly stall the program weeks before anyone notices. Reading them now is cheaper than learning them in flight.

Pitfall 1

PaaS

Treating observability as a procurement decision

Vendor selection becomes the program; instrumentation, evals, and replay get treated as "the vendor's job." The vendor ships a backend; you ship the instrumentation. Procurement is a week-one task, not the program itself.

Org · Week 1-4

Pitfall 2

Silos

One team owning the entire rollout

Platform engineering inherits the whole program in silo. Evals stall because data science isn't scoped in. Cost attribution stalls because finance was never consulted. Named owners per stream, with platform as program manager, is the fix.

Org · Week 5-8

Pitfall 3

Bodies

Saving storage cost by skipping prompt body storage

Teams economise on storage in phase one by hashing prompts. Phase three reveals that replay is impossible without verbatim bodies. The retrofit costs more than the storage saved. Store bodies on 100% of traces with short retention; downgrade older traces.

Tech · Week 3-4

Pitfall 4

Demo

Rehearsing the day-90 walkthrough on a staged incident

The temptation at gate review is to script a clean walkthrough on a chosen trace. Resist. Pick a real incident from the prior fortnight, ideally one the team hasn't solved yet. The walkthrough is the test — staged demos do not survive production contact.

Process · Week 12

One closing observation across the four. The technical problems in agent observability are well-understood and have mature solutions; the organisational problems are not, and they are where ninety-day rollouts mostly fail. A team that treats this as a cross-functional program with named owners, phase gates, and unvarnished gate demos is the team that ships at day 90. A team that treats it as a platform-engineering initiative running in parallel to feature work is the team that reaches day 60 with patchy coverage and a vendor invoice. The difference is governance, not engineering.

For organisations starting from scratch and wanting outside help to run the program, our AI transformation engagements ship exactly this phased plan against whichever vendor fits best — including the OpenTelemetry instrumentation, drift cron, cost-attribution model, and replay sandbox. The audit companion piece, the sixty-point observability audit, is the right first read for teams wanting to score the current state before kicking off; the vendor TCO comparison is the right second read for teams entering vendor selection in week one.

Conclusion

Observability is infrastructure — 90 days is the right horizon to make it stick.

Three phases, ninety days, six axes. The plan is unglamorous on purpose. Phase one buys the substrate — vendor, OTel conventions, top-three workload coverage — so every later signal has somewhere to land. Phase two earns the right to alert — eval signals, drift detection, cost attribution, all on the same trace surface. Phase three turns alerts into walkthroughs — replay, runbooks, severity-graded routing. The deliverable at day 90 is the on-call engineer walking a real incident from page to root cause in an afternoon, in front of leadership.

The trajectory we expect through the rest of 2026 is two shifts. First, OpenTelemetry semantic conventions for GenAI continue to stabilise, and vendor-neutral instrumentation becomes the default rather than the conscientious-objector position — making OTel discipline in week one a free decision rather than a deliberate one. Second, eval signals migrate from separate dashboards onto the same trace surfaces as reliability data — because the on-call engineer at 03:14 will not tolerate two URLs. Teams that build to those shifts now will run agents at scale without the organisational pain that catches up to teams who don't.

One closing thought. The day-90 readout, done honestly, is the most persuasive internal case you can make for continued observability investment. The audit document is paper; a live replay session of yesterday's real production trace is undeniable. Schedule it for a leadership audience; the rest of the program funds itself from there.

Agent Observability Rollout: 30/60/90-Day Plan 2026

01 — Why 90 DaysObservability before production — phased to keep cost manageable.

No instrumentation · folklore mode

Top-3 workloads · traced

Evals + drift + cost · on the trace

Replay · runbooks · alert routing

02 — Days 1-30Vendor pick, OpenTelemetry conventions, top-3 trace coverage.

Vendor selection

OpenTelemetry conventions

Top-3 workload instrumentation

Coverage audit + body storage

Day 30 readout

03 — Days 31-60Eval-signal integration, drift detection, cost attribution.

Inline heuristic evals

Sampled LLM-judge evals

Drift detection cron

Cost attribution per-user

Day 60 readout

Phase-two coverage targets · evals, drift, cost

04 — Days 61-90Replay infra, incident runbooks, alert routing.

Replay sandbox

Incident-response runbooks

Alert routing · severity-graded

Fire-drill + rollback playbooks

Day 90 readout

05 — Vendor PicksLangSmith, LangFuse, Helicone, Phoenix.

LangChain's integrated observability

Open-source · self-hostable

Proxy-based capture · low-touch install

OpenTelemetry-native · ML-ops heritage

06 — TemplatesOpenTelemetry config, alert routing, incident runbook.

07 — PitfallsFour observability rollout failure modes.

Treating observability as a procurement decision

One team owning the entire rollout

Saving storage cost by skipping prompt body storage

Rehearsing the day-90 walkthrough on a staged incident

Observability is infrastructure — 90 days is the right horizon to make it stick.

Agent observability is infrastructure — 90 days is the right horizon.

Observability rollout engagements

The questions ops teams ask before the rollout.

Continue exploring agent operations.

Agent Observability Audit: 60-Point Checklist 2026

MCP Server Org Deployment: A 30/60/90-Day Plan 2026

Case Study: Agent Observability with LangFuse Rollout 2026