SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
DevelopmentCase Study13 min readPublished May 9, 2026

Blind production agents, weekly cost surprises — how a multi-product SaaS halved MTTR with LangFuse + OpenTelemetry.

Case Study: Agent Observability with LangFuse Rollout 2026

A multi-product SaaS shipped agents on hope and debugged by folklore — until weekly cost surprises and silent quality regressions forced the issue. Ninety days later, the on-call engineer walked a real production incident from page to root cause in an afternoon, costs were predictable week-over-week, and MTTR was halved. This is what the rollout actually looked like.

DA
Digital Applied Team
Agentic engineering · Published May 9, 2026
PublishedMay 9, 2026
Read time13 min
Engagement90 days
MTTR
Halved
vs pre-rollout baseline
Cost predictability
Week-over-week
no more invoice surprises
OpenTelemetry adoption
Yes
vendor-neutral spans
Timeline
90d
three 30-day phases

Agent observability done badly is invisible until the second outage. This case study walks the ninety-day rollout that took a multi-product SaaS from blind production agents and weekly cost surprises to a stack where on-call engineers walk real incidents from page to root cause inside an afternoon — built on LangFuse with OpenTelemetry semantic conventions, per-tenant cost attribution, drift detection, and deterministic trace replay.

The customer is a mid-sized SaaS with three production agent surfaces: a customer-facing support agent embedded in their product, an internal research agent used by their go-to-market team, and an experimental code-generation agent shipped behind a feature flag to a subset of paying tenants. None of them had observability worth the name. Trace propagation broke at the first tool boundary. Cost showed up monthly at the invoice. Drift was detected by customers, not engineers.

What follows is the rollout exactly as it ran — vendor selection, OpenTelemetry adoption, cost attribution, drift detection, replay infrastructure, and the outcomes measured at day 90. It is meant to be replicated. The phased plan is documented separately in our 30/60/90-day rollout playbook; this post is what happened when a real team ran that plan.

Key takeaways
  1. 01
    Vendor selection is the foundation.The four-way bake-off across LangFuse, LangSmith, Helicone, and Phoenix concluded on day five, not day thirty. Five business days time-boxed against the six audit axes, scored honestly, decided cleanly. Every later phase composed on top of that choice.
  2. 02
    OpenTelemetry semantic conventions unlock portability.Adopting OTel-shaped spans (gen_ai.completion, gen_ai.tool_call, gen_ai.retrieval) in week one made the LangFuse choice reversible. If the vendor changes in year two, under ten percent of the instrumentation has to be rewritten. Conventions in week one are cheap insurance.
  3. 03
    Per-tenant cost attribution unlocks SaaS unit economics.Once user-ID and tenant-ID tags propagated through every span, the finance team had per-tenant cost rollups inside a week. Two outlier tenants (each burning 40x the median in agent cost) were flagged on day three of the new attribution surface and routed to commercial conversations.
  4. 04
    Drift detection prevents silent regressions.Hourly rolling-window cron over retry rate, cost-per-turn, golden-dataset eval score, and tool-selection distribution caught two prompt regressions and one tool-schema breakage during the engagement — each within hours, not weeks. Customers never noticed.
  5. 05
    Replay turns incident response from guesswork to forensics.Deterministic trace replay built in weeks 9-10 turned every post-mortem from prose narrative into demonstration. The day-90 walkthrough was a real trace from the previous afternoon, replayed live, with the candidate fix validated against twenty historical failed traces before deploy.

01SituationThree production agents, zero observability worth the name.

The customer came to the engagement with the usual symptoms of late-stage observability debt. Three agent surfaces had shipped over eighteen months — each instrumented to a different standard, none of them adequate. The support agent had request IDs that propagated through the first model call and disappeared at the first tool boundary. The research agent had spans, but they lived in a homegrown wrapper that no one had touched in six months. The code-generation agent had nothing — print statements to stdout and a Slack alert when error rate crossed five percent.

The triggering event was a Tuesday-morning Slack message from the head of finance: agent inference cost on the prior month's AWS bill had come in twenty-three percent over forecast, with no obvious cause, no per-product breakdown, and no way to tell whether the overrun was a runaway tenant, a model regression, or a price change that nobody had noticed. The CTO asked the senior platform engineer for a root cause. The platform engineer estimated four weeks of investigation, conditional on whether log retention had captured enough request bodies. The CTO called Digital Applied that afternoon.

The brief was simple: get visibility, get cost predictability, get incident response that didn't depend on guessing. The constraint was the budget cycle — ninety days, one quarter, no extension. The deliverable at day 90 was a live walkthrough in front of the executive team showing that the on-call engineer could trace a real production incident from alert to root cause without leaving the observability surface.

Support agent
Customer-facing · high blast radius

Embedded in product, customer-visible, ~40% of total agent traffic. Trace propagation broke at the first tool boundary. Quality issues surfaced via customer support tickets, not engineering dashboards. The agent the rollout had to instrument first.

Phase-one priority 1
Research agent
Internal · highest-cost

Used by GTM team for account research. Highest per-turn cost (long-context retrieval, frequent tool calls), lowest visibility — homegrown wrapper with stale spans, no eval signal, no drift detection. The cost story sat almost entirely on this surface.

Phase-one priority 2
Code-gen agent
Feature-flagged · diagnostic value

Behind a feature flag, ~50 paying tenants on it. Lowest traffic but most architecturally interesting — multi-step tool use, sub-agent delegation. Chosen as canary because it would surface trace-propagation issues fastest under instrumentation.

Phase-one priority 3
Everything else
Out of scope for the 90 days

Two additional internal agents (one operations, one data) were left uninstrumented for the engagement. Coverage breadth was a phase-four problem; the ninety-day deliverable was depth on the three highest-leverage surfaces, not patchy coverage across five.

Defer to year two

The starting metrics were the kind that look fine on a quarterly dashboard and reveal themselves the moment anyone asks a question. Mean time to resolution on the prior two agent-related incidents had been thirty-one hours and fifty-two hours — both ultimately resolved by an engineer scrolling stdout logs and grepping for clues. Cost variance from forecast to actual was running between fifteen and twenty-five percent month over month. Drift was a word the team used in retrospectives, never in dashboards. The brief was not subtle.

The pre-rollout state in one sentence
Three agent surfaces in production, no propagated identity through tool boundaries, no eval signal on traces, cost arriving monthly at the invoice rather than daily at the dashboard, and incident response measured in days because every post-mortem started with "let's see what stdout captured." The starting line for ninety days of work.

02Approach · Vendor SelectionFive-day bake-off · LangFuse won on portability and self-host.

The vendor decision was the first hard call of the engagement. The customer had been talking about vendor selection for six months without a decision; everyone had an opinion, no one had a scorecard, and the conversation had become its own form of paralysis. The first week of phase one was time-boxed to five business days for a structured bake-off — four vendors, six audit axes, one decision by Friday.

Each vendor got a half-day of hands-on instrumentation against the same target workload (the support agent), scored against trace coverage, span depth, eval signals, drift detection, cost tracking, and incident response. The scorecard was shared with the customer's platform lead, head of security, and head of finance — each contributing the constraints from their own seat. Security needed self-host or sovereign cloud, finance needed predictable cost, engineering needed portability.

LangFuse
Open-source · selected

Self-host option satisfied security review on day two. Vendor-neutral SDK accepted OTel-shaped spans natively. Strong on trace coverage, span depth, cost tracking; built-in eval framework. Best fit for multi-framework instrumentation across three agents.

Vendor pick
LangSmith
LangChain-integrated · runner-up

Strong on trace coverage and eval surfaces, weaker for non-LangChain stacks. Two of the three agents were not LangChain-based; instrumenting them would have meant LangSmith-shaped spans plus a separate wrapper for the non-LangChain surfaces. Portability story weakest of the four.

Not selected
Helicone
Proxy-based · fast on-ramp

Shortest time-to-first-trace — proxy install captured coverage in an afternoon. Lighter on agentic span-tree depth and inline evals, though improving. Considered for the support agent specifically; ultimately ruled out because the three-agent fleet needed one consistent surface.

Not selected
Phoenix (Arize)
OTel-native · ML-ops heritage

Strongest OTel portability story; eval framework solid, drift detection inherited Arize's mature ML-monitoring DNA. Lost on commercial fit — pricing and contract terms didn't match the customer's procurement window. A defensible alternate in a different commercial context.

Not selected

The decisive criterion in the end was not feature parity — all four had passable agent-observability stories — but commercial fit plus the self-host option. LangFuse's self-host shipped through security review in two days, because the data residency story collapsed an entire sub-process the customer had been planning to run with the other vendors. The decision was made on Friday afternoon of week one, the contract signed the following Tuesday, and week two opened with the OpenTelemetry convention spec already in draft.

"The vendor conversation had been going for six months. The bake-off finished it in five business days. Time-boxing wasn't a stylistic preference; it was the unlock."— Customer platform lead · post-rollout retrospective
Why the bake-off worked
The structured scorecard against six audit axes turned a recurring debate into a decision. Every stakeholder saw the same numbers; every "but what about" could be answered against the matrix. The lesson is general: vendor selection that runs longer than a week is rarely solving a vendor problem — it's usually a governance problem. A time-boxed scorecard cuts through both.

03Approach · OpenTelemetrySemantic conventions in week one, spans land in week two.

The most consequential architectural decision of the rollout was made before a single trace shipped: every span would emit in OpenTelemetry semantic-convention shape. That decision cost roughly two extra days in week two — codifying a small internal spec, agreeing attribute names across the three feature teams, and writing a thin internal helper that wrapped LangFuse's SDK to emit OTel-shaped attributes by default. It is the single decision the engagement would most want to make again.

The reasoning is portability. If LangFuse stops fitting in year two — pricing change, capability gap, acquisition risk — the customer can migrate the backend with under ten percent of the instrumentation rewritten. Vendor-specific spans would have made the same migration a quarter of engineering work. The two days spent in week two pre-pays a quarter of work that may never be needed and is cheap if it isn't.

Week 1
Convention spec
internal doc · span names · attributes

Draft the internal OTel semantic-convention spec — gen_ai.user_turn root span, gen_ai.completion / gen_ai.tool_call / gen_ai.retrieval child spans, attribute names for model, tokens, latency, identity. Reviewed by all three feature teams.

Owner: platform lead
Week 2
Support agent · first instrumentation
root span · model spans · tool spans

Wrap the customer-facing support agent. Root span per user turn. Child span per model invocation, per tool call, per retrieval step. Trace IDs flow into product logs so support engineers can cross-reference customer reports.

Owner: support team
Week 3
Research + code-gen instrumentation
parallel team work · same convention spec

Research-agent and code-gen-agent teams instrument in parallel against the same spec. Code-gen surfaces a sub-agent delegation pattern that extends the spec — added cleanly, broadcast back to support and research teams within the day.

Owner: feature teams
Week 4
Coverage audit · body storage
≥99% root coverage · prompt + response stored

Verified root-span coverage at 99%+ on all three agents. Confirmed prompt and response bodies persist with PII redaction in place. Fire-drill: picked a real trace from the prior day, asked a teammate to walk it in five minutes — passed.

Owner: platform lead
Gate
Day 30 readout
live demo · phase-two plan signed

Demoed the live trace viewer on a real support-agent incident from the prior week. Presented the drift-detection and cost-attribution design for phase two. Both signed by the CTO; the program ran to day 60 with budget secured.

Owner: program lead

Two implementation details from this phase are worth capturing. First, identity propagation. User-ID and tenant-ID were added to the trace context (OTel baggage) from day one — every span automatically inherited them through the trace context regardless of where in the call tree it lived. The teams that try to propagate identity through ad-hoc argument passing always lose it at the first tool boundary; baggage was the fix. Second, body storage with PII redaction. The redaction pipeline ran inline on the span exporter — no raw PII ever reached LangFuse — and stored prompt and response bodies on 100% of traces with seven-day retention. That decision became load-bearing in weeks 9-10 when replay was built; without verbatim bodies, replay would have been impossible.

The convention discipline
OpenTelemetry conventions in week one cost two days and bought a quarter of future portability. Vendor-specific spans would have been faster initially and become the migration tax later. The week-one question to ask: if the program had to migrate the observability backend in a single quarter, what percentage of the instrumentation would have to be rewritten? Under ten percent means OTel discipline is paying off.

04Approach · Cost AttributionFour attribution axes, per-user rollups inside a week.

The cost-attribution stream was the highest-leverage piece of phase two for this customer specifically — the original triggering event had been a cost surprise on the AWS bill, and the finance partner was watching closely. The implementation was straightforward because identity propagation had already been built in phase one. Token counts on leaf model spans rolled up to the root span; user-ID and tenant-ID tags propagated through trace context; a versioned unit-price table in code converted tokens to cost. Within a week of the surface going live, two outlier tenants had been identified and routed to commercial conversations.

Axis 1
Per- trace
rollup of leaf-span token counts

Every model span carries input, output, and cached input token counts. Versioned unit-price table in code, dated per provider update. Root span aggregates leaf costs. Per-trace cost lands as a span attribute, queryable in the same surface as latency.

Foundation axis
Axis 2
Per- user
user-ID tag · top-N daily report

User-ID propagates through trace context to every span. Daily top-N report surfaces runaway users — abusive callers, malformed integrations, accidentally-recursive workflows. Two outlier users flagged on day three of the surface, both ultimately legitimate but cost-improvable.

Outlier detection
Axis 3
Per- tenant
tenant-ID rollup · finance chargeback

Tenant-ID propagates alongside user-ID. Per-tenant rollups feed the finance team's chargeback model directly. Two tenants identified burning 40x the median in agent cost; one converted to a higher tier, one had a buggy integration fixed inside a week.

SaaS unit economics
Axis 4
Per-feature · per-route
route tag · feature flag tag

Route tag and feature-flag tag propagate alongside identity. Per-feature cost rollups let product managers see which features were worth their inference spend. One sub-feature retired in week ten because per-trace cost exceeded conversion lift.

Product economics

Two implementation details. First, the unit-price table is versioned in code rather than fetched from a vendor API. Provider prices change quarterly; an in-code table dated per update lets the finance team audit cost calculations historically — a prior incident's cost attribution can be recomputed at the prices that applied that day, not today's prices. Second, eval cost is tracked as a separate line item from production cost. LLM-judge calls consume tokens; budgeting them with the production agent obscures both. Separated, both are interpretable.

Cost attribution surface · four axes, all live

Cost attribution outcomes · day 60 readout
Per-trace cost on every root spanRollup of leaf-span token counts at versioned prices
100%
Per-user attribution surfaceTop-N daily report, outlier alerts on 50x median
Daily
Per-tenant chargeback rollupFeeds finance model; two outlier tenants caught wk 6
Live
Per-feature / per-route costProduct-manager surface; one sub-feature retired
Wk 8
Cost variance from forecastPre-rollout: 15-25% monthly · post: <5% weekly
Halved+
The finance unlock
The finance team had asked for per-tenant cost attribution twice in the prior year. Both requests stalled because the engineering work to backfill identity onto already-shipped traces was estimated at six weeks. Building identity propagation into the spans from week one of the rollout — a two-day discipline — made the same surface a one-week dashboard build in phase two. The retrofit tax avoided was roughly fivefold.

05Approach · Drift + ReplayDrift detection earns the alerts · replay earns the post-mortems.

The final two capabilities of the rollout — drift detection in phase two, deterministic replay in phase three — are the ones that turned observability from a forensic tool into an operating one. Drift detection earned the team the right to alert: instead of static thresholds that fired on noise, hourly rolling-window cron computed rate-of-change on the signals that mattered and routed step-changes to severity- graded alerts with trace URLs attached. Replay closed the loop on the other end: every alert now ended with a walkthrough rather than a guess.

Three drift incidents fired during the engagement, each instructive. The first, in week seven, was a prompt-template regression that dropped golden-dataset eval score by twelve percent on the support agent — caught by drift in under two hours, root cause identified in fifteen minutes (the prompt template change had been deployed without test coverage), fix shipped before the next business day. The second, in week ten, was a tool-schema breakage on the code-gen agent — an upstream API changed a field type, tool-call rejection rate climbed past ten percent on the canary tenants within an hour, the on-call engineer routed the trace through replay and validated the candidate fix against twenty failed traces before deploy. The third, in week eleven, was a runaway tenant on the research agent — per-user cost spiked 50x median, drift fired, the commercial conversation was scheduled the same afternoon.

Drift
Rolling- window cron
hourly · retry · cost · eval · tools

Hourly windows on retry rate, cost-per-turn, golden-dataset eval score, tool-selection distribution, per-route latency. Step-changes route to alerts with example trace URLs attached. Three drift incidents caught during engagement, each resolved within hours.

Right to alert
Alerts
Severity- graded routing
page · ticket · digest

Severity 1 pages on-call with 3-5 trace URLs attached. Severity 2 opens a ticket with five URLs. Severity 3 lands in a daily digest. Hard rule: no alert fires without trace evidence. Alert volume tuned weekly from on-call retrospectives.

Owner: SRE
Replay
Trace URL → sandboxed run
rendered prompt · tool outputs · retrieval payload

Given a trace URL, reconstructs the rendered prompt, tool outputs, retrieval payload from preserved bodies. Re-runs the agent in a sandbox. Compares outputs. Foundation of every later capability — runbook validation, post-mortem demos, regression testing.

Owner: platform lead
Runbooks
Trace-pattern triggers
named patterns · documented actions

Runbooks reference trace patterns by name — tool-call rejection above 10%, retry-rate step change above 50%, per-user cost above 50x median. First-five-minutes section gives explicit filter commands. Rollback paths named per surface. Two-screen format.

Owner: SRE + on-call
Fire-drill
End-to-end chaos exercise
synthetic incident · full chain validation

Week 12: synthetic incident injected, alert fired, on-call paged, trace pulled, replay validated root cause, hotfix replayed against historical traces, deploy shipped. End-to-end in 27 minutes. Documented as the baseline IR cadence for year two.

Owner: program lead

The single most important implementation detail of phase three was that replay only worked because prompt and response bodies had been stored verbatim from week two. Teams that try to save storage cost in phase one by hashing prompts always discover the trade-off when they reach replay and find it impossible to reconstruct historical inputs. Storage cost here was approximately $40 a month for seven-day retention on full bodies across all three agents — cheap insurance for the replay capability that anchored the day-90 readout.

"Replay was the moment observability stopped feeling like a tax. The first time we replayed yesterday&apos;s failed trace against a candidate fix and watched it pass, the whole program paid for itself."— Customer on-call engineer · day-90 retrospective

06OutcomesMTTR halved, cost predictable, walkthroughs beat narratives.

The day-90 readout was scheduled three weeks in advance and run in front of the executive team — CTO, head of finance, head of customer success, two product leads. The format was the same one the platform lead had used internally for the gate reviews: a real production trace from the prior forty-eight hours, walked live, with replay invoked midway to demonstrate root-cause validation against historical failures. Every number below is from the customer's own measurement, taken at the gate.

Day-90 outcomes · measured against day-0 baseline

Source: customer-measured outcomes · day-90 readout
MTTR · agent incidentsPre-rollout: 31h / 52h on last two · post: <12h average
Halved
Cost variance from forecastPre-rollout: 15-25% monthly · post: <5% weekly
Wk/wk
Drift incidents caught before customer noticedThree during engagement · zero customer tickets
3 of 3
Trace coverage on top-3 agentsRoot spans · 99%+ on all three production surfaces
99%+
Replay fidelity · historical tracesCaptured trace re-runs match original output bit-for-bit
97%+
OpenTelemetry adoption · vendor portabilityHypothetical migration: <10% instrumentation rewrite
≤ 10%
On-call IR rehearsal · time to root causeWeek-12 chaos exercise · alert-to-fix in 27 minutes
27 min

The narrative outcomes mattered as much as the numerical ones. The on-call engineer who had previously dreaded agent-related pages came out of the day-90 readout asking for the same observability stack to be extended to the two agents that had been left out of scope. The finance team asked for the per-tenant cost surface to be wired into their forecasting model directly. The CTO scheduled the second-quarter program — coverage breadth across the remaining agents, plus a second tier of eval automation — on the basis of the day-90 walkthrough. The replay demonstration in particular was what made the case; the audit document had been compelling on paper, the live walkthrough was undeniable.

What the customer chose not to do is worth naming. They did not extend to coverage breadth during the ninety days — the two operational agents that had been out of scope stayed uninstrumented. They did not build a custom dashboard layer on top of LangFuse — LangFuse's own surfaces were sufficient for the on-call cadence. They did not migrate the existing logging pipeline; trace IDs cross-referenced into product logs were enough integration for the ninety days. The discipline of depth-on-three-agents rather than breadth-on-five was the program's defining choice.

The walkthrough that closed the case
The day-90 readout was a real production trace from the prior afternoon — not a rehearsed staging demo. The on-call engineer walked it live: alert fires from drift, trace URL attached, trace viewer opens, root cause spotted in the tool-call span, candidate fix drafted, replay validates the fix against twenty historical failed traces with pass-rate 19 of 20. Deploy shipped same day. The executive team approved phase four on the spot.

07Lessons + ReplicationWhat to replicate, and the three traps to avoid.

Five lessons travel cleanly from this engagement to any multi-product SaaS running agents in production. Three traps are worth naming explicitly, because they are the ones the program lead would warn other teams about most firmly. The patterns are general; the implementation details vary with vendor and stack, but the shape of the rollout is replicable across most engineering organisations of similar scale.

Lesson 1
Box
Time-box vendor selection to one week

Five business days, four vendors, six axes, one scorecard, one decision. The six-month vendor debate the customer arrived with had been a governance problem masquerading as an engineering problem. The bake-off closed it.

Org · Week 1
Lesson 2
OTel
Conventions in week one, instrumentation in week two

Two days of OpenTelemetry-shape discipline pre-pays a quarter of future portability work. The same decision is approximately five times more expensive to retrofit later. Adopt before the first span ships, not after.

Tech · Week 1-2
Lesson 3
ID
Identity propagation through trace context, not arguments

User-ID and tenant-ID on every span via OTel baggage. Ad-hoc argument passing always loses identity at the first tool boundary. Cost attribution and per-tenant rollups depend on this in phase two — build it correctly in phase one.

Tech · Week 2-3
Lesson 4
Body
Store prompt and response bodies on 100% of traces

Verbatim bodies with seven-to-thirty-day retention. Cost is small; the replay capability they enable in phase three is the highest-leverage thing the engagement built. Hashing prompts to save storage is the false economy that kills replay.

Tech · Week 2-4
Lesson 5
Demo
Day-90 readout is a real trace, not a rehearsed one

Pick a real incident from the prior forty-eight hours; demonstrate end-to-end. The walkthrough is the test. Staged demos do not survive first-production contact and do not earn the budget that drives phase four.

Process · Week 12
Trap 1
Wrap
Don&apos;t build a custom SDK wrapper before the first trace

Every team feels the urge to abstract the vendor SDK on day one. Resist until vendor switch becomes a credible scenario; the homegrown wrapper otherwise consumes phase one and produces nothing demonstrable at the day-30 gate.

Anti-pattern
Trap 2
Silos
Don&apos;t run the rollout from a single team

Platform engineering inheriting the entire program in silo is the most common reason rollouts stall at day forty-five. Named owners per stream — evals to data science, cost to finance, replay to platform — with platform as program manager. Cross-functional or it doesn&apos;t ship.

Anti-pattern
Trap 3
Breadth
Don&apos;t spread phase one across every agent on day one

Coverage breadth is a phase-four problem. Depth on the three highest-leverage surfaces buys the substrate everything else composes on. Patchy coverage on twelve agents at day thirty is the failure mode that turns the whole quarter into a stall.

Anti-pattern

For organisations starting from scratch, the replication path is straightforward. Read the 30/60/90-day rollout playbook for the phased plan. Use the vendor TCO calculator to anchor the bake-off scorecard against your own cost profile. The replication risk is rarely technical — the patterns above are well-understood and the vendor surfaces are mature. The replication risk is governance: a program without phase gates and named stream owners is the program that reaches day sixty with patchy coverage and a vendor invoice but no operating leverage.

For teams who want outside help running the program, our AI transformation engagements ship exactly this phased rollout — vendor selection, OpenTelemetry instrumentation, cost attribution, drift detection, replay infrastructure — against whichever vendor fits your stack and sovereignty constraints best. The ninety-day shape is the same; the implementation details vary; the day-90 walkthrough is the bar we hold ourselves to in every engagement.

Conclusion

Observability turns blind production agents into observable production agents — and that's the whole game.

Ninety days. Three agents instrumented. One vendor chosen in five business days. OpenTelemetry conventions in week one. Per-tenant cost attribution live by week eight. Drift detection catching prompt regressions in hours rather than weeks. Deterministic trace replay turning post-mortems into walkthroughs. MTTR halved. Cost predictable week over week. The shape of the rollout is replicable; the outcomes are the kind a quarterly executive review can unambiguously verify.

The deeper lesson is that none of the technical pieces were novel. CSA-style attention compression is novel. Mixture-of-experts routing is novel. Observability for agents, as of mid-2026, is not novel — it is just work that most teams haven't done yet, because the patterns are well-understood and the vendor surfaces are mature enough to ship against. The barrier is governance: a cross-functional program with named owners, phase gates, and unvarnished gate demos. The customer above ran exactly that program, and the day-90 readout was the proof.

For any team operating production agents without observability worth the name, the question is no longer whether to roll it out — the cost of operating blind is now well-priced — but how quickly to start. Ninety days from kickoff to day-90 readout is the right horizon. Time-boxed vendor selection in week one is the unlock. OpenTelemetry discipline in week two is the cheap insurance. Replay infrastructure in weeks nine and ten is the leverage. Everything else composes on top.

Replicate this rollout

Observability turns blind agents into observable ones.

Our team runs agent observability rollouts mirroring this case — vendor selection, OpenTelemetry, cost attribution, drift detection, replay.

Free consultationExpert guidanceTailored solutions
What we replicate

Observability rollout engagements

  • Vendor selection (LangFuse / LangSmith / Helicone / Phoenix)
  • OpenTelemetry semantic conventions
  • Cost attribution per-user / per-tenant / per-feature
  • Drift detection cron and alert routing
  • Replay infrastructure for incident response
FAQ · LangFuse case

The questions ops teams ask after the case.

Six audit axes — trace coverage, span depth, eval signals, drift detection, cost tracking, incident response — scored across four vendors over five business days. LangFuse won on three concrete factors. First, the self-host option satisfied the customer&apos;s data-residency requirement in two days, collapsing an entire sub-process the team had been planning to run with closed-cloud alternatives. Second, the vendor-neutral SDK accepted OpenTelemetry-shaped spans natively, which made the OTel discipline cheap to adopt in week one. Third, the cost model was predictable enough for the finance partner to sign off without a multi-month commitment debate. LangSmith was the runner-up; Helicone won on time-to-first-trace but lost on agentic span depth; Phoenix lost on commercial fit rather than capability. The general lesson: vendor selection is rarely won by feature leadership alone — it is won by the combination of capability, commercial fit, and the constraint each non-engineering stakeholder brings.