Observability stack TCO is the cost question every agentic team postpones until the third invoice — and the choice between LangSmith, LangFuse, Helicone, and Phoenix gets harder once production traffic is committed. Four vendors, three volume tiers, six feature deltas — this guide is the per-trace total cost of ownership analysis that decides which stack survives twelve months of growth.
What's at stake is real money and real switching cost. Managed pricing scales linearly with trace volume until it doesn't; self-hosted infrastructure scales with operational headcount until it doesn't. The right answer at one thousand monthly traces is almost never the right answer at ten million, and the migration cost between vendors is the line item nobody models until they have to.
This guide covers four vendors honestly — managed strengths, self-hosted economics, the six feature deltas that decide a multi-year commitment — and gives three team archetypes a specific recommended pick. Every number is sourced from published rate cards as of mid-2026; verify against current vendor documentation before committing.
- 01Managed observability wins at low and mid volumes.Build cost versus operational complexity tilts decisively toward managed up to roughly one million monthly traces — paid plans cost less than the engineering hours self-hosting consumes when traffic is still small.
- 02Self-hosted wins above 10M monthly traces.Marginal cost on a self-hosted LangFuse or Phoenix deployment approaches the underlying storage cost once volume is high enough to amortise the ops headcount. Managed pricing keeps scaling linearly; storage cost does not.
- 03Eval integration is the killer feature most teams under-weight.Inline eval scores on the same trace surface as reliability data prevent the "quality is fine, reliability is broken" fiction. Do not pick a vendor without first-class eval integration; bolting it on later is more painful than switching tools.
- 04Cost attribution per-user is non-negotiable for SaaS.Hot-spot users — runaway agents, abusive callers, malformed integrations — surface earlier when cost is attributed per-user and per-tenant. Vendor support for granular attribution varies widely; check before you sign.
- 05Migration cost between vendors is moderate.OpenTelemetry semantic conventions for GenAI are stabilising, which makes cross-vendor migration meaningfully cheaper than it was a year ago. Emit OTel-shaped spans today and your future-vendor switching cost stays bounded.
01 — Four VendorsLangSmith, LangFuse, Helicone, Phoenix.
The agent observability market sorted itself into four mainstream choices by mid-2026 — each with a different origin story and a different deployment model. LangSmith ships as LangChain's integrated observability surface. LangFuse is the open-source vendor-neutral option with both managed cloud and self-host paths. Helicone is the proxy-based fast on-ramp. Phoenix is the OpenTelemetry-native, ML-ops-heritage option from Arize.
The TCO calculation is fundamentally different across the four because the cost structure is different. Managed vendors price per trace (or per event); self-hostable vendors price the managed tier per trace and the self-host tier in infrastructure plus operational headcount. The right comparison is not list price — it is the all-in twelve-month cost given your projected volume and your operational appetite.
LangSmith
First-party observability for LangChain and LangGraph stacks. Strongest when the orchestration framework is already LangChain. Inline evals are first-class; cost tracking via token counts; drift detection improving quarter over quarter.
Managed onlyLangFuse
Vendor-neutral SDK with both managed cloud and self-host paths. Strong on trace coverage, span depth, and cost tracking. Eval framework built in; drift via the time-series UI. The default pick when sovereignty or multi-framework is on the requirements list.
Managed + self-hostHelicone
Sits between your application and the LLM provider as a proxy — instant trace coverage with no SDK changes. Strong on cost tracking and rate limiting; lighter on agentic span-tree depth and inline evals (improving). The fast on-ramp for non-agentic LLM apps.
Proxy + managedPhoenix (Arize)
Emits OpenTelemetry-shaped spans by default — strongest portability story across the four. Eval framework solid; drift detection inherits Arize's ML-monitoring DNA. The right pick when OTel semantic conventions are a hard requirement.
Managed + self-host02 — Per-Trace PricingManaged vs self-hosted — same axis, different breaks.
Managed observability pricing across all four vendors follows the same general shape — a free or low-cost developer tier, a volume-tiered production plan, and an enterprise plan that collapses into a custom quote past some threshold. The shape is similar; the slope and the breakpoints are not.
Self-hosted pricing is a different beast. The infrastructure cost is a function of trace volume, retention window, and the cost of a managed Postgres or ClickHouse instance. The operational cost is a function of how much engineering time the stack consumes — patching, scaling, backup verification, on-call for the observability stack itself. Both numbers matter; both are often underestimated.
Vendor-hosted · per-trace billing
Free tier covers development and small production. Paid plans scale linearly with trace volume until enterprise tiers kick in. Operational complexity is the vendor's problem. The right default below roughly one million monthly traces.
Below 1M traces / monthYou run it · infra + headcount cost
Available on LangFuse and Phoenix (not LangSmith or Helicone in the same way). Infrastructure cost scales with storage and retention, not trace count directly. Operational headcount is the line item teams forget — figure 0.2 to 0.5 FTE of senior engineering attention at meaningful scale.
Above 10M traces / monthHelicone-style passthrough
The Helicone proxy captures every LLM call without SDK changes. Cost is per-event with generous free tiers. Less granular for deep agentic span trees, but the on-ramp is unmatched — instrumentation cost is effectively zero on day one.
Non-agentic LLM appsDIY pipeline · cold tier archive
Emit OpenTelemetry spans to a warehouse (ClickHouse, BigQuery, S3 + Athena). Long-term retention at storage prices, hot queries on a sampled tier. Operationally the most demanding option; the right answer only when retention or sovereignty requirements force it.
Compliance-boundThe single most useful framing is to think of pricing in two zones. Below roughly one million monthly traces, managed wins on almost any metric — paid plans cost less than the engineering attention self-hosting consumes when the stack is small. Above ten million monthly traces, the math flips — managed pricing keeps scaling linearly, self-hosted storage cost grows much more slowly, and the operational headcount amortises across enough traffic to be defensible.
The interesting zone is between one million and ten million — where pricing depends heavily on feature requirements, eval volume, retention policy, and the cost of an in-house operational engineer in your geography. There is no universal right answer in that band; it is the zone where the audit framework in our AI transformation engagements usually pays for itself.
"Managed pricing wins until your traffic is too big to amortise; self-hosted wins once your traffic is big enough to defend the headcount. The interesting decisions live in the middle band."— Production lesson · 2026 observability engagements
03 — Three Tiers1k, 100k, 10M monthly traces.
We modelled three representative volume tiers — one thousand, one hundred thousand, and ten million monthly traces — and estimated all-in twelve-month TCO for each vendor at each tier. The numbers below are illustrative ranges based on published rate cards and typical production parameters (median trace size, eval sample rate, retention window). Treat them as ranking signals, not procurement quotes — confirm against current vendor documentation before any commitment.
Twelve-month TCO by volume tier · managed vs self-hosted
Illustrative ranges from mid-2026 published rate cards · verify before committingTwo observations. First, the difference between the cheapest and most expensive managed vendor at one hundred thousand traces per month is genuinely small — the feature deltas in Section 04 will dominate the decision, not the rate card. Second, the gap between managed and self-hosted opens fast past one million traces — and once you cross five-to-ten million, the self-host case becomes increasingly hard to argue against on cost alone.
The hidden cost in the middle band is eval volume. Inline LLM-judge evaluations consume their own tokens and produce their own spans, and aggressive eval sampling can double or triple the effective trace count visible to your observability vendor. Many teams discover this only after the first quarterly true-up.
04 — Feature DeltasEval, replay, cost attribution, drift, multi-tenant.
Six feature deltas decide most multi-year commitments. The choice matrix below is how each vendor covers each delta as of mid-2026 — what is first-class, what is adequate, what is a gap to fill with custom instrumentation. The relative rankings shift quarterly; this is a starting point rather than a procurement spec.
Inline evals · same surface as traces
LangSmith and LangFuse are first-class — eval scores land as attributes on the same span. Phoenix is solid with the ML-ops heritage showing. Helicone is improving but still asks teams to bolt evals on separately. Eval integration is the single most under-weighted feature in early-stage decisions.
LangSmith / LangFuse / PhoenixReproduce yesterday's incident on a laptop
Replay-from-trace is best on LangSmith and LangFuse, where full prompt and response bodies are stored by default. Phoenix supports replay through its eval framework; Helicone's proxy model preserves what passed through it. The hardest single audit question on every observability vendor.
LangSmith / LangFusePer-user, per-tenant, per-trace rollups
Helicone is strongest out of the box — proxy-level capture makes per-user attribution trivial. LangFuse and LangSmith support it through span attributes that propagate. Phoenix supports it but requires manual setup. Non-negotiable for B2B SaaS.
Helicone first-classTime-series · annotations · rollback playbooks
Phoenix leads because of the Arize ML-monitoring inheritance — drift detection is mature, with time-series UIs and statistical detectors built in. LangFuse covers the basics through its time-series UI; LangSmith is catching up; Helicone covers cost drift well, output drift less so.
Phoenix first-classTenant isolation · role-based access
All four support multi-tenant attribution through span attributes; tenant-level access control is a paid-tier feature on every vendor. The differentiator is depth — LangFuse and LangSmith have richer organisational hierarchies, Phoenix supports SAML SSO at enterprise tier, Helicone keeps it simple.
LangFuse / LangSmith depthVendor-neutral spans · portability
Phoenix is OTel-native by default; LangFuse offers an OTel exporter that covers most use cases. LangSmith and Helicone emit vendor-specific spans primarily, with partial OTel support. If portability matters on a 12-to-24-month horizon, weight this delta heavily.
Phoenix / LangFuseTwo of these deltas are worth re-reading. Eval integration is the feature most teams under-weight at decision time and most regret skipping at month three — when the inevitable "quality is fine, reliability is broken" conversation happens with no shared surface to triage from. Cost attribution per-user is the feature that goes from nice-to-have to non-negotiable the first time a runaway integration consumes a thousand dollars in tokens overnight.
For agentic teams building their own audit framework against these deltas, our 60-point agent observability audit covers the per-axis questions in detail — and runs against any of the four vendors above without rewriting.
05 — Break-EvenWhere managed pricing breaks.
The break-even calculation is straightforward in shape and painful in detail. Managed cost grows roughly linearly with trace volume past the free tier. Self-hosted cost has a high floor — the operational engineering attention you owe the stack — and a much shallower slope. The break-even point is where the two lines cross, and the band around that point is wider than vendor brochures suggest.
Managed wins
engineering attention is the binding constraintPaid managed plans cost less than the senior-engineer hours self-hosting consumes when traffic is still modest. The decision is not pricing — it is feature fit (see Section 04) and ergonomics for the on-call team. Pick the managed vendor whose feature set matches your stack and move on.
Default at this scaleMiddle band · case-by-case
feature deltas dominate the rate cardPricing depends heavily on eval sample rates, retention window, multi-tenant requirements, and the cost of an in-house operational engineer in your geography. No universal answer. Model both options seriously; the right pick often depends on feature deltas that have nothing to do with rate cards.
Audit the use caseSelf-hosted wins
linear managed pricing meets sub-linear storage costManaged pricing keeps scaling linearly; storage on a self-hosted LangFuse or Phoenix instance grows much more slowly. Operational headcount amortises across enough traffic to be defensible. Most teams crossing this threshold also have the engineering depth to run the stack — both shifts happen at similar volumes.
Default at this scaleA few practical break-even considerations the cost model usually omits. First, retention policy moves the line — long retention windows favour self-hosted (where storage is cheap) over managed (where retention is often a paid multiplier). Second, regional data residency requirements often force self-hosted regardless of cost — managed vendors may not offer your jurisdiction. Third, the cost of switching vendors is real but bounded — OpenTelemetry semantic conventions are stabilising enough that re-platforming costs roughly two engineer-weeks once instrumentation is OTel-shaped.
One pattern worth naming: teams often over-buy retention. Ninety-day hot retention sounds reassuring; in practice most incident response operates on traces from the last seven days and most compliance lookups can be served from cold archival. Tiered retention — short hot window, longer cold archive in S3 or equivalent — meaningfully lowers managed bills without sacrificing operational utility.
"The break-even point on observability is where managed pricing meets self-hosted ops headcount. Both numbers are usually underestimated; the line is where the underestimates cancel out."— Agentic engineering · TCO modelling lesson
06 — Self-HostWhen LangFuse + Phoenix win.
Self-hosting is a real option only on LangFuse and Phoenix among the four — LangSmith is managed-only at the depth that matters, and Helicone's proxy model is fundamentally a hosted product. The trade-off when you self-host is the classic one: you trade vendor lock-in and per-trace billing for operational responsibility and a meaningful infrastructure footprint.
The right time to consider self-hosting is when at least one of three conditions holds. First, sovereignty — your data cannot leave your infrastructure for compliance, regulatory, or contractual reasons. Second, scale — your trace volume has crossed the band where managed pricing dominates and operational headcount is defensible. Third, depth of control — you need observability semantics or retention behaviour the managed product does not expose.
Data cannot leave your infrastructure
Healthcare, financial services, defence, EU-bound PII — any context where customer data crossing a managed vendor boundary is a regulatory or contractual non-starter. Self-hosted LangFuse is the most common pick here; Phoenix is the alternative when ML-ops heritage matters.
Compliance-drivenAbove the break-even line
Past roughly ten million monthly traces, managed pricing keeps scaling linearly while self-hosted storage cost flattens. Operational headcount amortises across the traffic. Most teams crossing this threshold also have the engineering depth to run the stack confidently.
Cost-drivenCustom semantics or retention
Long retention windows, unusual span attributes, custom eval pipelines, or sampling logic the managed product does not support. Niche but real — and one of the few reasons to self-host even at modest scale. Open-source means you can modify the stack to match your needs.
Engineering-drivenThe operational cost of self-hosting is the line item most teams underestimate. Plan on 0.2 to 0.5 FTE of senior engineering attention at meaningful scale — patching, upgrading, scaling Postgres or ClickHouse, backup verification, on-call for the observability stack itself. That cost is invisible until something breaks during an incident response, at which point it is suddenly the most important resource on the team.
A reasonable middle path exists. Run managed for the first twelve months while traffic builds and the team learns the tool. Re-evaluate at the second anniversary using real cost data and a real understanding of operational complexity. The cost of staying managed for an extra year is far smaller than the cost of self-hosting prematurely and discovering at month three that the operational burden is heavier than projected.
07 — RecommendationsThree team archetypes, three picks.
The right pick depends on stack, scale, and operational appetite. Three archetypes cover the majority of teams asking this question in 2026. Find the closest match; the recommendation is a starting point, not a verdict — and the audit work in our AI transformation engagements covers the per-stack details before any vendor commitment.
LangChain shop · small to mid volume
Orchestration is LangChain or LangGraph. Volume is below one million monthly traces. The team wants a managed surface with the lowest setup cost. Pick LangSmith — first-party integration, inline evals, and the shortest path to a working observability surface.
Pick LangSmithMulti-framework or sovereignty-bound
Stack mixes LangChain, raw SDKs, custom orchestration. Or compliance demands self-hosting. Or the team wants one observability surface across multiple LLM frameworks. Pick LangFuse — vendor-neutral SDK, both managed and self-host paths, strong eval framework.
Pick LangFuseEnterprise scale · OTel-first discipline
Volume past ten million monthly traces. OpenTelemetry semantic conventions are a hard requirement. Drift detection and ML-ops heritage matter. Pick Phoenix — OTel-native by default, mature drift detection, self-host path for scale, Arize lineage for ML-ops depth.
Pick PhoenixNon-agentic LLM apps · proxy-first
Single-shot LLM calls without deep agent fan-out. Want instant cost tracking and rate limiting without SDK changes. Pick Helicone for the on-ramp; revisit at one million traces per month or when agentic complexity grows beyond what a proxy can capture.
Pick HeliconeOne closing observation on vendor commitment. The observability market is moving fast enough that a three-year commitment is almost never the right shape — annual renewals with a defined OpenTelemetry-shaped instrumentation contract keep optionality alive. The vendor that wins today may not be the vendor that wins in eighteen months, and the cost of staying portable is small relative to the cost of being locked into a tool whose feature set falls behind.
For teams running their first observability commitment, the most useful exercise is not picking the right vendor — it is running the 60-point observability audit against whatever is already installed. The audit surfaces the gaps that matter; the vendor choice falls out of the gap analysis. For teams thinking through agentic security alongside observability, the MCP server security audit is the natural companion piece on the same operational axis.
Observability vendor choice is decided by volume × ops appetite — not by the brochure.
Four vendors, three volume tiers, six feature deltas. The interesting thing about the analysis is that the answer at one thousand monthly traces is almost never the answer at ten million — and the operational appetite of the team matters as much as the rate card. Managed wins at small and medium volume; self-hosted wins at scale; the middle band is decided case-by-case on feature deltas and retention policy.
The trajectory through the rest of 2026 is twofold. First, OpenTelemetry semantic conventions for GenAI continue to stabilise, which keeps cross-vendor migration cost bounded and makes vendor lock-in less catastrophic than it was a year ago. Second, eval integration migrates from a differentiator to a baseline expectation — the gap between vendors on that axis narrows, and the gap between teams that integrate evals inline and teams that do not widens.
One closing thought. The right way to make the vendor decision is to run the audit against the stack you have, let the gap analysis drive the requirements, and treat the rate card as the last input rather than the first. Teams that start from price end up picking the cheapest vendor that covers their current scale; teams that start from operational requirements end up picking the vendor that survives twelve months of growth.