Agent observability is the 2026 production-deployment necessity that most teams underestimated. Workflows that worked in dev fail in prod for reasons traditional APM doesn't surface — model drift, tool-call retry loops, prompt regressions on framework upgrades, cost spikes from runaway loops. Six platforms have emerged to handle this differently than classic APM.
We compare six platforms across tracing depth, eval rigor, runtime cost, framework integration, deployment model, and best-fit team shape. Most production teams pick a primary observability platform (LangSmith, Langfuse, or Arize Phoenix) and pair it with the broader infrastructure observability layer (Datadog, Honeycomb, New Relic) for whole-stack coverage.
This post covers the 7-axis matrix, deep dives on the three primary platforms, the specialist tier (Helicone, Datadog, Honeycomb), and four reference workflows we run for engineering teams today — prototype debugging, prod regression watch, eval-driven dev, and incident root-cause.
- 01LangSmith wins LangChain/LangGraph stacks — deepest framework integration in the field.LangSmith is purpose-built for LangChain and LangGraph. The integration is the deepest in the field: traces include node-by-node state diffs, full agent execution graphs, model + tool call breakdowns, replay against new model versions. For teams building on LangGraph or LangChain, LangSmith is the path of least friction. Pricing is generous on the free tier and usage-priced beyond.
- 02Langfuse is the open-source leader — self-hostable, framework-agnostic.Langfuse leads on the open-source observability path. Self-hostable (Postgres + ClickHouse), framework-agnostic, supports any LLM SDK or agent framework via OpenTelemetry traces. Cloud tier from $59/seat for teams that don't want to self-host. Right pick for teams that want observability without framework lock-in or that have data-residency requirements.
- 03Arize Phoenix wins eval rigor — ML-grade evaluation primitives for the agent era.Arize built ML observability before LLMs were a thing; the LLM observability story benefits from that ML rigor. Phoenix (the open-source layer) ships eval primitives, drift detection, and embeddings analysis that other platforms approximate with weaker statistics. Pairs naturally with Arize cloud for enterprise deployments. Right pick when eval rigor matters more than framework integration.
- 04Helicone is the drop-in proxy — simplest install, no SDK changes.Helicone routes LLM API calls through its proxy, capturing observability without SDK changes — change one base URL, get traces. Simplest install in the field. Trade-off is shallower depth than the platform-native integrations: traces are at the API call level, not the agent execution level. Right when minimal install effort is the dominant constraint.
- 05Pair the LLM observability platform with whole-stack APM (Datadog, Honeycomb, New Relic).LLM observability and infrastructure observability are different layers. The LLM platform (LangSmith, Langfuse, Arize) handles agent traces, eval, and LLM-specific metrics. The infra platform (Datadog, Honeycomb, New Relic) handles host metrics, app errors, request traces, deployment health. Most production deployments need both — LLM observability for agent debugging, infra observability for whole-stack health. Datadog ships an LLM Observability product for shops already on Datadog — pays back when integration consistency matters more than depth.
01 — The FieldThe 2026 observability field.
Agent observability emerged as a distinct category in 2023-2024 when production teams realized classic APM (Datadog, New Relic, Honeycomb) wasn't surfacing the failure modes that LLM-driven workflows actually hit. The first wave of platforms (LangSmith, Langfuse, Helicone) shipped LLM-native traces and prompt management. The second wave (Arize Phoenix, Datadog LLM Observability, Honeycomb's LLM tracing) added eval rigor and integration with broader infrastructure observability.
By April 2026 six platforms own the production conversation. The decision dimensions are: framework lock-in, deployment model (cloud vs self-host), eval rigor, integration with broader APM, and pricing model. Most teams pick one primary platform and pair it with a whole-stack APM for infrastructure-layer coverage.
LangSmith — LangChain-native depth
Cloud · LangChain + LangGraph integration · usage-pricedThe deepest framework integration. Built by the LangChain team for LangChain and LangGraph. Trace depth includes node-by-node state diffs, full agent graphs, replay-against-new-models. Right primary for any LangChain or LangGraph stack.
Framework-nativeLangfuse — open-source leader
Self-host or cloud · framework-agnostic · OTel-friendlyOpen-source observability platform. Self-hostable on Postgres + ClickHouse; cloud tier from $59/seat. Framework-agnostic via OpenTelemetry. Right pick for teams that want OSS or self-host or framework-agnostic coverage.
Open-source defaultArize Phoenix — eval-rigor leader
Open-source + cloud · ML-grade primitivesML-grade observability heritage applied to LLMs. Phoenix (OSS) ships eval primitives, drift detection, embeddings analysis. Arize cloud adds enterprise scale. Right pick when eval rigor matters more than framework integration.
Eval rigorHelicone — drop-in proxy
Proxy-based · zero SDK changes · simplest installRoutes LLM API calls through its proxy. Change one base URL, get traces. Simplest install in the field. Trace depth is at the API-call level, not agent execution level. Right when install simplicity dominates.
Simplest installDatadog LLM Observability
Datadog-native · enterprise APM bundleDatadog's LLM observability product. Pays back when the team is already on Datadog and wants integrated LLM + infra observability without adding a separate platform. $31+/host/mo + LLM volume add-on.
Datadog shopsHoneycomb LLM Observability
Event-based · deep tracing · query-drivenHoneycomb's LLM extension of its event-based observability. Deep tracing primitives, query-driven exploration. Pays back when the team already runs Honeycomb and values event-based depth over LLM-specific dashboards.
Honeycomb shops02 — MatrixFeature matrix, six platforms.
The matrix below covers the seven capabilities that drive 2026 observability decisions: framework integration depth, eval rigor, deployment model, install effort, broader-APM integration, pricing model, and best-fit team shape.
Framework integration depth
LangSmith wins for LangChain + LangGraph (deepest). Langfuse covers any framework via OTel — broad but shallower per-framework. Arize Phoenix integrates well with LlamaIndex, OpenAI Agents SDK. Helicone is API-call level (framework-agnostic but shallower). Datadog + Honeycomb match their broader OTel patterns.
LangSmith (LangChain) · Langfuse (broad)Eval rigor (drift, scoring, regression)
Arize Phoenix wins. Built on ML-observability heritage; eval primitives are deeper than competitors. LangSmith and Langfuse ship strong eval features but Arize's depth is meaningful for regulated or accuracy-critical workloads.
Arize PhoenixDeployment model (self-host vs cloud)
Langfuse is the self-host leader (Postgres + ClickHouse, fully OSS-compatible). Arize Phoenix self-hosts as OSS; Arize cloud is enterprise. LangSmith is cloud-only (with VPC-scope enterprise tier). Helicone is proxy-cloud. Datadog + Honeycomb cloud-hosted.
Langfuse · Arize (self-host)Install effort (time to first trace)
Helicone wins (~5 min via proxy URL change). LangSmith ~15 min for LangChain stacks (instrumentation auto-enabled). Langfuse ~30 min cloud / ~60 min self-host. Arize Phoenix ~30 min. Datadog + Honeycomb depend on existing setup.
Helicone (simplest)Broader-APM integration
Datadog + Honeycomb are themselves APM platforms — full-stack integration native. LangSmith, Langfuse, Arize integrate with broader APM via OTel + webhooks but pair best with their own primary surfaces. Helicone's proxy traces are easy to forward to APM.
Datadog · HoneycombPricing model
Langfuse self-host is free (compute + storage costs only). Helicone has generous free tier; usage-priced. LangSmith free tier + usage-priced. Arize Phoenix OSS free; cloud enterprise contracts. Datadog $31+/host + LLM add-on. Honeycomb usage-priced.
Langfuse self-host (cheapest)Best-fit team shape
LangSmith: LangChain/LangGraph engineering teams. Langfuse: OSS-preference + self-host teams. Arize Phoenix: ML-heritage teams that need eval rigor. Helicone: small/early teams that need fastest install. Datadog: enterprises already on Datadog. Honeycomb: enterprises already on Honeycomb.
Match team shape03 — LangSmithLangSmith — the framework-native leader.
LangSmith is purpose-built for LangChain and LangGraph. Built by the LangChain team, the integration is the deepest in the field — traces include node-by-node state diffs, full agent execution graphs, model + tool call breakdowns, and replay against new model versions. For teams building on LangChain or LangGraph, LangSmith is the path of least friction.
Deepest LangGraph integration
Trace depth includes node-by-node state diffs, conditional edge transitions, retry timelines, human-in-the-loop interrupt timing. The integration was co-designed with LangGraph — features ship in lock-step. Right primary for any production LangGraph deployment.
LangGraph-nativeReplay-against-new-models
Capture traces in production; replay against new model versions to test regressions before deploying. The eval surface integrates tightly with traces — production behavior becomes the eval dataset. Among the strongest eval workflows in the field.
Eval workflowCloud-only deployment
LangSmith is cloud-hosted by default. Enterprise tier supports VPC-scope deployment. Right when cloud-hosted is acceptable; less compelling when data-residency requires self-host. Langfuse is the alternative for self-host requirements.
Cloud-only"If you're on LangGraph, LangSmith is the right choice. If you're framework-agnostic, Langfuse. If eval rigor is the priority, Arize Phoenix."— Internal observability stack retro, March 2026
04 — LangfuseLangfuse — the open-source leader.
Langfuse leads on the open-source observability path. Fully self-hostable on Postgres + ClickHouse, framework-agnostic via OpenTelemetry, with a cloud tier ($59+/seat) for teams that don't want to self-host. Right pick for teams that want observability without framework lock-in, that have data-residency requirements, or that prefer open-source ownership.
Fully self-hostable
Postgres + ClickHouse stack; deploys cleanly to Kubernetes or managed-Postgres setups. Self-host means $0 platform cost (compute + storage only) and full data control. Strong for data-residency requirements and teams that prefer OSS ownership.
Self-host defaultFramework-agnostic via OpenTelemetry
Native OTel support means any LLM SDK or agent framework can instrument into Langfuse. Works with LangGraph, Mastra, OpenAI Agents SDK, raw Anthropic SDK, custom code. The OTel-first design is the differentiator vs platform-native peers.
OTel-nativePer-framework depth gap
Langfuse covers any framework but doesn't match LangSmith's depth on LangChain/LangGraph. The OTel-driven approach is broad but shallower per-framework. Right pick when breadth matters more than depth; less ideal when one specific framework is the dominant surface.
Breadth over depth05 — Arize PhoenixArize Phoenix — the eval-rigor leader.
Arize built ML observability before LLMs were a thing. The LLM observability story benefits from that ML rigor. Phoenix (the open-source layer) ships eval primitives, drift detection, and embeddings analysis that other platforms approximate with weaker statistics. Pairs naturally with Arize cloud for enterprise deployments.
ML-grade evaluation primitives
Built on Arize's ML-observability heritage. Eval primitives, drift detection, embeddings analysis are deeper than competitors' equivalents. Right pick when eval rigor matters most — regulated industries, accuracy-critical workloads, model-comparison studies.
Eval depthPhoenix open-source layer
Phoenix is the open-source layer — full eval + tracing primitives without licensing. Pairs with Arize cloud for enterprise scale. The OSS option is more capable than typical OSS observability tools because the rigor came from a paid ML-observability product.
OSS + cloudLess LLM-specific dashboard polish
Arize's UI is shaped by ML-observability heritage; the LLM-specific dashboard polish is improving but trails LangSmith and Langfuse's LLM-native presentations. Right pick when eval rigor dominates; less ideal when LLM-specific dashboard ergonomics matter most.
ML-heritage UI06 — SpecialistsHelicone, Datadog, Honeycomb — specialist tier.
Three specialists occupy distinct niches that the primary platforms don't cover as well. Helicone wins on install simplicity (proxy-based, zero SDK changes). Datadog LLM Observability wins for Datadog-native shops (integrated LLM + infra). Honeycomb LLM Observability wins for Honeycomb-native shops (event-based deep tracing).
Drop-in proxy · simplest install
Routes LLM API calls through its proxy. Change one base URL, get traces. Simplest install in the field (~5 min). Trade-off is shallower depth — traces are at the API-call level, not agent execution level. Right when minimal install effort dominates.
Simplest installDatadog-native · enterprise APM bundle
Datadog's LLM observability product. Pays back when the team is already on Datadog and wants integrated LLM + infra observability. Trace depth is good but not as deep as LangSmith for LangGraph stacks. $31+/host/mo + LLM volume add-on.
Datadog enterprisesEvent-based · deep tracing · query-driven
Honeycomb's LLM extension of event-based observability. Deep tracing primitives, query-driven exploration. Pays back when the team already runs Honeycomb and values event-based depth over LLM-specific dashboards.
Honeycomb enterprises07 — Reference WorkflowsFour reference workflows.
Below are four observability workflows we run for engineering teams in client engagements, with the platform recommendation that consistently wins on each.
Prototype debugging (early-stage agent dev)
Fastest install, lowest friction. Helicone if the team is just standing up; LangSmith if the team is already on LangChain/LangGraph; Langfuse cloud if the team prefers framework-agnostic. The decision criterion: which platform gets the team to first useful trace fastest.
Helicone · LangSmith · LangfuseProd regression watch (deployment-day monitoring)
Capture every prod request; alert on regressions in latency, cost, output quality. Pair LangSmith or Langfuse for LLM-layer regressions with Datadog or Honeycomb for infra-layer regressions. The two-layer pattern catches both types of failures.
LangSmith/Langfuse + Datadog/HoneycombEval-driven dev (regression-gated deploys)
Build eval datasets from prod traces; run on every commit; gate deploys on pass-rate. LangSmith excels for LangGraph workflows (replay against new model versions). Arize Phoenix excels for eval rigor across any framework. Pick by which axis matters more.
LangSmith (LangGraph) · Arize (rigor)Incident root-cause (production failure investigation)
Trace depth is the dominant variable. LangSmith for LangGraph stacks (state diffs are diagnostic gold). Langfuse for framework-agnostic. Honeycomb for event-based exploration when LLM-specific dashboards aren't enough. Datadog when infra-layer correlation matters most.
Match by stack08 — ConclusionPair LLM observability with whole-stack APM.
There is no single best observability platform. There are right defaults per framework and team shape.
By April 2026 the agent-observability field has consolidated to six production-grade platforms: LangSmith, Langfuse, Arize Phoenix, Helicone, Datadog LLM Observability, and Honeycomb LLM Observability. Each occupies a different spot on the trade-off surface, and each wins on its home territory. There is no "best" platform in the abstract; there is the right default for the framework and team shape.
The pattern that scales: pair an LLM-native observability platform with whole-stack APM. The LLM platform (LangSmith, Langfuse, or Arize) handles agent traces, eval, and LLM-specific metrics. The APM (Datadog, Honeycomb, New Relic) handles host metrics, app errors, deployment health. Most production deployments need both — LLM observability for agent debugging, infra observability for whole-stack health.
The right move for most engineering teams: pick the LLM observability platform by framework + team shape (LangSmith for LangChain/LangGraph; Langfuse for OSS or self-host; Arize for eval rigor) and pair it with the team's existing APM. Don't try to make one platform do both jobs — the layered pattern is more reliable in incident response and easier to operate.