Agent observability is the 2026 production-deployment necessity that most teams underestimated. Workflows that worked in dev fail in prod for reasons traditional APM doesn't surface — model drift, tool-call retry loops, prompt regressions on framework upgrades, cost spikes from runaway loops. Six platforms have emerged to handle this differently than classic APM.

We compare six platforms across tracing depth, eval rigor, runtime cost, framework integration, deployment model, and best-fit team shape. Most production teams pick a primary observability platform (LangSmith, Langfuse, or Arize Phoenix) and pair it with the broader infrastructure observability layer (Datadog, Honeycomb, New Relic) for whole-stack coverage.

This post covers the 7-axis matrix, deep dives on the three primary platforms, the specialist tier (Helicone, Datadog, Honeycomb), and four reference workflows we run for engineering teams today — prototype debugging, prod regression watch, eval-driven dev, and incident root-cause.

Key takeaways

01
LangSmith wins LangChain/LangGraph stacks — deepest framework integration in the field.LangSmith is purpose-built for LangChain and LangGraph. The integration is the deepest in the field: traces include node-by-node state diffs, full agent execution graphs, model + tool call breakdowns, replay against new model versions. For teams building on LangGraph or LangChain, LangSmith is the path of least friction. Pricing is generous on the free tier and usage-priced beyond.
02
Langfuse is the open-source leader — self-hostable, framework-agnostic.Langfuse leads on the open-source observability path. Self-hostable (Postgres + ClickHouse), framework-agnostic, supports any LLM SDK or agent framework via OpenTelemetry traces. Cloud tier from $59/seat for teams that don't want to self-host. Right pick for teams that want observability without framework lock-in or that have data-residency requirements.
03
Arize Phoenix wins eval rigor — ML-grade evaluation primitives for the agent era.Arize built ML observability before LLMs were a thing; the LLM observability story benefits from that ML rigor. Phoenix (the open-source layer) ships eval primitives, drift detection, and embeddings analysis that other platforms approximate with weaker statistics. Pairs naturally with Arize cloud for enterprise deployments. Right pick when eval rigor matters more than framework integration.
04
Helicone is the drop-in proxy — simplest install, no SDK changes.Helicone routes LLM API calls through its proxy, capturing observability without SDK changes — change one base URL, get traces. Simplest install in the field. Trade-off is shallower depth than the platform-native integrations: traces are at the API call level, not the agent execution level. Right when minimal install effort is the dominant constraint.
05
Pair the LLM observability platform with whole-stack APM (Datadog, Honeycomb, New Relic).LLM observability and infrastructure observability are different layers. The LLM platform (LangSmith, Langfuse, Arize) handles agent traces, eval, and LLM-specific metrics. The infra platform (Datadog, Honeycomb, New Relic) handles host metrics, app errors, request traces, deployment health. Most production deployments need both — LLM observability for agent debugging, infra observability for whole-stack health. Datadog ships an LLM Observability product for shops already on Datadog — pays back when integration consistency matters more than depth.

01 — The FieldThe 2026 observability field.

Agent observability emerged as a distinct category in 2023-2024 when production teams realized classic APM (Datadog, New Relic, Honeycomb) wasn't surfacing the failure modes that LLM-driven workflows actually hit. The first wave of platforms (LangSmith, Langfuse, Helicone) shipped LLM-native traces and prompt management. The second wave (Arize Phoenix, Datadog LLM Observability, Honeycomb's LLM tracing) added eval rigor and integration with broader infrastructure observability.

By April 2026 six platforms own the production conversation. The decision dimensions are: framework lock-in, deployment model (cloud vs self-host), eval rigor, integration with broader APM, and pricing model. Most teams pick one primary platform and pair it with a whole-stack APM for infrastructure-layer coverage.

Tier 1

LangSmith — LangChain-native depth

Cloud · LangChain + LangGraph integration · usage-priced

The deepest framework integration. Built by the LangChain team for LangChain and LangGraph. Trace depth includes node-by-node state diffs, full agent graphs, replay-against-new-models. Right primary for any LangChain or LangGraph stack.

Framework-native

Tier 1

Langfuse — open-source leader

Self-host or cloud · framework-agnostic · OTel-friendly

Open-source observability platform. Self-hostable on Postgres + ClickHouse; cloud tier from $59/seat. Framework-agnostic via OpenTelemetry. Right pick for teams that want OSS or self-host or framework-agnostic coverage.

Open-source default

Tier 1

Arize Phoenix — eval-rigor leader

Open-source + cloud · ML-grade primitives

ML-grade observability heritage applied to LLMs. Phoenix (OSS) ships eval primitives, drift detection, embeddings analysis. Arize cloud adds enterprise scale. Right pick when eval rigor matters more than framework integration.

Eval rigor

Tier 2

Helicone — drop-in proxy

Proxy-based · zero SDK changes · simplest install

Routes LLM API calls through its proxy. Change one base URL, get traces. Simplest install in the field. Trace depth is at the API-call level, not agent execution level. Right when install simplicity dominates.

Simplest install

Tier 2

Datadog LLM Observability

Datadog-native · enterprise APM bundle

Datadog's LLM observability product. Pays back when the team is already on Datadog and wants integrated LLM + infra observability without adding a separate platform. $31+/host/mo + LLM volume add-on.

Datadog shops

Tier 2

Honeycomb LLM Observability

Event-based · deep tracing · query-driven

Honeycomb's LLM extension of its event-based observability. Deep tracing primitives, query-driven exploration. Pays back when the team already runs Honeycomb and values event-based depth over LLM-specific dashboards.

Honeycomb shops

02 — MatrixFeature matrix, six platforms.

The matrix below covers the seven capabilities that drive 2026 observability decisions: framework integration depth, eval rigor, deployment model, install effort, broader-APM integration, pricing model, and best-fit team shape.

Capability

Framework integration depth

LangSmith wins for LangChain + LangGraph (deepest). Langfuse covers any framework via OTel — broad but shallower per-framework. Arize Phoenix integrates well with LlamaIndex, OpenAI Agents SDK. Helicone is API-call level (framework-agnostic but shallower). Datadog + Honeycomb match their broader OTel patterns.

LangSmith (LangChain) · Langfuse (broad)

Capability

Eval rigor (drift, scoring, regression)

Arize Phoenix wins. Built on ML-observability heritage; eval primitives are deeper than competitors. LangSmith and Langfuse ship strong eval features but Arize's depth is meaningful for regulated or accuracy-critical workloads.

Arize Phoenix

Capability

Deployment model (self-host vs cloud)

Langfuse is the self-host leader (Postgres + ClickHouse, fully OSS-compatible). Arize Phoenix self-hosts as OSS; Arize cloud is enterprise. LangSmith is cloud-only (with VPC-scope enterprise tier). Helicone is proxy-cloud. Datadog + Honeycomb cloud-hosted.

Langfuse · Arize (self-host)

Capability

Install effort (time to first trace)

Helicone wins (~5 min via proxy URL change). LangSmith ~15 min for LangChain stacks (instrumentation auto-enabled). Langfuse ~30 min cloud / ~60 min self-host. Arize Phoenix ~30 min. Datadog + Honeycomb depend on existing setup.

Helicone (simplest)

Capability

Broader-APM integration

Datadog + Honeycomb are themselves APM platforms — full-stack integration native. LangSmith, Langfuse, Arize integrate with broader APM via OTel + webhooks but pair best with their own primary surfaces. Helicone's proxy traces are easy to forward to APM.

Datadog · Honeycomb

Capability

Pricing model

Langfuse self-host is free (compute + storage costs only). Helicone has generous free tier; usage-priced. LangSmith free tier + usage-priced. Arize Phoenix OSS free; cloud enterprise contracts. Datadog $31+/host + LLM add-on. Honeycomb usage-priced.

Langfuse self-host (cheapest)

Capability

Best-fit team shape

LangSmith: LangChain/LangGraph engineering teams. Langfuse: OSS-preference + self-host teams. Arize Phoenix: ML-heritage teams that need eval rigor. Helicone: small/early teams that need fastest install. Datadog: enterprises already on Datadog. Honeycomb: enterprises already on Honeycomb.

Match team shape

03 — LangSmithLangSmith — the framework-native leader.

LangSmith is purpose-built for LangChain and LangGraph. Built by the LangChain team, the integration is the deepest in the field — traces include node-by-node state diffs, full agent execution graphs, model + tool call breakdowns, and replay against new model versions. For teams building on LangChain or LangGraph, LangSmith is the path of least friction.

Strength

Graph

Deepest LangGraph integration

Trace depth includes node-by-node state diffs, conditional edge transitions, retry timelines, human-in-the-loop interrupt timing. The integration was co-designed with LangGraph — features ship in lock-step. Right primary for any production LangGraph deployment.

LangGraph-native

Strength

Eval

Replay-against-new-models

Capture traces in production; replay against new model versions to test regressions before deploying. The eval surface integrates tightly with traces — production behavior becomes the eval dataset. Among the strongest eval workflows in the field.

Eval workflow

Trade-off

Cloud

Cloud-only deployment

LangSmith is cloud-hosted by default. Enterprise tier supports VPC-scope deployment. Right when cloud-hosted is acceptable; less compelling when data-residency requires self-host. Langfuse is the alternative for self-host requirements.

Cloud-only

"If you're on LangGraph, LangSmith is the right choice. If you're framework-agnostic, Langfuse. If eval rigor is the priority, Arize Phoenix."— Internal observability stack retro, March 2026

04 — LangfuseLangfuse — the open-source leader.

Langfuse leads on the open-source observability path. Fully self-hostable on Postgres + ClickHouse, framework-agnostic via OpenTelemetry, with a cloud tier ($59+/seat) for teams that don't want to self-host. Right pick for teams that want observability without framework lock-in, that have data-residency requirements, or that prefer open-source ownership.

Strength

OSS

Fully self-hostable

Postgres + ClickHouse stack; deploys cleanly to Kubernetes or managed-Postgres setups. Self-host means $0 platform cost (compute + storage only) and full data control. Strong for data-residency requirements and teams that prefer OSS ownership.

Self-host default

Strength

OTel

Framework-agnostic via OpenTelemetry

Native OTel support means any LLM SDK or agent framework can instrument into Langfuse. Works with LangGraph, Mastra, OpenAI Agents SDK, raw Anthropic SDK, custom code. The OTel-first design is the differentiator vs platform-native peers.

OTel-native

Trade-off

Mid

Per-framework depth gap

Langfuse covers any framework but doesn't match LangSmith's depth on LangChain/LangGraph. The OTel-driven approach is broad but shallower per-framework. Right pick when breadth matters more than depth; less ideal when one specific framework is the dominant surface.

Breadth over depth

05 — Arize PhoenixArize Phoenix — the eval-rigor leader.

Arize built ML observability before LLMs were a thing. The LLM observability story benefits from that ML rigor. Phoenix (the open-source layer) ships eval primitives, drift detection, and embeddings analysis that other platforms approximate with weaker statistics. Pairs naturally with Arize cloud for enterprise deployments.

Strength

Eval

ML-grade evaluation primitives

Built on Arize's ML-observability heritage. Eval primitives, drift detection, embeddings analysis are deeper than competitors' equivalents. Right pick when eval rigor matters most — regulated industries, accuracy-critical workloads, model-comparison studies.

Eval depth

Strength

OSS

Phoenix open-source layer

Phoenix is the open-source layer — full eval + tracing primitives without licensing. Pairs with Arize cloud for enterprise scale. The OSS option is more capable than typical OSS observability tools because the rigor came from a paid ML-observability product.

OSS + cloud

Trade-off

ML-flavored

Less LLM-specific dashboard polish

Arize's UI is shaped by ML-observability heritage; the LLM-specific dashboard polish is improving but trails LangSmith and Langfuse's LLM-native presentations. Right pick when eval rigor dominates; less ideal when LLM-specific dashboard ergonomics matter most.

ML-heritage UI

06 — SpecialistsHelicone, Datadog, Honeycomb — specialist tier.

Three specialists occupy distinct niches that the primary platforms don't cover as well. Helicone wins on install simplicity (proxy-based, zero SDK changes). Datadog LLM Observability wins for Datadog-native shops (integrated LLM + infra). Honeycomb LLM Observability wins for Honeycomb-native shops (event-based deep tracing).

Helicone

Drop-in proxy · simplest install

Routes LLM API calls through its proxy. Change one base URL, get traces. Simplest install in the field (~5 min). Trade-off is shallower depth — traces are at the API-call level, not agent execution level. Right when minimal install effort dominates.

Simplest install

Datadog LLM

Datadog-native · enterprise APM bundle

Datadog's LLM observability product. Pays back when the team is already on Datadog and wants integrated LLM + infra observability. Trace depth is good but not as deep as LangSmith for LangGraph stacks. $31+/host/mo + LLM volume add-on.

Datadog enterprises

Honeycomb LLM

Event-based · deep tracing · query-driven

Honeycomb's LLM extension of event-based observability. Deep tracing primitives, query-driven exploration. Pays back when the team already runs Honeycomb and values event-based depth over LLM-specific dashboards.

Honeycomb enterprises

07 — Reference WorkflowsFour reference workflows.

Below are four observability workflows we run for engineering teams in client engagements, with the platform recommendation that consistently wins on each.

Workflow 1

Prototype debugging (early-stage agent dev)

Fastest install, lowest friction. Helicone if the team is just standing up; LangSmith if the team is already on LangChain/LangGraph; Langfuse cloud if the team prefers framework-agnostic. The decision criterion: which platform gets the team to first useful trace fastest.

Helicone · LangSmith · Langfuse

Workflow 2

Prod regression watch (deployment-day monitoring)

Capture every prod request; alert on regressions in latency, cost, output quality. Pair LangSmith or Langfuse for LLM-layer regressions with Datadog or Honeycomb for infra-layer regressions. The two-layer pattern catches both types of failures.

LangSmith/Langfuse + Datadog/Honeycomb

Workflow 3

Eval-driven dev (regression-gated deploys)

Build eval datasets from prod traces; run on every commit; gate deploys on pass-rate. LangSmith excels for LangGraph workflows (replay against new model versions). Arize Phoenix excels for eval rigor across any framework. Pick by which axis matters more.

LangSmith (LangGraph) · Arize (rigor)

Workflow 4

Incident root-cause (production failure investigation)

Trace depth is the dominant variable. LangSmith for LangGraph stacks (state diffs are diagnostic gold). Langfuse for framework-agnostic. Honeycomb for event-based exploration when LLM-specific dashboards aren't enough. Datadog when infra-layer correlation matters most.

Match by stack

08 — ConclusionPair LLM observability with whole-stack APM.

Agent observability, April 2026

There is no single best observability platform. There are right defaults per framework and team shape.

By April 2026 the agent-observability field has consolidated to six production-grade platforms: LangSmith, Langfuse, Arize Phoenix, Helicone, Datadog LLM Observability, and Honeycomb LLM Observability. Each occupies a different spot on the trade-off surface, and each wins on its home territory. There is no "best" platform in the abstract; there is the right default for the framework and team shape.

The pattern that scales: pair an LLM-native observability platform with whole-stack APM. The LLM platform (LangSmith, Langfuse, or Arize) handles agent traces, eval, and LLM-specific metrics. The APM (Datadog, Honeycomb, New Relic) handles host metrics, app errors, deployment health. Most production deployments need both — LLM observability for agent debugging, infra observability for whole-stack health.

The right move for most engineering teams: pick the LLM observability platform by framework + team shape (LangSmith for LangChain/LangGraph; Langfuse for OSS or self-host; Arize for eval rigor) and pair it with the team's existing APM. Don't try to make one platform do both jobs — the layered pattern is more reliable in incident response and easier to operate.

Agent Observability: LangSmith vs Langfuse vs Arize.

01 — The FieldThe 2026 observability field.

LangSmith — LangChain-native depth

Langfuse — open-source leader

Arize Phoenix — eval-rigor leader

Helicone — drop-in proxy

Datadog LLM Observability

Honeycomb LLM Observability

02 — MatrixFeature matrix, six platforms.

Framework integration depth

Eval rigor (drift, scoring, regression)

Deployment model (self-host vs cloud)

Install effort (time to first trace)

Broader-APM integration

Pricing model

Best-fit team shape

03 — LangSmithLangSmith — the framework-native leader.

Deepest LangGraph integration

Replay-against-new-models

Cloud-only deployment

04 — LangfuseLangfuse — the open-source leader.

Fully self-hostable

Framework-agnostic via OpenTelemetry

Per-framework depth gap

05 — Arize PhoenixArize Phoenix — the eval-rigor leader.

ML-grade evaluation primitives

Phoenix open-source layer

Less LLM-specific dashboard polish

06 — SpecialistsHelicone, Datadog, Honeycomb — specialist tier.

Drop-in proxy · simplest install

Datadog-native · enterprise APM bundle

Event-based · deep tracing · query-driven

07 — Reference WorkflowsFour reference workflows.

Prototype debugging (early-stage agent dev)

Prod regression watch (deployment-day monitoring)

Eval-driven dev (regression-gated deploys)

Incident root-cause (production failure investigation)

08 — ConclusionPair LLM observability with whole-stack APM.

There is no single best observability platform. There are right defaults per framework and team shape.

Move past observability debates. Pair the platform that fits the framework + APM.

Observability engagements

The questions we get every week.

Continue exploring agentic AI infrastructure.

Agent Observability 2026: Evals, Traces, Cost Guide

Agent Success Rate (ASR): The Measurement Framework

AI Agent Adoption 2026: 120+ Enterprise Data Points