AI agent observability is the practice of tracing, monitoring, and evaluating autonomous agents in production — capturing every model call, tool execution, and reasoning step as structured spans so you can answer the one question that matters when something goes wrong: why did the agent do that? In 2026 it has become a discipline of its own, with a vendor-neutral standard and a fast-consolidating market of platforms behind it.

The reason agents need their own observability layer is that they fail differently from ordinary software. A traditional service either returns a 200 or throws an error. An agent can return a confident, well-formed, completely wrong answer — having made three unnecessary tool calls and one syntactically valid action that did the wrong thing. Binary pass/fail monitoring is blind to all of it. You need step-level traces.

This guide covers what to actually log, trace, and alert on; the OpenTelemetry GenAI semantic conventions that are becoming the common language for agent telemetry; the new MCP tracing layer; and a practical comparison of seven observability platforms grouped by deployment model — self-hosted, managed SDK, and proxy gateway — so you can pick by cost, data residency, and the way you actually run your agents.

Key takeaways

01
OpenTelemetry GenAI conventions are the emerging standard — but still in Development.As of v1.41, the spec defines agent, workflow, tool, and model spans plus required latency and token-usage metrics. Critically, nearly all gen_ai.* attributes carry Development stability badges, so attribute names can change without a major version bump.
02
Agents fail in ways binary monitoring cannot see.The same input can trigger different tool sequences across runs, and outputs that look correct can be semantically wrong. Step-level tracing — not pass/fail health checks — is the minimum viable signal for an agent in production.
03
Pick your stack by deployment model first, features second.Self-hosted (Langfuse, Arize Phoenix) for data residency and cost control; managed SDK (LangSmith, Braintrust) for speed and built-in evals; proxy gateway (Helicone) for zero-code-change cost tracking. The deployment model usually decides the choice before the feature list does.
04
MCP tracing is the new frontier added in OTel v1.39.Model Context Protocol spans (mcp.method.name, mcp.session.id, mcp.protocol.version) enrich existing execute_tool spans rather than duplicating them — giving agent traces visibility into the tool layer that was previously a black box.
05
2026 is the consolidation year for LLM observability.ClickHouse acquired Langfuse in January and Braintrust raised an $80M Series B in February. The market is reportedly growing at a 30%+ CAGR, yet by early 2026 only about 15% of GenAI deployments instrument observability at all, per a Gartner figure.

01 — The ProblemWhy agents break differently.

A deterministic service is observable in the classic three pillars: metrics, logs, traces. You watch latency and error rates, you read the logs when something throws, and you trace a request across services. An agent breaks that model in two ways at once.

First, the same input does not always produce the same behavior. Temperature, retrieval results, and tool availability all shift the path the agent takes. The same prompt can trigger a different sequence of tool calls on two consecutive runs. That non-determinism makes a single "happy path" trace insufficient — you need to observe the distribution of behaviors, not one example.

Second, failure rarely surfaces as an error. The agent returns something. It is well-formed. It may even be plausible. The problem is that it is wrong, or it took an expensive detour to get there, or it called a tool it never needed. None of that trips a 500. This is why step-level tracing — recording each reasoning step, tool call, and model response as a nested span — is the foundational requirement, and why a health check that only reports "up" is close to useless for an agent.

"Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool calls, or actions that are syntactically valid but semantically wrong."— Aryan Kargwal, PhD Candidate, Polytechnique Montréal

The practical consequence is that observability for agents has to capture intent and process, not just inputs and outputs. You want the reasoning trace, the tools considered, the tools actually invoked, the arguments passed, the responses returned, the tokens spent at each step, and the latency of each hop — all stitched into one hierarchical trace you can replay. Runtime tracing of this kind is the natural complement to offline agent evaluation frameworks, which catch regressions before deployment; tracing catches what production throws at you after.

02 — The StandardOpenTelemetry GenAI: vendor-neutral tracing.

The most important development in this space is not a product — it is a specification. The OpenTelemetry GenAI semantic conventions define a common vocabulary for AI telemetry: a standard set of gen_ai.* span and metric attributes that any instrumentation library can emit and any backend can ingest. Adopt them and you decouple your instrumentation from your vendor — you can switch observability platforms without re-instrumenting your agents.

The spec spans six layers: client (model-call) spans, agent and workflow spans, MCP conventions, semantic events, metrics, and provider-specific attributes. Two histogram metrics are effectively mandatory for any production deployment: gen_ai.client.operation.duration (latency in seconds) and gen_ai.client.token.usage (consumption in tokens, broken down by input and output). Those two signals are the floor — export them or you cannot reason about cost or speed.

The status trap most coverage misses

As of v1.41, the OpenTelemetry GenAI conventions are still in Development status — not Stable. Nearly every gen_ai.* attribute carries a Development badge (the exceptions being error.type, server.address, and server.port). In practice that means an attribute name like gen_ai.usage.input_tokens can change without a major version bump. The escape hatch: OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental enables dual-emission of both legacy (v1.36.0 and earlier) and current attribute names, so a transition does not silently break your dashboards.

Adoption is the encouraging part. For the most common providers, instrumentation is close to free: in Python, OpenAI tracing can be a single line — OpenAIInstrumentor().instrument() — after which semconv-compliant spans are produced automatically with no manual span creation. And the major backends already speak the convention: Datadog natively supports OTel GenAI conventions from v1.37 onward (announced December 1, 2025), mapping gen_ai.* attributes to its own LLM Observability schema automatically.

03 — Span ReferenceThe agent span quick-reference.

The spec defines four span operation types specifically for agents: create_agent, invoke_agent, invoke_workflow, and execute_tool. The subtle part is the span kind. An invoke_agent span is CLIENT when the agent runs remotely (for example an OpenAI Assistants API or AWS Bedrock Agent) and INTERNAL when it runs inside your own process (a LangChain or CrewAI agent). And in multi-agent systems, a single INTERNAL invoke_workflow span is the parent that wraps several invoke_agent children — that hierarchy is what lets you follow a task across agent handoffs in one trace.

The reference below consolidates span types that are otherwise spread across three separate pages of the specification.

Span operation

create_agent

Span kind

INTERNAL

When it fires

Agent definition / instantiation. Carries the agent name, model, and configuration. Fires once when the agent object is created, not per request.

Span operation

invoke_agent

Span kind

CLIENT

When it fires

Remote agent execution — OpenAI Assistants API, AWS Bedrock Agents. The agent runs on someone else's infrastructure; the span measures the round trip.

Span operation

invoke_agent

Span kind

INTERNAL

When it fires

Local framework execution — LangChain, CrewAI, LangGraph agents running inside your process. Parents the model-call and tool spans for that agent.

Span operation

invoke_workflow

Span kind

INTERNAL

When it fires

Multi-agent orchestration. One invoke_workflow parents multiple invoke_agent children — the structure that makes handoffs legible in a single trace.

Span operation

execute_tool

Span kind

INTERNAL

When it fires

A single tool / function call. Captures the tool name, arguments, and result. MCP instrumentation enriches this span rather than creating a duplicate.

Span operation

chat / inference

Span kind

CLIENT

When it fires

The model call itself. Required attributes include model and token usage; input.messages and output.messages are opt-in, not captured by default.

Span operation	Span kind	When it fires
`create_agent`	INTERNAL	Agent definition / instantiation. Carries the agent name, model, and configuration. Fires once when the agent object is created, not per request.
`invoke_agent`	CLIENT	Remote agent execution — OpenAI Assistants API, AWS Bedrock Agents. The agent runs on someone else's infrastructure; the span measures the round trip.
`invoke_agent`	INTERNAL	Local framework execution — LangChain, CrewAI, LangGraph agents running inside your process. Parents the model-call and tool spans for that agent.
`invoke_workflow`	INTERNAL	Multi-agent orchestration. One invoke_workflow parents multiple invoke_agent children — the structure that makes handoffs legible in a single trace.
`execute_tool`	INTERNAL	A single tool / function call. Captures the tool name, arguments, and result. MCP instrumentation enriches this span rather than creating a duplicate.
`chat / inference`	CLIENT	The model call itself. Required attributes include model and token usage; input.messages and output.messages are opt-in, not captured by default.

Privacy by default

Content capture is opt-in, not automatic. The spec defines three modes: not recorded (the default), stored on span attributes, or kept in external storage with only a reference URL on the span. Message bodies — gen_ai.input.messages and gen_ai.output.messages — are not captured unless you explicitly opt in. For production systems handling PII, the external- storage-plus-reference mode is the recommended pattern: you keep the trace structure for debugging without writing customer data into your telemetry pipeline.

04 — MCP TracingThe new layer: MCP span enrichment.

The tool layer used to be the black box in agent traces. You could see that a tool was called and what it returned, but the protocol mechanics underneath — which MCP method, which session, which protocol version — were invisible. OpenTelemetry closed that gap in v1.39, which added MCP semantic conventions with attributes including mcp.method.name, mcp.session.id, and mcp.protocol.version.

The clever design decision is how these attributes attach. When MCP instrumentation detects that an outer GenAI instrumentation already tracks the tool execution, it enriches the existing execute_tool span with the MCP attributes instead of creating a second, duplicate span. You get the protocol-level detail layered onto the tool span you already had — not a noisier trace. This matters for anyone building on Model Context Protocol tracing: the visibility into the tool layer is now standardized, so an agent that calls ten MCP servers can be traced as cleanly as one that calls a single local function.

OTel GenAI conventions

Current spec version

v1.41

Defines agent, workflow, tool, and model spans plus required latency and token metrics — still in Development status, so attribute names are not yet frozen.

Development

MCP spans added

Protocol-level tool visibility

v1.39

mcp.method.name, mcp.session.id, and mcp.protocol.version enrich existing execute_tool spans rather than duplicating them — the tool layer stops being a black box.

Enrich, don't duplicate

Required metrics

The minimum signals

gen_ai.client.operation.duration (latency) and gen_ai.client.token.usage (input/output tokens). Export these two histograms or you cannot reason about cost or speed.

Latency + tokens

05 — Deployment ModelsThree deployment models, three trade-offs.

Before you compare features, decide how the observability layer should be deployed — because that single choice eliminates most of the field. There are three architectures, and each makes a different trade between control, convenience, and risk.

Self-hosted

Run it yourself

Langfuse · Arize Phoenix

You host the platform. Best for data residency, sovereignty, and cost control at scale. Langfuse deploys via Docker Compose in minutes; Phoenix is built directly on OTLP. The cost is operational ownership.

Data stays in your perimeter

Managed SDK

Instrument and ship

LangSmith · Braintrust

You add an SDK; the vendor runs the backend, storage, and UI. Fastest path to step-level tracing plus built-in eval tooling. The trade is per-trace or per-span pricing and your data living on their infrastructure.

Fastest time to value

Proxy gateway

Route through a gateway

Helicone

Point your base URL at the gateway; it logs every request with near-zero code change and tracks cost across 300+ models. The architectural caveat: the gateway is a single point of failure for the whole fleet.

Zero-code-change cost tracking

The proxy single-point-of-failure

The proxy-gateway model has one architectural risk the SDK models do not: if the gateway goes down, every agent loses connectivity to every model provider at once. The class of risk is also different — gateways have a larger attack surface, and at least one gateway product has reportedly required a patch for a server-side request forgery vulnerability. Treat the gateway as critical infrastructure: budget for high availability, and verify any reported security advisory against an authoritative CVE database before drawing conclusions.

06 — Stack ComparisonThe 2026 observability stack compared.

The table below compares seven platforms across the dimensions that actually drive a selection in 2026 — deployment model, free tier, paid entry price, OpenTelemetry support, MCP tracing, and the funding or acquisition signal that tells you how durable the vendor is. Pricing and version figures are taken from each vendor's own documentation and should be re-checked before you commit; this market moves quickly.

Platform	Deployment	Free tier	Paid entry	OTel GenAI	2025–26 signal
Langfuse	Self-hosted / cloud	Hobby (self-host free)	Cloud paid tiers	Yes	Acquired by ClickHouse, Jan 2026
Arize Phoenix	Self-hosted	Open-source (free)	Cloud / Arize AX	Yes (OTLP-native)	~9.9k GitHub stars
LangSmith	Managed SDK	Developer (5K traces/mo)	Plus $39 / seat / mo	Yes	SmithDB: ~12× faster trace queries
Braintrust	Managed SDK	Starter (1M spans/mo)	Pro $249 / mo	Yes	$80M Series B, Feb 2026
Helicone	Proxy gateway	10K requests/mo	Usage-based	Via gateway	300+ models in cost repo
AgentOps	SDK	Open-source (free)	Cloud tiers	Yes	Time-travel replay debugging
Datadog LLM Obs.	Managed agent	40K LLM spans/mo	Pro $160 / mo	Yes (from v1.37)	Bills LLM spans only

Why the column choices matter

No standard comparison includes the acquisition-and-funding column — yet in 2026 it is a real selection signal. ClickHouse acquiring Langfuse tightens the self-hosting story around a ClickHouse database dependency; Braintrust's well-funded Series B signals a managed option that is unlikely to disappear. Pair that durability read with the OpenTelemetry column and you can choose a stack you will not have to rip out in a year.

"We built Langfuse on ClickHouse because LLM observability and evaluation is fundamentally a data problem. Now, as one team, we can deliver a tighter end-to-end product: faster ingestion, deeper evaluation, and a shorter path from a production issue to a measurable improvement."— Marc Klingen, CEO of Langfuse

A note on the open-source options, because their licenses differ in ways that matter for redistribution. Langfuse is MIT-licensed (excluding its enterprise ee folder) and self-hosts via Docker Compose, Kubernetes/Helm, or Terraform. Arize Phoenix uses Elastic License 2.0 and is built on OpenTelemetry with OpenInference instrumentation underneath; AgentOps is MIT-licensed and notable for time-travel debugging — replaying an agent session with point-in-time precision. Read the exact license text before you embed any of them in a commercial product.

07 — Cost & Eval GatesToken cost tracking and eval gates.

Cost observability for agents has a subtlety that pricing models expose unevenly. Consider Datadog: its LLM Observability free tier includes 40,000 LLM spans per month, and the Pro plan starts at $160 per month with 100,000 LLM spans. The detail that changes the math is that only LLM spans are billed — tool spans, embedding spans, retrieval spans, and agent spans are free. A highly agentic system that makes many tool calls but relatively few model calls can therefore be dramatically cheaper to observe on a span-class-aware model than on a flat per-span one.

Per-trace and per-span pricing models diverge fast at scale. LangSmith meters traces — a Developer tier free at 5,000 traces per month, Plus at $39 per seat per month, with overage around $2.50 per thousand traces at standard retention. Braintrust meters spans generously — a Starter tier free at one million spans per month, Pro at $249 per month. Helicone, as a gateway, meters requests — free at 10,000 per month — and computes cost across more than 300 models using its model-cost repository, integrating natively with the Vercel AI SDK. The unit of billing (trace vs span vs request) interacts with your agent's call pattern, so model your own traffic before assuming one is cheaper.

Free-tier volume by platform · note the differing units

Source: vendor pricing pages (units differ — span vs request vs trace)

Braintrust StarterFree tier · spans per month

1M spans

Datadog free tierFree tier · LLM spans per month

40K spans

Helicone free tierFree tier · requests per month

10K req

LangSmith DeveloperFree tier · traces per month

5K traces

Tracing is necessary but not sufficient. The mature pattern in 2026 is to pair runtime tracing with eval gates — automated scorers that grade agent outputs and can block a regression from shipping or flag a live quality drop. The managed platforms increasingly bundle this: LangSmith's natural-language trace assistant lets an engineer ask "why did the agent enter this loop?" and get an answer by analyzing the traces directly, while Braintrust pairs tracing with scorers for human and automated review. Tracing tells you what happened; eval gates tell you whether it was good — and you want both wired into the same pipeline. Security-sensitive deployments should also fold prompt injection detection into that gate, since a clean-looking trace can still hide an injected instruction.

"Teams have never had less conviction about what will fail next. When something does break, it has never been harder to explain why."— Ankur Goyal, CEO of Braintrust

08 — Market SignalsThe market — and the adoption gap.

Two things are true about LLM observability in 2026 at the same time: capital is flooding in, and most teams still are not using it. That gap is the most interesting signal in the space, and it is where the opportunity sits.

Market figures — treat as directional

One market-research estimate puts the LLM observability platform market at roughly $1.97B in 2025 growing to $2.69B in 2026 — a CAGR in the mid-30s percent. Separately, a Gartner figure reported in trade press holds that only about 15% of GenAI deployments instrument observability today, with a forecast that the share could reach 50% by 2028. Both classes of figure come from analyst firms and vendor citations known for wide ranges; read them as direction and momentum, not precision, and source the primary report before quoting an exact number in your own work.

The funding side is concrete. ClickHouse acquired Langfuse on January 16, 2026, as part of a $400M Series D that valued ClickHouse at $15B; at acquisition, Langfuse reported more than 2,000 paying customers and tens of millions of SDK installs per month, and its open-source licensing and self-hosting were stated to remain unchanged. A month later, on February 17, 2026, Braintrust raised an $80M Series B at an $800M valuation, led by Iconiq with participation from Andreessen Horowitz and others. Two of the most-watched names in the category took major capital events inside a single quarter — a clear consolidation signal.

Here is our read on the gap. When 85% of GenAI deployments run without observability while the tooling market grows at a 30%-plus clip, you are looking at a discipline that is being built faster than it is being adopted. The teams instrumenting now are buying an unfair advantage: when an agent misbehaves in production, they can answer "why" in minutes from a trace, while the uninstrumented majority is reduced to guessing and re-running. As agents move from pilots into revenue-bearing workflows, that asymmetry stops being a nice-to-have and becomes the difference between a fixable incident and an unexplained one.

Projecting forward, two forces should converge over the next year. The OpenTelemetry GenAI conventions will likely graduate toward stable status, which removes the last real objection to standardizing on a vendor-neutral layer. And the convergence already underway — OpenInference instrumentations emitting both their own and OTel attributes for backward compatibility — points to a near future where you instrument once and route the same telemetry to a self-hosted tool and a managed backend simultaneously. The likely winners are the platforms that make that dual-destination story painless.

09 — DecisionChoosing the right stack for your workload.

The decision compresses to a few questions about how you run agents and what constraints you carry. The matrix below maps the common situations to a recommended starting point — start there, instrument, and let your own traces tell you whether to move.

Data residency

Sovereignty or regulated data

If telemetry cannot leave your perimeter, self-host. Langfuse via Docker Compose or Kubernetes keeps traces inside your infrastructure; Phoenix on OTLP does the same. The cost is operational ownership of the stack.

Pick self-hosted (Langfuse / Phoenix)

Speed to value

Ship tracing this week

If the priority is step-level traces plus eval tooling with minimal setup, a managed SDK wins. LangSmith and Braintrust give you backend, storage, UI, and scorers out of the box — model your trace or span volume against their pricing first.

Pick managed SDK (LangSmith / Braintrust)

Cost visibility

Zero-code-change cost tracking

If the immediate need is to see and attribute model spend across many providers without re-instrumenting, route through a gateway. Helicone tracks cost across 300+ models — just accept the single-point-of-failure architecture and plan for HA.

Pick proxy gateway (Helicone)

Existing observability

Already on a major APM

If you already run Datadog for the rest of your stack, its LLM Observability extends the same panes — and bills only LLM spans, which favors tool-heavy agents. Consolidating telemetry in one place is often worth more than a marginally better point tool.

Extend your current APM

Whatever you pick, instrument against the OpenTelemetry GenAI conventions rather than a vendor-proprietary SDK wherever the platform supports it. That single discipline keeps the migration door open: if the market consolidates further, or your needs change, you re-point the exporter instead of re-instrumenting every agent. For teams deciding the architecture, our AI digital transformation engagements start with exactly this kind of stack evaluation — mapping your agent traffic, residency constraints, and budget to a concrete recommendation, and our web development team wires the instrumentation into your application so the traces flow from day one.

10 — ConclusionObservability is now part of the agent, not an add-on.

The shape of agent observability, mid-2026

If you cannot explain why an agent did something, you do not control it.

Agent observability stopped being optional the moment agents started making decisions in production. The failure mode that defines the discipline is the one that looks like success — a confident, wrong, expensive answer that no health check will ever catch. The only answer is step-level tracing: every reasoning step, tool call, and model response stitched into a trace you can replay and interrogate.

The good news is that the foundation is now standardized. The OpenTelemetry GenAI conventions give you a vendor-neutral vocabulary for agent, workflow, tool, and MCP spans — and although the spec is still in Development status, instrumenting against it today, with the dual-emit opt-in as a safety net, is the move that keeps your options open as the market consolidates around it.

So choose by deployment model first — self-hosted for residency, managed SDK for speed, proxy gateway for cost visibility — then by features, and instrument against the open standard regardless. With most GenAI deployments still flying blind and serious capital backing the category, the teams that wire in tracing and eval gates now will be the ones who can answer why while everyone else is still guessing.

AI Agent Observability: Tracing & Monitoring in 2026

01 — The ProblemWhy agents break differently.

02 — The StandardOpenTelemetry GenAI: vendor-neutral tracing.

03 — Span ReferenceThe agent span quick-reference.

04 — MCP TracingThe new layer: MCP span enrichment.

Current spec version

Protocol-level tool visibility

The minimum signals

05 — Deployment ModelsThree deployment models, three trade-offs.

Run it yourself

Instrument and ship

Route through a gateway

06 — Stack ComparisonThe 2026 observability stack compared.

07 — Cost & Eval GatesToken cost tracking and eval gates.

Free-tier volume by platform · note the differing units

08 — Market SignalsThe market — and the adoption gap.

09 — DecisionChoosing the right stack for your workload.

Sovereignty or regulated data

Ship tracing this week

Zero-code-change cost tracking

Already on a major APM

10 — ConclusionObservability is now part of the agent, not an add-on.

If you cannot explain why an agent did something, you do not control it.

Step-level tracing turns agent incidents from unexplained into fixable.

Agent observability engagements

The questions we get every week.

Continue exploring AI development.

Observability Stack TCO: LangSmith vs LangFuse vs Helicone

Agent Observability: LangSmith, Langfuse, Arize 2026

Agent Observability 2026: Evals, Traces, Cost Guide

Nous Hermes Blank Slate: Tighter Agent Tool Scoping