AI agent observability is the practice of tracing, monitoring, and evaluating autonomous agents in production — capturing every model call, tool execution, and reasoning step as structured spans so you can answer the one question that matters when something goes wrong: why did the agent do that? In 2026 it has become a discipline of its own, with a vendor-neutral standard and a fast-consolidating market of platforms behind it.
The reason agents need their own observability layer is that they fail differently from ordinary software. A traditional service either returns a 200 or throws an error. An agent can return a confident, well-formed, completely wrong answer — having made three unnecessary tool calls and one syntactically valid action that did the wrong thing. Binary pass/fail monitoring is blind to all of it. You need step-level traces.
This guide covers what to actually log, trace, and alert on; the OpenTelemetry GenAI semantic conventions that are becoming the common language for agent telemetry; the new MCP tracing layer; and a practical comparison of seven observability platforms grouped by deployment model — self-hosted, managed SDK, and proxy gateway — so you can pick by cost, data residency, and the way you actually run your agents.
- 01OpenTelemetry GenAI conventions are the emerging standard — but still in Development.As of v1.41, the spec defines agent, workflow, tool, and model spans plus required latency and token-usage metrics. Critically, nearly all gen_ai.* attributes carry Development stability badges, so attribute names can change without a major version bump.
- 02Agents fail in ways binary monitoring cannot see.The same input can trigger different tool sequences across runs, and outputs that look correct can be semantically wrong. Step-level tracing — not pass/fail health checks — is the minimum viable signal for an agent in production.
- 03Pick your stack by deployment model first, features second.Self-hosted (Langfuse, Arize Phoenix) for data residency and cost control; managed SDK (LangSmith, Braintrust) for speed and built-in evals; proxy gateway (Helicone) for zero-code-change cost tracking. The deployment model usually decides the choice before the feature list does.
- 04MCP tracing is the new frontier added in OTel v1.39.Model Context Protocol spans (mcp.method.name, mcp.session.id, mcp.protocol.version) enrich existing execute_tool spans rather than duplicating them — giving agent traces visibility into the tool layer that was previously a black box.
- 052026 is the consolidation year for LLM observability.ClickHouse acquired Langfuse in January and Braintrust raised an $80M Series B in February. The market is reportedly growing at a 30%+ CAGR, yet by early 2026 only about 15% of GenAI deployments instrument observability at all, per a Gartner figure.
01 — The ProblemWhy agents break differently.
A deterministic service is observable in the classic three pillars: metrics, logs, traces. You watch latency and error rates, you read the logs when something throws, and you trace a request across services. An agent breaks that model in two ways at once.
First, the same input does not always produce the same behavior. Temperature, retrieval results, and tool availability all shift the path the agent takes. The same prompt can trigger a different sequence of tool calls on two consecutive runs. That non-determinism makes a single "happy path" trace insufficient — you need to observe the distribution of behaviors, not one example.
Second, failure rarely surfaces as an error. The agent returns something. It is well-formed. It may even be plausible. The problem is that it is wrong, or it took an expensive detour to get there, or it called a tool it never needed. None of that trips a 500. This is why step-level tracing — recording each reasoning step, tool call, and model response as a nested span — is the foundational requirement, and why a health check that only reports "up" is close to useless for an agent.
"Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool calls, or actions that are syntactically valid but semantically wrong."— Aryan Kargwal, PhD Candidate, Polytechnique Montréal
The practical consequence is that observability for agents has to capture intent and process, not just inputs and outputs. You want the reasoning trace, the tools considered, the tools actually invoked, the arguments passed, the responses returned, the tokens spent at each step, and the latency of each hop — all stitched into one hierarchical trace you can replay. Runtime tracing of this kind is the natural complement to offline agent evaluation frameworks, which catch regressions before deployment; tracing catches what production throws at you after.
02 — The StandardOpenTelemetry GenAI: vendor-neutral tracing.
The most important development in this space is not a product — it is a specification. The OpenTelemetry GenAI semantic conventions define a common vocabulary for AI telemetry: a standard set of gen_ai.* span and metric attributes that any instrumentation library can emit and any backend can ingest. Adopt them and you decouple your instrumentation from your vendor — you can switch observability platforms without re-instrumenting your agents.
The spec spans six layers: client (model-call) spans, agent and workflow spans, MCP conventions, semantic events, metrics, and provider-specific attributes. Two histogram metrics are effectively mandatory for any production deployment: gen_ai.client.operation.duration (latency in seconds) and gen_ai.client.token.usage (consumption in tokens, broken down by input and output). Those two signals are the floor — export them or you cannot reason about cost or speed.
gen_ai.* attribute carries a Development badge (the exceptions being error.type, server.address, and server.port). In practice that means an attribute name like gen_ai.usage.input_tokens can change without a major version bump. The escape hatch: OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental enables dual-emission of both legacy (v1.36.0 and earlier) and current attribute names, so a transition does not silently break your dashboards.Adoption is the encouraging part. For the most common providers, instrumentation is close to free: in Python, OpenAI tracing can be a single line — OpenAIInstrumentor().instrument() — after which semconv-compliant spans are produced automatically with no manual span creation. And the major backends already speak the convention: Datadog natively supports OTel GenAI conventions from v1.37 onward (announced December 1, 2025), mapping gen_ai.* attributes to its own LLM Observability schema automatically.
03 — Span ReferenceThe agent span quick-reference.
The spec defines four span operation types specifically for agents: create_agent, invoke_agent, invoke_workflow, and execute_tool. The subtle part is the span kind. An invoke_agent span is CLIENT when the agent runs remotely (for example an OpenAI Assistants API or AWS Bedrock Agent) and INTERNAL when it runs inside your own process (a LangChain or CrewAI agent). And in multi-agent systems, a single INTERNAL invoke_workflow span is the parent that wraps several invoke_agent children — that hierarchy is what lets you follow a task across agent handoffs in one trace.
The reference below consolidates span types that are otherwise spread across three separate pages of the specification.
create_agentinvoke_agentinvoke_agentinvoke_workflowexecute_toolchat / inference| Span operation | Span kind | When it fires |
|---|---|---|
create_agent | INTERNAL | Agent definition / instantiation. Carries the agent name, model, and configuration. Fires once when the agent object is created, not per request. |
invoke_agent | CLIENT | Remote agent execution — OpenAI Assistants API, AWS Bedrock Agents. The agent runs on someone else's infrastructure; the span measures the round trip. |
invoke_agent | INTERNAL | Local framework execution — LangChain, CrewAI, LangGraph agents running inside your process. Parents the model-call and tool spans for that agent. |
invoke_workflow | INTERNAL | Multi-agent orchestration. One invoke_workflow parents multiple invoke_agent children — the structure that makes handoffs legible in a single trace. |
execute_tool | INTERNAL | A single tool / function call. Captures the tool name, arguments, and result. MCP instrumentation enriches this span rather than creating a duplicate. |
chat / inference | CLIENT | The model call itself. Required attributes include model and token usage; input.messages and output.messages are opt-in, not captured by default. |
gen_ai.input.messages and gen_ai.output.messages — are not captured unless you explicitly opt in. For production systems handling PII, the external- storage-plus-reference mode is the recommended pattern: you keep the trace structure for debugging without writing customer data into your telemetry pipeline.04 — MCP TracingThe new layer: MCP span enrichment.
The tool layer used to be the black box in agent traces. You could see that a tool was called and what it returned, but the protocol mechanics underneath — which MCP method, which session, which protocol version — were invisible. OpenTelemetry closed that gap in v1.39, which added MCP semantic conventions with attributes including mcp.method.name, mcp.session.id, and mcp.protocol.version.
The clever design decision is how these attributes attach. When MCP instrumentation detects that an outer GenAI instrumentation already tracks the tool execution, it enriches the existing execute_tool span with the MCP attributes instead of creating a second, duplicate span. You get the protocol-level detail layered onto the tool span you already had — not a noisier trace. This matters for anyone building on Model Context Protocol tracing: the visibility into the tool layer is now standardized, so an agent that calls ten MCP servers can be traced as cleanly as one that calls a single local function.
Current spec version
Defines agent, workflow, tool, and model spans plus required latency and token metrics — still in Development status, so attribute names are not yet frozen.
Protocol-level tool visibility
mcp.method.name, mcp.session.id, and mcp.protocol.version enrich existing execute_tool spans rather than duplicating them — the tool layer stops being a black box.
The minimum signals
gen_ai.client.operation.duration (latency) and gen_ai.client.token.usage (input/output tokens). Export these two histograms or you cannot reason about cost or speed.
05 — Deployment ModelsThree deployment models, three trade-offs.
Before you compare features, decide how the observability layer should be deployed — because that single choice eliminates most of the field. There are three architectures, and each makes a different trade between control, convenience, and risk.
Run it yourself
You host the platform. Best for data residency, sovereignty, and cost control at scale. Langfuse deploys via Docker Compose in minutes; Phoenix is built directly on OTLP. The cost is operational ownership.
Instrument and ship
You add an SDK; the vendor runs the backend, storage, and UI. Fastest path to step-level tracing plus built-in eval tooling. The trade is per-trace or per-span pricing and your data living on their infrastructure.
Route through a gateway
Point your base URL at the gateway; it logs every request with near-zero code change and tracks cost across 300+ models. The architectural caveat: the gateway is a single point of failure for the whole fleet.
06 — Stack ComparisonThe 2026 observability stack compared.
The table below compares seven platforms across the dimensions that actually drive a selection in 2026 — deployment model, free tier, paid entry price, OpenTelemetry support, MCP tracing, and the funding or acquisition signal that tells you how durable the vendor is. Pricing and version figures are taken from each vendor's own documentation and should be re-checked before you commit; this market moves quickly.
| Platform | Deployment | Free tier | Paid entry | OTel GenAI | 2025–26 signal |
|---|---|---|---|---|---|
| Langfuse | Self-hosted / cloud | Hobby (self-host free) | Cloud paid tiers | Yes | Acquired by ClickHouse, Jan 2026 |
| Arize Phoenix | Self-hosted | Open-source (free) | Cloud / Arize AX | Yes (OTLP-native) | ~9.9k GitHub stars |
| LangSmith | Managed SDK | Developer (5K traces/mo) | Plus $39 / seat / mo | Yes | SmithDB: ~12× faster trace queries |
| Braintrust | Managed SDK | Starter (1M spans/mo) | Pro $249 / mo | Yes | $80M Series B, Feb 2026 |
| Helicone | Proxy gateway | 10K requests/mo | Usage-based | Via gateway | 300+ models in cost repo |
| AgentOps | SDK | Open-source (free) | Cloud tiers | Yes | Time-travel replay debugging |
| Datadog LLM Obs. | Managed agent | 40K LLM spans/mo | Pro $160 / mo | Yes (from v1.37) | Bills LLM spans only |
"We built Langfuse on ClickHouse because LLM observability and evaluation is fundamentally a data problem. Now, as one team, we can deliver a tighter end-to-end product: faster ingestion, deeper evaluation, and a shorter path from a production issue to a measurable improvement."— Marc Klingen, CEO of Langfuse
A note on the open-source options, because their licenses differ in ways that matter for redistribution. Langfuse is MIT-licensed (excluding its enterprise ee folder) and self-hosts via Docker Compose, Kubernetes/Helm, or Terraform. Arize Phoenix uses Elastic License 2.0 and is built on OpenTelemetry with OpenInference instrumentation underneath; AgentOps is MIT-licensed and notable for time-travel debugging — replaying an agent session with point-in-time precision. Read the exact license text before you embed any of them in a commercial product.
07 — Cost & Eval GatesToken cost tracking and eval gates.
Cost observability for agents has a subtlety that pricing models expose unevenly. Consider Datadog: its LLM Observability free tier includes 40,000 LLM spans per month, and the Pro plan starts at $160 per month with 100,000 LLM spans. The detail that changes the math is that only LLM spans are billed — tool spans, embedding spans, retrieval spans, and agent spans are free. A highly agentic system that makes many tool calls but relatively few model calls can therefore be dramatically cheaper to observe on a span-class-aware model than on a flat per-span one.
Per-trace and per-span pricing models diverge fast at scale. LangSmith meters traces — a Developer tier free at 5,000 traces per month, Plus at $39 per seat per month, with overage around $2.50 per thousand traces at standard retention. Braintrust meters spans generously — a Starter tier free at one million spans per month, Pro at $249 per month. Helicone, as a gateway, meters requests — free at 10,000 per month — and computes cost across more than 300 models using its model-cost repository, integrating natively with the Vercel AI SDK. The unit of billing (trace vs span vs request) interacts with your agent's call pattern, so model your own traffic before assuming one is cheaper.
Free-tier volume by platform · note the differing units
Source: vendor pricing pages (units differ — span vs request vs trace)Tracing is necessary but not sufficient. The mature pattern in 2026 is to pair runtime tracing with eval gates— automated scorers that grade agent outputs and can block a regression from shipping or flag a live quality drop. The managed platforms increasingly bundle this: LangSmith's natural-language trace assistant lets an engineer ask "why did the agent enter this loop?" and get an answer by analyzing the traces directly, while Braintrust pairs tracing with scorers for human and automated review. Tracing tells you what happened; eval gates tell you whether it was good — and you want both wired into the same pipeline. Security-sensitive deployments should also fold prompt injection detection into that gate, since a clean-looking trace can still hide an injected instruction.
"Teams have never had less conviction about what will fail next. When something does break, it has never been harder to explain why."— Ankur Goyal, CEO of Braintrust
08 — Market SignalsThe market — and the adoption gap.
Two things are true about LLM observability in 2026 at the same time: capital is flooding in, and most teams still are not using it. That gap is the most interesting signal in the space, and it is where the opportunity sits.
The funding side is concrete. ClickHouse acquired Langfuse on January 16, 2026, as part of a $400M Series D that valued ClickHouse at $15B; at acquisition, Langfuse reported more than 2,000 paying customers and tens of millions of SDK installs per month, and its open-source licensing and self-hosting were stated to remain unchanged. A month later, on February 17, 2026, Braintrust raised an $80M Series B at an $800M valuation, led by Iconiq with participation from Andreessen Horowitz and others. Two of the most-watched names in the category took major capital events inside a single quarter — a clear consolidation signal.
Here is our read on the gap. When 85% of GenAI deployments run without observability while the tooling market grows at a 30%-plus clip, you are looking at a discipline that is being built faster than it is being adopted. The teams instrumenting now are buying an unfair advantage: when an agent misbehaves in production, they can answer "why" in minutes from a trace, while the uninstrumented majority is reduced to guessing and re-running. As agents move from pilots into revenue-bearing workflows, that asymmetry stops being a nice-to-have and becomes the difference between a fixable incident and an unexplained one.
Projecting forward, two forces should converge over the next year. The OpenTelemetry GenAI conventions will likely graduate toward stable status, which removes the last real objection to standardizing on a vendor-neutral layer. And the convergence already underway — OpenInference instrumentations emitting both their own and OTel attributes for backward compatibility — points to a near future where you instrument once and route the same telemetry to a self-hosted tool and a managed backend simultaneously. The likely winners are the platforms that make that dual-destination story painless.
09 — DecisionChoosing the right stack for your workload.
The decision compresses to a few questions about how you run agents and what constraints you carry. The matrix below maps the common situations to a recommended starting point — start there, instrument, and let your own traces tell you whether to move.
Sovereignty or regulated data
If telemetry cannot leave your perimeter, self-host. Langfuse via Docker Compose or Kubernetes keeps traces inside your infrastructure; Phoenix on OTLP does the same. The cost is operational ownership of the stack.
Ship tracing this week
If the priority is step-level traces plus eval tooling with minimal setup, a managed SDK wins. LangSmith and Braintrust give you backend, storage, UI, and scorers out of the box — model your trace or span volume against their pricing first.
Zero-code-change cost tracking
If the immediate need is to see and attribute model spend across many providers without re-instrumenting, route through a gateway. Helicone tracks cost across 300+ models — just accept the single-point-of-failure architecture and plan for HA.
Already on a major APM
If you already run Datadog for the rest of your stack, its LLM Observability extends the same panes — and bills only LLM spans, which favors tool-heavy agents. Consolidating telemetry in one place is often worth more than a marginally better point tool.
Whatever you pick, instrument against the OpenTelemetry GenAI conventions rather than a vendor-proprietary SDK wherever the platform supports it. That single discipline keeps the migration door open: if the market consolidates further, or your needs change, you re-point the exporter instead of re-instrumenting every agent. For teams deciding the architecture, our AI digital transformation engagements start with exactly this kind of stack evaluation — mapping your agent traffic, residency constraints, and budget to a concrete recommendation, and our web development team wires the instrumentation into your application so the traces flow from day one.
10 — ConclusionObservability is now part of the agent, not an add-on.
If you cannot explain why an agent did something, you do not control it.
Agent observability stopped being optional the moment agents started making decisions in production. The failure mode that defines the discipline is the one that looks like success — a confident, wrong, expensive answer that no health check will ever catch. The only answer is step-level tracing: every reasoning step, tool call, and model response stitched into a trace you can replay and interrogate.
The good news is that the foundation is now standardized. The OpenTelemetry GenAI conventions give you a vendor-neutral vocabulary for agent, workflow, tool, and MCP spans — and although the spec is still in Development status, instrumenting against it today, with the dual-emit opt-in as a safety net, is the move that keeps your options open as the market consolidates around it.
So choose by deployment model first — self-hosted for residency, managed SDK for speed, proxy gateway for cost visibility — then by features, and instrument against the open standard regardless. With most GenAI deployments still flying blind and serious capital backing the category, the teams that wire in tracing and eval gates now will be the ones who can answer why while everyone else is still guessing.