AI Development17 min read

Agent Observability 2026: Evals, Traces, Cost Guide

Agent observability guide — LangSmith, Braintrust, Langfuse compared, eval patterns, trace sampling, and cost attribution for multi-tenant agents.

Digital Applied Team

April 14, 2026

17 min read

Platforms Compared

3-Layer

Eval Model

OTel

Instrumentation

Per-Tenant

Cost Attribution

Key Takeaways

Tool Failures Dominate Outages: Most agent incidents stem from tool-call failures, context truncation, and runaway loops rather than model errors; standard APM tools cannot see these without agent-aware instrumentation.

Three Evaluation Layers Matter: Reliable agents need unit evals on discrete steps, LLM-as-judge regression suites for subjective output quality, and continuous production trace sampling to catch real-world drift.

Three Platforms Lead the Category: LangSmith, Braintrust, and Langfuse each occupy distinct niches; Langfuse is the open-source baseline, LangSmith leans into LangChain workflows, and Braintrust targets rigorous eval science.

OpenTelemetry Is the Portable Layer: Instrumenting with OpenTelemetry semantic conventions for generative AI keeps traces vendor-agnostic, letting teams swap or stack observability platforms without rewriting the instrumentation layer.

Cost Attribution Must Be Multi-Dimensional: Per-user, per-task, and per-tenant cost breakdowns are table stakes for multi-tenant agent products; tagging at the trace root and propagating through children is the reliable pattern.

Tail-Based Sampling Beats Head-Based: Keep every failed, expensive, or anomalous trace in full; sample the happy path aggressively. Head-based sampling at high volumes drops exactly the traces you need when an incident hits.

Drift Is a First-Class Signal: Model updates, prompt edits, and tool schema changes all induce silent drift. Scheduled replay of a golden trace set against current production is the most reliable early warning.

Most agent outages aren't model failures. They're tool failures, context failures, and runaway loops that standard APM tools can't see. Agent observability is a different discipline from traditional application performance monitoring, and the teams shipping reliable production agents treat it that way from day one.

This guide walks through the three eval layers every serious agent team needs, compares LangSmith, Braintrust, and Langfuse on the dimensions that matter in production, and covers the infrastructure patterns: OpenTelemetry instrumentation, trace sampling, cost attribution across multi-tenant workloads, and drift detection. Pricing specifics shift often and vary by scale, so we compare qualitatively; verify current tiers before committing.

Who this is for: Engineering teams running LLM-powered agents in production, platform teams standardizing observability across multiple agent products, and technical leads evaluating their first agent observability platform.

Why Agent Observability Is Different

A traditional web service trace has a predictable shape: request in, a handful of database queries, maybe a cache hit or two, response out. Latency lives in the database or the network. Errors are HTTP 5xx or uncaught exceptions. APM tools evolved to slice that shape well.

An agent trace looks nothing like that. A single user request can fan out into dozens of LLM calls, tool invocations, sub-agent handoffs, and retry loops. Latency lives in model inference and tool-call round trips. Cost lives in token counts scattered across every child span. "Errors" include successful HTTP responses that contain hallucinated content, tool calls with malformed arguments the model never noticed, and loops where the agent retries the same broken step forty times before giving up.

The Four Failure Modes APM Misses

Tool-call failures — the model emits a tool call with invalid arguments, gets an error back, and either loops or silently fabricates a result.
Context truncation — the prompt hit the context window and critical instructions or retrieved documents were silently dropped.
Runaway loops — the agent keeps calling itself or a tool with minor variations, burning tokens without converging.
Silent quality regressions — all spans return 200, but the end-user-facing answer quality has degraded after a prompt edit or model update.

Building agents you have to trust in production? Our AI Digital Transformation practice helps teams stand up observability, evals, and rollout guardrails before agents touch real traffic.

The consequence is that standard request-span traces and red-yellow- green dashboards don't catch the incidents your users actually feel. Agent observability is a specialized discipline with its own data model (traces of LLM calls and tool calls, not HTTP hops), its own quality signals (eval scores, not 5xx rates), and its own cost model (token-weighted per-user attribution, not CPU seconds).

The Three-Layer Eval Model

Teams that ship reliable agents converge on a three-layer evaluation model. Each layer covers different failure modes and runs on different cadences. Missing any one of them creates a predictable blind spot.

Layer 1: Unit Evals on Discrete Steps

Unit evals assert deterministic properties on individual agent steps: the router picked the correct branch, the tool-calling output parsed as valid JSON matching the schema, the retrieval step returned at least one document above a relevance threshold, the date parser produced a valid ISO-8601 string. These are fast, cheap, and belong in CI. They catch regressions on the plumbing that LLM-as-judge cannot reliably detect.

Layer 2: LLM-as-Judge Regression Suites

LLM-as-judge evals score subjective output quality against a rubric using a strong grading model. Typical dimensions include factual grounding against retrieved context, helpfulness, conciseness, tone alignment, and hallucination rate. A regression suite runs the same 100-500 test cases through each candidate prompt or model and produces aggregate scores. These run per-PR and per-release, gating merges on regression thresholds.

Layer 3: Production Trace Sampling

Layers 1 and 2 run on curated test data. Production sampling runs your evals against a slice of real user traffic, either online (real-time scoring of every Nth trace) or offline (batch scoring of a day's sample). This is where you catch distribution shift, edge cases your test set missed, and the slow drift that static eval suites never surface. Production trace sampling is the feedback loop that keeps Layers 1 and 2 honest.

Layer	What It Catches	Cadence	Cost Profile
Unit evals	Plumbing regressions, schema drift, routing errors	Every CI run	Cheap, deterministic
LLM-as-judge	Subjective quality regressions, hallucination spikes	Per-PR, per-release	Moderate, grader tokens dominate
Production sampling	Distribution shift, real-world drift, long-tail failures	Continuous	Tunable via sampling rate

For a deeper treatment of multi-agent failure modes specifically, see our multi-agent orchestration patterns guide.

LangSmith Deep Dive

LangSmith is LangChain's observability and evaluation platform, built alongside the LangChain and LangGraph frameworks. It ships tracing, dataset management, evaluation runs, a prompt hub, and an annotation UI for human feedback — all integrated tightly with the LangChain SDK so callback-based tracing is nearly automatic for LangChain-built agents.

Where LangSmith Is Strongest

LangGraph integration. If your agents are graphs, LangSmith's visualizer shows node-by-node execution, state snapshots, and decision traces without extra instrumentation.
Annotation queues. Human reviewers can mark up production traces with labels and feedback directly inside the platform, and those annotations flow back into datasets for the next eval run.
Prompt hub. Version-controlled prompts with the ability to reference prompts by name in code and deploy edits without redeploying the application.
Studio and playground. Side-by-side prompt experimentation with eval-score comparison.

Integration Path

For a LangChain-built agent, integration is setting an API key and a project environment variable; tracing attaches via the existing callback manager. For non-LangChain agents, the @traceable Python decorator and equivalent TypeScript wrappers instrument functions at the boundary. LangSmith also supports OpenTelemetry ingest for teams that have already standardized there.

Trade-Offs

LangSmith's tightest wins are inside the LangChain ecosystem. Teams building with vanilla OpenAI, Anthropic, or Vercel AI SDK calls can use it, but lose some of the automatic node-level visualization that makes LangSmith feel magical with LangGraph. It is a closed-source SaaS with enterprise self-hosting as a paid option — not an open-source baseline.

Braintrust Deep Dive

Braintrust positions itself as an evaluation-first platform. The core product is eval science: rigorous dataset management, scoring function authoring, experiment tracking, and prompt iteration with statistical confidence indicators. Tracing and online monitoring ship alongside, but the center of gravity is the eval development loop.

Where Braintrust Is Strongest

Eval authoring. First-class scoring function library including numeric scorers, LLM-as-judge templates, and composable custom scorers written in TypeScript or Python.
Experiment diffs. Side-by-side comparison of two eval runs at the row level — see which inputs improved, which regressed, and which flipped between runs.
Statistical rigor. Confidence intervals and significance indicators on aggregate scores, so a 2% delta on a 40-row dataset is flagged as noise rather than a win.
Prompt playground. Fast iteration on prompts with evals running in-browser against your datasets.

Integration Path

Braintrust has SDKs for Python and TypeScript plus an OpenAI proxy wrapper that traces LLM calls automatically when you swap the base URL. For full-trace instrumentation the traced decorator wraps agent steps. OTel ingest is supported, and the eval runner integrates with CI via a CLI.

Trade-Offs

Braintrust is a managed SaaS with enterprise deployment options; self-hosting is not the default mode. Teams that treat eval development as a primary workflow — ML engineers, research-adjacent agent teams, groups running rigorous model comparisons — get the most value. Teams that want observability-first with evals as a secondary feature often find Langfuse or LangSmith a closer fit.

Langfuse Deep Dive

Langfuse is the open-source entry in the category. The core is MIT-licensed and can be self-hosted end-to-end, with a managed cloud available for teams that prefer SaaS. It covers tracing, datasets, evaluations (scorable manually or via LLM-as-judge), prompt management, and cost tracking — a broad surface area with a healthy plugin and framework ecosystem.

Where Langfuse Is Strongest

Self-hosting. Full open-source deployment suitable for teams with data residency, privacy, or cost constraints that rule out SaaS.
Framework neutrality. Broad SDK and decorator support across Python, TypeScript, and integrations with most major agent frameworks — LangChain, LlamaIndex, Vercel AI SDK, OpenAI, Anthropic, and OTel-based pipelines.
Pricing at scale. The self-hosted option makes Langfuse the most predictable at high volumes; you pay for infrastructure, not per-trace.
Session and user grouping. First-class session and user ID fields on traces make multi-tenant cost attribution straightforward out of the box.

Integration Path

Langfuse SDKs expose decorators and context managers for instrumenting arbitrary Python or TypeScript functions. The @observe decorator traces the call with automatic argument and return-value capture. OpenTelemetry ingest is a first-class path, and framework integrations mean LangChain or LlamaIndex agents often need a few lines of setup to start emitting full traces.

Trade-Offs

The self-hosted deployment does require operational ownership — Postgres, ClickHouse for trace storage at scale, and the usual upgrade and backup duties. Teams that want a pure SaaS experience with zero infrastructure can use Langfuse Cloud instead, but the compelling differentiator (self-host) then goes unused. Feature velocity on rigorous eval science lags Braintrust; feature velocity on LangGraph-native visualization lags LangSmith.

Comparison Matrix

Here is how the three platforms compare across the dimensions that matter most for production agent teams:

Dimension	LangSmith	Braintrust	Langfuse
License	Closed (SaaS + enterprise)	Closed (SaaS + enterprise)	MIT open-source core
Self-hosting	Enterprise tier only	Enterprise tier only	Full OSS self-host
Framework fit	LangChain / LangGraph native	Framework-agnostic	Framework-agnostic
Eval authoring depth	Solid	Best-in-class	Solid
Prompt management	Hub with versioning	Playground + versioning	Versioned prompts
OpenTelemetry ingest	Supported	Supported	First-class
Annotation queues	Native	Available	Available
Multi-tenant grouping	Metadata-based	Metadata-based	First-class user + session
Pricing model	Usage-based SaaS	Usage-based SaaS	SaaS or self-host
Best-fit team	LangChain / LangGraph shops	Eval-heavy ML teams	OSS-first, multi-framework

Comparison date: April 2026. All three platforms are evolving quickly; verify specifics against current documentation before committing to a platform choice. Pricing tiers in particular shift often.

For a broader look at agent framework trade-offs, see our OpenAI Agents SDK vs LangGraph vs CrewAI matrix.

Trace Sampling Strategies

At modest volumes (under a few thousand traces per day) you can keep 100% of traces. Above that, retention cost and UI signal-to- noise push you toward sampling. The question is how to sample without discarding the traces you'll actually need when an incident lands.

Head-Based Sampling

Decide at trace start

Sample N% of traces at trace start with a deterministic hash or RNG draw. Cheap to implement, zero buffering overhead, predictable cost.

Weakness: random selection drops exactly the anomalous traces you want. A 1% failure rate with 1% sampling gives you ~1 preserved failure per 10,000 traces.

Tail-Based Sampling

Decide at trace end

Buffer traces until complete, then decide based on properties: error status, latency, cost, eval score, trace length. Keeps the tails that matter.

Weakness: buffering adds memory overhead and requires a collector (OTel Collector or platform-side) that understands your trace shape.

Cost-Weighted Sampling

Keep expensive traces

Bias retention toward high-cost traces. A trace burning 50k tokens is worth keeping; a trace burning 500 is not. Tune retention probability as a function of token spend.

Pairs with: tail-based infrastructure; cost is known only at trace end.

Stratified Sampling

Preserve per-tenant coverage

Sample at a per-tenant or per-feature rate to guarantee coverage across your user base. A 1% global rate can leave small tenants with zero retained traces; stratified sampling keeps minimums per group.

Use when: supporting enterprise tenants who expect their traces to be queryable.

A Practical Default Policy

100% of traces with errors, timeouts, or unhandled exceptions.
100% of traces above a cost threshold (for example, top 5% by token spend).
100% of traces below an eval-score threshold when online scoring is enabled.
1-5% of healthy, cheap, passing traces for distribution coverage.
Stratify the healthy-trace sampling by tenant or feature so small tenants retain visibility.

Cost Attribution Patterns

Cost attribution is where most agent teams first feel the limits of generic APM. A single user request can fan out into dozens of LLM calls across different models and tools, each with its own token cost. Attribution means rolling that spend back up to the dimensions you need: per-user, per-task, per-tenant, per-feature.

The Tag-at-Root, Propagate-to-Children Pattern

The reliable pattern is simple: attach identifying tags at the root span when the request enters your system, and ensure every child span inherits them. All three platforms support this via metadata, session fields, or tags. The discipline is yours to enforce — a single sub-agent or tool wrapper that forgets to propagate breaks the attribution chain for everything downstream.

# Python example — Langfuse-style tagging
from langfuse.decorators import observe, langfuse_context

@observe()
def handle_request(request):
    langfuse_context.update_current_trace(
        user_id=request.user_id,
        session_id=request.session_id,
        metadata={
            "tenant_id": request.tenant_id,
            "task_type": request.task_type,
            "feature_flag_cohort": request.cohort,
        },
    )
    return run_agent(request)

# Every nested @observe() call inherits these tags
# via context propagation. Cost rolls up by any
# combination of user_id, tenant_id, or task_type.

Three Dimensions That Matter in Production

Per-user. Detects abusive or runaway usage, feeds usage-based billing, informs rate limits.
Per-task (or per-feature). Tells you which product surfaces are profitable and which are subsidized losses. Essential for product decisions on which agent features to keep, tune, or deprecate.
Per-tenant (for B2B). Gross margin per enterprise customer, usage-tier enforcement, and the basis for account-level cost alerts.

For a deeper treatment of multi-tenant cost attribution specifically, see our LLM agent cost attribution guide.

Need cost attribution wired into your analytics stack? Our Analytics and Insights practice connects agent observability to the BI tools finance and product already use.

OpenTelemetry Instrumentation Reference

OpenTelemetry is the open standard for distributed tracing, and it has gained first-class semantic conventions for generative AI workloads. Instrumenting against OTel rather than a vendor-specific SDK is the portability play: the same instrumentation ships to LangSmith, Braintrust, Langfuse, or a self-hosted backend with only config changes.

Key Semantic Conventions for Agents

The OTel GenAI conventions define attribute names for LLM calls, tool invocations, and embedding operations. The canonical subset agent teams instrument first:

gen_ai.system — provider (openai, anthropic, etc.)
gen_ai.request.model — model ID used for the call
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — token counts for cost rollup
gen_ai.operation.name — chat, embedding, tool_call, etc.
gen_ai.response.finish_reasons — stop, length, tool_calls, content_filter

Minimal Instrumentation Example

from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes

tracer = trace.get_tracer(__name__)

def call_llm(prompt: str, model: str):
    with tracer.start_as_current_span("llm.chat") as span:
        span.set_attribute(SpanAttributes.GEN_AI_SYSTEM, "openai")
        span.set_attribute(SpanAttributes.GEN_AI_REQUEST_MODEL, model)
        span.set_attribute("gen_ai.operation.name", "chat")

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )

        span.set_attribute(
            SpanAttributes.GEN_AI_USAGE_INPUT_TOKENS,
            response.usage.prompt_tokens,
        )
        span.set_attribute(
            SpanAttributes.GEN_AI_USAGE_OUTPUT_TOKENS,
            response.usage.completion_tokens,
        )
        return response

Collector-Side Routing

The OTel Collector lets you route the same trace stream to multiple backends. A common pattern: send all traces to self-hosted Langfuse for long-term retention and query, and sample interesting traces to Braintrust for eval development. The application code changes nothing; the Collector configuration is the only place routing rules live.

For production agent reference architectures built on OTel, see our enterprise agent platform reference architecture guide.

Drift Detection and Alerting

Drift is the silent killer in agent systems. Model versions shift under you, prompt edits have unexpected downstream effects, tool schemas change, and the input distribution you see in production moves away from the data your eval set was built from. Without active drift detection, teams learn about regressions from user complaints rather than dashboards.

Golden-Set Replay on a Schedule

The most reliable drift signal is a scheduled replay of a curated golden set (50-500 traces with expected outputs or rubric scores) through your current production pipeline. Run it daily, or on every deploy, and track aggregate scores over time. A sustained drop is a drift event — investigate before it becomes an incident.

Production-Side Score Distribution Tracking

Golden-set replay catches regressions against a fixed reference. Production-side scoring catches distribution shift in real traffic. Run your LLM-as-judge scorers on a sampled slice of production traces continuously, bucket by time window, and alert on p50/p95 score movement. A drop here with no drop on the golden set usually means the input distribution changed, not the pipeline.

Tool-Call Schema Validation in CI

External API schema changes are a frequent invisible drift source: the tool your agent depends on added a required field, or deprecated an endpoint, and your model is still calling the old signature. CI-side schema snapshot testing (record tool schemas, diff on each run, fail on breaking changes) gives you a lead time before production notices.

Alerting Policy

Page-worthy: golden-set aggregate score drops by more than the noise threshold for two consecutive runs; production error rate exceeds baseline by 3x; cost per user spikes by 5x.
Ticket-worthy: gradual golden-set drift over seven days; p95 production latency growth of 50%+ week-over- week; new tool-call error patterns appearing above noise.
Dashboard-only: per-model, per-tenant, and per-feature score trends; token spend trends; sampling distribution sanity checks.

Prompt injection detection is a specialized drift concern — see our prompt injection production agents taxonomy for the adversarial drift taxonomy.

Scaling agent operations across customer workflows? Our CRM and Automation practice wires agent telemetry into the downstream systems where drift signals need to land — ops queues, revenue alerts, and customer health scores.

Conclusion

Agent observability earns its own discipline because the failure modes, cost model, and quality signals are different enough from request-response web services that generic APM falls short. The teams shipping reliable agents converge on a recognizable stack: a three-layer eval model (unit, LLM-as-judge, production sampling), OpenTelemetry as the portable instrumentation layer, tail-based sampling that keeps the traces that matter, and multi-dimensional cost attribution rolled up from tagged root spans.

The platform choice between LangSmith, Braintrust, and Langfuse comes down to the shape of your team and stack. LangGraph-heavy shops lean LangSmith; eval-science-heavy teams lean Braintrust; OSS-first, framework-agnostic teams lean Langfuse. Whichever you pick, instrument against OTel semantic conventions so the decision isn't load-bearing on your codebase. For production patterns around agent SDKs specifically, see our Claude Agent SDK production patterns guide.

Ship Agents You Can Actually Trust

From eval design and OpenTelemetry instrumentation through cost attribution and drift detection, we help teams stand up the observability foundation production agents require.

Get Started Explore AI Digital Transformation

Free consultation

Expert guidance

Tailored solutions