AI agent evaluation frameworks reached an inflection point in May 2026: the category now spans five commercial platforms and three open-source standards, each solving a distinct slice of the production-readiness problem. The right choice depends on your team size, compliance tier, existing observability stack, and — following OpenAI's March 9 acquisition of Promptfoo — which model vendor you trust to grade your agents.

The stakes are no longer theoretical. Eval tooling is strategic infrastructure. Braintrust closed an $80M Series B at an $800M valuation in February 2026 with customers including Notion, Replit, Cloudflare, and Ramp. OpenAI acquired Promptfoo — the security and red-team eval leader used by more than 25% of Fortune 500 companies — for a reported $86M valuation. LangSmith, backed by LangChain's 80M monthly downloads, is the default starting point for LangGraph teams. The commercial eval market is consolidating around a handful of serious platforms.

This guide covers all eight frameworks: the five commercial platforms (LangSmith, Braintrust, Helicone, Phoenix by Arize, and Promptfoo), and the three open-source standards (OpenAI Evals, DeepEval v4.0.3, and Inspect AI v0.3.225 from the UK AI Security Institute). It includes a full 8-framework comparison matrix, a breakdown of where the SOC 2 cliff actually falls per platform, a 5-step custom eval design template that goes from production incidents to CI/CD regression gates, and an honest look at the vendor-objectivity question that the Promptfoo acquisition raised but most roundups have ignored. For the deeper observability context, see our agent observability 2026 guide to evals, traces, and cost.

Key takeaways

01
The landscape is 5 commercial + 3 open-source — pick your tier first.LangSmith, Braintrust, Helicone, Phoenix (Arize), and Promptfoo are the five commercial platforms. OpenAI Evals, DeepEval v4.0.3, and UK AISI Inspect AI v0.3.225 are the three OSS standards. Your choice axis is not features — it is who evaluates the evals, where traces live, and what compliance tier you need. Engineering teams at seed stage can start on OSS and migrate; regulated industries need to know the SOC 2 cliff before signing any contract.
02
SOC 2 is a paid gate at every platform except LangSmith Plus.The compliance cost is hidden in the tier structure. Braintrust SOC 2 requires the $249/mo Pro tier (versus free Starter). Helicone SOC 2 requires the $799/mo Team tier (versus $79/mo Pro). Arize SOC 2 is Enterprise-only (custom quote). LangSmith carries SOC 2 Type II at the Plus tier ($39/seat) — the lowest entry point in the commercial field. OSS frameworks carry no SOC 2 of their own; compliance is your infrastructure problem.
03
Promptfoo's OpenAI acquisition raises a legitimate objectivity question.Promptfoo was acquired by OpenAI on March 9, 2026 at an $86M valuation. It remains open source (MIT) and Ian Webster has committed to keeping it vendor-neutral. But teams running non-OpenAI models now face a structural question: if your security and red-team eval tooling is owned by a model vendor, is the grading truly neutral? This is not a fabricated concern — it is the same question applied to any vendor-controlled benchmark. Weigh it deliberately before choosing Promptfoo for non-OpenAI workflows.
04
Phoenix v16 and DeepEval v4.0.3 shipped the same week.Phoenix v16.0.0 (May 21, 2026) introduced sandboxed Code Evaluators for composite scoring and LLM-jury implementations run server-side. DeepEval v4.0.3 (also May 21, 2026) shipped Decision Graph Logic for granular simulation control, on top of a complete agentic eval harness added in v4.0.2. Both frameworks updated in the same week — the OSS eval space is shipping at commercial velocity. Track both repos if you are evaluating open-source options.
05
The harness effect makes cross-vendor SWE-Bench numbers incomparable.Claude Opus 4.7 reports 87.6% on SWE-Bench Verified (Anthropic, April 16, 2026). Codex CLI on GPT-5.5 reportedly scores 88.7%. These numbers appear close — but identical model weights in different harnesses produce 10-20 point swings on the same benchmark. When two vendors report different SWE-Bench scores for different models, the harness methodology is usually the dominant variable, not the model. Build your eval harness before drawing conclusions from published leaderboards.

01 — Landscape OverviewFive commercial platforms, three OSS standards — the 2026 split explained.

The AI agent eval landscape matured enough in early 2026 to segment cleanly into two tiers. Commercial platforms provide managed infrastructure — trace ingestion, annotation queues, dataset versioning, SOC 2 certifications, and enterprise SLAs — in exchange for subscription fees. Open-source frameworks provide eval logic, metric libraries, and CI/CD integration that runs on your own infrastructure, with no data leaving your environment and no monthly bill for the framework itself.

The two tiers are not mutually exclusive. Many teams run DeepEval or OpenAI Evals for unit-level CI checks and LangSmith or Braintrust for production trace annotation and dataset management. The workflow pattern that emerging engineering teams are converging on — as described in our deep-dive on LangSmith, Langfuse, and Arize Phoenix observability platforms — is OSS for speed at the PR level and commercial for compliance and audit at the production level.

The five commercial platforms differ primarily on architecture philosophy. LangSmith is tightly integrated with the LangChain ecosystem and optimized for LangGraph agent tracing. Braintrust is dataset-first and model-agnostic, with sandboxed Python custom scorers that no other platform currently offers. Helicone is a proxy wrapper — it requires zero SDK changes and bolts eval capability onto your existing LLM calls. Phoenix (Arize) is OpenTelemetry-native, giving it the strongest portability story for teams that already instrument with OTel. Promptfoo is security and red-team focused, now integrated into the OpenAI Frontier infrastructure following the March 2026 acquisition.

The three OSS standards each fill a distinct gap. OpenAI Evals (18.5k GitHub stars, MIT) is a registry-based framework for reproducible benchmark-style evals — the closest analog to running a published benchmark against your agent. DeepEval (15.7k stars, Apache 2.0) is the pytest-native framework, wrapping G-Eval, DAG metrics, and a full agentic eval harness into a developer ergonomics that feels like a standard Python test suite. Inspect AI (2.1k stars, MIT, UK AISI) ships 200+ pre-built evaluations across providers and is the gold standard for public-sector, safety-critical, and multi-provider testing environments.

02 — Commercial PlatformsLangSmith, Braintrust, Helicone, Phoenix, Promptfoo — five architectures, five trade-offs.

Each commercial platform has a distinct architectural identity that shapes everything downstream: what data it captures, how custom metrics are written, whether it supports on-prem deployment, and which compliance tier it can satisfy. The cost delta between entry and SOC 2 tiers is the sharpest selection signal — covered in full in §05, but surfaced here so it is present while reading platform descriptions.

LangSmith — Three published tiers: Developer (free, 5,000 base traces/month), Plus ($39/seat/month, 10,000 base traces), and Enterprise (custom). SOC 2 Type II certification was announced July 15, 2024; HIPAA and GDPR compliance are also certified. Enterprise ships self-hosted deployment via Docker Compose or Kubernetes, plus a hybrid mode that keeps the data plane in the customer's VPC. The plus-tier SOC 2 coverage is the most accessible entry point in the commercial field. LangSmith's natural home is LangChain and LangGraph teams — Harrison Chase, LangChain CEO, described the founding purpose as helping “close the gap between prototype and production” by giving developers visibility into LLM context. For teams outside the LangChain ecosystem, LangSmith is usable but carries integration friction that Phoenix (OTel) or Braintrust (model-agnostic) does not.

Braintrust — Starter (free, 1 GB processed data, 10k scores, 14-day retention), Pro ($249/mo, SOC 2 Type II, custom topics and environments), Enterprise (custom, on-prem available). The architectural differentiator is sandboxed Python custom scorers — the only commercial platform that runs custom scorer code in an isolated environment, reducing the risk of scorer-side side effects and enabling arbitrary Python logic without trust issues. Braintrust's February 2026 Series B announcement at $800M valuation with ICONIQ as lead investor validates the dataset-driven, model-agnostic positioning. Ankur Goyal, CEO, on the Latent Space podcast: “If you embrace evaluation as the sort of core workflow in AI engineering, meaning every time you make a change, you evaluate it… then you're able to build much, much better AI software.”

Helicone — Hobby (free, 10,000 requests, 1 GB storage, 7-day retention), Pro ($79/mo), Team ($799/mo, SOC 2 Type II + HIPAA), Enterprise (custom, on-prem available). The proxy architecture is Helicone's defining trait: teams add a single base-URL change and Helicone captures every LLM call automatically, with no SDK changes or instrumentation work. Scores and datasets are bolt-on capabilities. CI/CD integration is possible via webhooks. The SOC 2 cliff here is steep — $79/mo to $799/mo is a 10x jump for compliance. See the full breakdown in our observability stack TCO calculator across LangSmith, Langfuse, and Helicone.

Phoenix (Arize) — Phoenix itself is free, open-source (Elastic License 2.0), and self-hosted by default: 9.8k GitHub stars, Python (arize-phoenix-evals) and TypeScript (@arizeai/phoenix-evals) packages. The Arize AX managed cloud adds tiers: AX Free (25k spans/mo, 1 GB, 15-day retention), AX Pro ($50/mo), AX Enterprise (custom, SOC 2 Type II + HIPAA). Phoenix v16.0.0 shipped May 21, 2026 with sandboxed Code Evaluators for composite scoring, embedding-based eval, and LLM-jury implementations executed server-side. The OTel-native architecture is its key portability advantage — any team already instrumented with OpenTelemetry can point traces at Phoenix without SDK lock-in. No other platform in this list matches that portability story. See our deep-dive on LangSmith, Langfuse, and Arize Phoenix observability platforms for the full OTel integration comparison.

Promptfoo — Community tier free (10k probes/month for red teaming); Enterprise and On-Premise pricing is custom. SOC 2 certified and ISO 27001 certified at the organization level. Promptfoo's GitHub repo carries 21.5k stars and an MIT license; founders Ian Webster and Michael D'Angelo have committed to keeping it open source post-acquisition: “Promptfoo will remain open source and we will continue to serve users and customers.” The acquisition is documented in a March 9, 2026 TechCrunch report: Promptfoo had raised just $23M and was valued at $86M after its most recent round in July 2025. The vendor-objectivity concern this raises for non-OpenAI teams is addressed in full in §08.

03 — Open-Source StandardsOpenAI Evals, DeepEval v4.0.3, Inspect AI — three OSS frameworks worth knowing.

Open-source eval frameworks ship no SOC 2, no managed trace storage, and no enterprise SLA. What they ship instead is eval logic you own entirely, metrics you can audit, and CI/CD integration that runs in any Python or Node environment. For teams at the prototype-to-early- production phase, these frameworks provide the fastest path to meaningful eval coverage. For regulated teams, they are the CI layer that sits under a commercial platform's managed data pipeline.

OpenAI Evals (18.5k GitHub stars, MIT, github.com/openai/evals) is a framework for evaluating LLMs and LLM systems, plus an open-source registry of benchmarks. It is template-driven: eval definitions live in YAML, custom code evals are supported, and the registry provides a library of community-contributed benchmarks. One important caveat: OpenAI Evals does not ship a CI/CD runner — it is a Python-runnable framework plus registry. CI integration is bring-your-own glue (a GitHub Actions step that calls the eval script). The repo has 691 commits on main as of May 2026 with no tagged GitHub releases; development cadence is irregular. Best fit: teams running reproducible benchmark-style evals and contributing to or consuming the community benchmark registry.

DeepEval v4.0.3 (15.7k GitHub stars, Apache 2.0, Confident AI, github.com/confident-ai/deepeval) released May 21, 2026 with Decision Graph Logic for granular simulation control. v4.0.2 (May 13, 2026) shipped a coding-agent eval harness and a terminal trace inspection TUI. The framework is fully pytest-integrated: deepeval test run in CI is the canonical command. Metrics cover the full agent lifecycle — G-Eval and DAG for LLM-as-judge, Answer Relevancy and Faithfulness for RAG, Task Completion and Tool Correctness for agents, plus Hallucination, Bias, and Toxicity as safety metrics. DeepEval is the closest open-source analog to a commercial eval platform's metric library, without the managed infrastructure. Confident AI offers a cloud layer for teams that want managed datasets and history. See our AI evaluation metrics reference guide for how G-Eval and DAG compare to deterministic scorers.

Inspect AI v0.3.225 (2.1k GitHub stars, MIT, github.com/UKGovernmentBEIS/inspect_ai) released May 23, 2026, maintained by the UK AI Security Institute and Meridian Labs. The defining feature is 200+ pre-built evals spanning OpenAI, Anthropic, Google, Grok, Mistral, Hugging Face, AWS Bedrock, Azure AI, vLLM, and Ollama — no other framework approaches that provider coverage out of the box. Releases are tracked via PyPI (not GitHub Releases). The official documentation site is the canonical reference. Best fit: public-sector teams with multi-provider requirements, safety-critical workloads, and teams that want to run established safety benchmarks without writing custom metric code.

OpenAI Evals

MIT · Registry-based benchmarks

18.5k★

Template-driven YAML evals + community benchmark registry. No built-in CI runner — bring-your-own GitHub Actions glue. Best for reproducible benchmark-style evals.

MIT · github.com/openai/evals

DeepEval

Apache 2.0 · v4.0.3 · May 21, 2026

15.7k★

Pytest-integrated. G-Eval, DAG, RAG metrics, agent metrics (Task Completion, Tool Correctness), Hallucination/Bias/Toxicity. 'deepeval test run' in CI. Confident AI cloud optional.

Apache 2.0 · confident-ai/deepeval

Inspect AI

MIT · v0.3.225 · UK AISI

200+evals

200+ pre-built evals across 10+ providers. Built-in agent + tool capture. Model-graded + custom scorers. PyPI releases. Public-sector and safety-critical gold standard.

MIT · UKGovernmentBEIS/inspect_ai

Anthropic Console

Built-in eval · claude.ai dashboard

5-ptgrades

5-point quality grading, side-by-side compare, prompt versioning. Double-brace variables. Test cases: manual, Claude-generated, or CSV import. Prompt generator uses Claude Opus 4.1.

platform.claude.com/docs/eval-tool

04 — Framework ComparisonAll 8 frameworks in one comparison matrix: price, SOC 2, on-prem, CI/CD.

No existing comparison in May 2026 covers all eight frameworks — commercial and open-source — in a single matrix with SOC 2 tier, on-prem availability, and CI/CD support explicit. Most roundups cover four commercial platforms or three open-source tools; the cross-tier view is absent. The matrix below is sourced from vendor pricing pages (retrieved 2026-05-24), GitHub repositories, and the research data confirmed in this post's fact file. For the deeper observability-first view of the commercial subset, see our 30/60/90-day observability rollout plan.

LangSmith

Commercial · LangChain-native · SOC 2 at Plus

Pricing: Developer (free, 5k traces/mo) → Plus ($39/seat, 10k traces) → Enterprise (custom). SOC 2 Type II + HIPAA + GDPR: certified Jul 15, 2024. On-prem: Enterprise (self-host via Docker Compose / Kubernetes + hybrid VPC mode). CI/CD: yes (native evaluator runs). Custom scorers: yes (online + offline). Trace capture: auto via SDK + partial OTel. Best for: LangChain and LangGraph teams; the lowest-cost SOC 2 entry point in the commercial tier.

Best: LangChain/LangGraph teams

Braintrust

Commercial · dataset-driven · sandboxed scorers

Pricing: Starter (free, 1 GB, 10k scores, 14-day retention) → Pro ($249/mo, SOC 2) → Enterprise (custom, on-prem). SOC 2 Type II: Pro tier and above. On-prem: Enterprise only. CI/CD: yes (GitHub Action). Custom scorers: yes — sandboxed Python, the only platform with this isolation model. Trace capture: manual + SDK. Customers: Notion, Replit, Cloudflare, Ramp, Dropbox. $80M Series B (Feb 2026) at $800M valuation.

Best: dataset-driven model-agnostic teams

Helicone

Commercial · proxy-first · zero SDK changes

Pricing: Hobby (free, 10k requests, 7-day retention) → Pro ($79/mo) → Team ($799/mo, SOC 2 + HIPAA) → Enterprise (custom, on-prem). SOC 2 Type II: Team tier only. On-prem: Enterprise only. CI/CD: via webhooks. Custom scorers: Scores + Datasets APIs. Trace capture: proxy-based (single base-URL change). Steepest SOC 2 cliff in the commercial tier: $79 → $799 is a 10x jump.

Best: observability-first, minimal instrumentation

Phoenix (Arize)

Commercial + OSS · OTel-native · v16.0.0

Pricing: Phoenix OSS (free, self-host, Elastic License 2.0) → AX Free (25k spans/mo) → AX Pro ($50/mo) → AX Enterprise (custom, SOC 2 + HIPAA). SOC 2 Type II: AX Enterprise only. On-prem: Phoenix is the OSS self-hosted product by default. CI/CD: yes. Custom scorers: Code Evaluators (v16), LLM jury, embedding-based eval. Trace capture: OpenTelemetry-native. 9.8k GitHub stars. Strongest portability story for existing OTel instrumentation.

Best: OTel-instrumented and ML-rigor teams

Promptfoo

Commercial (OpenAI) · security-first · red team

Pricing: Community (free, 10k probes/mo red teaming) → Enterprise + On-Premise (custom). SOC 2 + ISO 27001: certified at org level. On-prem: Enterprise. CI/CD: yes (PR review GitHub Action). Custom scorers: yes. Trace capture: CLI + plugins. 21.5k GitHub stars, MIT license. Acquired by OpenAI March 9, 2026 ($86M valuation). >25% Fortune 500 usage (per TechCrunch). Vendor-objectivity consideration for non-OpenAI model teams.

Best: security, red-team, regulated workloads

OpenAI Evals

Open-source · registry-based · 18.5k stars

Pricing: free (MIT). SOC 2: none (self-hosted). On-prem: always (self-host). CI/CD: bring-your-own (no built-in runner — wrap in GitHub Actions). Custom scorers: yes (template + custom code). Trace capture: registry-based. 18.5k GitHub stars, 691 commits. No tagged GitHub releases; active development, irregular cadence. Best for reproducible benchmark-style evals and community benchmark registry consumption.

Best: reproducible benchmarks, community registry

DeepEval

Open-source · pytest-native · v4.0.3

Pricing: free OSS (Apache 2.0) + Confident AI Cloud (separate). SOC 2: none for OSS. On-prem: self-host always. CI/CD: yes ('deepeval test run'). Custom scorers: G-Eval, DAG. Metrics: RAG (Answer Relevancy, Faithfulness, Contextual Recall/Precision), Agent (Task Completion, Tool Correctness, Goal Accuracy), Safety (Hallucination, Bias, Toxicity). 15.7k GitHub stars. v4.0.3 May 21, 2026.

Best: engineering teams wanting pytest ergonomics

Inspect AI

Open-source · UK AISI · 200+ pre-built evals

Pricing: free (MIT). SOC 2: none (self-hosted). On-prem: always. CI/CD: yes. Custom scorers: model-graded + custom. Trace capture: built-in agent + tool capture. 2.1k GitHub stars. v0.3.225 May 23, 2026. Maintained by UK AI Security Institute + Meridian Labs. 200+ pre-built evals across 10+ providers. No GitHub Releases tab — releases tracked on PyPI.

Best: public-sector, safety-critical, multi-provider

05 — Compliance PricingThe SOC 2 cliff: what compliance actually costs per platform.

Every comparison post we surveyed in May 2026 lists pricing tiers without naming the exact moment when SOC 2 compliance kicks in. That gap matters for procurement decisions. The delta between a functional trial tier and a compliant production tier can be 10x — and for teams that need compliance on day one, the entry price is not the number that matters. The number that matters is the SOC 2 floor.

The analysis below is grounded in vendor pricing pages retrieved 2026-05-24. Pricing changes without notice; re-verify before procurement.

SOC 2 compliance entry price per platform (2026-05-24)

Source: vendor pricing pages (langchain.com/pricing, braintrust.dev/pricing, helicone.ai/pricing, arize.com/pricing, promptfoo.dev/pricing) — retrieved 2026-05-24

LangSmith — SOC 2 entry: Plus ($39/seat/mo)Developer tier (free) has no SOC 2 signal; Plus tier carries the certification — lowest commercial entry point

$39/seat

Braintrust — SOC 2 entry: Pro ($249/mo)Starter (free) explicitly marked '—' for SOC 2 on pricing page; Pro and above carry it

$249/mo

Helicone — SOC 2 entry: Team ($799/mo)Pro ($79/mo) has no SOC 2; Team tier is the first compliant tier — a 10x jump from Pro

$799/mo

Arize AX — SOC 2 entry: Enterprise (custom)AX Free ($0) and AX Pro ($50/mo) have no SOC 2; only AX Enterprise carries it — no published price

Custom

Promptfoo — SOC 2: org-level (Community free tier)SOC 2 + ISO 27001 certified at org level; community tier is free with 10k probes/mo. Enterprise pricing custom.

Free / Custom

Three structural patterns emerge from this breakdown. First, LangSmith is the only commercial platform where SOC 2 compliance is accessible at a per-seat price — $39/seat means a team of five pays $195/month for a compliant tier. Braintrust's $249/mo is flat-rate (no per-seat cliff for small teams). Helicone's $799/mo Team tier is flat-rate but jumps from $79 — that delta is the highest absolute cliff in the group.

Second, OSS frameworks carry no SOC 2 of their own. Teams running DeepEval or Inspect AI self-hosted are responsible for their own infrastructure's compliance posture — SOC 2 for the eval layer means SOC 2 for the compute environment running it. This is not a disqualifier for OSS; it is a procurement framing point.

Third, on-prem availability is inconsistent in ways that matter for regulated workloads. LangSmith ships self-hosted plus a hybrid VPC mode on Enterprise. Braintrust ships on-prem on Enterprise. Helicone ships on-prem only on Enterprise. Arize ships Phoenix as the self-hosted product by default (it's the OSS). For regulated data environments where traces cannot leave the customer network, the on-prem availability tier is the dominant constraint — and most tier-comparison posts bury it. Our Agent Success Rate (ASR) methodology covers how to structure eval pipelines for regulated environments where external data transfer is restricted.

Ankur Goyal, CEO, Braintrust — Series B announcement, Feb 17, 2026

“AI is an operating system that changes constantly, beyond what humans can inspect directly, and engineering and product teams need to maintain clarity, accountability, and confidence in every update.” — Ankur Goyal, CEO, Braintrust, Braintrust Series B announcement, February 17, 2026.

06 — Benchmark MethodologySWE-Bench Verified, the harness effect, and why scores aren't comparable across vendors.

SWE-Bench Verified is the most-cited coding-agent benchmark in the eval ecosystem as of May 2026 — and one of the most consistently misread. The numbers most teams reference: Claude Opus 4.7 at 87.6% (Anthropic, April 16, 2026) and Codex CLI on GPT-5.5 at a reported 88.7%. Both are correct, and neither is comparable to the other in the way a head-to-head score comparison implies. Confirming Claude Opus 4.7's 87.6% SWE-Bench Verified score against the primary Anthropic announcement source is straightforward. The 88.7% figure for GPT-5.5 via Codex CLI is cross-referenced in our SWE-Bench Live leaderboard analysis — but comes with the same caveat.

The harness effect is the dominant variable. Identical model weights in different evaluation harnesses commonly produce 10-20 point score differences on SWE-Bench. The scoring depends on the task selection within the “Verified” subset, the scaffolding and tool access given to the model, how retries are counted, and how partial-credit patches are graded. When two vendors report different SWE-Bench numbers for their respective models, the harness is usually why — not the models themselves. “My model scored X on SWE-Bench Verified” is a different statement from “my model beats a competitor on SWE-Bench Verified” unless both used identical harness configurations.

One additional benchmark deserves mention for context. CursorBench, where Cursor's Composer 2.5 reportedly scores 63.2%, is built and scored by Cursor on Cursor's own harness — vendor-controlled methodology. Aider's polyglot benchmark is similarly self-scored. OSWorld (369 real computer tasks) and AgentBench (1,360 tasks across 8 environments) are academic, externally run benchmarks — a different validity tier from vendor-published numbers.

The practical implication for eval framework selection: before choosing a platform based on a model's published SWE-Bench score, establish your own internal benchmark on a golden set of tasks representative of your actual production workload. Published leaderboard positions tell you about the model in the benchmark vendor's harness. Your harness is the only number that matters for your production agent. This is the foundation of step 4 in the custom eval design template below — establish baseline + acceptance bar against your own golden dataset, not against a published benchmark. See our cost-per-successful-task as an eval metric for the economic framing that ties task completion rates to actual production value.

Identical model weights in different harnesses produce 10-20 point score swings on SWE-Bench Verified. When two vendors publish different scores for different models, the harness is usually why — not the models. Build your own golden dataset before citing any published leaderboard in a procurement decision.Digital Applied analysis, May 22, 2026

07 — Custom Eval DesignThe 5-step custom eval design template: incidents to CI gates.

Every framework in this roundup ships documentation on how to write a metric. None ships a vendor-neutral guide on how to design a custom eval program from scratch — starting from production incidents and ending in a CI/CD gate that blocks deploys. That practitioner gap is what the 5-step template below fills.

The template applies to any framework combination: LangSmith for dataset storage + DeepEval for CI checks, Braintrust end-to-end, or Inspect AI for provider-agnostic coverage with self-hosted trace capture. The framework is the tool; the 5-step process is the methodology. For the related pass-rate and revision-rate framing, see our agent quality metrics — pass rate and revision rate guide, and for the prompt-level regression complement, see the prompt-library regression framework.

Step 01

Define failure modes from production incidents

Output: Failure-mode taxonomy

Start from past production incidents — wrong tool selected, hallucinated price, infinite tool loop, leaked PII. Do not invent hypothetical failure modes. Build a taxonomy of 10-20 real cases. Each failure mode becomes an eval category with its own scorer. Spreadsheet or incident-retro doc is the deliverable here — no eval framework required at this step.

Artifact: failure-mode taxonomy (10-20 cases)

Step 02

Build a golden dataset — 50-200 examples

Output: golden.jsonl versioned in git

Fifty to 200 hand-labeled examples per failure mode — quality over quantity. Label them yourself or with domain experts; do not generate them synthetically. Version the dataset in git alongside the agent code. Tools: Braintrust datasets, LangSmith annotation queue, or a CSV file. The dataset is the most valuable artifact in your eval program — protect it as a first-class engineering asset.

Artifact: golden.jsonl · git-versioned

Step 03

Choose a scorer mix — 60/30/10

Output: scorer registry per failure mode

Target mix: 60% deterministic (exact match, regex, JSON-schema validation, latency threshold), 30% LLM-as-judge (G-Eval, DeepEval DAG, Braintrust custom Python scorers, Phoenix Code Evaluators), 10% human-in-the-loop for ambiguous cases. Never rely on LLM-as-judge alone — it introduces scorer-side stochasticity on top of the agent's stochasticity. Deterministic scorers are your ground truth.

60% deterministic · 30% LLM-judge · 10% human

Step 04

Establish baseline + acceptance bar

Output: baseline scorecard + bar doc

Score your current production agent on the golden set before setting acceptance bars — never set an acceptance bar without a baseline. Example: 'Tool-selection failure mode: current baseline 88%, acceptance bar 95% — below baseline blocks deploy.' Block-on-regression, not block-on-absolute-threshold. All 8 frameworks in this roundup can run this scoring pass.

Artifact: baseline scorecard + acceptance bar

Step 05

Wire CI/CD regression gates

Output: GitHub Actions workflow

Eval runs on every PR touching agent code or prompts. Failed eval = blocked merge. Slack alert for borderline cases (within 2 percentage points of acceptance bar). Tools: Promptfoo PR review GitHub Action, 'deepeval test run' in CI, LangSmith CI/CD eval trigger, Braintrust GitHub Action. This step makes evals mandatory infrastructure — not an optional quality check.

Artifact: CI workflow + Slack alert

08 — Vendor ObjectivityPromptfoo + OpenAI: the objectivity question no roundup is asking.

Every I/O 2026 and March 2026 news cycle produced pieces on the Promptfoo acquisition that essentially rewrote the press release: OpenAI bought Promptfoo, the tool remains open source, Ian Webster committed to vendor neutrality. Those facts are all correct. None of the pieces we surveyed raised the structural question that matters for enterprise procurement: if your security and red-team eval tooling is owned by a model vendor, is the grading structurally neutral for non-OpenAI model workflows?

The concern is not that Promptfoo will actively bias scores against Claude Opus 4.7, Gemini 3.5 Flash, or open-source models. The concern is structural. Promptfoo is integrated into “OpenAI Frontier” for “automated red-teaming, evaluating agentic workflows for security concerns, and monitoring activities for risks and compliance needs” (per Futurum's post-acquisition analysis). The development priorities, benchmark selection, and integration depth will now reflect OpenAI's product roadmap — not a neutral open-source foundation's roadmap.

For teams running Claude Opus 4.7 or Gemini 3.5 Flash in production, the red-team coverage depth for non-OpenAI model attack surfaces may lag behind OpenAI model coverage over time. This is not fabrication — it is the standard analysis applied to any eval tool with a dominant patron. Applied to CursorBench (built by Cursor, scored by Cursor) or Aider's polyglot benchmark (built by Aider), the same skepticism is warranted. Promptfoo's founders have explicitly committed to open-source maintenance. Take that commitment at face value and track whether the commit activity on non-OpenAI provider integrations diverges post-acquisition.

The practical guidance: if your production agents run on non-OpenAI models and red-team coverage for those specific models is a material compliance requirement, supplement Promptfoo with Inspect AI (which covers OpenAI, Anthropic, Google, Grok, Mistral, AWS Bedrock, and Azure AI in its 200+ pre-built evals) or run DeepEval's Hallucination and Bias metrics as an independent scoring layer. The goal is not to avoid Promptfoo — it is to not rely on a single vendor-controlled tool for your entire security eval surface. Our AI transformation advisory work includes eval-stack design for regulated teams with multi-vendor model portfolios precisely because of this kind of structural risk.

09 — Selection GuideWhich framework fits your team — a decision heuristic.

The eight-framework matrix is useful for inventory. The decision heuristic below converts that inventory into a starting point. Three variables drive the initial cut: team size and existing stack, the compliance tier required in the next 12 months, and whether eval needs to run inside the network or can send traces to a cloud provider.

Start on LangSmith if you are already on LangChain or LangGraph. The automatic trace capture, native LangGraph integration, and $39/seat SOC 2 entry make it the zero-friction starting point for that ecosystem. If you are not in the LangChain ecosystem, LangSmith's friction is real — evaluate Phoenix or Braintrust instead.

Start on Braintrust if you need dataset-driven eval with model-agnostic scoring and are willing to pay the $249/mo Pro floor for SOC 2. The sandboxed Python scorer capability is a genuine differentiator for teams that need custom business-logic metrics that cannot be expressed in template-based scorers. If you are not yet at the $249/mo commitment point, start with DeepEval OSS + a git-versioned golden dataset and migrate to Braintrust when the dataset management overhead justifies the subscription.

Start on Phoenix + DeepEval if you need self-hosted trace storage plus a pytest-native CI eval layer. Phoenix OSS handles trace capture and visualization with full OTel portability; DeepEval handles the CI eval run. Both are free and self-hostable. This combination covers the 80% use case for engineering teams that are not yet in a compliance-gated procurement cycle. For the broader observability stack design, see our agent observability 2026 guide to evals, traces, and cost.

Start on Inspect AI if you work in the public sector, a regulated health or finance environment, or need to run evals across five or more model providers. The 200+ pre-built evals and provider-agnostic design mean you can benchmark Claude Opus 4.7, GPT-5.5, Gemini 3.5 Flash, Mistral, and a Bedrock model in a single eval run without writing custom adapter code. The 2.1k GitHub star count understates its importance in the safety and public-sector ecosystem.

Add Promptfoo to your stack if security red-teaming and automated jailbreak testing are material requirements — and you have considered the vendor-objectivity implications for non-OpenAI model portfolios described in §08. Promptfoo's community tier (free, 10k probes/month) makes it the easiest red-team layer to add to any existing stack without procurement friction.

Use the build a Codex test-generation pipeline tutorial as the CI integration reference if you are wiring any of these frameworks into a GitHub Actions workflow. The patterns apply across DeepEval, LangSmith, and Promptfoo with minimal adaptation.

Conclusion

Eval tooling is now strategic infrastructure — treat the selection decision accordingly.

The eight frameworks in this guide are not interchangeable commodity tools. They represent five distinct architectural philosophies — LangChain-native tracing, dataset-driven model-agnostic scoring, proxy-first zero-instrumentation, OTel-portable self-hosted, and security-first red-team coverage — plus three open-source standards that cover pytest ergonomics, public-sector safety requirements, and reproducible benchmark registries. The selection decision should be made with the same rigor applied to any production infrastructure choice: define the failure modes you need to catch, establish the compliance tier your organization requires in the next 12 months, and map those requirements against the SOC 2 cliff at each platform.

The Promptfoo acquisition and Braintrust's $800M valuation tell the same story: eval tooling has stopped being an optional engineering quality layer and become a strategic infrastructure category. The teams that build a systematic eval program now — production incident taxonomy, golden datasets, deterministic + LLM-judge scorer mix, CI regression gates — will have a structural quality advantage over teams that ship agents on published benchmark scores alone. The harness effect on SWE-Bench is not a footnote; it is the central lesson. Build your own harness, on your own golden data, before citing any published leaderboard in a deployment decision.

AI Agent Eval in 2026: Eight Frameworks, One Honest Comparison