AI DevelopmentMethodology14 min readPublished June 2, 2026

pass^k over pass@k · calibrated judges · trace-driven datasets · CI gating

Building an AI Agent Evaluation Pipeline: 2026 Methodology

Most agent eval content optimizes the dashboard, not the product. A pipeline that actually ships measures all-runs consistency with pass^k, calibrates its LLM judge against a human gold set, gates CI on real scores, and grows from production traces. Here is the methodology, end to end.

DA
Digital Applied Team
Senior strategists · Published June 2, 2026
PublishedJune 2, 2026
Read time14 min
Sources12 primary
pass@3 vs pass^3
97%
vs ~34% for the same agent
−63 pp gap
Min gold-set for judge
100+
labeled examples
Acceptable judge kappa
≥ 0.6
vs human reviewers
Eval share of dev time
60–80%
in successful teams

An AI agent evaluation pipeline is the engineering system that decides whether your agent actually works — not in a demo, but across the full distribution of real inputs it will meet in production. The teams that ship reliable agents in 2026 do not buy a dashboard and call it done. They build a measurement loop: a golden dataset drawn from real failures, graders they trust, a judge calibrated against human reviewers, and a CI gate that blocks regressions before they reach users.

What is at stake is the gap between a number that looks good and a number that means something. Most public eval content reports best-case success rates that quietly overstate reliability, and most LLM-as-judge setups are deployed without ever being checked against a human gold set. Both habits produce metrics that optimize the dashboard rather than the product. As agents take on longer, more autonomous tasks, those habits get more expensive to keep.

This guide walks the full pipeline in build order: the reliability math that reframes how you read a success rate, the shared vocabulary that keeps a team aligned, how to construct a golden dataset without waiting for hundreds of examples, the two grader classes and when to use each, how to calibrate an LLM judge you can actually trust, how to gate CI on eval scores, and the production feedback loop that turns live traces into your next test cases.

Key takeaways
  1. 01
    pass^k exposes the reliability pass@k hides.pass@k measures whether at least one of k attempts succeeds (best case); pass^k measures whether all k succeed (consistency). A 70%-per-trial agent reads as ~97% on pass@3 but ~34% on pass^3 — the gap is the real story.
  2. 02
    Start with 20–50 tasks from real failures.Anthropic recommends not waiting for hundreds of curated examples. Early agents have large per-change effect sizes, so a small set sourced from actual production failures gives enough signal to iterate.
  3. 03
    Two grader classes, used deliberately.Code-based graders (string match, regex, static analysis, outcome verification) are fast, cheap, and deterministic. Model-based graders (rubric scoring, pairwise comparison, multi-judge consensus) are flexible but require calibration.
  4. 04
    An uncalibrated LLM judge is a liability.Judges carry position, verbosity, self-preference, format, and drift biases. Without a human-labeled gold set and a tracked agreement metric, judge scores can show perfect dashboards while diverging hard from expert review.
  5. 05
    The feedback loop is the moat.Tracing tells you what the agent did; evaluation tells you whether it was correct. The platforms and pipelines that win share one data layer between production traces and offline eval cases, so failing live scorers automatically grow the dataset.

01The Reliability GapWhy pass^k is the number that matters.

Start with the framing shift that reorders everything else. There are two ways to summarize an agent that you run multiple times on the same task. pass@k is the probability that at least one of k attempts succeeds — a best-case view. pass^k is the probability that all k attempts succeed — a consistency view. They answer different questions, and for production reliability only the second one is honest.

Take an agent with a 70% per-trial success rate, run three times. Its pass@3 is roughly 97% — read alone, that looks production ready. Its pass^3, the chance all three runs land, is roughly 34.3%. That is a single agent described two ways, and the 62-point gap is entirely an artifact of which metric you quoted. According to Philipp Schmid's explainer, the biggest production challenge is not peak performance but reliability — and pass@k is built to flatter peak performance.

"The biggest challenge for AI agents in production isn't their peak performance, but their reliability."— Philipp Schmid, Developer Relations, Google DeepMind

The table below is a reliability reality check. Find your agent's measured per-trial success rate, then read across to see what best-case (pass@3) and all-runs (pass^3) consistency actually look like, and how wide the gap between them runs. The bars use the pass^3 value — the consistency number — so the visual length is the reliability you can count on, not the one that looks good in a deck.

pass^3 (all-runs consistency) by per-trial success rate

Source: pass^3 = p³ over the per-trial rate p
50% per-trial agentpass@3 ≈ 87.5% · gap to pass^3 ≈ 75 pp
12.5%
70% per-trial agentpass@3 ≈ 97.3% · gap to pass^3 ≈ 63 pp
34.3%
80% per-trial agentpass@3 ≈ 99.2% · gap to pass^3 ≈ 48 pp
51.2%
90% per-trial agentpass@3 ≈ 99.9% · gap to pass^3 ≈ 27 pp
72.9%
95% per-trial agentpass@3 ≈ 99.99% · gap to pass^3 ≈ 14 pp
85.7%
How to read the gap
The bars are pass^3 — the chance every run of three succeeds. Each subtitle shows the matching pass@3 best-case number and the gap between them. At a 70% per-trial rate the two metrics differ by more than 60 points; the gap only closes as per-trial reliability climbs toward the high 90s. These pass^3 figures are derived directly from the formula (p cubed), so they hold for any agent at that per-trial rate.

Research on long-horizon agents reinforces the point with real benchmarks rather than formulas. A reliability-science framework paper on arXiv reports pass@k versus pass^k gaps of up to roughly 25 percentage points across agentic benchmarks — evidence that a meaningful share of measured success comes from stochastic exploration across attempts rather than from deterministic capability. The practical takeaway: if your eval reports only pass@1 or pass@k, you do not yet know how reliable your agent is. Quote pass^k alongside it, and the conversation changes from "impressive" to "deployable." For deeper structural work on getting these numbers up, our guide to improving agent reliability picks up where the measurement ends.

02Shared VocabularyFive terms a team has to agree on first.

Before any tooling, a team needs shared definitions — otherwise two engineers argue about "the eval" while meaning different things. Anthropic's engineering guidance defines five core terms that make the rest of the pipeline legible. Adopt them verbatim; the precision pays off the first time a grader disagreement turns out to be a vocabulary disagreement.

The unit
Task & Trial
1 task · N trials

A task is one test with defined inputs and explicit success criteria. A trial is a single attempt at that task. Running the same task across many trials is exactly what makes pass^k measurable.

Define success up front
The scorer
Grader
code or model

The grader is the scoring logic that turns a trial into a verdict. It can be deterministic code or a calibrated model. Everything downstream depends on the grader being trustworthy.

Trust is earned, not assumed
The record
Transcript & Outcome
steps + final state

The transcript is the full record of the agent's steps; the outcome is the final environmental state — for example, whether a booking actually exists in the database, not just whether the agent said it booked.

Verify the world, not the words

That last distinction is the one teams skip and regret. Grading on the transcript alone — "did the agent claim it finished?" — is how you ship an agent that confidently reports success while the database stays empty. Outcome verification, checking the actual end state of the world, is what separates an eval that protects users from one that protects feelings.

03Golden DatasetStart with real failures, not a wishlist.

The most common reason eval projects stall is waiting for the perfect dataset. You do not need one. Anthropic recommends starting with 20–50 tasks sourced from real failures rather than hundreds of curated examples, because early-stage agents have large effect sizes per change — a small, well-chosen set provides adequate signal to iterate. The bar for a good task is concrete: two domain experts should independently reach the same pass/fail verdict. If they would disagree, the task is ambiguous, and an ambiguous task produces an unreliable grader.

As the pipeline matures, two distinct dataset types serve two stages, and conflating them is a frequent mistake. CI datasets are purpose-built — eventually 100-plus examples covering core features and known regressions — and run on every commit. Production evaluation is different: it samples live traces asynchronously and leans on reference-free evaluators, because for most real traffic there is no ground-truth answer to compare against.

Cold start
Iteration dataset
20–50tasks

Sourced from real failures, not curated wishlists. Large per-change effect sizes mean a small set gives enough signal to move fast in the early phase, per Anthropic's engineering guidance.

Don't wait for hundreds
Judge calibration
Labeled examples
100+

The minimum a domain expert should label before you trust an LLM-as-judge. Below this, you cannot measure judge agreement against humans with any statistical confidence.

Hamel Husain · LLM Evals FAQ
Production gold set
Human-labeled traces
200–500

A recommended production gold-set size for tracking judge calibration over time. Construction works best two-step: hand-build ~20 dimension tuples, then scale with LLM-generated variations converted to natural language.

Hybrid build beats pure-LLM

On construction method: pure LLM generation reliably fails on complex domain-specific contexts, low-resource languages, high-stakes applications, and underrepresented groups. The durable approach is hybrid — hand-create roughly 20 dimension-combination tuples to anchor the distribution, then scale through LLM-generated tuples converted into natural language. Humans set the shape; the model fills the volume.

04GradersCode-based and model-based, used deliberately.

Two grader classes dominate production pipelines, and the engineering discipline is knowing which job each is for. Code-based graders — string matching, regex, binary tests, static analysis, and outcome verification — are fast, cheap, and deterministic. Reach for them first for anything checkable in code. Model-based graders — rubric scoring, natural-language assertions, pairwise comparison, and multi-judge consensus — are flexible enough for subjective quality, but they cost more per call and require calibration before you can trust them.

There is also a choice about whatyou grade. Trajectory evaluation asks "did the agent take the right path?" and answers engineering questions; outcome evaluation asks "did the agent complete the task?" and answers business questions. Anthropic explicitly warns against over-grading step sequences, because agents regularly find valid approaches their designers never anticipated — and graders can be gamed when they reward the path instead of the result.

Default first
Code-based graders

String match, regex, binary tests, static analysis, and outcome verification against the real end state. Deterministic, near-zero marginal cost, and impossible to bias. Use for everything a deterministic check can confirm.

Use whenever checkable
When code can't
Model-based graders

Rubric scoring, natural-language assertions, pairwise comparison, and multi-judge consensus for subjective quality code can't capture. Flexible but expensive — and unusable until calibrated against a human gold set.

Calibrate before trusting
Binary over scales
Pass/fail, not 1–5

Hamel Husain — who has worked with 50-plus companies and taught thousands of students on eval systems — recommends binary verdicts over Likert scales: they force clearer thinking, need smaller samples for significance, and stop annotators from dodging hard calls.

Prefer binary labels
What you grade
Outcome over trajectory

Grade whether the task completed (business question), not whether the agent took the path you imagined (engineering question). Over-grading step sequences punishes valid unexpected approaches and invites grader gaming.

Verify the outcome

The binary-versus-scale point deserves weight because it is counterintuitive. A 1–5 quality scale feels more informative, but in practice it lets annotators cluster on a noncommittal 3 and requires far larger samples to reach statistical significance. A forced pass/fail call surfaces real disagreement, which is exactly the signal you want early. For the platform-level mechanics of running these graders against traces at scale, see our deeper walkthrough of connecting traces to evaluations.

05LLM-as-JudgeThe five biases every judge carries.

An LLM-as-judge is the most abused component in the modern eval stack: deployed fast, calibrated never, and quietly producing numbers nobody has checked against a human. Building one you can trust is a five-step process — identify persistent failure modes through error analysis, have domain experts create 100-plus labeled examples, iteratively refine the judge prompt, measure true-positive and true-negative rates against a held-out test set, and deploy only once human alignment is demonstrably there.

The reason calibration is non-optional is that judges carry systematic, named biases. Research from Eugene Yan catalogs five. It is important to frame his hard numbers correctly: they come from 2024 research on earlier model generations (the GPT-3.5 era), so they read as a historical baseline rather than current frontier behavior — today's judges likely perform better. What has not changed is that the bias shapes persist, which is why structured calibration remains necessary regardless of model generation.

Read these numbers as history, not headline
The figures in the matrix below are drawn from 2024 research on GPT-3.5-era models. They are a baseline for the kinds of bias to test for, not a measurement of any 2026 frontier judge. Frontier judge accuracy has improved since; what persists is the need to verify each bias type against your own human-labeled gold set before trusting a judge in a quality gate.
Position bias
Order favoritism
~70%

In Eugene Yan's 2024 research, one model family favored the first response in roughly 70% of comparisons. Mitigation: randomize order or average over swapped positions. Cheap to apply, worth doing on every pairwise judge.

2024 baseline · GPT-3.5 era
Verbosity bias
Longer-is-better
>90%

Judges in the same 2024 study preferred longer responses over 90% of the time even at equal quality. Mitigation: control for length in the rubric or normalize. Re-test on current models rather than assuming the rate holds.

2024 baseline · re-verify
Self-preference
Family bias
+10–25%

Same-family judges over-reward their own lineage — historically a +10% to +25% win-rate inflation in the 2024 data. Mitigation is structural and still sound: use a judge from a different model family than the generator.

Cross-family judging

Two more biases round out the set: format bias, where judges over-reward outputs that match a preferred structure regardless of substance, and calibration drift, where a judge that agreed with humans last quarter silently diverges as prompts, models, or data distributions shift. The accepted baseline for acceptable judge calibration is a Cohen's kappa of at least 0.6 against human reviewers; below the 0.41–0.60 band, you are in "not yet moderate agreement" territory and should not be gating on that judge.

How wrong can an uncalibrated judge be while looking right? In one documented practitioner case, a team that used a same-family model to judge its own outputs saw a dashboard showing essentially perfect metrics for three months — while the judge's actual agreement with domain-expert review sat at a Cohen's kappa around 0.31, well below the moderate-agreement floor. Treat that as an illustrative single case, not a universal statistic, but treat the pattern as real: a green dashboard is not evidence of a calibrated judge.

"If teams don't apply the scientific method, buying another evaluation tool won't save the product."— Eugene Yan, ML Engineer & Author

Calibration is not a one-time setup. Recalibrate on a regular cadence — monthly is a sensible default — and trigger an out-of-cycle recalibration whenever you change the rubric prompt, upgrade the model, suspect gold-set staleness, or see judge-versus-human divergence climb past roughly 20–25%. On cost: a judge that runs on every trace can quietly dominate spend, so a common guardrail is to keep judge cost under 10–15% of production LLM cost and to act — reduce sampling or downgrade the judge model — if it approaches 25%. Distilled judge models can cut that expense by an order of magnitude or more at scale.

06CI GatingBlock the regression before it merges.

Evals only protect users if they run automatically and can fail a build. The canonical CI pattern is straightforward: a pull request opens, CI triggers an eval run, the agent runs against a dataset of (say) 50 examples, an evaluator scores each output, and the build passes only if the average score clears a defined threshold — commonly an average at or above 0.85 — otherwise the PR fails. Eval-platform tooling integrates with standard CI runners and test frameworks so this fits inside the workflow a team already uses.

The best-implemented gates surface results where engineers already work. One widely used GitHub Action posts per-scorer improvements and regressions directly as PR comments and blocks merges below a defined quality threshold — so a reviewer sees "this change regressed the refund-flow scorer" inline, not buried in a log. That is the difference between an eval suite people respect and one they route around.

Step 1
PR opens → eval runs
CI trigger

A pull request triggers the eval job on your CI runner. The agent or chain runs against a fixed CI dataset — purpose-built examples covering core features and known regressions, not live traffic.

Runs on every commit
Step 2
Score against threshold
avg ≥ 0.85 (common)

Each output is scored by your calibrated graders. The build passes only if the aggregate clears the agreed bar — a commonly cited threshold is an average score at or above 0.85. Tune the bar to your risk tolerance.

Pass/fail the build
Step 3
Report in the PR
per-scorer diff

Surface per-scorer improvements and regressions as PR comments so the regression is visible at review time. A gate that blocks merges below threshold turns evals from a report into a guardrail.

Visible where work happens

A note on threshold discipline: 0.85 is a starting point, not a law. Set the bar from the cost of a failure in that specific flow — a refund-issuing agent earns a higher gate than a draft-summarizer. And resist the urge to celebrate a perfect pass rate. As Hamel Husain puts it, if you are passing 100% of evals you are probably not stress-testing hard enough; a gate that never fails is a gate that has stopped finding problems.

07Feedback LoopOne data layer for traces and evals.

The architectural decision that separates pipelines that compound from pipelines that stagnate is whether production traces and offline eval datasets share the same data layer. When they do, a failing online scorer can automatically promote the offending trace into your offline eval set — your test suite grows from real user behavior without a manual export step. When they do not, your eval dataset slowly drifts away from what users actually do, and your green CI stops meaning anything.

This is also where tracing and evaluation get conflated, and the distinction is worth stating plainly. Tracing without evaluation tells you what the agent did; it does not tell you whether what the agent did was correct. Observability is necessary but not sufficient — the feedback loop is what turns a stream of traces into an ever-improving test set.

"We can't improve outcomes if we can't measure it."— Eugene Yan, ML Engineer & Author

Why this matters more every quarter: agent tasks are getting longer and more autonomous, which makes consistent evaluation harder. Anthropic's research on agent autonomy reports that the 99.9th percentile turn duration nearly doubled from under 25 minutes in October 2025 to over 45 minutes by January 2026, while only a small fraction of tool calls — under 1% — involve irreversible actions and a large majority still retain human involvement. Longer turns mean more places for an agent to go subtly wrong, and the only defense that scales is an eval set that keeps absorbing the new failure modes real usage surfaces. For a structured way to pressure-test those longer flows, our agentic workflow resilience audit pairs naturally with this measurement loop, and the pipeline health metrics guide covers what to watch once the loop is running.

The core loop, in one line
Failing online scorers should automatically promote their traces into offline eval cases. That single mechanism is what turns a static test suite into a living one — and it is the reason to care about the data-layer architecture before you care about which vendor logo is on the dashboard.

08Governance AngleWhen calibration stops being optional.

There is a forward-looking reason to get judge calibration right that goes beyond product quality. The EU AI Act's obligations for high-risk AI systems phase in further from August 2026. For teams whose agents fall in scope, demonstrable evaluation rigor — documented graders, a tracked human-alignment metric, a recalibration cadence — is the kind of evidence that maps cleanly onto accountability and quality-management expectations. We frame this as preparation advice, not in-force compliance: nothing here is a statement that calibration is already a codified legal requirement, and in-scope providers should confirm their specific obligations against current regulatory guidance and counsel.

The practical move is to build the audit trail now because it is good engineering anyway. A pipeline that already records which graders ran, what the judge's agreement with humans was, and when it was last recalibrated is both a better product process and a far easier story to tell a reviewer later. Calibration discipline you adopt for reliability doubles as governance readiness — you do not have to choose. If a comparative tooling and methodology assessment would help, that is exactly the kind of work our AI transformation engagements scope first, alongside the agentic SEO programs where the same eval discipline keeps AI-generated work trustworthy.

Acceptable agreement
Cohen's kappa baseline
≥ 0.6

The commonly cited floor for an LLM judge to be considered calibrated against human reviewers. Below the 0.41–0.60 'moderate' band, do not gate on the judge. Track the metric; do not assume it.

Measured, not assumed
Recalibration cadence
Default review interval
30days

Monthly recalibration is a sensible default; trigger out-of-cycle when you change the rubric, upgrade the model, suspect gold-set staleness, or see divergence climb past ~20–25%.

Plus event triggers
Judge cost ceiling
Of production LLM cost
10–15%

A practical guardrail for judge spend, with an action trigger near 25%: reduce sampling rate or downgrade the judge model. Distilled judges can cut judge cost by an order of magnitude at scale.

Act before 25%

09ConclusionEvaluation is the core engineering activity.

The shape of agent evaluation, mid-2026

A pipeline that ships beats a dashboard that impresses.

The teams building reliable agents in 2026 treat evaluation not as a QA step but as the central engineering activity. Practitioner reports put evaluation at 60–80% of development time in successful AI product teams — most of it spent understanding failures rather than writing automated checks. That number sounds high until you internalize the alternative: shipping agents whose measured success is an artifact of which metric you quoted.

The methodology is a loop, not a checklist. Reframe success around pass^k so the number means something. Agree on a vocabulary. Build a golden dataset from real failures, starting small. Use code-based graders wherever a deterministic check can confirm the outcome, and calibrate any model-based judge against a human gold set before you ever gate on it. Wire that gate into CI so regressions fail the build. Then close the loop: let failing production traces flow back into the offline eval set so the whole system gets harder to fool over time.

The deeper signal is that the moat is process, not tooling. As Eugene Yan frames it, evals are practices that apply the scientific method — and no purchased dashboard substitutes for that discipline. Build the loop, keep the judge honest, and the question shifts from "does the demo work" to "is this agent reliable enough to put in front of users at the scale we actually serve." That is the only eval question that ever mattered.

Make your agents reliable enough to ship

Stop optimizing the dashboard. Build evals that make agents genuinely reliable.

Our team designs and operates agent evaluation pipelines end to end — golden datasets from your real failures, calibrated LLM judges with tracked human agreement, CI gating, and the production trace feedback loop — so your agents are reliable enough to ship, not just to demo.

Free consultationExpert guidanceTailored solutions
What we work on

Agent evaluation engagements

  • Golden datasets built from your real production failures
  • LLM-as-judge calibration with tracked Cohen's kappa
  • CI eval gates wired into your existing pipeline
  • Trace-to-eval feedback loops on a shared data layer
  • pass^k reliability reporting your team can trust
FAQ · Agent evaluation pipelines

The questions we get every week.

pass@k measures whether at least one of k attempts succeeds — a best-case view of capability. pass^k measures whether all k attempts succeed — a consistency view of reliability. They describe the same agent very differently. An agent with a 70% per-trial success rate reads as roughly 97% on pass@3 but only about 34.3% on pass^3, a gap of more than 60 points. For production reliability, pass^k is the honest number, because users experience the all-runs consistency, not the best of three. If your eval reports only pass@1 or pass@k, you do not yet know how reliable your agent actually is.