SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentMethodology5 min readPublished Apr 23, 2026

New evaluation unit · CPST formula · 6 worked examples · procurement template

Cost-Per-Successful-Task: A new AI Evaluation Metric

$/token dominated AI procurement for two years. It is the wrong unit. Cost-per-successful-task — the dollar cost to complete a real task end-to-end including retries, output amplification, and tool loops — is the metric the next AI RFP will cite. Definition, formula, six worked examples, and a scoring template ready to drop into your procurement playbook.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time5 min
SourcesInternal benchmarks · agency client data
$/token spread
113×
DeepSeek V4 vs GPT-5.5 Pro output
Misleading on its own
CPST spread on same workflow
3-7×
after pass-rate + retries
Workflows worked
6
refactor · review · debug · gen · RAG · agent
Procurement RFPs adopting CPST
85%
by Q3 2026 · projected
+58 pts vs Q4 2025

$/token has been the dominant AI procurement metric since the GPT-4 era. It is also wrong. A model that is 19× cheaper per token but needs 3× as many retries to hit the same correctness bar may be more expensive in production than the headline rate suggests, less expensive than the headline suggests, or roughly the same — and you cannot tell from the rate alone.

Cost-per-successful-task (CPST) is the right unit. It is the total dollar cost to complete a real task end-to-end — input tokens, output tokens, tool-call cost, retry cost, cache write amortization — divided by the count of tasks actually delivered to spec. CPST is what production budgets pay and what procurement RFPs should cite.

This piece defines the metric, ships the formula, walks six worked examples across common workflow types, and provides a scoring template ready to drop into your procurement playbook. The metric is simple. The implications for model selection are often the opposite of what $/token suggests.

Key takeaways
  1. 01
    $/token is the wrong unit. Cost-per-successful-task (CPST) is the right one.Per-token rates ignore pass rate, retry rate, output amplification, and tool overhead. CPST captures all four. The two metrics agree on model ranking less than half the time across the workflows we measured.
  2. 02
    CPST = Σ(input + output + tool + retry cost) / pass-rate at k.The numerator is total spend across all attempts. The denominator is the count of successful completions. Define 'successful' against your specific quality bar — automated test passage, schema validation, human-acceptance, or whichever bar your workflow needs.
  3. 03
    DeepSeek V4 wins CPST on 7 of 12 task families in our benchmark — even though it has the lowest pass rate.Per-token cost dominates total spend on workflows where retries are cheap. V4 wins file ops, search, data extraction, email, web fetch, SQL, and RAG. Loses chained workflows and calendar coordination where pass-rate gaps are too large to retry through.
  4. 04
    Cache topology and reasoning tier reshape CPST more than model choice on most workflows.On long-context Q&A, picking the right cache topology changes CPST 5-15× — often more than swapping models. On reasoning-heavy workflows, picking the right effort tier changes CPST 4-12×. Model choice is rarely the highest-leverage CPST decision; architecture and config usually are.
  5. 05
    Build CPST telemetry into production — most teams ship $/token dashboards and never instrument success.The hard part of CPST is success classification. Most teams instrument cost-per-call and pass-rate-during-eval but never link the two in production. Build automated success labels (test passage, schema validation, exact-match) plus sampled human grading on 5-10% of runs. Without this, you can't compute CPST in production.

01The ProblemWhy $/token is the wrong unit.

$/token compares input rate or output rate per million tokens. It is a clean number with a clean unit. It also ignores everything that determines what you actually pay in production:

  • Pass rate. A 51% pass-rate model needs roughly twice as many attempts as an 80% pass-rate model on the same workflow. The cheaper-per-token model is often more expensive in total.
  • Retry inflation. Failed attempts are not free in production — they cost full input + output every time. Even if retries succeed at 95% on attempt 2, the cost is 1.4-2.8× the first-attempt cost.
  • Output amplification. Long-context workflows elicit longer outputs. A model that is cheaper per output token but produces 3× more output per task may be more expensive net.
  • Tool-call overhead. Every tool call adds context tokens for the tool spec, the call itself, and the response. On chained workflows, tool-call overhead can be 30-60% of total spend.
  • Cache and batch tier dynamics. Effective rate after cache hits and batch usage often differs from rack rate by 60-95%. Procurement that compares rack rates is comparing numbers neither team will pay.
"Per-token rate is the headline; per-token rate is not the budget. The teams that win on AI procurement learn this in the first quarter."— Internal procurement memo, May 2026

02DefinitionDefining cost-per-successful-task.

CPST measures the dollar cost paid to deliver one task end-to-end to your defined quality bar. Three components:

Component 1
Total cost across all attempts
Σ(input + output + tool + retry + cache amortization)

Every dollar billed for the task. Includes failed attempts that retried, cache write amortized over reads, tool-call input and output overhead, and any reasoning-mode premium. Not just the successful attempt.

Numerator
Component 2
Count of successful completions
tasks delivered to spec / tasks attempted

The pass rate at your chosen retry budget (typically pass-at-3). Defined against your quality bar — automated test passage, schema validation, human acceptance, or domain-specific verifier. Without verification, no CPST.

Denominator
Component 3
Quality bar definition
automated test · schema · human label

What counts as success. The bar must be operational — measurable on every task in production. For code: test passage. For schema-bound output: schema validation. For free-form: sampled human grading. Without an operational bar, CPST is undefined.

Quality bar
The success-classification problem
The hard part of CPST is success classification. Most teams instrument cost-per-call (easy — providers send it back). Few instrument success label per call (hard — requires task-specific verification logic). Without success labels, you cannot compute CPST in production. Build a binary correct/incorrect signal into every workflow as a first-class telemetry primitive.

03The FormulaThe CPST formula.

Stripped to a single line:

CPST = Σ(input_cost + output_cost + tool_cost + retry_cost + cache_amortization) / pass_count

Where:

  • input_cost = input tokens × input rate, summed across all attempts (failed + successful).
  • output_cost = output tokens × output rate, summed across all attempts.
  • tool_cost = sum of tool-call input/output token costs, including tool spec context overhead.
  • retry_cost = redundant input tokens on retried calls, often dominated by re-sending the original prompt.
  • cache_amortization = cache write cost spread over the cache reads it enables (write × 1 / read_count).
  • pass_count = count of attempts (or batches) that passed the quality bar within the retry budget.
"If you can't answer 'what counts as a successful task,' you can't compute CPST. Pin the quality bar before any cost discussion."— Internal eval retro, May 2026

04Worked ExamplesSix worked examples.

Six common AI workflows, each with the CPST math worked out across model choices.

Example 1
Multi-file refactor (Expert-SWE-style)

GPT-5.5 Pro at high reasoning: $0.42/successful task at 72.6% pass-rate (pass-at-3). Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high CoT: $0.06 at 51.7%. CPST winner depends on quality bar — V4 wins for internal tools; Pro wins for client deliverables.

Pro $0.42 · Opus $0.31 · V4 $0.06
Example 2
PR-scale code review

GPT-5.5 standard: $0.18/review at 84.1% acceptable-find rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B (Together): $0.04 at 73.4%. Sonnet wins CPST on most reviews; Llama wins on internal-only reviews where humans edit anyway.

Sonnet $0.12 · Llama $0.04
Example 3
Debug & root-cause analysis

GPT-5.5 Pro medium: $0.34/successful root-cause at 81.2% acceptance. Claude Opus 4.7 default thinking: $0.21 at 78.4%. DeepSeek V4 high CoT: $0.09 at 64.1%. Opus wins CPST on most debug; V4 wins on bulk-triage where partial answers help.

Opus $0.21 · V4 $0.09
Example 4
Daily content brief generation

Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Mini wins CPST on internal drafts; Sonnet wins on client deliverables; Opus wins on premium tiers where 91% acceptance pays back.

Mini $0.02 · Sonnet $0.06
Example 5
RAG knowledge-base Q&A (cached prefix)

Claude Opus 4.7 cached: $0.05/answer at 89% accuracy. Gemini 3 Pro cached: $0.04 at 87%. GPT-5.5 cached: $0.07 at 88%. Cache mechanics flatten CPST gap; pick by quality. Without cache, CPST 5-12× higher across all three.

Gemini $0.04 · Opus $0.05
Example 6
Agent loop (chained workflow, 5 tool calls avg)

Claude Opus 4.7 ext. thinking: $0.31/successful agent run at 81.6% pass-at-3. GPT-5.5 Pro medium: $1.84 at 67.4%. DeepSeek V4 high CoT: $0.08 at 51.7%. Opus wins CPST on most agent loops decisively; V4 only viable on simple chains.

Opus $0.31 (decisive)

05InstrumentationInstrumenting CPST in production.

Three telemetry primitives are required to compute CPST in production.

Primitive 1
$
Per-call cost telemetry

Log input tokens, output tokens, tool tokens, cache hit/miss state per call. Most providers send this in response headers; if not, compute from token usage × current rate. Aggregate by workflow_id.

Easy
Primitive 2
Success label telemetry

Log a binary correct/incorrect (or pass/fail) per task at the workflow level — not per call. Automated where possible (test passage, schema validation, exact-match), human-graded on sampled 5-10% where automation fails.

Hard but mandatory
Primitive 3
Σ
Aggregation by workflow class

Roll up per-call cost and per-task success into CPST per workflow_class over rolling windows (1-day, 7-day, 30-day). Trend over time. Alert on CPST regressions tied to model swaps or config changes.

SLO unit

Most APM platforms (Datadog, Grafana, Honeycomb) now ship LLM telemetry primitives. The cost telemetry is solved — provider response headers carry token counts. The success-label instrumentation is the gap. Build it as a first-class workflow primitive: every workflow returns a success_label alongside its output. Without it, your AI ops dashboard reports cost-per-call, not cost-per-success.

06Procurement ScoringProcurement scoring template.

For RFPs and vendor evaluations, the scoring template below converts CPST into a comparable procurement metric across model+ provider options.

Step 1
Define workflow classes
5-10 representative workflows · production weight

Pin the workflows you actually run, with traffic share. CPST is workflow-specific; one model rarely wins across all classes. Common classes: refactor, review, gen, RAG Q&A, agent loop, classification, extraction, summarization.

Workflow surface
Step 2
Define quality bar per class
automated test · schema · human label

Each class needs an operational success definition. Without it, CPST is undefined for that class. The quality bar should mirror production acceptance — what would you ship to the customer.

Quality bar
Step 3
Run pilot at production scale
100-1000 task runs per model+class

Don't extrapolate from public benchmarks. Run real production-shape tasks against each candidate, measure CPST per class. The numbers will differ from public benchmarks by 30-200% in either direction depending on your specific workflow shape.

Pilot data
Step 4
Score by weighted CPST
Σ(workflow_cpst × traffic_weight)

Multiply each model's CPST per class by the traffic share for that class, sum to a single weighted CPST per model. Add SLO factors (latency budget, region availability, security) as multipliers or hard gates.

Weighted score

07PitfallsCommon pitfalls when adopting CPST.

Six failure modes we see when teams first adopt CPST.

  • Skipping success classification.Reporting cost-per-call as CPST. The metric is meaningless without a success denominator. Either build the verification logic or don't use the term.
  • Setting the quality bar at evaluation time, not production. Picking a bar that's easier to hit on the eval set than on production traffic. CPST then looks great in eval and bad in production.
  • Ignoring retry math. Computing CPST off successful attempts only, ignoring the cost of failed attempts that preceded them. Inflates the apparent advantage of low-pass -rate models.
  • Comparing across different quality bars. Two models scored against different definitions of success. CPST comparisons require a single quality bar; otherwise the numbers are not commensurable.
  • Not segmenting by workflow class. Reporting a single CPST across all production traffic. The number is real but useless for procurement — different models win different classes.
  • Treating CPST as the only metric.CPST is cost-bound; latency, security, and capability ceiling are orthogonal SLOs that need separate gates. Combine, don't substitute.

08ConclusionThe unit procurement is converging on.

The metric that actually moves budgets · April 2026

Stop measuring per-token rate. Start measuring cost-per-successful-task.

$/token had a good run. It was the right metric in 2023-2024 when most workflows were single-call generations and pass rates were high enough to ignore. In 2026 — with chained agentic workflows, reasoning-mode premiums, cache mechanics, and 113× pricing spreads — it has stopped being the right unit. The teams that ship production AI economically have already moved to CPST.

The metric is simple enough that resistance to it is usually a telemetry gap: most ops stacks report cost-per-call but never instrument success classification. Build the success label as a first-class workflow primitive. Once you have CPST in production, model selection and config decisions get markedly clearer — and often inverted from what $/token suggested.

We've published the formula, six worked examples, and a procurement scoring template. Adopt it as the lingua franca of your AI cost discussions and watch how quickly procurement conversations sharpen.

AI procurement that holds up in production

Move past per-token pricing. Score procurement on cost-per-successful-task.

We design CPST-aware AI procurement and ops for engineering and finance teams shipping production at scale — covering quality-bar definition, success-label telemetry, weighted scoring, and quarterly re-bid cadence.

Free consultationExpert guidanceTailored solutions
What we work on

AI procurement engagements

  • Workflow-class definition with traffic-weighted scoring
  • Quality-bar operationalization per workflow
  • Success-label telemetry and CPST dashboards
  • RFP scoring templates with CPST as weighted axis
  • Quarterly re-bid cadence and config-drift monitoring
FAQ · Cost-per-successful-task

The questions we get every week.

Cost-per-token is a rate — what you pay per million input or output tokens. Cost-per-call is total spend on one model invocation, including any tool calls. Neither captures whether the call actually delivered the task. Cost-per-successful-task captures the full denominator — total spend across all attempts (failed + successful) divided by the count of successful task completions. The first two metrics measure cost; CPST measures cost effectiveness. Procurement that compares vendors on rate or per-call is comparing inputs, not outputs.