$/token has been the dominant AI procurement metric since the GPT-4 era. It is also wrong. A model that is 19× cheaper per token but needs 3× as many retries to hit the same correctness bar may be more expensive in production than the headline rate suggests, less expensive than the headline suggests, or roughly the same — and you cannot tell from the rate alone.
Cost-per-successful-task (CPST) is the right unit. It is the total dollar cost to complete a real task end-to-end — input tokens, output tokens, tool-call cost, retry cost, cache write amortization — divided by the count of tasks actually delivered to spec. CPST is what production budgets pay and what procurement RFPs should cite.
This piece defines the metric, ships the formula, walks six worked examples across common workflow types, and provides a scoring template ready to drop into your procurement playbook. The metric is simple. The implications for model selection are often the opposite of what $/token suggests.
- 01$/token is the wrong unit. Cost-per-successful-task (CPST) is the right one.Per-token rates ignore pass rate, retry rate, output amplification, and tool overhead. CPST captures all four. The two metrics agree on model ranking less than half the time across the workflows we measured.
- 02CPST = Σ(input + output + tool + retry cost) / pass-rate at k.The numerator is total spend across all attempts. The denominator is the count of successful completions. Define 'successful' against your specific quality bar — automated test passage, schema validation, human-acceptance, or whichever bar your workflow needs.
- 03DeepSeek V4 wins CPST on 7 of 12 task families in our benchmark — even though it has the lowest pass rate.Per-token cost dominates total spend on workflows where retries are cheap. V4 wins file ops, search, data extraction, email, web fetch, SQL, and RAG. Loses chained workflows and calendar coordination where pass-rate gaps are too large to retry through.
- 04Cache topology and reasoning tier reshape CPST more than model choice on most workflows.On long-context Q&A, picking the right cache topology changes CPST 5-15× — often more than swapping models. On reasoning-heavy workflows, picking the right effort tier changes CPST 4-12×. Model choice is rarely the highest-leverage CPST decision; architecture and config usually are.
- 05Build CPST telemetry into production — most teams ship $/token dashboards and never instrument success.The hard part of CPST is success classification. Most teams instrument cost-per-call and pass-rate-during-eval but never link the two in production. Build automated success labels (test passage, schema validation, exact-match) plus sampled human grading on 5-10% of runs. Without this, you can't compute CPST in production.
01 — The ProblemWhy $/token is the wrong unit.
$/token compares input rate or output rate per million tokens. It is a clean number with a clean unit. It also ignores everything that determines what you actually pay in production:
- Pass rate. A 51% pass-rate model needs roughly twice as many attempts as an 80% pass-rate model on the same workflow. The cheaper-per-token model is often more expensive in total.
- Retry inflation. Failed attempts are not free in production — they cost full input + output every time. Even if retries succeed at 95% on attempt 2, the cost is 1.4-2.8× the first-attempt cost.
- Output amplification. Long-context workflows elicit longer outputs. A model that is cheaper per output token but produces 3× more output per task may be more expensive net.
- Tool-call overhead. Every tool call adds context tokens for the tool spec, the call itself, and the response. On chained workflows, tool-call overhead can be 30-60% of total spend.
- Cache and batch tier dynamics. Effective rate after cache hits and batch usage often differs from rack rate by 60-95%. Procurement that compares rack rates is comparing numbers neither team will pay.
"Per-token rate is the headline; per-token rate is not the budget. The teams that win on AI procurement learn this in the first quarter."— Internal procurement memo, May 2026
02 — DefinitionDefining cost-per-successful-task.
CPST measures the dollar cost paid to deliver one task end-to-end to your defined quality bar. Three components:
Total cost across all attempts
Σ(input + output + tool + retry + cache amortization)Every dollar billed for the task. Includes failed attempts that retried, cache write amortized over reads, tool-call input and output overhead, and any reasoning-mode premium. Not just the successful attempt.
NumeratorCount of successful completions
tasks delivered to spec / tasks attemptedThe pass rate at your chosen retry budget (typically pass-at-3). Defined against your quality bar — automated test passage, schema validation, human acceptance, or domain-specific verifier. Without verification, no CPST.
DenominatorQuality bar definition
automated test · schema · human labelWhat counts as success. The bar must be operational — measurable on every task in production. For code: test passage. For schema-bound output: schema validation. For free-form: sampled human grading. Without an operational bar, CPST is undefined.
Quality bar03 — The FormulaThe CPST formula.
Stripped to a single line:
CPST = Σ(input_cost + output_cost + tool_cost + retry_cost + cache_amortization) / pass_count
Where:
input_cost= input tokens × input rate, summed across all attempts (failed + successful).output_cost= output tokens × output rate, summed across all attempts.tool_cost= sum of tool-call input/output token costs, including tool spec context overhead.retry_cost= redundant input tokens on retried calls, often dominated by re-sending the original prompt.cache_amortization= cache write cost spread over the cache reads it enables (write × 1 / read_count).pass_count= count of attempts (or batches) that passed the quality bar within the retry budget.
"If you can't answer 'what counts as a successful task,' you can't compute CPST. Pin the quality bar before any cost discussion."— Internal eval retro, May 2026
04 — Worked ExamplesSix worked examples.
Six common AI workflows, each with the CPST math worked out across model choices.
Multi-file refactor (Expert-SWE-style)
GPT-5.5 Pro at high reasoning: $0.42/successful task at 72.6% pass-rate (pass-at-3). Claude Opus 4.7 at standard: $0.31 at 64.3%. DeepSeek V4 at high CoT: $0.06 at 51.7%. CPST winner depends on quality bar — V4 wins for internal tools; Pro wins for client deliverables.
Pro $0.42 · Opus $0.31 · V4 $0.06PR-scale code review
GPT-5.5 standard: $0.18/review at 84.1% acceptable-find rate. Sonnet 4.6: $0.12 at 81.3%. Llama 4 405B (Together): $0.04 at 73.4%. Sonnet wins CPST on most reviews; Llama wins on internal-only reviews where humans edit anyway.
Sonnet $0.12 · Llama $0.04Debug & root-cause analysis
GPT-5.5 Pro medium: $0.34/successful root-cause at 81.2% acceptance. Claude Opus 4.7 default thinking: $0.21 at 78.4%. DeepSeek V4 high CoT: $0.09 at 64.1%. Opus wins CPST on most debug; V4 wins on bulk-triage where partial answers help.
Opus $0.21 · V4 $0.09Daily content brief generation
Sonnet 4.6: $0.06/brief at 88% editor-acceptance. GPT-5.5 Mini: $0.02 at 79%. Opus 4.7: $0.10 at 91%. Mini wins CPST on internal drafts; Sonnet wins on client deliverables; Opus wins on premium tiers where 91% acceptance pays back.
Mini $0.02 · Sonnet $0.06RAG knowledge-base Q&A (cached prefix)
Claude Opus 4.7 cached: $0.05/answer at 89% accuracy. Gemini 3 Pro cached: $0.04 at 87%. GPT-5.5 cached: $0.07 at 88%. Cache mechanics flatten CPST gap; pick by quality. Without cache, CPST 5-12× higher across all three.
Gemini $0.04 · Opus $0.05Agent loop (chained workflow, 5 tool calls avg)
Claude Opus 4.7 ext. thinking: $0.31/successful agent run at 81.6% pass-at-3. GPT-5.5 Pro medium: $1.84 at 67.4%. DeepSeek V4 high CoT: $0.08 at 51.7%. Opus wins CPST on most agent loops decisively; V4 only viable on simple chains.
Opus $0.31 (decisive)05 — InstrumentationInstrumenting CPST in production.
Three telemetry primitives are required to compute CPST in production.
Per-call cost telemetry
Log input tokens, output tokens, tool tokens, cache hit/miss state per call. Most providers send this in response headers; if not, compute from token usage × current rate. Aggregate by workflow_id.
EasySuccess label telemetry
Log a binary correct/incorrect (or pass/fail) per task at the workflow level — not per call. Automated where possible (test passage, schema validation, exact-match), human-graded on sampled 5-10% where automation fails.
Hard but mandatoryAggregation by workflow class
Roll up per-call cost and per-task success into CPST per workflow_class over rolling windows (1-day, 7-day, 30-day). Trend over time. Alert on CPST regressions tied to model swaps or config changes.
SLO unitMost APM platforms (Datadog, Grafana, Honeycomb) now ship LLM telemetry primitives. The cost telemetry is solved — provider response headers carry token counts. The success-label instrumentation is the gap. Build it as a first-class workflow primitive: every workflow returns a success_label alongside its output. Without it, your AI ops dashboard reports cost-per-call, not cost-per-success.
06 — Procurement ScoringProcurement scoring template.
For RFPs and vendor evaluations, the scoring template below converts CPST into a comparable procurement metric across model+ provider options.
Define workflow classes
5-10 representative workflows · production weightPin the workflows you actually run, with traffic share. CPST is workflow-specific; one model rarely wins across all classes. Common classes: refactor, review, gen, RAG Q&A, agent loop, classification, extraction, summarization.
Workflow surfaceDefine quality bar per class
automated test · schema · human labelEach class needs an operational success definition. Without it, CPST is undefined for that class. The quality bar should mirror production acceptance — what would you ship to the customer.
Quality barRun pilot at production scale
100-1000 task runs per model+classDon't extrapolate from public benchmarks. Run real production-shape tasks against each candidate, measure CPST per class. The numbers will differ from public benchmarks by 30-200% in either direction depending on your specific workflow shape.
Pilot dataScore by weighted CPST
Σ(workflow_cpst × traffic_weight)Multiply each model's CPST per class by the traffic share for that class, sum to a single weighted CPST per model. Add SLO factors (latency budget, region availability, security) as multipliers or hard gates.
Weighted score07 — PitfallsCommon pitfalls when adopting CPST.
Six failure modes we see when teams first adopt CPST.
- Skipping success classification.Reporting cost-per-call as CPST. The metric is meaningless without a success denominator. Either build the verification logic or don't use the term.
- Setting the quality bar at evaluation time, not production. Picking a bar that's easier to hit on the eval set than on production traffic. CPST then looks great in eval and bad in production.
- Ignoring retry math. Computing CPST off successful attempts only, ignoring the cost of failed attempts that preceded them. Inflates the apparent advantage of low-pass -rate models.
- Comparing across different quality bars. Two models scored against different definitions of success. CPST comparisons require a single quality bar; otherwise the numbers are not commensurable.
- Not segmenting by workflow class. Reporting a single CPST across all production traffic. The number is real but useless for procurement — different models win different classes.
- Treating CPST as the only metric.CPST is cost-bound; latency, security, and capability ceiling are orthogonal SLOs that need separate gates. Combine, don't substitute.
08 — ConclusionThe unit procurement is converging on.
Stop measuring per-token rate. Start measuring cost-per-successful-task.
$/token had a good run. It was the right metric in 2023-2024 when most workflows were single-call generations and pass rates were high enough to ignore. In 2026 — with chained agentic workflows, reasoning-mode premiums, cache mechanics, and 113× pricing spreads — it has stopped being the right unit. The teams that ship production AI economically have already moved to CPST.
The metric is simple enough that resistance to it is usually a telemetry gap: most ops stacks report cost-per-call but never instrument success classification. Build the success label as a first-class workflow primitive. Once you have CPST in production, model selection and config decisions get markedly clearer — and often inverted from what $/token suggested.
We've published the formula, six worked examples, and a procurement scoring template. Adopt it as the lingua franca of your AI cost discussions and watch how quickly procurement conversations sharpen.