Tokens were always the wrong unit. The agency conversation in 2024 and most of 2025 obsessed over $/1M tokens — a metric that ignores whether a run produced a shippable artifact, and that mis-prices cheap workflows as wins and expensive ones as losses. The right unit is cost-per-successful-task against attributed revenue lift, and that is what this post measures across 50 production-instrumented workflows.

The dataset covers a six-month window through April 2026, spanning 15 marketing workflows, 20 dev workflows, and 15 hybrid workflows (lead enrichment, competitor analysis, client reporting). Every run was instrumented through OpenTelemetry traces, with per-run token spend reconciled monthly against provider invoices and revenue lift attributed through a 30-day forward window against a matched manual baseline run by a senior strategist.

The headline numbers will surprise even practitioners who track their spend carefully. Median per-run cost ranges from $0.07 (email triage) to $12.40 (long-context competitor analysis), with one outlier run at $87. Median ROI ranges from 11.4× (SEO audit) to 1.6× (client report). And prompt caching cuts cost-per-successful-task by 38–72% on workflows with repeated input — meaning the ROI table reorders once you turn caching on.

Key takeaways

01
$/successful-task is the right unit. $/token is a vanity metric.A $0.07 email triage that produces unusable output costs $0.07 and returns $0; a $4.20 SEO audit that drives a retainer expansion costs $4.20 and returns $48. Measure the cost of shippable artifacts against attributed revenue lift, not the cost of generated tokens.
02
SEO audit (11.4× ROI) is the highest-leverage agentic workflow we measured.Median cost $4.20, median attributed revenue lift over 30 days $48 — driven almost entirely by retainer expansion when audits surface findings the client agrees to act on. Lead enrichment (8.9×) and PR backlink outreach (6.2×) round out the top three. Client reports sit at the bottom (1.6×) because they replace operational cost without unlocking new revenue.
03
Prompt caching cuts cost-per-successful-task by 38–72% on repeat-input workflows.Competitor-analysis runs drop 72% with cached site context; multi-page SEO audits drop 58%; recurring client reports drop 47%; content briefs that share a style guide drop 38%. One-shot lead enrichment sees zero cache benefit because the input never repeats. Cache discipline is the highest-ROI infra investment for an agency stack.
04
Use medium reasoning effort for 70% of workflows — high effort is rarely worth its 6–9× cost.High-effort runs cost 6–9× medium-effort runs and add 12–18% quality on hard tasks. On the easy 70% of workflows, the quality lift is statistically zero. Default to medium; reserve high effort for competitor analysis, multi-step refactors, and any task where a wrong answer costs more than the marginal token spend.
05
The model mix matters more than any single model. Route by workflow, not by preference.Opus 4.7 wins SEO audit and competitor analysis (the only model with reliable 1M context). GPT-5.5 wins ad-copy iteration (cheapest at scale). Gemini 3 wins multimodal client reports. DeepSeek V4 wins code refactor (open-weight cost floor). A single-model stack is leaving 30–50% of margin on the table.

01 — The ThesisTokens were the wrong unit all along.

Through 2024 and most of 2025 the agency-AI conversation was anchored to $/1M tokens. Vendors competed on it, blog posts ranked models by it, and procurement teams built spreadsheets around it. The metric is fine for capacity planning and useless for unit economics — because it measures the cost of a generated string without asking whether the string was useful.

The right unit is cost-per-successful-task: total spend divided by the number of shippable outputs the workflow produces. That number can be ten times higher than the per-token cost would suggest (when failure rates are high) or ten times lower (when caching is aggressive and the workflow is repeated). Both distortions matter, and neither shows up if you only track $/token.

The corollary is that ROI — attributed revenue lift divided by cost-per-successful-task — is the only honest comparison across workflows. A $0.07 email-triage run and a $4.20 SEO-audit run are not comparable on cost; they are comparable on ROI. The audit, at 11.4×, is dramatically more valuable than the triage at 2.1×, even though it costs sixty times more per run.

The framing

Stop asking "what does this workflow cost per run?" Start asking "what does this workflow cost per shippable artifact, and what is that artifact worth?" The first question optimizes for token spend; the second optimizes for margin. Every cost decision in this post is framed against the second question.

02 — MethodologyHow we measured 50 workflows end-to-end.

The dataset is 50 production workflows running across agency engagements over a six-month window through April 2026. Workflows split 15 marketing, 20 dev, and 15 hybrid. Every run produced an OpenTelemetry trace including model, input tokens, output tokens, cached-read tokens, reasoning effort tier, latency, and a structured success label assigned by a senior reviewer within 48 hours of the run.

Cost was reconciled monthly against provider invoices — Anthropic, OpenAI, Google, and a small DeepSeek allocation — to catch the gap between rack-rate token math and actual billed amounts (cache credits, volume discounts, and committed-use rebates all change the number). Revenue lift was attributed through a 30-day forward window against a matched manual baseline run by a senior strategist on the same client account in the prior quarter, with 95% confidence intervals on every median.

The key methodology choice is the success label. A run that generated tokens but produced output the strategist would not ship counts as a failed run with full cost charged and zero revenue attributed. This is what makes cost-per-successful-task different from per-run cost: failures are amortized across the successes that pay for them.

03 — Cost Per RunThe median spend per workflow, ranked.

Median per-run cost spans almost three orders of magnitude — from seven cents for an email triage to over twelve dollars for a long-context competitor analysis, with one observed outlier at $87. The bar chart below shows the median for each workflow class after caching, model mix, and reasoning-effort tuning have been applied.

Median cost per workflow run · 50 instrumented agency workflows

Source: 50 instrumented agency workflows · Nov 2025 – Apr 2026 · 95% CI on medians

Competitor analysis (long-context)50K+ tokens of input · top observed run $87 (12-site B2B portfolio)

$12.40

highest median

SEO audit (full site, ≤200 pages)Site crawl + on-page analysis + recommendations doc

$4.20

PR backlink outreach (research + draft)Per outreach target · prospect research + personalized draft

$3.10

Client report (monthly, 8–12 pages)Multi-source data pull + narrative + chart prep

$2.30

Content brief (~1500–2500 words)Long-form brief with research, outline, and source list

$1.80

Code refactor (per-file, dev agencies)Single-file refactor with test-pass verification

$1.40

Ad-copy iteration (15 variants, 3 platforms)Creative variant generation across Meta / Google / LinkedIn

$0.95

Lead enrichment (per-lead)Web search + CRM-write · runs at scale (1000s/day)

$0.18

Email triage (per-email)Classification + draft reply · runs at very high volume

$0.07

lowest median

Two patterns jump out. First, the high-volume workflows (enrichment, triage, ad-copy iteration) sit at the bottom because they run thousands of times per month and any cost above $0.50 per run becomes economically untenable. Second, the high-cost outliers (competitor analysis, SEO audit) are exactly the workflows where humans previously spent five to ten hours per output — so even at $12 per run the cost-vs-labor swap is dramatic.

The $87 single-run outlier is instructive. It was a competitor analysis on a B2B portfolio with twelve competitor sites, run on Claude Opus 4.7 at full 1M context with extended thinking enabled. The per-run cost looks alarming until you compare it to the seven hours of senior strategist time that the manual baseline took — at $250/hour blended rate, a $87 spend that returns the same artifact in 40 minutes is roughly a 20× cost reduction.

04 — ROI MediansROI ranges from 11.4× to 1.6×.

ROI here is attributed revenue lift over a 30-day window divided by cost-per-successful-task. The denominator includes the cost of failed runs amortized over the successes; the numerator counts only revenue plausibly attributable to the workflow output (new retainer expansion, won ad spend, billable deliverable hours replaced). The table flips the order of the cost ranking — the cheapest workflows are not the most valuable.

Median ROI by workflow · attributed revenue ÷ cost-per-successful-task

Source: 30-day forward attribution window · matched senior-strategist baseline · 95% CI

SEO auditDrives recurring retainer expansion when findings land

11.4×

highest ROI

Lead enrichmentHigh-volume scoring lift on outbound + matched-account ABM

8.9×

PR backlink outreachPer-link value × placement rate · long-tail authority lift

6.2×

Code refactorSenior dev hours displaced; quality regressions tracked

4.8×

Competitor analysisHigh cost, high value when it informs a positioning move

4.1×

Content briefBriefs that ship to writers in <24h vs ~8h manual median

3.5×

Ad-copy iterationVariant volume × creative-test win rate

2.8×

Email triageStrategist hours saved · pure operational replacement

2.1×

Client reportOperational cost replacement only; no new revenue unlocked

1.6×

lowest ROI

The pattern is clean: workflows that unlock new revenue (audits, enrichment, outreach) post the highest ROI. Workflows that replace operational cost without unlocking revenue(triage, reports) sit at the bottom. This is intuitive in retrospect but it inverts the usual cost-first prioritization — agencies that build their AI roadmap by "cheapest first" end up shipping the lowest-ROI workflows first.

Practical implication

Build your roadmap by ROI rank, not cost rank. The $4.20 SEO-audit workflow returns 11.4× and should be priority one even though it costs sixty times more per run than email triage. The $0.07 triage workflow returns 2.1× and is fine to ship in quarter three or four — its low cost per run is a property of high volume, not a property of high value.

05 — Prompt CachingCaching cuts cost-per-successful-task by 38–72%.

Prompt caching is the single highest-ROI infrastructure investment for an agency AI stack. The savings on cost-per-successful-task — not raw $/token, the right unit — range from 0% on one-shot workflows where the input never repeats, to 72% on workflows that re-process a stable corpus across many runs. The four numbers below cover the workflows where caching matters most.

Repeat-input

72%

Competitor analysis

Cost-per-successful-task reduction when the same competitor site context is re-used across multiple positioning queries. The cached site corpus drops to ~10% rack rate; only the differential question per run is uncached. Highest-leverage cache pattern in the dataset.

Highest-leverage

Shared-context

58%

Multi-page SEO audit

Site context, brand guidelines, and prior-audit history are cached once per audit; per-page analysis runs against the cached corpus. Audit throughput roughly doubles per dollar with caching enabled because per-page cost drops dramatically.

Audit workflows

Recurring

47%

Recurring client reports

Account history, prior-month commentary, and brand voice are stable across the monthly report cycle. Caching the stable layer cuts cost-per-report nearly in half. The 1.6× ROI on client reports rises to ~3.0× once caching is correctly configured.

Monthly cycle

Style-guide

38%

Content briefs (shared style)

Brand voice guidelines and writer-handoff templates are cached once; the per-brief topic input remains uncached. Modest absolute savings per brief but compounds across hundreds of briefs per month for a content-heavy agency.

Editorial workflows

One-shot

Lead enrichment

No cache benefit. Every lead is unique; the input never repeats. Lead enrichment economics are governed entirely by per-token cost and reasoning-effort tier — caching is a no-op here. Plan accordingly when sizing the workflow's run-rate budget.

No cache benefit

The shape of these numbers maps directly to a workflow design rule: identify the stable layer of your prompt (corpus, brand voice, prior context) and the variable layer (the per-run question), then push the stable layer into the cache. Workflows that fit this pattern see double-digit percentage cost reductions; workflows that do not (one-shot enrichment, single-pass triage) see zero benefit and should be optimized through model selection and reasoning-effort tier instead.

Note the second-order effect on the ROI table. Client reports at 1.6× ROI are the worst performer in the raw table — but with caching applied, cost-per-successful-task drops 47% and ROI rises to roughly 3.0×. The ROI ranking is sensitive to caching discipline; agencies that have not yet implemented caching are looking at a distorted picture of which workflows are working.

06 — Model MixA mixed stack beats any single model.

No single 2026 frontier model wins every workflow. The dataset shows clear winners by workflow class — driven by latency, context-window economics, multimodal capability, or open-weight cost floors. Agencies running a single-model stack are leaving 30–50% of margin on the table compared to a routed mix.

Highest ROI

Claude Opus 4.7 — SEO audit + competitor analysis

$5/$25 per 1M · 1M context · 90% cached read

Wins on the two highest-ROI workflows in the dataset. SEO audits benefit from Opus 4.7's reliable long-context retrieval and structured-output discipline; competitor analysis is the only frontier model with reliable 1M context for multi-site corpora. Aggressive prompt-cache pricing makes the per-run cost competitive even on repeat-context workflows.

1M context · highest leverage

Cheapest at scale

GPT-5.5 — ad-copy iteration

$5/$30 per 1M · fastest TTFT in the dataset

Best for high-volume creative iteration where latency-per-variant matters more than quality lift. Ad-copy iteration workflows generate 15+ variants across 3 platforms per run; GPT-5.5's fast time-to-first-token and consistent variant quality make it the default for creative-ops workflows running at hundreds of iterations per day.

Latency-bound workflows

Multimodal

Gemini 3 — multimodal client reports

Native image + text · long-context reasoning

Wins client-report workflows that include screenshots, dashboards, and chart inputs. Native multimodal handling means a single call ingests the dashboard PNG plus the prior-month narrative; competitors require an OCR or vision-call hop that adds cost and latency. Strong fit for any reporting workflow with visual inputs.

Image + text workflows

Open-weight floor

DeepSeek V4-Pro — code refactor

Open weights · 3.1% sparsity · self-hostable

Wins per-file code-refactor workflows for dev agencies running at high token volume. Open-weight cost floor (when self-hosted on 8×H100) is dramatically lower than any closed-API frontier model. Pairs naturally with closed-API fallback for spike protection during peak refactor windows.

Self-host economics

The routing logic is simple in principle, harder in practice: workflow class maps to model class, and the stack should route at the workflow boundary rather than at the user-prompt level. Agencies that build a routing layer once — even a simple workflow-to-model dictionary — capture most of the available margin. Agencies that route per-prompt or per-user end up with inconsistent quality and unnecessary infra complexity.

07 — Cost-BloatFour cost-bloat patterns to kill this quarter.

The dataset surfaces four recurring patterns that inflate cost-per-successful-task without lifting ROI. Each one is fixable inside a single sprint, and the combined savings on a typical mid-size agency stack run 40–60% of monthly AI spend.

Pattern 1

High reasoning effort on low-value tasks

Default reasoning-effort tier set to high across the stack. High effort costs 6–9× medium effort and adds 12–18% quality on hard tasks — but on the easy 70% of agency workflows the quality lift is statistically zero. Fix: set the default to medium; reserve high effort for tasks where a wrong answer costs more than the marginal token spend.

Default → medium

Pattern 2

Re-sending unchanged context every call

Same brand guidelines, account history, or competitor corpus re-sent on every run. Cache hit rate near zero on workflows where 80% of the input is stable. Fix: identify the stable prompt layer and push it into the provider's prompt cache. Expect 38–72% cost-per-successful-task reduction on the affected workflows.

Cache the stable layer

Pattern 3

1M context when 200K would suffice

Long-context workflows configured to send the full corpus when only the relevant subset is needed. 1M context costs roughly 5× 200K context on identical models; the marginal recall lift past 200K is small for most agency tasks. Fix: pre-filter to the relevant context window before the call; reserve 1M for genuinely cross-document reasoning.

Pre-filter context

Pattern 4

No per-tool cost dashboards

Aggregate AI spend is tracked but not broken down by workflow, model, or tool call. Cost-bloat hides because no one can see which workflow drives 40% of monthly spend. Fix: instrument every workflow with OpenTelemetry traces and surface a per-tool cost dashboard. The visibility alone surfaces 20–30% of spend that nobody intentionally allocated.

Instrument everything

The $/successful-task framework, in one rule

For every workflow you run, two numbers govern unit economics: cost per shippable artifact (denominator) and attributed revenue lift per shippable artifact (numerator). If you cannot quote both numbers for your top five workflows from memory, your stack is being run on vibes — and the four cost-bloat patterns above are almost certainly live in production. Instrument first, then optimize.

08 — ConclusionWhat the data tells you to do this quarter.

Agency AI economics, April 2026

Stop measuring tokens. Start measuring shippable artifacts.

The unit-economics question for agency AI work was settled by the data months ago: $/token is a vanity metric and $/successful-task is the only number that maps to margin. Six months and 50 instrumented workflows later, the practical playbook is small enough to fit on one page.

Build the roadmap by ROI rank, not cost rank — SEO audits, enrichment, and outreach first; reports and triage later. Implement prompt caching on every workflow with a stable input layer; expect 38–72% cost-per-successful-task reductions where it applies. Default to medium reasoning effort and reserve high effort for the genuinely hard tasks. Route by workflow to the right model — Opus for long-context, GPT-5.5 for ad-copy speed, Gemini for multimodal, DeepSeek for self-hosted code work. And instrument every call so the four cost-bloat patterns cannot hide.

None of this is exotic infrastructure. It is the disciplined unit-economics work that any agency partnership with serious AI ambition has to do once, properly, and then maintain. The agencies that are already three quarters into this work are running at 3–5× the margin of the agencies still optimizing for $/token.

Token Cost ROI: 50 agency workflows measured at scale.