Tokens were always the wrong unit. The agency conversation in 2024 and most of 2025 obsessed over $/1M tokens — a metric that ignores whether a run produced a shippable artifact, and that mis-prices cheap workflows as wins and expensive ones as losses. The right unit is cost-per-successful-task against attributed revenue lift, and that is what this post measures across 50 production-instrumented workflows.
The dataset covers a six-month window through April 2026, spanning 15 marketing workflows, 20 dev workflows, and 15 hybrid workflows (lead enrichment, competitor analysis, client reporting). Every run was instrumented through OpenTelemetry traces, with per-run token spend reconciled monthly against provider invoices and revenue lift attributed through a 30-day forward window against a matched manual baseline run by a senior strategist.
The headline numbers will surprise even practitioners who track their spend carefully. Median per-run cost ranges from $0.07 (email triage) to $12.40 (long-context competitor analysis), with one outlier run at $87. Median ROI ranges from 11.4× (SEO audit) to 1.6× (client report). And prompt caching cuts cost-per-successful-task by 38–72% on workflows with repeated input — meaning the ROI table reorders once you turn caching on.
- 01$/successful-task is the right unit. $/token is a vanity metric.A $0.07 email triage that produces unusable output costs $0.07 and returns $0; a $4.20 SEO audit that drives a retainer expansion costs $4.20 and returns $48. Measure the cost of shippable artifacts against attributed revenue lift, not the cost of generated tokens.
- 02SEO audit (11.4× ROI) is the highest-leverage agentic workflow we measured.Median cost $4.20, median attributed revenue lift over 30 days $48 — driven almost entirely by retainer expansion when audits surface findings the client agrees to act on. Lead enrichment (8.9×) and PR backlink outreach (6.2×) round out the top three. Client reports sit at the bottom (1.6×) because they replace operational cost without unlocking new revenue.
- 03Prompt caching cuts cost-per-successful-task by 38–72% on repeat-input workflows.Competitor-analysis runs drop 72% with cached site context; multi-page SEO audits drop 58%; recurring client reports drop 47%; content briefs that share a style guide drop 38%. One-shot lead enrichment sees zero cache benefit because the input never repeats. Cache discipline is the highest-ROI infra investment for an agency stack.
- 04Use medium reasoning effort for 70% of workflows — high effort is rarely worth its 6–9× cost.High-effort runs cost 6–9× medium-effort runs and add 12–18% quality on hard tasks. On the easy 70% of workflows, the quality lift is statistically zero. Default to medium; reserve high effort for competitor analysis, multi-step refactors, and any task where a wrong answer costs more than the marginal token spend.
- 05The model mix matters more than any single model. Route by workflow, not by preference.Opus 4.7 wins SEO audit and competitor analysis (the only model with reliable 1M context). GPT-5.5 wins ad-copy iteration (cheapest at scale). Gemini 3 wins multimodal client reports. DeepSeek V4 wins code refactor (open-weight cost floor). A single-model stack is leaving 30–50% of margin on the table.
01 — The ThesisTokens were the wrong unit all along.
Through 2024 and most of 2025 the agency-AI conversation was anchored to $/1M tokens. Vendors competed on it, blog posts ranked models by it, and procurement teams built spreadsheets around it. The metric is fine for capacity planning and useless for unit economics — because it measures the cost of a generated string without asking whether the string was useful.
The right unit is cost-per-successful-task: total spend divided by the number of shippable outputs the workflow produces. That number can be ten times higher than the per-token cost would suggest (when failure rates are high) or ten times lower (when caching is aggressive and the workflow is repeated). Both distortions matter, and neither shows up if you only track $/token.
The corollary is that ROI — attributed revenue lift divided by cost-per-successful-task — is the only honest comparison across workflows. A $0.07 email-triage run and a $4.20 SEO-audit run are not comparable on cost; they are comparable on ROI. The audit, at 11.4×, is dramatically more valuable than the triage at 2.1×, even though it costs sixty times more per run.
02 — MethodologyHow we measured 50 workflows end-to-end.
The dataset is 50 production workflows running across agency engagements over a six-month window through April 2026. Workflows split 15 marketing, 20 dev, and 15 hybrid. Every run produced an OpenTelemetry trace including model, input tokens, output tokens, cached-read tokens, reasoning effort tier, latency, and a structured success label assigned by a senior reviewer within 48 hours of the run.
Cost was reconciled monthly against provider invoices — Anthropic, OpenAI, Google, and a small DeepSeek allocation — to catch the gap between rack-rate token math and actual billed amounts (cache credits, volume discounts, and committed-use rebates all change the number). Revenue lift was attributed through a 30-day forward window against a matched manual baseline run by a senior strategist on the same client account in the prior quarter, with 95% confidence intervals on every median.
The key methodology choice is the success label. A run that generated tokens but produced output the strategist would not ship counts as a failed run with full cost charged and zero revenue attributed. This is what makes cost-per-successful-task different from per-run cost: failures are amortized across the successes that pay for them.
03 — Cost Per RunThe median spend per workflow, ranked.
Median per-run cost spans almost three orders of magnitude — from seven cents for an email triage to over twelve dollars for a long-context competitor analysis, with one observed outlier at $87. The bar chart below shows the median for each workflow class after caching, model mix, and reasoning-effort tuning have been applied.
Median cost per workflow run · 50 instrumented agency workflows
Source: 50 instrumented agency workflows · Nov 2025 – Apr 2026 · 95% CI on mediansTwo patterns jump out. First, the high-volume workflows (enrichment, triage, ad-copy iteration) sit at the bottom because they run thousands of times per month and any cost above $0.50 per run becomes economically untenable. Second, the high-cost outliers (competitor analysis, SEO audit) are exactly the workflows where humans previously spent five to ten hours per output — so even at $12 per run the cost-vs-labor swap is dramatic.
The $87 single-run outlier is instructive. It was a competitor analysis on a B2B portfolio with twelve competitor sites, run on Claude Opus 4.7 at full 1M context with extended thinking enabled. The per-run cost looks alarming until you compare it to the seven hours of senior strategist time that the manual baseline took — at $250/hour blended rate, a $87 spend that returns the same artifact in 40 minutes is roughly a 20× cost reduction.
04 — ROI MediansROI ranges from 11.4× to 1.6×.
ROI here is attributed revenue lift over a 30-day window divided by cost-per-successful-task. The denominator includes the cost of failed runs amortized over the successes; the numerator counts only revenue plausibly attributable to the workflow output (new retainer expansion, won ad spend, billable deliverable hours replaced). The table flips the order of the cost ranking — the cheapest workflows are not the most valuable.
Median ROI by workflow · attributed revenue ÷ cost-per-successful-task
Source: 30-day forward attribution window · matched senior-strategist baseline · 95% CIThe pattern is clean: workflows that unlock new revenue (audits, enrichment, outreach) post the highest ROI. Workflows that replace operational cost without unlocking revenue(triage, reports) sit at the bottom. This is intuitive in retrospect but it inverts the usual cost-first prioritization — agencies that build their AI roadmap by "cheapest first" end up shipping the lowest-ROI workflows first.
05 — Prompt CachingCaching cuts cost-per-successful-task by 38–72%.
Prompt caching is the single highest-ROI infrastructure investment for an agency AI stack. The savings on cost-per-successful-task — not raw $/token, the right unit — range from 0% on one-shot workflows where the input never repeats, to 72% on workflows that re-process a stable corpus across many runs. The four numbers below cover the workflows where caching matters most.
Competitor analysis
Cost-per-successful-task reduction when the same competitor site context is re-used across multiple positioning queries. The cached site corpus drops to ~10% rack rate; only the differential question per run is uncached. Highest-leverage cache pattern in the dataset.
Highest-leverageMulti-page SEO audit
Site context, brand guidelines, and prior-audit history are cached once per audit; per-page analysis runs against the cached corpus. Audit throughput roughly doubles per dollar with caching enabled because per-page cost drops dramatically.
Audit workflowsRecurring client reports
Account history, prior-month commentary, and brand voice are stable across the monthly report cycle. Caching the stable layer cuts cost-per-report nearly in half. The 1.6× ROI on client reports rises to ~3.0× once caching is correctly configured.
Monthly cycleContent briefs (shared style)
Brand voice guidelines and writer-handoff templates are cached once; the per-brief topic input remains uncached. Modest absolute savings per brief but compounds across hundreds of briefs per month for a content-heavy agency.
Editorial workflowsLead enrichment
No cache benefit. Every lead is unique; the input never repeats. Lead enrichment economics are governed entirely by per-token cost and reasoning-effort tier — caching is a no-op here. Plan accordingly when sizing the workflow's run-rate budget.
No cache benefitThe shape of these numbers maps directly to a workflow design rule: identify the stable layer of your prompt (corpus, brand voice, prior context) and the variable layer (the per-run question), then push the stable layer into the cache. Workflows that fit this pattern see double-digit percentage cost reductions; workflows that do not (one-shot enrichment, single-pass triage) see zero benefit and should be optimized through model selection and reasoning-effort tier instead.
Note the second-order effect on the ROI table. Client reports at 1.6× ROI are the worst performer in the raw table — but with caching applied, cost-per-successful-task drops 47% and ROI rises to roughly 3.0×. The ROI ranking is sensitive to caching discipline; agencies that have not yet implemented caching are looking at a distorted picture of which workflows are working.
06 — Model MixA mixed stack beats any single model.
No single 2026 frontier model wins every workflow. The dataset shows clear winners by workflow class — driven by latency, context-window economics, multimodal capability, or open-weight cost floors. Agencies running a single-model stack are leaving 30–50% of margin on the table compared to a routed mix.
Claude Opus 4.7 — SEO audit + competitor analysis
$5/$25 per 1M · 1M context · 90% cached readWins on the two highest-ROI workflows in the dataset. SEO audits benefit from Opus 4.7's reliable long-context retrieval and structured-output discipline; competitor analysis is the only frontier model with reliable 1M context for multi-site corpora. Aggressive prompt-cache pricing makes the per-run cost competitive even on repeat-context workflows.
1M context · highest leverageGPT-5.5 — ad-copy iteration
$5/$30 per 1M · fastest TTFT in the datasetBest for high-volume creative iteration where latency-per-variant matters more than quality lift. Ad-copy iteration workflows generate 15+ variants across 3 platforms per run; GPT-5.5's fast time-to-first-token and consistent variant quality make it the default for creative-ops workflows running at hundreds of iterations per day.
Latency-bound workflowsGemini 3 — multimodal client reports
Native image + text · long-context reasoningWins client-report workflows that include screenshots, dashboards, and chart inputs. Native multimodal handling means a single call ingests the dashboard PNG plus the prior-month narrative; competitors require an OCR or vision-call hop that adds cost and latency. Strong fit for any reporting workflow with visual inputs.
Image + text workflowsDeepSeek V4-Pro — code refactor
Open weights · 3.1% sparsity · self-hostableWins per-file code-refactor workflows for dev agencies running at high token volume. Open-weight cost floor (when self-hosted on 8×H100) is dramatically lower than any closed-API frontier model. Pairs naturally with closed-API fallback for spike protection during peak refactor windows.
Self-host economicsThe routing logic is simple in principle, harder in practice: workflow class maps to model class, and the stack should route at the workflow boundary rather than at the user-prompt level. Agencies that build a routing layer once — even a simple workflow-to-model dictionary — capture most of the available margin. Agencies that route per-prompt or per-user end up with inconsistent quality and unnecessary infra complexity.
07 — Cost-BloatFour cost-bloat patterns to kill this quarter.
The dataset surfaces four recurring patterns that inflate cost-per-successful-task without lifting ROI. Each one is fixable inside a single sprint, and the combined savings on a typical mid-size agency stack run 40–60% of monthly AI spend.
High reasoning effort on low-value tasks
Default reasoning-effort tier set to high across the stack. High effort costs 6–9× medium effort and adds 12–18% quality on hard tasks — but on the easy 70% of agency workflows the quality lift is statistically zero. Fix: set the default to medium; reserve high effort for tasks where a wrong answer costs more than the marginal token spend.
Default → mediumRe-sending unchanged context every call
Same brand guidelines, account history, or competitor corpus re-sent on every run. Cache hit rate near zero on workflows where 80% of the input is stable. Fix: identify the stable prompt layer and push it into the provider's prompt cache. Expect 38–72% cost-per-successful-task reduction on the affected workflows.
Cache the stable layer1M context when 200K would suffice
Long-context workflows configured to send the full corpus when only the relevant subset is needed. 1M context costs roughly 5× 200K context on identical models; the marginal recall lift past 200K is small for most agency tasks. Fix: pre-filter to the relevant context window before the call; reserve 1M for genuinely cross-document reasoning.
Pre-filter contextNo per-tool cost dashboards
Aggregate AI spend is tracked but not broken down by workflow, model, or tool call. Cost-bloat hides because no one can see which workflow drives 40% of monthly spend. Fix: instrument every workflow with OpenTelemetry traces and surface a per-tool cost dashboard. The visibility alone surfaces 20–30% of spend that nobody intentionally allocated.
Instrument everything08 — ConclusionWhat the data tells you to do this quarter.
Stop measuring tokens. Start measuring shippable artifacts.
The unit-economics question for agency AI work was settled by the data months ago: $/token is a vanity metric and $/successful-task is the only number that maps to margin. Six months and 50 instrumented workflows later, the practical playbook is small enough to fit on one page.
Build the roadmap by ROI rank, not cost rank — SEO audits, enrichment, and outreach first; reports and triage later. Implement prompt caching on every workflow with a stable input layer; expect 38–72% cost-per-successful-task reductions where it applies. Default to medium reasoning effort and reserve high effort for the genuinely hard tasks. Route by workflow to the right model — Opus for long-context, GPT-5.5 for ad-copy speed, Gemini for multimodal, DeepSeek for self-hosted code work. And instrument every call so the four cost-bloat patterns cannot hide.
None of this is exotic infrastructure. It is the disciplined unit-economics work that any agency partnership with serious AI ambition has to do once, properly, and then maintain. The agencies that are already three quarters into this work are running at 3–5× the margin of the agencies still optimizing for $/token.