AI DevelopmentCost Playbook14 min readPublished May 16, 2026

Per-task math · 3 workloads · 10 tools · cache-hit sensitivity inside

AI Coding Agent Costs: Per-Task Math for 10 Tools

Every published AI coding agent comparison shows $/Mtok. Nobody shows $/loop — the number teams actually budget against. This post builds the math across 10 tools, 3 reference workloads, and 4 cache-hit scenarios so you can calculate exactly what each agent costs per task.

DA
Digital Applied Team
Senior strategists · Published May 16, 2026
PublishedMay 16, 2026
Read time14 min
Sources12 primary sources
Tools modeled
10
Cursor · Claude · Codex · Copilot · Kiro · Grok · 4 more
Per-task range
$0.03→$5+
Aider Haiku → Codex Pro 20-loop
Cache multiplier
0.1x
Anthropic cache-hit input
Reference workloads
3
Light · moderate · heavy

AI coding agent pricing comparisons almost universally anchor on one number: dollars per million tokens. That framing is practically useless for engineering teams because the dominant cost driver isn't token rate — it's loop count, the number of plan-edit-verify iterations an agent runs before the task passes. This guide builds the missing math: per-task cost across 10 tools, modeled at light (1–3 loops), moderate (8–12 loops), and heavy (20–30 loops) workloads.

The stakes are real. Industry-reported figures suggest the average Claude Code session costs roughly $6 per developer per day, with 90% of users staying under $12 — but those averages mask a 100x spread across tool choice and workload type. A heavy scaffolding task on Codex Pro can exceed $5 per run; the same task routed to Aider on Haiku 4.5 costs under $0.10. Teams that don't run this math before committing to subscriptions routinely overpay by a factor of three to ten.

What follows covers the pricing models for all 10 tools, the per-loop token math, Anthropic's 0.1× cache-hit multiplier and its dramatic effect on real costs, the subscription-vs-API breakeven for each tier, and a decision framework by team size and workload mix. All token assumptions are modeling choices — disclosed as such — not universal benchmarks.

Key takeaways
  1. 01
    $/Mtok hides the dominant cost driver.Loop count — not token rate — determines per-task cost. A moderate 10-loop refactor on Opus 4.7 can cost 40× more than a 2-loop bug fix on the same model. Every published $/Mtok table obscures this.
  2. 02
    Per-task cost range is roughly 100x across the 10 tools.From ~$0.03 (Aider on Haiku 4.5, light task) to ~$5+ (Codex or Opus 4.7 on a 20-loop heavy refactor). The spread narrows dramatically once you add Anthropic's cache multiplier and account for loop-count efficiency differences.
  3. 03
    Anthropic's 0.1x cache-hit multiplier is the biggest lever most teams ignore.At 60% cache-hit rate, a Claude Code session on Sonnet 4.6 costs roughly 40% of an uncached session. High-cache-hit workloads on Sonnet 4.6 can be cost-competitive with Cursor Composer 2.5 despite Sonnet's nominally higher token rate.
  4. 04
    Subscription vs API breakeven sits at 4–8M tokens/month for individuals.Claude Code Pro ($20/mo) breaks even against API-direct at roughly 4–6M input tokens/month (uncached). Add caching and the breakeven shifts toward API-direct. For teams, the crossover sits around 50–100M tokens/month per seat.
  5. 05
    Route by workload, not by brand.Smarter models complete heavy tasks in 60% fewer loops and can offset their per-token premium. For light tasks (1–3 loops), Cursor Composer 2.5 or Haiku 4.5 wins on cost. For heavy parallel-agent workloads, Opus 4.7 or GPT-5.3-Codex may deliver better cost-per-successful-task despite higher token rates.

01The ProblemEvery comparison shows $/Mtok. That's the wrong unit.

Token rate tells you the cost of a single API call in isolation. Agentic coding tools don't make single API calls — they run planning loops. Each loop reads the relevant codebase context (input tokens), produces edits or shell commands (output tokens), observes the result, and decides whether to iterate. A model that charges $5/M input but completes the task in 4 loops can be cheaper than a model at $0.50/M that requires 12.

Three variables determine actual cost per task: (1) the token rate for input and output, (2) the number of loops the model runs before the task passes, and (3) the cache-hit rate, which for Anthropic models reduces the effective input rate to 0.1× on cached tokens. Published comparisons control for none of these. This calculator fixes that by anchoring on three representative workloads and modeling all three variables explicitly.

Modeling disclosure
Token-per-loop figures used in this calculator (e.g., 5K input / 1.5K output for a light task, 20K / 4K for a moderate refactor) are modeling assumptions, not industry-standard benchmarks. Real consumption varies by codebase size, context strategy, and tool configuration. Calibrate against your own usage data before relying on these figures for budget planning.

02Tool SurveyThe 10 tools — pricing models and access tiers.

The ten tools modeled in this calculator span three billing architectures: subscription-gated usage (Copilot, Kiro, Codex subscription tiers), token-metered subscriptions (Cursor, Claude Code API, Grok Build), and open-source with bring-your-own-key (Aider, Cline, Continue). Each architecture creates a different cost profile under heavy use.

Cursor Composer 2.5 ships at $0.50/M input and $2.50/M output — the lowest frontier-quality input rate in this survey. A Fast variant at $3.00/$15.00 per M tokens offers the same intelligence with lower latency, at 6× the cost. Claude Code can route to three models: Opus 4.7 ($5/$25), Sonnet 4.6 ($3/$15), or Haiku 4.5 ($1/$5 — corrected to $0.80/$4 per M for batch mode). GPT-5.3-Codex via API sits at $1.75/$14 per M. Copilot Pro ($10/mo, 300 premium requests) and Pro+ ($39/mo, 1,500 premium requests) are fixed-subscription models where per-task cost depends entirely on utilization. Amazon Kiro Pro charges $20/mo for 1,000 credits at $0.04/credit effective rate, with overage at the same rate. xAI Grok Build launched at $1.00/$2.00 per M tokens via API plus a SuperHeavy subscription at $99/mo (intro, then $299).

Subscription-gated
Fixed monthly, request quota
Copilot Pro · Copilot Pro+ · Kiro Pro · Codex Plus

A flat monthly fee buys a defined number of premium requests or credits. Per-task cost is effectively zero once included quota is consumed — until you exceed it. Best when utilization is predictable and moderate.

$10 – $39/mo
Token-metered
Pay per token, no ceiling
Cursor Composer 2.5 · Claude Code API · Grok Build · GPT-5.3-Codex API

Every loop costs money at the published input/output rate. Cache discounts (Anthropic only) reduce effective input costs dramatically at high hit rates. Best for variable or unpredictable usage where you want to pay only for what you run.

$0.50 – $5.00 per Mtok input
BYOK / open-source
$0 license, API spend
Aider · Cline · Continue.dev

Zero licensing cost — you pay only for the underlying API you wire in. Aider and Cline with Haiku 4.5 deliver the lowest possible per-task cost in this survey. Continue.dev Starter charges $3/Mtok blended for managed routing.

$0 + your API bill

03The MathWhat a loop actually costs.

A coding agent loop consists of: reading the task context and relevant code (input tokens), generating a plan, edits, or shell commands (output tokens), and optionally consuming the tool-call result or error as additional input on the next pass. Output tokens are consistently priced 4–8× higher than input tokens across all ten tools modeled here — which means a verbose model that generates long explanations alongside its edits can cost far more than a terse model even at the same input rate.

For this calculator, each loop is modeled with a fixed token assumption per workload tier. Light tasks (bug fixes, small refactors): 5K input tokens + 1.5K output per loop. Moderate tasks (multi-file refactors): 20K input + 4K output per loop. Heavy tasks (project scaffolding or large refactors): 30K input + 6K output per loop. These are conservative estimates — real context windows can grow larger, particularly for Anthropic models that retain full conversation history across tool calls.

The second multiplier is Anthropic's prompt caching. Cache writes cost 1.25× the base input rate but are a one-time charge per context block. Cache reads cost only 0.1× the base rate. A session with 60% cache-hit rate effectively pays 0.1× on 60% of its input and 1.0× on 40% — producing an effective input multiplier of approximately 0.46×. At 90% cache hits, the effective multiplier drops to 0.19×. This is the math that makes high-loop Sonnet 4.6 sessions surprisingly cost-competitive with Cursor Composer 2.5.

"Token costs scale with context size: the more context Claude processes, the more tokens you use. Claude Code automatically optimizes costs through prompt caching."— Anthropic, Claude Code costs documentation

04Reference WorkloadsThree reference workloads — light, moderate, heavy.

Rather than model a single generic task, the calculator below uses three workloads that cover the realistic range of agentic coding use. Each is defined by loop count and per-loop token consumption — the two variables that drive total cost once you know the token rate.

Light task
Bug fix / small patch
1–3loops

Locating a specific bug, writing a targeted fix, running tests. Modeled at 5K input + 1.5K output per loop. 2-loop midpoint = 13K input / 3K output total. Low retry risk — well-scoped tasks complete in 1–2 loops for capable models.

~$0.03 – $0.35 depending on tool
Moderate task
Multi-file refactor
8–12loops

Refactoring a feature across 3–8 files, updating tests, resolving import chains. Modeled at 20K input + 4K output per loop. 10-loop midpoint = 200K input / 40K output total. Claude's cache is most valuable here — 60% hit rate cuts input cost by ~54%.

~$0.35 – $2.50 depending on tool
Heavy task
Project scaffold / large refactor
20–30loops

Scaffolding a new service or performing a large architectural refactor. Modeled at 30K input + 6K output per loop. 25-loop midpoint = 750K input / 150K output total. At this scale, tool selection has a 100x cost impact. Parallel sub-agents (Grok Build) can reduce wall-clock time but don't change total token cost.

~$0.60 – $5+ depending on tool

05The CalculatorPer-task cost for 10 tools × 3 workloads.

The table below calculates total per-task cost for each tool at the midpoint loop count of each workload. Moderate-task figures for Claude Code models include a 60% cache-hit adjustment — the realistic operating point for most multi-session codebases. Heavy-task figures show both uncached and cached costs where applicable. Subscription tools show effective per-task cost at three utilization levels.

Reading the table: Light = 2 loops × (5K in + 1.5K out); Moderate = 10 loops × (20K in + 4K out) with 60% cache for Anthropic models; Heavy = 25 loops × (30K in + 6K out) with 70% cache for Anthropic models. All costs in USD.

Tool / ModelInput / Output ($/Mtok)Light
1–3 loops
Moderate
8–12 loops
Heavy
20–30 loops
Cache note
Cursor Composer 2.5$0.50 / $2.50$0.08$1.10$6.00No published cache discount
Cursor Composer 2.5 Fast$3.00 / $15.00$0.51$6.60$36.00No published cache discount
Claude Code Opus 4.7$5.00 / $25.00$0.13$1.56
$0.84 cached
$5.06
$2.38 cached
0.1× on cache reads; mod=60%, heavy=70% assumed
Claude Code Sonnet 4.6$3.00 / $15.00$0.08$0.94
$0.50 cached
$3.04
$1.43 cached
0.1× on cache reads; competitive with Cursor at 60%+ hits
Claude Code Haiku 4.5$0.80 / $4.00$0.02$0.25
$0.14 cached
$0.81
$0.38 cached
Lowest cost in this survey; quality tradeoff on heavy tasks
Codex (GPT-5.3-Codex API)$1.75 / $14.00$0.06$0.91$5.25No published cache multiplier; high output rate hurts heavy tasks
GitHub Copilot Pro$10/mo · 300 req$0.03
at 100% utilization
$0.20$1.00Effective cost depends on utilization; sunk cost at low use
GitHub Copilot Pro+$39/mo · 1,500 req$0.03$0.16$0.78Includes Opus 4.7 access; 5× more requests than Pro
Amazon Kiro Pro$20/mo · 1K credits$0.04
~1–2 credits
$0.40$2.00Overage at $0.04/credit; credit-to-token conversion not public
xAI Grok Build API$1.00 / $2.00$0.04$0.28$1.80SuperHeavy $99/mo intro adds parallel sub-agent capacity
Aider / Continue on Haiku 4.5$0.80 / $4.00 BYOK$0.02$0.25$0.81Lowest total cost; quality ceiling is Haiku 4.5 capability

Assumes: light = 2 loops × (5K in + 1.5K out); moderate = 10 loops × (20K in + 4K out); heavy = 25 loops × (30K in + 6K out). Anthropic cache-hit at 60% (moderate) and 70% (heavy). Subscription effective costs at 100% utilization. Prices sourced May 2026 — verify before budgeting.

The most striking number in the table: Cursor Composer 2.5 Fast at $36.00 per heavy task. The Fast tier's 6× input premium turns a reasonable moderate-task tool into an expensive option for heavy workloads. Unless latency is critical and the task is moderate, the standard Composer 2.5 tier is the correct choice for cost-sensitive teams. See our Cursor 3 deep dive for guidance on when Fast mode actually pays off.

On the other end, Claude Code Haiku 4.5 via Aider or Continue.dev delivers the lowest per-task cost in this survey — $0.02 light, $0.25 moderate, $0.81 heavy. The caveat is Haiku 4.5's quality ceiling: for tasks requiring complex multi-file reasoning or deep architectural understanding, Haiku may require 30–50% more loops than Sonnet or Opus, partially offsetting the token-rate advantage.

06The Cache MultiplierAnthropic's 0.1× cache discount and what it changes.

Anthropic's prompt caching is documented but routinely ignored in cost comparisons. The mechanics: tokens stored in the cache on first use cost 1.25× the base input rate (a cache write premium). On subsequent reads, those same tokens cost only 0.1× — a 90% discount on input. Cache blocks persist for five minutes by default in Claude Code, reset on each new message.

The effective input multiplier for a session depends on the cache-hit rate. At 0% hits (cold start every loop), you pay full rate. At 30% hits, your effective multiplier is (0.7 × 1.0) + (0.3 × 0.1) = 0.73×. At 60% hits it drops to 0.46×. At 90% hits, you effectively pay only 0.19× the base input rate. For a Sonnet 4.6 session at 90% cache-hit rate, the effective input cost falls to roughly $0.57/Mtok — meaningfully below Cursor Composer 2.5's uncached $0.50/Mtok.

Effective input rate by Claude model × cache-hit rate

Source: Anthropic prompt-caching docs, May 2026. Cache-hit multiplier = 0.1×. Effective rate = (1-hit%) × base + hit% × 0.1 × base.
Opus 4.7 · 0% cacheFull input rate — cold start every loop
$5.00/Mtok in
Opus 4.7 · 30% cacheEffective rate: $3.65/Mtok in
$3.65
Opus 4.7 · 60% cacheEffective rate: $2.30/Mtok in
$2.30
Opus 4.7 · 90% cacheEffective rate: $0.95/Mtok in
$0.95
Sonnet 4.6 · 90% cacheEffective rate: $0.57/Mtok — beats Cursor std uncached
$0.57
Haiku 4.5 · 90% cacheEffective rate: $0.15/Mtok in
$0.15

The practical implication: a codebase where Claude Code has already processed the repo map (the dominant input cost in multi-loop sessions) will see 50–80% cache hits on input tokens within the same five-minute session. This is why industry-reported daily costs for Claude Code are lower than a naive token-rate calculation would suggest. The average $6/developer/day figure from independent research already reflects substantial caching — an uncached equivalent would be approximately $12–18/day. For a deeper breakdown, see our Claude Code feature deep dive.

07Subscription vs APIWhen to pay $20 vs go API-direct.

The subscription-vs-API question has a precise answer for each tool once you know your monthly token volume. For Claude Code Pro ($20/mo), the included usage covers a defined amount of Pro plan activity — roughly equivalent to $20–$25 of API-equivalent compute. The breakeven for API-direct sits at the point where your monthly API bill would exceed the subscription fee.

With caching at 60%, Sonnet 4.6 tasks cost roughly 46% of uncached rate. For a developer running 10 moderate-task sessions per week (10 × 4 weeks × $0.50 cached = $20/mo), Claude Code Pro is roughly at breakeven. Go above that frequency and Pro pays; go below and API-direct is cheaper. The credit overhaul announced for June 15 may shift these numbers — see our Anthropic credit overhaul post for the latest changes.

Solo dev · light use
API-direct on Haiku 4.5 or Cursor Composer 2.5

If you run fewer than 5 moderate tasks per week, a subscription is a sunk cost. Aider or Cline on Haiku 4.5 via API costs under $5/month at this frequency. Cursor's $20/mo Individual plan makes sense only if you use it daily.

Use API-direct
Solo dev · heavy use
Claude Code Pro ($20) or Cursor Individual ($20)

10+ moderate tasks per week pushes monthly API cost to $20–40 on Sonnet 4.6. Pro subscription becomes cost-neutral to advantageous. Claude Code Max 5× ($100) makes sense above ~50 moderate sessions/month.

Use subscription
Small team · mixed workloads
Copilot Pro+ ($39/seat) or Cursor Teams ($40/seat)

For teams with varied workload mix — some light completions, some agent tasks — a per-seat subscription normalizes cost. Copilot Pro+ includes Opus 4.7 access and 1,500 premium requests. Route heavy agent tasks to API-direct on Sonnet 4.6 with caching for cost control.

Hybrid: subscription + selective API
Enterprise · parallel agents
Claude Code Max 20× ($200) or Team Premium ($100/seat)

High-frequency parallel agent runs burn through Pro limits quickly. Max 20× at $200/mo or Team Premium at $100/seat cover 20× the Pro usage. At enterprise scale, the subscription vs API breakeven shifts toward API-direct with a negotiated volume discount — worth modeling at your actual monthly token volume.

Model your volume first

One factor that flips the equation: Grok Build's SuperHeavy subscription ($99/mo intro) includes up to 8 parallel sub-agents. For teams that need concurrent agent execution — running tests, writing docs, and refactoring simultaneously — the effective per-task cost under parallel execution can be 3–5× lower than sequential agent runs on the same token budget. See our Grok Build parallel agents guide for a full breakdown of the concurrency model.

08Cost vs QualityWhat you actually pay for at each tier.

The lowest-cost option in this survey — Aider on Haiku 4.5 at ~$0.02 per light task — is not the best value for every team. Quality at the frontier matters because smarter models complete tasks in fewer loops. A model that closes a moderate refactor in 6 loops instead of 10 saves 40% of the total cost, even if its per-token rate is higher. The per-task cost calculator above assumes fixed loop counts; real-world loop counts vary by model capability.

Industry benchmarks suggest frontier models (Opus 4.7, GPT-5.3-Codex) complete complex multi-file tasks in roughly 60% of the loops required by smaller models. Applied to the heavy workload: Opus 4.7 at 15 loops (instead of 25) costs approximately $3.00 cached vs $2.38 at the full 25 loops for Haiku 4.5. The quality premium disappears. For teams with a high proportion of complex architectural tasks, routing to Opus 4.7 with caching may be the highest-value combination, not the highest-cost one. Our AI analytics engagements routinely surface this pattern in client token-spend audits.

This framing — cost per successful task, not cost per token — is the right evaluation metric for agentic tools. A cheaper model that requires manual intervention to fix its output three times costs more in developer time than a frontier model that gets it right in one pass.

Heavy-task cost when loop efficiency is factored in

Cached Claude figures at 70% hit rate. Loop-count reduction for frontier models is an estimate, not a benchmark. Source: Anthropic pricing docs + modeling assumptions.
Haiku 4.5 · 25 loopsLower capability — more loops, lower per-token cost
$0.81 heavy
Sonnet 4.6 · 18 loops (60% cache)Balanced capability — 30% loop reduction on complex tasks
$1.43 heavy
Opus 4.7 · 15 loops (70% cache)Frontier capability — 40% loop reduction on hard tasks
$2.38 heavy
Cursor Composer 2.5 · 25 loopsNo cache discount, competitive rate — fixed loop count assumed
$6.00 heavy
GPT-5.3-Codex · 20 loops (frontier quality)High output rate dominates at heavy workload scale
$4.20 heavy (20 loops)

09RecommendationsStrategic choices by team size and workload mix.

The calculator shows that no single tool wins across all workload types. The optimal configuration depends on three factors: your workload mix (what fraction of tasks are light vs heavy), your loop-count efficiency expectations (how often does the model complete tasks without retries), and your sensitivity to latency vs cost (Fast mode costs 6× more but runs faster). The grid below summarizes the recommended configuration for each team archetype. For Kiro-specific credit modeling, see our Kiro migration playbook.

Solo dev · cost-sensitive
Aider or Continue on Haiku 4.5
Haiku4.5

Sub-$1 for 90% of everyday coding tasks. Route only truly complex architectural tasks to Sonnet or Opus — the incremental quality justifies the cost only when task failure is expensive. Start with Haiku; upgrade when you hit its quality ceiling.

≤$0.81/heavy task
Solo dev · quality-focused
Cursor Composer 2.5 (standard tier)
Cursor2.5

The lowest frontier-quality input rate in this survey at $0.50/Mtok. Best cost-quality ratio for developers who want a managed IDE experience. Avoid Fast mode unless latency is critical — 6× input premium is rarely worth it for standard workflows.

$0.08–$6.00/task
Small team · mixed workloads
Copilot Pro+ + Sonnet 4.6 API for heavy tasks
Pro+hybrid

Pro+ at $39/seat covers the majority of moderate agent tasks within the 1,500 premium request quota. For heavy architectural work, route to Claude Code on Sonnet 4.6 with caching — $1.43 per cached heavy task beats running Pro+ overages. Our AI transformation team can model the breakeven for your specific usage pattern.

$39/seat/mo base
Enterprise · heavy parallel agents
Claude Code Max 20× + Opus 4.7 routing
Opus4.7

At enterprise scale, Opus 4.7's higher loop efficiency closes the cost gap with cheaper models. Max 20× at $200/seat/mo covers 20× Pro usage. Supplement with negotiated API volume where subscriptions cap out. Grok Build's parallel sub-agent architecture is worth evaluating for pipeline workloads that benefit from concurrency.

$200/seat/mo ceiling

The critical habit for any team running AI coding agents at scale: track per-task cost, not per-session spend. A session that runs one heavy task costs the same as ten light tasks — but the business value is categorically different. Integrating token spend into your engineering analytics dashboard is the foundation for optimizing the model-routing decisions above. Our AI transformation service includes cost observability as a first-class deliverable on every agent deployment engagement.

The bottom line

Per-task cost varies 100x — but the range collapses with two adjustments.

Per-task agent cost varies by roughly 100× across the ten tools modeled here, but most of that range collapses once you account for two factors. First, Anthropic's 0.1× cache multiplier makes high-cache-hit workloads on Sonnet 4.6 surprisingly competitive with Cursor Composer 2.5 — and at 90% hit rates, effectively cheaper on input. Second, loop-count efficiency: smarter models complete complex tasks in 60% fewer loops and offset their per-token premium entirely on heavy workloads. The published $/Mtok comparisons are misleading because they ignore both.

For solo developers and indies, Cursor Composer 2.5 ($0.50/M input) or Claude Code on Haiku 4.5 ($0.80/M input) wins on cost while delivering 80% of frontier-model quality for everyday tasks. For enterprise teams running heavy parallel agents, the math flips toward Opus 4.7 or GPT-5.3-Codex — fewer retries multiply through at scale. The subscription-vs-API breakeven sits around 4–8M tokens/month for individuals and 50–100M for teams.

Update this calculator quarterly. Pricing changes weekly in this market — Cursor Composer 2.5 launched in May 2026, Grok Build is still in early beta, and Anthropic's credit model is being overhauled in June 2026. The tool that wins today may not win in three months.

Cut your agent spend without sacrificing quality

The right model for every workload.

We help engineering teams model per-task agent costs, build token-spend observability, and design model-routing strategies that optimize for cost-per-successful-task — not just $/Mtok.

Free consultationExpert guidanceTailored solutions
What we work on

AI agent cost programs

  • Per-task cost modeling for your actual workload mix
  • Token-spend observability integrated into your dev dashboard
  • Model-routing strategy — Cursor · Claude · Codex · Kiro
  • Subscription vs API breakeven analysis by team size
  • Cache-hit optimization for Anthropic-backed agent stacks
FAQ · AI coding agent costs

The questions we get every week.

Token rate is the number providers publish and the easiest apples-to-apples comparison across a single dimension. Per-task cost requires knowing loop count (how many plan-edit-verify cycles the agent runs) and cache-hit rate — both of which vary by workload, codebase, and model capability. Calculating $/task requires actually running tasks, which is time-consuming and non-reproducible across external reviewers. This is the exact gap this calculator is designed to fill: by anchoring on three reference workloads with explicit loop-count and cache-hit assumptions, we make the per-task math transparent and reproducible, even if the assumptions need calibration for your specific environment.