AI unit economics break the classic SaaS playbook in exactly one place — cost of goods sold. Traditional software is built once and served to the next customer for almost nothing, which is why mature SaaS runs 80-90% gross margins. An AI service cannot copy that trick: every query spends real inference, so gross margins land closer to 50-60%. This guide is a working framework for the COGS levers, the cross-vendor caching math, and a pricing-model stress test you can run on your own numbers.

The stakes are concrete and immediate. A flat monthly price that looks healthy on a blended average can be quietly negative on your heaviest 10% of users, because an agentic loop consumes far more tokens than a single chat turn. GitHub watched exactly that happen to Copilot and moved every plan to usage-based billing in 2026. Pricing an AI product or an AI-powered service without modelling the inference underneath it is not optimism — it is exposure.

What follows is the framework we use when we re-price AI offerings: why the margin ceiling is lower than SaaS, the three levers that actually move COGS, a side-by-side of how the three frontier vendors price caching, a worked stress test showing where each pricing structure breaks, and a read on what the market is converging toward. Where a number is illustrative, it is labelled illustrative.

Key takeaways

01
AI COGS is not zero — and it does not shrink with scale.Inference is a real variable cost on every single query, so the SaaS assumption of near-free marginal serving cost does not hold. Counterintuitively, ICONIQ's 2026 survey found model-inference cost rising from 20% to 23% of total spend as products mature and usage grows — COGS becomes a larger share, not smaller.
02
Expect 50-60% gross margins, not SaaS's 80-90%.Bessemer Venture Partners' February 2026 pricing playbook puts AI gross margins at 50-60% versus 80-90% for traditional SaaS, and a16z's durable framing lands in the same range. ICONIQ's surveyed average is 52% for 2026, up from 41% in 2024 — improving, but structurally below software.
03
Model routing is the single biggest COGS lever.Within one vendor the price spread runs 5x or more — Claude Haiku 4.5 at $1/$5 per million tokens versus Opus 4.8 at $5/$25. Routing routine work to a cheap model and reserving the frontier model for genuinely hard tasks compresses the inference bill more than any other change.
04
Caching is a near-universal ~90% discount on repeats.Anthropic, OpenAI, and Google all independently land at roughly a 90% discount on cache reads for repeated input. Anthropic charges 1.25x to write to the cache; Google bills cache storage by the hour. The mechanics differ, but the convergence on read pricing is real.
05
Hybrid pricing is the default under uncertainty.A base subscription plus a usage or outcome tier gives customers predictability while capturing upside and capping your downside on heavy users. Bessemer calls it the effective middle ground for early-stage companies, and the market is shifting that way — GitHub was forced from flat to usage-based in roughly a year.

01 — The COGS ProblemWhy AI doesn’t earn SaaS margins.

The defining feature of software-as-a-service is that the marginal cost of one more customer is close to zero. You build the product once and serve it to the thousandth user for roughly what it cost to serve the tenth. That is the entire reason mature SaaS runs 80-90% gross margins, and it is the assumption every classic SaaS pricing model quietly rests on.

AI breaks that assumption at the point of cost of goods sold. There is no build-once-serve-infinitely trick for inference: every query sends tokens through a model, and every token has a price. The cost does not amortise away with scale — it recurs on the next request, and the next. A product can look profitable in a blended average and still be losing money on its most engaged users, because those users are the ones generating the most tokens.

Three things keep AI COGS stubbornly non-zero, and each one is a place where a naive flat price springs a leak.

Inference

The token meter never stops

Per query

Every request bills input plus output tokens at the model's per-million-token rate. Unlike a SaaS database read, this is a real variable cost that scales linearly with usage — the heavier the user, the larger the bill.

The core variable cost

Agent loops

Retries and reasoning stack up

Multi-step

An autonomous agent does not make one call — it screenshots, reasons, retries, and re-prompts across a long loop. A single agentic session can consume orders of magnitude more tokens than one chat turn, with no extra revenue attached.

Where flat plans break

Human-in-loop

People in the margin

Support

Many AI products still need humans reviewing edge cases, correcting outputs, or handling escalations. a16z notes AI companies often spend a meaningful share of revenue on the compute and people it takes to keep quality acceptable.

The hidden COGS line

The foundational framing — still cited in 2026

The canonical reference here is Martin Casado and Matt Bornstein’s a16z essay, which argues AI-company gross margins sit often in the 50-60% range, well below the 60-80%+ benchmark for comparable SaaS, and that AI companies routinely spend a quarter or more of revenue on cloud and compute. That piece predates the current model generation, so treat it as the durable structural argument rather than fresh 2026 data — but note that its numbers line up closely with Bessemer’s independently published February 2026 figures, which is a useful cross-check rather than a coincidence.

02 — The Margin RealityThe 50-60% gross-margin ceiling.

Put the benchmarks next to each other and they agree. Bessemer Venture Partners’ AI pricing and monetization playbook, fetched from the live page, states AI companies see 50-60% gross margins against 80-90% for traditional SaaS. ICONIQ Capital’s State of AI snapshot, surveying roughly 300 software executives, puts the average AI product gross margin at 52% for 2026 — up from 41% in 2024, so the trend is improving, but still a full tier below software.

The composition matters as much as the headline. Within ICONIQ’s sample, companies that combine model-layer and product-layer innovation — what the report calls balanced differentiation — report the highest margins at 53%, while pure application-layer companies reselling someone else’s model sit at 45%. The closer you are to just passing through a frontier API, the thinner the margin you keep.

Gross margin: AI products vs traditional SaaS

Source: ICONIQ State of AI (Jan 2026) via SaaStr; SaaS benchmark per Bessemer Venture Partners and a16z — survey and self-reported figures, directional not audited

Traditional SaaSBVP / a16z benchmark — the ceiling AI does not reach

80-90%

Benchmark

AI · balanced differentiationICONIQ · model + product-layer innovation

53%

AI products · 2026 projected avgICONIQ · up from 41% in 2024

52%

This post's subject

AI · pure application layerICONIQ · reselling someone else's model

45%

AI products · 2024 avgICONIQ · the prior baseline

41%

AI products (2026)Benchmarks & cohorts

Bessemer, in its own words

The line worth pinning above your pricing model: “Companies see 50-60% gross margins vs. 80-90% for SaaS. If the math doesn’t work at 10 customers, it won’t at 1,000.” Unit economics that are underwater at small scale do not get rescued by volume in an AI business, because the dominant cost scales with usage rather than amortising across it. The time to fix the margin is before you sign the tenth customer, not the thousandth.

Here is the counterintuitive finding that should reshape how you plan. The instinct from software is that COGS shrinks as a share of revenue with scale. ICONIQ found the opposite for inference: model-inference cost climbed from 20% to 23% of total spend as products moved from early to mature stage, while talent dropped from 32% to 26%. As an AI product succeeds, the meter runs faster, not slower — usage grows, inference grows with it, and the line you most wanted to dilute becomes a bigger part of the bill. Planning for margins to expand automatically with scale is planning on the wrong curve.

03 — Cost LeversThree levers that actually move COGS.

If the margin ceiling is structural, the response is engineering: drive the cost-to-serve down so a defensible margin survives at your price. Three levers do most of the work, and the largest one is free — it is an architecture decision, not a discount you negotiate.

Model routing

Send routine work to a cheap model

5x+ price spread

Within one vendor, Haiku 4.5 ($1/$5 per Mtok) is a fifth the price of Opus 4.8 ($5/$25). Routing simple requests to the small model and reserving the frontier model for hard tasks is the single largest COGS lever available — and it draws directly on verified per-token prices, not a vendor savings claim.

The biggest lever

Prompt caching

Stop paying for the same input twice

≈90% off reads

Agents re-send a stable system prompt and tool schema on every turn. Caching that repeated context cuts the read price to roughly 10% of base input across all three frontier vendors. Anthropic charges 1.25x to write the cache once, then 0.10x to read it back.

Repeat-prompt savings

Batch / async

Trade latency for a discount

~50% off, no SLA

Major providers offer batch and async tiers at roughly half price for work that does not need a synchronous response — nightly enrichment, bulk classification, offline scoring. Treat the figure as a durable structural discount rather than one precise number; it varies by provider and model.

For non-urgent work

Routing is where most teams leave the most money on the table. The reflex is to run everything on the best model, but the price spread between a cheap and a frontier model from the same vendor is routinely 5x or more, and a large share of real traffic is routine enough for the small model to handle. For the engineering detail on the caching lever, our prompt-caching engineering guide walks through breakpoints and cache-life mechanics, and our guide to right-sizing model spend covers the gate-by-gate playbook for cutting an AI bill. A worked example of the routing lever in practice — near-frontier coding performance at a fraction of the price — sits in our Claude Sonnet agentic-coding cost breakdown.

04 — Caching ComparedThree vendors, one caching pattern.

Caching deserves its own table because it is the rare cost lever where the whole frontier has independently converged. We fetched each vendor’s live pricing on June 30, 2026 and laid the mechanics side by side. The read discount is the same story everywhere — roughly 90% off for re-used input — while the way each vendor charges to create and hold a cache differs.

Cross-vendor prompt and context caching mechanics for Anthropic Claude, OpenAI GPT-5.5, and Google Gemini 3.1 Pro, with cache-read discount, cache-write cost, mechanics or minimum cache life, and any separate storage fee. Read-discount figures are derived from each vendor’s own published per-million-token rates retrieved June 30, 2026; write and storage cells are vendor-stated.
Vendor / model	Cache-read discount	Cache-write cost	Mechanics / cache life	Storage fee
Anthropic — Claude (Opus 4.8 / Sonnet 4.6 / Haiku 4.5)	0.10x base input (≈90% off)	1.25x base input (25% premium)	Explicit cache breakpoints	No separate storage fee
OpenAI — GPT-5.5	$0.50 vs $5.00 input (90% off)	No separate write fee (automatic)	Automatic; previewed GPT-5.6 moves to explicit breakpoints + 30-min minimum cache life	No separate storage fee
Google — Gemini 3.1 Pro (≤200K context)	$0.20 vs $2.00 input (90% off)	Billed via storage, not a per-token write premium	Context caching	$4.50 / Mtok / hour

The convergence is the interesting part. When three competing labs independently price the same lever the same way, it stops being a promotional discount and starts looking like a structural property of how transformers serve repeated context. The practical takeaway: if your agent re-sends a stable prompt — and almost every agent does — caching is not an optimisation you get to later, it is part of the cost model from day one. Just price the write premium and any storage fee into the math: Anthropic’s 1.25x write and Google’s hourly storage charge mean caching pays off on reuse, not on a single call.

05 — The Stress TestWhere each pricing model breaks.

Most coverage repeats the 50-60% headline without showing why a flat price can go underwater on its heaviest users while looking profitable on average. So here is the math, worked. The numbers below are illustrative, not market data — but they are built on real per-token prices so the mechanics are exact.

Assume a cost-to-serve of $0.04 per agent task: a 3,000-input plus 1,000-output-token task at Claude Opus 4.8 list prices ($5 and $25 per million tokens) is (3,000 × $5 ÷ 1M) + (1,000 × $25 ÷ 1M) = $0.015 + $0.025 = $0.04. Three monthly intensities: a light user runs 100 tasks ($4 cost), a median user 500 tasks ($20), and a power user on long agent loops 2,500 tasks ($100). Now watch what three pricing structures do to gross margin across those users.

Illustrative pricing-model margin stress test. Gross margin under three pricing structures (flat per-seat at $30 per user per month, pure usage-based priced at cost times two, and a hybrid of a $15 base including 250 tasks plus $0.08 per task overage) for a light user of 100 tasks, a median user of 500 tasks, and a power user of 2,500 tasks per month, assuming a $0.04 cost-to-serve per task. A final column shows who absorbs the cost-spike risk. Numbers are illustrative, derived from real per-token prices, not market data.
Pricing structure	Light user · 100 tasks	Median user · 500 tasks	Power user · 2,500 tasks	Cost-spike risk
Flat per-seat$30 / user / mo	86.7%$30 rev · $4 cost	33.3%$30 rev · $20 cost	−233%$30 rev · $100 cost	Vendor — fully exposed to the power user
Pure usage-basedprice = cost × 2 ($0.08 / task)	50.0%$8 rev · $4 cost	50.0%$40 rev · $20 cost	50.0%$200 rev · $100 cost	Customer — pays in proportion to use
Hybrid (base + overage)$15 base, 250 tasks incl., then $0.08 / task	73.3%$15 rev · $4 cost	42.9%$35 rev · $20 cost	48.7%$195 rev · $100 cost	Shared — vendor floor, customer predictability

Read the power-user column first. The flat per-seat plan posts a healthy 86.7% margin on a light user and a comfortable 33.3% on a median one, then collapses to roughly negative 233% on the power user — you are paying $100 to serve someone who pays you $30. Pure usage-based pricing holds a flat 50% margin across every user by construction, but it gives up the predictable revenue floor that customers and finance teams both like. Hybrid threads the needle: 73.3% on the light user, 42.9% on the median, and 48.7% on the power user — never negative, never wild, with a base that guarantees a floor and overage that follows cost.

The forward read is that the industry is walking down exactly this table. As agentic usage spreads, the power-user column stops being a tail case and becomes a meaningful slice of the base, which is precisely why flat-fee AI products keep converting to usage and hybrid models rather than the other way around. The structure that survives contact with heavy agentic users is the one with a cost-following component built in. To build the column on the left — an honest cost-to-serve per task and per user — see our per-task, per-user cost framework; you cannot price what you have not measured.

06 — Market SignalsWhat the market is actually doing.

The theory shows up in live pricing pages. The clearest case study is GitHub Copilot, which moved all plans to usage-based billing effective June 1, 2026, replacing flat request allotments with metered GitHub AI Credits charged on actual token consumption. Base subscription prices were left unchanged; what changed was that consumption now meters. GitHub’s own stated reason is the exact mismatch this whole article is built on.

"Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount."— Mario Rodriguez, Chief Product Officer, GitHub, announcing Copilot's move to usage-based billing, April 27, 2026

That is a flat-fee vendor saying out loud that flat fees stop working once one user’s session can cost orders of magnitude more than another’s. The rest of the market is improvising different answers to the same problem.

GitHub Copilot

Flat to usage

effective Jun 1, 2026

All plans moved to usage-based GitHub AI Credits metered on token consumption. Base prices held (Pro $10/mo, Business $19/user/mo), but the flat allotment is gone — the canonical real-world proof of the margin thesis.

The cautionary tale

Salesforce Agentforce

Three models at once

live, concurrent

Runs Flex Credits ($500 per 100K credits, ~$0.10 per standard action), a flat $2 per resolved conversation, and a per-seat add-on, simultaneously. Flex Credits and Conversations cannot be combined in one org — a hedge against any single metric being wrong.

Optionality as strategy

Sierra

Outcome pricing at scale

$200M ARR, May 2026

Charges per resolved conversation, not per seat. Investors validated the model with a $950M Series E at a $15.8B post-money valuation in May 2026, with ARR up from ~$26M at end-2024 to $200M, and 40%+ of the Fortune 50 as customers.

Outcome-based, funded

Intercom Fin

Pay per outcome

$0.99 per outcome

Bills $0.99 per billable outcome — a resolution, handoff, or qualification — with a 50-resolution monthly minimum, only one outcome billed per conversation. Treat the exact rate as vendor-stated current list pricing; Intercom has changed it before.

Outcome-based, metered

ICONIQ — what AI companies actually charge

The survey mix is the honest answer to which model is winning: 58% of AI companies still use subscription or platform pricing, 35% use usage-based, and 18% use outcome-based — with outcome-based up from just 2% in Q2 2025 (categories overlap, since many run hybrids). Subscription still dominates, but outcome is the fastest-growing line off a tiny base, and 37% of companies plan to change their pricing model within a year — driven by customer demand for consumption and outcome pricing (46%) and for more predictable pricing (40%). The direction of travel is clear even though no single model has won. Our usage-based pricing decision matrix maps each model to when it fits.

07 — Your PlaybookPricing your own AI services.

For an agency or a software team putting a price on AI-powered work, the lesson is not to copy whichever vendor is loudest. It is to start from your own cost-to-serve, then choose the structure whose risk profile matches how variable your usage is. Bessemer’s framing is the right one: your charge metric is a strategic statement, not just a billing decision — tokens suit technical buyers but confuse everyone else, and outcomes align value but force you to absorb cost variability. Pick deliberately.

Bounded usage

Flat retainer or per-seat

When the workload is stable and predictable — a fixed set of automations run at a known cadence — a flat price is simplest and most legible. It only stays safe if you have measured the ceiling and priced above your worst-case cost-to-serve.

Flat works when usage is bounded

Variable, token-hungry

Usage-based pass-through

For agentic automation where intensity swings wildly between clients, bill usage plus a markup so your margin is constant by construction. You trade a predictable revenue floor for protection against the power-user blowout.

Usage protects your margin

Measurable result

Outcome-based pricing

When the deliverable is a clean, countable result — a qualified lead, a resolved ticket — outcome pricing aligns your fee with client value. It is the strongest alignment story, but you absorb the cost variance, so only offer it once you can model that variance.

Outcome aligns value — you carry variance

Most engagements

Hybrid: base + overage

Under genuine uncertainty about how heavily a client will use the system, a base retainer plus a usage or outcome tier gives them predictability and gives you a floor with capped downside. It is the structure that survived the stress test above.

Hybrid: the default under uncertainty

This is exactly the work we do when we re-price an AI offering: model the real cost-to-serve, apply the routing and caching levers until a defensible margin survives, then choose the pricing structure that matches the client’s usage profile rather than a template. It is where our AI transformation engagements begin, and it is the difference between an AI-powered service like CRM automation that compounds margin and one that quietly loses money on its best customers. Price the COGS first; everything downstream follows from it.

08 — ConclusionPrice the COGS first.

AI unit economics, June 2026

AI services are a margin business: model the COGS, then price to protect it.

The single fact that reorganises everything is that AI cost of goods is real, variable, and does not amortise away with scale. That is why gross margins sit at 50-60% rather than the 80-90% software has spoiled us into expecting, and why ICONIQ found inference becoming a larger share of spend as products mature. The margin ceiling is structural, so the work is to engineer the cost down and price the rest honestly.

On the cost side, the levers are clear and mostly in your control: route routine work to a model a fifth the price, cache the repeated input every vendor now discounts roughly 90%, and batch anything that does not need an answer this second. On the pricing side, the stress test is the lesson — a flat fee goes underwater on heavy agentic users, and the market is converging on usage and hybrid structures precisely because they keep a cost-following component in the price.

For anyone pricing AI-powered work in 2026, the discipline is the same whether you are a frontier lab or a two-person agency: measure your cost-to-serve before you set a price, choose a structure whose risk profile matches how variable your usage is, and default to hybrid when you are genuinely unsure. The teams that win the AI services market will not be the ones with the lowest headline price. They will be the ones who priced the inference underneath it.

AI Unit Economics: Pricing & Margins for AI Services