AI unit economics break the classic SaaS playbook in exactly one place — cost of goods sold. Traditional software is built once and served to the next customer for almost nothing, which is why mature SaaS runs 80-90% gross margins. An AI service cannot copy that trick: every query spends real inference, so gross margins land closer to 50-60%. This guide is a working framework for the COGS levers, the cross-vendor caching math, and a pricing-model stress test you can run on your own numbers.
The stakes are concrete and immediate. A flat monthly price that looks healthy on a blended average can be quietly negative on your heaviest 10% of users, because an agentic loop consumes far more tokens than a single chat turn. GitHub watched exactly that happen to Copilot and moved every plan to usage-based billing in 2026. Pricing an AI product or an AI-powered service without modelling the inference underneath it is not optimism — it is exposure.
What follows is the framework we use when we re-price AI offerings: why the margin ceiling is lower than SaaS, the three levers that actually move COGS, a side-by-side of how the three frontier vendors price caching, a worked stress test showing where each pricing structure breaks, and a read on what the market is converging toward. Where a number is illustrative, it is labelled illustrative.
- 01AI COGS is not zero — and it does not shrink with scale.Inference is a real variable cost on every single query, so the SaaS assumption of near-free marginal serving cost does not hold. Counterintuitively, ICONIQ's 2026 survey found model-inference cost rising from 20% to 23% of total spend as products mature and usage grows — COGS becomes a larger share, not smaller.
- 02Expect 50-60% gross margins, not SaaS's 80-90%.Bessemer Venture Partners' February 2026 pricing playbook puts AI gross margins at 50-60% versus 80-90% for traditional SaaS, and a16z's durable framing lands in the same range. ICONIQ's surveyed average is 52% for 2026, up from 41% in 2024 — improving, but structurally below software.
- 03Model routing is the single biggest COGS lever.Within one vendor the price spread runs 5x or more — Claude Haiku 4.5 at $1/$5 per million tokens versus Opus 4.8 at $5/$25. Routing routine work to a cheap model and reserving the frontier model for genuinely hard tasks compresses the inference bill more than any other change.
- 04Caching is a near-universal ~90% discount on repeats.Anthropic, OpenAI, and Google all independently land at roughly a 90% discount on cache reads for repeated input. Anthropic charges 1.25x to write to the cache; Google bills cache storage by the hour. The mechanics differ, but the convergence on read pricing is real.
- 05Hybrid pricing is the default under uncertainty.A base subscription plus a usage or outcome tier gives customers predictability while capturing upside and capping your downside on heavy users. Bessemer calls it the effective middle ground for early-stage companies, and the market is shifting that way — GitHub was forced from flat to usage-based in roughly a year.
01 — The COGS ProblemWhy AI doesn’t earn SaaS margins.
The defining feature of software-as-a-service is that the marginal cost of one more customer is close to zero. You build the product once and serve it to the thousandth user for roughly what it cost to serve the tenth. That is the entire reason mature SaaS runs 80-90% gross margins, and it is the assumption every classic SaaS pricing model quietly rests on.
AI breaks that assumption at the point of cost of goods sold. There is no build-once-serve-infinitely trick for inference: every query sends tokens through a model, and every token has a price. The cost does not amortise away with scale — it recurs on the next request, and the next. A product can look profitable in a blended average and still be losing money on its most engaged users, because those users are the ones generating the most tokens.
Three things keep AI COGS stubbornly non-zero, and each one is a place where a naive flat price springs a leak.
The token meter never stops
Every request bills input plus output tokens at the model's per-million-token rate. Unlike a SaaS database read, this is a real variable cost that scales linearly with usage — the heavier the user, the larger the bill.
Retries and reasoning stack up
An autonomous agent does not make one call — it screenshots, reasons, retries, and re-prompts across a long loop. A single agentic session can consume orders of magnitude more tokens than one chat turn, with no extra revenue attached.
People in the margin
Many AI products still need humans reviewing edge cases, correcting outputs, or handling escalations. a16z notes AI companies often spend a meaningful share of revenue on the compute and people it takes to keep quality acceptable.
02 — The Margin RealityThe 50-60% gross-margin ceiling.
Put the benchmarks next to each other and they agree. Bessemer Venture Partners’ AI pricing and monetization playbook, fetched from the live page, states AI companies see 50-60% gross margins against 80-90% for traditional SaaS. ICONIQ Capital’s State of AI snapshot, surveying roughly 300 software executives, puts the average AI product gross margin at 52% for 2026 — up from 41% in 2024, so the trend is improving, but still a full tier below software.
The composition matters as much as the headline. Within ICONIQ’s sample, companies that combine model-layer and product-layer innovation — what the report calls balanced differentiation — report the highest margins at 53%, while pure application-layer companies reselling someone else’s model sit at 45%. The closer you are to just passing through a frontier API, the thinner the margin you keep.
Gross margin: AI products vs traditional SaaS
Source: ICONIQ State of AI (Jan 2026) via SaaStr; SaaS benchmark per Bessemer Venture Partners and a16z — survey and self-reported figures, directional not auditedHere is the counterintuitive finding that should reshape how you plan. The instinct from software is that COGS shrinks as a share of revenue with scale. ICONIQ found the opposite for inference: model-inference cost climbed from 20% to 23% of total spend as products moved from early to mature stage, while talent dropped from 32% to 26%. As an AI product succeeds, the meter runs faster, not slower — usage grows, inference grows with it, and the line you most wanted to dilute becomes a bigger part of the bill. Planning for margins to expand automatically with scale is planning on the wrong curve.
03 — Cost LeversThree levers that actually move COGS.
If the margin ceiling is structural, the response is engineering: drive the cost-to-serve down so a defensible margin survives at your price. Three levers do most of the work, and the largest one is free — it is an architecture decision, not a discount you negotiate.
Send routine work to a cheap model
Within one vendor, Haiku 4.5 ($1/$5 per Mtok) is a fifth the price of Opus 4.8 ($5/$25). Routing simple requests to the small model and reserving the frontier model for hard tasks is the single largest COGS lever available — and it draws directly on verified per-token prices, not a vendor savings claim.
Stop paying for the same input twice
Agents re-send a stable system prompt and tool schema on every turn. Caching that repeated context cuts the read price to roughly 10% of base input across all three frontier vendors. Anthropic charges 1.25x to write the cache once, then 0.10x to read it back.
Trade latency for a discount
Major providers offer batch and async tiers at roughly half price for work that does not need a synchronous response — nightly enrichment, bulk classification, offline scoring. Treat the figure as a durable structural discount rather than one precise number; it varies by provider and model.
Routing is where most teams leave the most money on the table. The reflex is to run everything on the best model, but the price spread between a cheap and a frontier model from the same vendor is routinely 5x or more, and a large share of real traffic is routine enough for the small model to handle. For the engineering detail on the caching lever, our prompt-caching engineering guide walks through breakpoints and cache-life mechanics, and our guide to right-sizing model spend covers the gate-by-gate playbook for cutting an AI bill. A worked example of the routing lever in practice — near-frontier coding performance at a fraction of the price — sits in our Claude Sonnet agentic-coding cost breakdown.
04 — Caching ComparedThree vendors, one caching pattern.
Caching deserves its own table because it is the rare cost lever where the whole frontier has independently converged. We fetched each vendor’s live pricing on June 30, 2026 and laid the mechanics side by side. The read discount is the same story everywhere — roughly 90% off for re-used input — while the way each vendor charges to create and hold a cache differs.
| Vendor / model | Cache-read discount | Cache-write cost | Mechanics / cache life | Storage fee |
|---|---|---|---|---|
| Anthropic — Claude (Opus 4.8 / Sonnet 4.6 / Haiku 4.5) | 0.10x base input (≈90% off) | 1.25x base input (25% premium) | Explicit cache breakpoints | No separate storage fee |
| OpenAI — GPT-5.5 | $0.50 vs $5.00 input (90% off) | No separate write fee (automatic) | Automatic; previewed GPT-5.6 moves to explicit breakpoints + 30-min minimum cache life | No separate storage fee |
| Google — Gemini 3.1 Pro (≤200K context) | $0.20 vs $2.00 input (90% off) | Billed via storage, not a per-token write premium | Context caching | $4.50 / Mtok / hour |
The convergence is the interesting part. When three competing labs independently price the same lever the same way, it stops being a promotional discount and starts looking like a structural property of how transformers serve repeated context. The practical takeaway: if your agent re-sends a stable prompt — and almost every agent does — caching is not an optimisation you get to later, it is part of the cost model from day one. Just price the write premium and any storage fee into the math: Anthropic’s 1.25x write and Google’s hourly storage charge mean caching pays off on reuse, not on a single call.
05 — The Stress TestWhere each pricing model breaks.
Most coverage repeats the 50-60% headline without showing why a flat price can go underwater on its heaviest users while looking profitable on average. So here is the math, worked. The numbers below are illustrative, not market data — but they are built on real per-token prices so the mechanics are exact.
Assume a cost-to-serve of $0.04 per agent task: a 3,000-input plus 1,000-output-token task at Claude Opus 4.8 list prices ($5 and $25 per million tokens) is (3,000 × $5 ÷ 1M) + (1,000 × $25 ÷ 1M) = $0.015 + $0.025 = $0.04. Three monthly intensities: a light user runs 100 tasks ($4 cost), a median user 500 tasks ($20), and a power user on long agent loops 2,500 tasks ($100). Now watch what three pricing structures do to gross margin across those users.
| Pricing structure | Light user · 100 tasks | Median user · 500 tasks | Power user · 2,500 tasks | Cost-spike risk |
|---|---|---|---|---|
| Flat per-seat$30 / user / mo | 86.7%$30 rev · $4 cost | 33.3%$30 rev · $20 cost | −233%$30 rev · $100 cost | Vendor — fully exposed to the power user |
| Pure usage-basedprice = cost × 2 ($0.08 / task) | 50.0%$8 rev · $4 cost | 50.0%$40 rev · $20 cost | 50.0%$200 rev · $100 cost | Customer — pays in proportion to use |
| Hybrid (base + overage)$15 base, 250 tasks incl., then $0.08 / task | 73.3%$15 rev · $4 cost | 42.9%$35 rev · $20 cost | 48.7%$195 rev · $100 cost | Shared — vendor floor, customer predictability |
Read the power-user column first. The flat per-seat plan posts a healthy 86.7% margin on a light user and a comfortable 33.3% on a median one, then collapses to roughly negative 233% on the power user — you are paying $100 to serve someone who pays you $30. Pure usage-based pricing holds a flat 50% margin across every user by construction, but it gives up the predictable revenue floor that customers and finance teams both like. Hybrid threads the needle: 73.3% on the light user, 42.9% on the median, and 48.7% on the power user — never negative, never wild, with a base that guarantees a floor and overage that follows cost.
The forward read is that the industry is walking down exactly this table. As agentic usage spreads, the power-user column stops being a tail case and becomes a meaningful slice of the base, which is precisely why flat-fee AI products keep converting to usage and hybrid models rather than the other way around. The structure that survives contact with heavy agentic users is the one with a cost-following component built in. To build the column on the left — an honest cost-to-serve per task and per user — see our per-task, per-user cost framework; you cannot price what you have not measured.
06 — Market SignalsWhat the market is actually doing.
The theory shows up in live pricing pages. The clearest case study is GitHub Copilot, which moved all plans to usage-based billing effective June 1, 2026, replacing flat request allotments with metered GitHub AI Credits charged on actual token consumption. Base subscription prices were left unchanged; what changed was that consumption now meters. GitHub’s own stated reason is the exact mismatch this whole article is built on.
"Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount."— Mario Rodriguez, Chief Product Officer, GitHub, announcing Copilot's move to usage-based billing, April 27, 2026
That is a flat-fee vendor saying out loud that flat fees stop working once one user’s session can cost orders of magnitude more than another’s. The rest of the market is improvising different answers to the same problem.
Flat to usage
All plans moved to usage-based GitHub AI Credits metered on token consumption. Base prices held (Pro $10/mo, Business $19/user/mo), but the flat allotment is gone — the canonical real-world proof of the margin thesis.
Three models at once
Runs Flex Credits ($500 per 100K credits, ~$0.10 per standard action), a flat $2 per resolved conversation, and a per-seat add-on, simultaneously. Flex Credits and Conversations cannot be combined in one org — a hedge against any single metric being wrong.
Outcome pricing at scale
Charges per resolved conversation, not per seat. Investors validated the model with a $950M Series E at a $15.8B post-money valuation in May 2026, with ARR up from ~$26M at end-2024 to $200M, and 40%+ of the Fortune 50 as customers.
Pay per outcome
Bills $0.99 per billable outcome — a resolution, handoff, or qualification — with a 50-resolution monthly minimum, only one outcome billed per conversation. Treat the exact rate as vendor-stated current list pricing; Intercom has changed it before.
07 — Your PlaybookPricing your own AI services.
For an agency or a software team putting a price on AI-powered work, the lesson is not to copy whichever vendor is loudest. It is to start from your own cost-to-serve, then choose the structure whose risk profile matches how variable your usage is. Bessemer’s framing is the right one: your charge metric is a strategic statement, not just a billing decision — tokens suit technical buyers but confuse everyone else, and outcomes align value but force you to absorb cost variability. Pick deliberately.
Flat retainer or per-seat
When the workload is stable and predictable — a fixed set of automations run at a known cadence — a flat price is simplest and most legible. It only stays safe if you have measured the ceiling and priced above your worst-case cost-to-serve.
Usage-based pass-through
For agentic automation where intensity swings wildly between clients, bill usage plus a markup so your margin is constant by construction. You trade a predictable revenue floor for protection against the power-user blowout.
Outcome-based pricing
When the deliverable is a clean, countable result — a qualified lead, a resolved ticket — outcome pricing aligns your fee with client value. It is the strongest alignment story, but you absorb the cost variance, so only offer it once you can model that variance.
Hybrid: base + overage
Under genuine uncertainty about how heavily a client will use the system, a base retainer plus a usage or outcome tier gives them predictability and gives you a floor with capped downside. It is the structure that survived the stress test above.
This is exactly the work we do when we re-price an AI offering: model the real cost-to-serve, apply the routing and caching levers until a defensible margin survives, then choose the pricing structure that matches the client’s usage profile rather than a template. It is where our AI transformation engagements begin, and it is the difference between an AI-powered service like CRM automation that compounds margin and one that quietly loses money on its best customers. Price the COGS first; everything downstream follows from it.
08 — ConclusionPrice the COGS first.
AI services are a margin business: model the COGS, then price to protect it.
The single fact that reorganises everything is that AI cost of goods is real, variable, and does not amortise away with scale. That is why gross margins sit at 50-60% rather than the 80-90% software has spoiled us into expecting, and why ICONIQ found inference becoming a larger share of spend as products mature. The margin ceiling is structural, so the work is to engineer the cost down and price the rest honestly.
On the cost side, the levers are clear and mostly in your control: route routine work to a model a fifth the price, cache the repeated input every vendor now discounts roughly 90%, and batch anything that does not need an answer this second. On the pricing side, the stress test is the lesson — a flat fee goes underwater on heavy agentic users, and the market is converging on usage and hybrid structures precisely because they keep a cost-following component in the price.
For anyone pricing AI-powered work in 2026, the discipline is the same whether you are a frontier lab or a two-person agency: measure your cost-to-serve before you set a price, choose a structure whose risk profile matches how variable your usage is, and default to hybrid when you are genuinely unsure. The teams that win the AI services market will not be the ones with the lowest headline price. They will be the ones who priced the inference underneath it.