An agent token budget calculator is the operational backbone of any SaaS shipping agentic features at scale. Without one, a single power-user's curiosity-driven session can consume the same inference budget as a hundred ordinary users in a day — and the margin model collapses quietly long before the dashboards catch up.

The economics are unforgiving on chat-heavy features. Output tokens on a Sonnet-class model cost three to five times their input counterparts. A long agentic loop with five tool calls and a reasoning trace can run twenty thousand tokens before the user even reads the answer. Multiply that by daily-active behavior across a free tier with no ceiling and the unit-economics math stops working by month two.

This guide is a complete cost-control framework — the formula, four-tier budget design, three overage-routing patterns, alert thresholds, per-feature attribution, and a worked spreadsheet calibrated against a real production SaaS at ten-thousand monthly active users. The recipe is the same one we apply on AI transformation engagements when a feature is moving from beta to general availability.

Key takeaways

01
Budgets per tier prevent the 'one user eats the budget' failure mode.Hard limits with soft degradation cap the per-user blast radius. Without a ceiling, a single curiosity-driven power user can consume the same monthly inference budget as the next hundred ordinary users combined.
02
Output token cost dominates for chat-heavy features.Output tokens are typically 3 to 5 times the cost of input tokens on Sonnet-class models. Multiplier-aware formulas surface that asymmetry; flat input+output averages hide it and produce wrong tier sizing.
03
Hard-cap is harsh — soft-cap-with-degradation is the production default.Hard-cap UX (request denied) earns one-star reviews. Soft-cap-with-degradation silently swaps the model from Sonnet to Haiku once the threshold is hit, preserving the feature while protecting margin. Customers rarely notice on routine turns.
04
Per-feature attribution surfaces the 80/20 cost drivers.Telemetry tagged by feature and user pays for itself the first time it identifies a single feature consuming 60% of inference budget for 8% of usage. Without attribution, every optimization pass is guesswork.
05
Quarterly tier-redesign cadence prevents margin drift.Token prices drop year-over-year — Sonnet-tier rates fell roughly 40% across 2025. Pass that saving through to customers, or pocket it as margin, but never let last year's ceiling silently become this year's floor.

01 — Why BudgetUnbounded agentic features eat margin — budgets fix it.

The default state of a newly-launched agentic feature is unbounded. A user opens the chat surface, asks a question, the agent fans out into three tool calls and a reasoning trace, and the meter spins. For the first month after launch — when daily-active numbers are still small enough that finance hasn't noticed — that unboundedness feels fine. By month three it isn't.

The failure mode is not the average user. It is the long tail. Across the production deployments behind this framework, the top 5% of users routinely consumed 50 to 70% of total inference spend. That is not abuse — it is curiosity, multi-step workflows, or just the user who happens to use the product the way the marketing site invites them to. Unmetered, those users are the customers who turn an agentic feature into a margin sink.

A token budget is a contract. It states what a user gets for their tier price, what happens when they exceed it, and how the product stays whole when the tail behavior shows up. The four-tier framework below is the shape that fits most SaaS products; the three overage patterns in §04 are the shape of what happens when the contract is reached.

The failure pattern we keep seeing

A B2B SaaS ships an agentic feature on the free tier without a ceiling. Month one looks great — viral signups, no margin pressure. Month two, the top 3% of users (typically power users sharing screenshots on social) consume 40% of inference budget. Month three, the CFO asks why the AI line item just doubled. The fix is always the same — retrofit a budget, ship a soft-cap, add per-feature attribution. We would rather you shipped the budget on day one.

There is a counter-argument worth taking seriously: that budgets constrain the user experience and make the product feel cheap. The evidence from the deployments behind this framework is the opposite. Well-designed budgets are invisible to the median user. They cap blast radius for the long tail. They make the feature shippable on free and trial tiers — which is the difference between a feature that drives signups and one that gets pulled back behind a paywall after three months because it broke the unit model.

02 — The Formulamonthly_tokens = sessions × turns × (input + output × multiplier).

The whole framework rests on a single per-user formula. It is deliberately simple — the goal is to be reasoned about in a meeting, not lost in a spreadsheet. The right complexity lives in the attribution layer (§06), not in the headline equation.

monthly_tokens_per_user = sessions × turns × (input_tokens + output_tokens × output_multiplier)

Five inputs. The first three describe behavior: sessions per user per month, turns per session, and the average input_tokens a user sends per turn (the question plus any retrieved context). The last two describe the model: output_tokens the model produces per turn, and the output_multiplier — the ratio of output token price to input token price for the chosen model.

The output multiplier is the lever most teams miss

On Claude Sonnet 4.7, input tokens are roughly $3 per million and output tokens are roughly $15 per million — a 5× multiplier. On GPT-5.5, the ratio is closer to 4×. On Haiku and the smaller models, the ratio compresses to around 3×. A budget formula that averages input and output prices into a single per-token number understates the cost of every chat-heavy feature by a factor of two or three.

The formula, worked once end-to-end

Inputs. A median Pro-tier user: 8 sessions per month, 6 turns per session, 600 input tokens per turn, 1,200 output tokens per turn, multiplier 5× (Sonnet 4.7).

Effective tokens per turn. 600 + 1,200 × 5 = 6,600 tokens of equivalent input-priced cost.

Monthly total. 8 × 6 × 6,600 = 316,800 equivalent tokens per user per month. At $3/M, that is ~$0.95 per user in raw inference cost — before any tool-call overhead, retries, or agent loops.

Three adjustments make the formula production-grade. First, multiply turns by a tool_call_factor — typically 1.5 to 2.5× — to account for tool-augmented turns consuming the model multiple times per user turn. Second, add a retry_factor of 1.1 to 1.2× to absorb timeouts and user-initiated regenerations. Third, separate retrieval-augmented turns from raw-chat turns when the input distribution is meaningfully different — RAG turns commonly run 3 to 5× the input token count of unaugmented turns.

The output multiplier is also where strategic model choice shows up. A feature pegged to Sonnet 4.7 with no escape hatch costs five times the equivalent feature on Haiku for the output portion alone. Most teams over-pick at the top of the model ladder; the budget calculator surfaces that immediately, which is half the reason to run one at all.

"Average input and output prices into a single per-token number and you understate every chat-heavy feature by a factor of two or three. The multiplier is the lever."— Internal playbook for SaaS agentic unit economics

03 — Four TiersFree, Pro, Team, Enterprise — budget by plan.

Four tiers is the right number for most SaaS products with an agentic feature. Fewer than four conflates power users with median users; more than four creates pricing-page paralysis. The four below are calibrated against the deployments behind this framework — adjust the absolute numbers to your stack, but keep the relative shape.

Each tier carries three budget dimensions: a hard monthly token ceiling (the contract), an output-multiplier-aware soft threshold (the warning), and a model-tier policy (which model the feature invokes for that plan). The combination is what makes the budget actually defensible.

Free

Free — acquisition

Soft-cap at 50k equivalent tokens per month. Pegged to a Haiku-tier model. Designed to make the feature feel responsive without enabling chat-heavy power use. Tool calls limited to two per turn. Sessions throttled at 5/day. Above the cap, feature pauses with a clear 'upgrade for more' prompt — never silently degrades on Free, because the perception cost outweighs the saved tokens.

Cap 50k · Haiku

Pro

Pro — median user

Hard ceiling at 500k tokens per month, soft threshold at 350k. Sonnet-tier model by default with Haiku fallback above the soft threshold (the soft-cap-with-degradation pattern from §04). Tool calls uncapped. Sessions unlimited. This is the tier the formula in §02 sizes against. ~$1 in raw inference cost per Pro user at typical usage — well under any reasonable monthly seat price.

Cap 500k · Sonnet→Haiku

Team

Team — shared pool

Per-workspace pooled budget at 3M tokens × seats. Pooling absorbs intra-team variance — heavy users on the same Team plan share the ceiling with light users. Sonnet-tier model. Workspace admin dashboard shows per-member spend. Overage routes to the pay-as-you-go pattern at the workspace level, billed to the workspace admin's card on file.

Pool 3M × seats

Enterprise

Enterprise — contracted pool

Contracted annual token pool sized during the sales motion, with quarterly true-up. Custom model routing per workload class (Sonnet for general, Opus for hard reasoning, Haiku for high-volume retrieval). SSO-bound per-user attribution. Overage handled as contracted overage rate, never as service interruption. Quarterly business review covers token spend alongside seat growth.

Contracted pool

The Free-tier discipline most teams skip

Free-tier sizing is the hardest calibration decision in the framework. Too tight and the feature feels broken to evaluators. Too loose and free users drain the budget that paid users are funding. The 50k ceiling on a Haiku-tier model is the floor we see working in practice. It funds roughly a week of real use — enough for an evaluator to genuinely see the product's value, not enough for indefinite daily chat-heavy use.

One operational note on tier transitions. The hardest customer conversation is the one where a user on the Pro tier hits the hard ceiling for the first time. The conversation goes well when the soft-threshold warning at 70% gave them a week of runway to either upgrade or moderate their use. It goes badly when the first signal they receive is the hard cap. Alerts (§05) are not optional — they are the difference between churn and upgrade on this transition.

04 — Overage RoutingHard-cap, soft-cap, pay-as-you-go.

Three patterns cover the space of what to do when a user hits their budget. Pick one per tier; never mix two patterns on the same tier (the customer will not understand which one applies). The production default for paid tiers is soft-cap-with-degradation. The production default for free is hard-cap with a clear upgrade prompt.

Hard-cap

Hard cap

Request denied · Upgrade prompt

Cleanest contract: at the ceiling, the feature stops responding and shows an upgrade card. Right for Free tier (where the prompt drives conversion) and for compliance-sensitive Enterprise contracts (where unbounded overage is unacceptable). Harsh on paid tiers — users churn rather than upgrade if hit without warning.

Free · regulated Enterprise

Soft-cap

Soft-cap with degradation

Silent Sonnet → Haiku swap

The production default for Pro and Team. At the soft threshold, the backend transparently swaps the model from the premium tier to the budget tier. The feature keeps working, the customer rarely notices on routine turns, the margin stays whole. Telemetry distinguishes degraded turns from premium turns for the eventual upgrade conversation.

Pro · Team default

Pay-as-you-go

Pay-as-you-go overage

Metered billing · Card on file

Above the ceiling, additional tokens bill at a published per-thousand rate against the card on file. Right for Team workspaces (where the admin is the billing relationship and expects the line item) and as an explicit add-on for Pro users who proactively opt-in. Never auto-enable on a tier where the user hasn't agreed to the meter.

Team · opt-in Pro

The case for soft-cap-with-degradation deserves the most attention, because it is both the highest-leverage pattern and the one teams most often skip in favor of hard-cap simplicity. The mechanism is straightforward: the route handler reads a per-user token-spend counter on each request, and once that counter crosses the soft threshold for the user's tier, the model parameter in the SDK call swaps from the premium tier to the budget tier.

What makes the pattern work is that the swap is genuinely invisible on routine turns. Sonnet to Haiku on a three-turn customer-support chat is imperceptible to most users — both models produce coherent, on-brand answers. The degradation only becomes visible on hard reasoning, multi-step analysis, or long-context tasks. By that point, the user is well into the "heavy usage" territory where the upgrade conversation should be happening anyway. The telemetry layer (§06) tracks degraded turns separately so that in-product messaging can surface "you would have gotten a better answer on the Team tier" at the right moment.

Pay-as-you-go is the right pattern when the relationship is explicitly metered, but it is not the right default. Customers consistently report higher anxiety on metered AI features than on tier-capped ones — the perceived risk of the unbounded bill outweighs the actual cost. Use pay-as-you-go where the relationship already supports it (Team admins, Enterprise contracts), not as a universal overage answer.

The legal note teams forget

Any tier with an automatic model-swap (soft-cap-with-degradation) must disclose the swap somewhere in the product or terms. Not on a pricing-page footnote — visible in the in-product budget UI, so users who hit the threshold understand why subsequent turns may feel different. The disclosure costs nothing and prevents the worst-case customer-trust incident: users discovering the swap from a vendor leak rather than from the product itself.

05 — Alerts70%, 90%, 100% — thresholds that don't surprise.

Three thresholds, three different messages. The 70% mark is a soft notice — "you are tracking heavier than the median this month, plenty of room left." The 90% mark is a clear warning — "you have a week of typical usage remaining; consider upgrading." The 100% mark is the contract event — what happens next is determined by the overage pattern from §04.

Both in-product and email signals matter. The in-product signal is what catches the user mid-flow; the email signal is what reaches users who have left the tab open for three days. Send both at 90% and 100%, in-product only at 70%. Never alert at 50% or below — customers experience low-watermark alerts as harassment and tune them out, which makes the high-watermark alerts less effective.

70% threshold

70%

Soft notice · in-product only

Discreet badge in the chat surface header: 'tracking heavier than usual this month.' No email. No friction. Surfaces the budget concept without alarming the user. This threshold catches roughly 25% of users on Pro who eventually hit 90%, giving them a week of runway.

in-app · no email

90% threshold

90%

Clear warning · in-product + email

Banner in the chat surface plus a single email to the account owner. Message: 'you have used 90% of this month's budget — about a week of typical usage remaining.' Embed an upgrade CTA. This is the threshold that converts heavy free-tier users into Pro and heavy Pro users into Team.

in-app + email

100% threshold

100%

Contract event · pattern from §04

The behavior here is determined by the overage routing pattern for the tier — hard-cap blocks, soft-cap degrades, pay-as-you-go starts metered billing. In all three cases: clear in-product message, follow-up email confirming what happened, and a one-click path to the next-tier upgrade or the explicit metered consent.

tier-dependent

Two operational details. First, alert thresholds should be computed against the user's pace, not just the absolute counter. A user who hits 70% on day three of the month is on a different trajectory than a user who hits 70% on day twenty-eight — the in-product copy should reflect that. "At this pace you'll hit your cap on the 14th" is more useful than "you are at 70% of your monthly budget." Second, the alert pipeline must be idempotent: if telemetry has flapped and re-crossed the 90% threshold three times in a day, the user should receive one email, not three.

The team-and-enterprise variant is slightly different. Pooled budgets benefit from a workspace-admin dashboard rather than individual user alerts — heavy users on a Team plan are workspace-level information, not their own concern. Send the 70% and 90% alerts to the workspace admin email; surface individual user spend in the admin dashboard. The pattern keeps the personal chat surface uncluttered while preserving the admin's visibility.

06 — AttributionPer-feature, per-user — where the spend lands.

A budget without attribution is a number with no story. The telemetry layer that pays for itself fastest is the one that tags every model invocation with the feature surface that made the call, the user who triggered it, and the model tier that handled it. The first time that telemetry surfaces a single feature consuming the majority of inference spend for a small minority of usage, the investment is repaid for the year.

Three tags are the floor: feature_id (which surface in the product made the call), user_id (or a hashed equivalent for analytics-only environments), and model_tier (premium, standard, budget). Add tool_call_count and retrieval_size_kb once the first three are in place — those two surface the second-order cost drivers that don't show up in raw token counts.

Inference-spend attribution by feature · representative shape

Source: aggregate from 12 production SaaS deployments, 2025-2026

Chat surface (general agent)~45% of total inference spend

45%

Document Q&A (RAG over PDFs)~25% of inference · 8% of usage events

25%

Workflow automation (multi-step agent)~18% of inference · 3% of usage events

18%

Inline suggestions (autocomplete-style)~7% of inference · 60% of usage events

Other (notifications, summarization)~5% of inference · 29% of usage events

The distribution above is the shape that recurs across deployments. Two patterns to internalize. First, the heaviest feature by usage (inline suggestions, accounting for 60% of usage events) is rarely the heaviest by cost — small per-call tokens at high volume add up to less than long agentic loops at low volume. Second, the second-highest cost feature (Document Q&A at 25%) typically represents the smallest user cohort — power users running RAG workflows that the median user never touches. Both observations argue for different optimization strategies: cap the long tail of heavy workflows, route the high-volume features to budget models.

Per-user attribution is the other half of the equation. Within any given feature, the distribution of spend across users follows the standard power-law shape — top 5% of users by spend account for 50 to 70% of feature-level cost. That observation is what makes the soft-cap-with-degradation pattern in §04 work: degrading the model on the heavy 5% recovers the majority of available margin without touching the median user's experience at all.

The telemetry stack we recommend

A self-hosted or managed agent-observability platform — LangFuse and Helicone are both well-suited — is the right home for token-spend telemetry. Tag at the SDK call site, not after the fact in logs. Surface a per-feature, per-user dashboard to product and finance weekly. If the telemetry layer is two clicks deeper than the rest of the product analytics, no one will look at it — which means the optimization passes get done on intuition rather than data.

One trap to avoid: do not let attribution telemetry drive per-customer pricing decisions on individual outliers. The right response to a single user crossing 10× the median spend is to check that the tier's overage pattern is doing its job — not to manually intervene with a per-customer rate. Personalized inference pricing is a sales-engineering anti-pattern; it poisons the tier model and creates conversations that can't be repeated at scale.

07 — Worked ExampleA SaaS at 10k MAUs — the spreadsheet.

The framework above is most useful when it runs against a concrete shape. The walkthrough below is the calibrated spreadsheet for a 10k-MAU production SaaS, distributed across the four tiers in the ratios we typically observe: 60% Free, 28% Pro, 10% Team, 2% Enterprise. Numbers rounded for readability; the actual deployment sat within 8% of these in month-on-month aggregate.

The shape at 10k MAUs

Free (6,000 users). Soft-cap 50k tokens × ~70% actually utilizing the agentic feature = ~210M tokens / month. Pegged to Haiku at ~$0.80/M effective. Monthly inference: ~$170.

Pro (2,800 users). Ceiling 500k × ~60% median utilization = ~840M tokens / month at the Sonnet effective rate of ~$5/M (mixed input/output, multiplier-adjusted). Monthly inference: ~$4,200.

Team (1,000 users in ~160 workspaces). Pool 3M × seats × ~50% utilization = ~1.5B tokens / month at the same Sonnet rate. Monthly inference: ~$7,500.

Enterprise (200 users in ~20 contracts). Contracted ~10M / user / month at a blended ~$4/M (Opus on hard reasoning, Sonnet on general, Haiku on retrieval). Monthly inference: ~$8,000.

Total monthly inference cost: ~$19,900. Total monthly SaaS revenue at typical tier ARPUs: ~$280k. Inference cost as percentage of revenue: ~7.1%— well within the "feature, not a cost-center" band.

The 7.1% inference-to-revenue ratio is the headline number. Below 10% on a chat-heavy agentic feature is the band where the unit economics are genuinely healthy — agentic features can be a permanent product line rather than a quarterly margin worry. Above 15% and the framework needs revisiting before the next quarter closes. Above 25% and the feature is funding itself off margin rather than off pricing, which is unsustainable past the first year of growth.

The spreadsheet repays its construction time the first time a pricing-page change is proposed. "What happens to inference cost if we double the Pro-tier ceiling?" is a thirty-second answer with the calculator, a two-week debate without it. The same is true for "what does it cost to ship this feature free for the first month?" or "can we route Enterprise Q&A to Opus without breaking the contract?" These are quarterly questions, and a budget calculator is the artifact that keeps them from becoming arguments.

"Below 10% inference-to-revenue on a chat-heavy agentic feature is the band where the unit economics are genuinely healthy. Above 15%, the framework needs revisiting before the quarter closes."— Calibration band from 12 production SaaS deployments

For teams without an existing chatbot to calibrate against, the shortest path from zero to a calibrated calculator is to ship the simplest possible chat surface, instrument it from day one with the attribution tags from §06, and run it on internal users for two weeks. Two weeks of real per-user behavior produces a better sizing model than any amount of analyst-built scenario planning. The companion guide on building a Next.js 16 AI chatbot from scratch covers the scaffold; layer the attribution tags onto the route handler and you have a real measurement environment by week three. And for the rate-card inputs that drive the formula, the LLM API pricing index for Q2 2026 is the rolling reference we update each quarter.

The shape of agentic unit economics, 2026

Token budgets turn agentic features from cost-center to product line.

The recurring failure mode for SaaS agentic features is not that the model is wrong, the prompts are wrong, or the retrieval is wrong. It is that the feature was shipped without a budget — and by the time the margin pressure shows up in finance dashboards, the user behavior is set and the retrofit is painful. A token budget calculator is the prevention, not the cure.

The framework above is deliberately compact: one formula, four tiers, three overage patterns, three alert thresholds, two attribution tags. Every piece earns its place. Skip any of them and the system has a hole — skip the multiplier and you under-size paid tiers; skip the soft-cap and you generate one-star reviews; skip the attribution and you optimize on intuition rather than data. The compactness is the point.

Run the calculator against your own product before the next quarterly planning cycle. Forty-five minutes with a spreadsheet, real per-user telemetry, and current rate cards gets a defensible tier model that survives the next year of token-price changes and user-behavior shifts. Re-run it every quarter. The features that stay profitable are the ones whose unit economics are designed into the product, not retrofitted after the margin pressure shows up.

Agent Token Budget Calculator: Cost-Control Framework 2026