Agent cost metrics are the operational backbone of any business shipping agentic features at scale. Without per-task, per-user, per-tenant attribution, token spend stays invisible until the quarterly bill arrives — at which point the user behavior is set, the margin damage is done, and the retrofit is painful. This framework is the prevention.

The pattern that keeps repeating across production deployments is simple. A team ships an agentic feature on a flat seat price, the top 5% of users consume 60% of inference budget, one feature silently absorbs the majority of spend for a minority of usage, and the unit-economics model collapses by month three. Every one of those problems is invisible without the right metrics — and every one is trivial to manage with them.

This guide is the eight-KPI cost panel we ship on AI transformation engagements when an agentic feature moves from beta to general availability. Per-task cost. Per-user cost. Per-tenant attribution. Tier budgets. Overage routing. Dashboard cadence. Each KPI with a formula, a target band, and the production failure pattern it surfaces.

Key takeaways

01
Per-task cost predicts feature ROI.Tag every model invocation with the feature surface that made the call and the workload class it served. Without that, the question of which agentic features earn their token budget stays unanswerable — and the optimization passes get done on intuition rather than data.
02
Per-user cost surfaces hot-spot users.The top 5% of users routinely consume 50 to 70% of inference budget. That is not abuse — it is the long tail of curiosity-driven power use. Per-user metrics make the hot-spot pattern visible before the margin damage shows up in the finance dashboard.
03
Per-tenant attribution unlocks SaaS unit economics.On Team and Enterprise tiers, the unit of economic analysis is the workspace, not the user. Per-tenant rollups answer the questions account managers actually ask — which workspaces are over their pool, which are under, which renewal conversations should include a budget expansion.
04
Tier budgets prevent margin death.Four tier patterns (free, pro, team, enterprise), each with a hard ceiling and a soft threshold, cap the per-tier blast radius. The combination of tier budgets plus the soft-cap-with-degradation overage pattern recovers the majority of available margin without touching the median user experience.
05
Overage routing makes cost control product-aware.Hard-cap denies the request. Soft-cap silently swaps the model from premium to budget tier. Pay-as-you-go meters above the ceiling. Pick one per tier — never mix two on the same tier — and the cost-control mechanism becomes a product decision rather than a finance one.

01 — Why Cost MetricsInvisible token spend is the margin killer for agentic features.

The default state of a newly-shipped agentic feature is invisible. A user opens the chat surface, the agent fans out into three tool calls and a reasoning trace, and the meter spins — but the meter is a single line item on the monthly provider invoice. By the time finance asks why the AI line just doubled, the answer requires archaeology rather than a dashboard query.

The cost panel below replaces archaeology with attribution. Tag every model invocation at the SDK call site with three fields — feature, user, tenant — and the next time the bill spikes, the answer is one dashboard query away. The optimization passes that follow stop being guesswork. The pricing-page conversations stop being arguments.

There is a counter-argument worth taking seriously: that cost metrics add operational overhead and the agentic feature is fine without them. The evidence from the deployments behind this framework is the opposite. The telemetry layer pays for itself the first time it identifies a single feature consuming 60% of inference budget for 8% of usage. The pay-back window is typically measured in weeks, not quarters.

The failure pattern we keep seeing

A SaaS ships an agentic feature on a flat seat price without per-feature attribution. Month one looks great. Month two, one feature (typically a RAG-over-documents flow) silently absorbs 60% of inference spend for 8% of usage events. Month three, the CFO asks why the AI line item is the second-largest infrastructure expense. The fix is always the same — retrofit the attribution, identify the cost driver, route it to a budget model. We would rather you shipped the attribution on day one.

Eight KPIs make the panel. Per-task cost (§02). Per-user cost (§03). Per-tenant cost (§04). Tier-budget utilization (§05). Overage rate by tier (§06). Soft-cap degradation rate (§06). Top-feature concentration ratio. Cost-to-revenue ratio at the product level. Each one earns its place — drop any and the panel has a hole. The eight together produce the picture finance, product, and engineering can argue from.

02 — Per-Task CostCost per task — three workload classes, three formulas.

Per-task cost is the foundational metric. Every model invocation costs a measurable amount of money; the question is which task it served. Three workload classes cover the space — single-turn chat, multi-step agent workflows, and retrieval-augmented turns. Each has a different cost structure and a different optimization lever.

The headline formula is the same one we use for tier sizing in the companion token-budget calculator — tokens = sessions × turns × (input + output × multiplier) — but applied per-task rather than per-user. Tag the workload class at the SDK call site and the panel resolves automatically.

Single-turn chat

input + output × multiplier

The simplest workload class. One user message, one model response, no tool calls. Cost is dominated by the output multiplier — typically 3 to 5× the input rate on Sonnet-class models. Target band: 800 to 2,000 effective tokens per turn. Above 5,000 and the prompt is likely over-stuffed; below 400 and the context budget is probably starving the response.

cost driver: output multiplier

Multi-step agent

(input + output × multiplier) × tool_call_factor

Agentic workflows with tool calls and reasoning traces. Cost is dominated by the tool-call factor — typically 1.5 to 2.5× per user-visible turn, sometimes higher for deep-research loops. Target band: 8,000 to 20,000 effective tokens per task. Above 40,000 and the agent is likely looping; below 5,000 and the agent is probably under-using its tools.

cost driver: tool-call depth

RAG turn

RAG retrieval

(input + retrieved_context + output × multiplier)

Retrieval-augmented turns. Cost is dominated by retrieved context — typically 3 to 5× the input token count of unaugmented turns. Target band: 4,000 to 12,000 effective tokens per turn. Above 25,000 and the retrieval is over-fetching; the chunk-size and top-k decisions are the primary optimization lever, not the model choice.

cost driver: retrieval size

The reason to split workload classes is that the optimization lever for each is different. Chat-heavy features benefit most from model routing (Sonnet to Haiku on routine turns); agent workflows benefit most from tool-call depth caps and reasoning-trace budgets; RAG turns benefit most from chunk-size and retrieval-quality work. A single per-task average masks all three patterns.

One trap to avoid: do not normalize per-task cost against user-visible turns alone. A multi-step agent may consume the model five times for a single user turn. The denominator that matters is the count of model invocations, not the count of UI interactions — otherwise the numbers undercount the most expensive workload class by a factor of two or three.

"A single per-task average masks the pattern. Split the workload classes and the optimization lever for each becomes obvious."— Internal playbook for agentic cost panels

03 — Per-User CostPer-user cost — three aggregations, one hot-spot signal.

Per-user cost is the second metric the panel surfaces. The distribution of spend across users follows the standard power-law shape — top 5% account for 50 to 70% of feature-level cost, median user is within an order of magnitude of the bottom decile. Without per-user telemetry, that distribution stays invisible and the soft-cap pattern in §06 has nothing to act on.

Three aggregations are the floor. Total monthly spend per user. Spend trajectory (today vs the trailing 30-day median). Top-decile concentration ratio (what fraction of spend is concentrated in the top 10% of users). Together they answer the operational questions account managers and finance both ask — who are the heavy users, are they trending up, and how skewed is the distribution.

Median user

Median user — seat economics

The median paid-tier user spends 0.5 to 1.5% of their monthly seat price in raw inference cost on a healthy agentic feature. Above 3% and the seat price is funding the feature off margin; below 0.2% and the feature is under-used at the median (a product engagement problem, not a cost problem). This metric anchors the unit-economics conversation.

Target: 0.5–1.5% of seat

Top-decile user

Top-decile user — soft-cap signal

The 90th-percentile user is the soft-cap trigger. On a well-sized tier, this user is approaching but not yet at the soft threshold from §05. Above 80% of the soft-cap, the in-product alerts should already be firing; above the soft-cap, the degradation pattern in §06 should be active. This metric calibrates tier ceilings.

Target: 60–80% of soft-cap

Top-percentile user

Top-percentile user — blast radius

The 99th-percentile user is the blast-radius check. On a properly-bounded tier, this user has either hit the hard ceiling and converted (best case) or hit the soft threshold and degraded (default case). If the 99th-percentile user is still consuming linearly past the ceiling, the overage pattern is not doing its job — and the long tail is silently funding itself off margin.

Cap enforcement check

New vs returning

New vs returning — cohort drift

Per-user cost segmented by signup cohort. Newer cohorts typically run 20 to 40% higher per-user cost than year-old cohorts — the product is more capable, the prompts are richer, the feature gets used more. Without cohort segmentation, the year-over-year comparison conflates capability growth with usage growth and the planning model goes sideways.

Cohort segmentation

The hot-spot signal that pays for the panel

The single most valuable view in a per-user cost dashboard is the top ten by 7-day spend, sorted descending, with the trajectory column showing day-over-day delta. The first time that view identifies a single user spending 30× the median — typically a power user running a multi-step agent workflow on a free tier with no ceiling — the telemetry investment is repaid for the year.

One operational note. Per-user cost telemetry must respect the customer-data boundary your product already enforces. For consumer products with strong privacy commitments, store the user_id as a hashed equivalent in the telemetry pipeline and keep the linking table inside the application database, not in the analytics warehouse. The aggregations work the same; the privacy posture stays intact.

04 — Per-Tenant AttributionPer-tenant cost — the SaaS unit of economic analysis.

For B2B SaaS, the user is not the unit of economic analysis — the workspace, account, or tenant is. A Team plan with twenty seats renews against workspace-level value, not against the heaviest individual user inside it. Per-tenant cost rollups answer the questions account managers actually ask, and they make the renewal-conversation math defensible rather than anecdotal.

The pattern that recurs across deployments: pool the per-user telemetry from §03 by tenant_id at the warehouse level, surface tenant-level dashboards to account managers and finance, and gate tenant-level overage alerts on workspace-admin email rather than individual user inbox. The mechanic is straightforward; the organizational discipline is harder than the engineering.

Tenant spend

Total monthly spend per tenant

The raw aggregation. Sum of all model invocations attributed to the workspace. The dashboard view sorts descending; the top 20 workspaces typically account for 60 to 80% of total tenant spend on the Team and Enterprise tiers combined. This is the view account managers and finance share once a month.

monthly · descending

Pool utilization

Pool utilization vs contracted

For Team and Enterprise tiers with pooled budgets, the utilization metric is what catches the renewal conversation. Workspaces consistently over 90% pool utilization are upgrade candidates; consistently under 30% are right-sizing candidates. Quarterly review of the utilization-by-workspace distribution feeds the next contract cycle.

team · enterprise

Concentration

70%

Top-user concentration per tenant

The fraction of tenant-level spend concentrated in the top 5% of users within the workspace. High concentration (over 70%) signals workspace-level governance issues — a small handful of heavy users on a pooled budget. Workspace admins can act on this dashboard view directly; pooled-budget governance is most often a workspace-internal conversation.

internal governance

Cost-to-ARR

Tenant cost as % of tenant ARR

The unit-economics check at the tenant level. Healthy band: 3 to 8% of tenant ARR. Above 12% and the workspace is funding the feature off margin; above 20% and the contract needs revisiting at renewal. Per-tenant cost-to-ARR is the single most actionable number for sales and customer-success teams operating against agentic features.

renewal signal

The cost-to-ARR metric deserves the most operational attention, because it is the one that translates directly into commercial decisions. A workspace running at 15% cost-to-ARR is not a crisis, but it is a quarterly check-in. A workspace running at 25% is a scheduled conversation at renewal — either expanding the contract value to absorb the actual usage, capping the agentic feature with tighter overage governance, or routing the workspace's heaviest workload to a budget model under contract amendment.

One trap to avoid: do not let per-tenant cost rollups drive individual-customer pricing rates. The right response to a single workspace running heavy is to renegotiate the contracted pool at the next renewal cycle — not to issue a one-off rate concession or an off-list overage rate mid-quarter. Personalized inference pricing is a sales-engineering anti-pattern that poisons the tier model and creates conversations that cannot be repeated at scale.

The renewal-conversation pattern

The most valuable artifact a per-tenant cost panel produces is the renewal-prep one-pager — workspace name, contracted pool, actual 12-month consumption, projected next-year consumption based on seat growth and feature-adoption trajectory, recommended pool size for the next contract. Customer-success teams report that running the renewal conversation against this artifact shortens the cycle by weeks and removes the awkward mid-quarter overage conversations almost entirely.

05 — Tier BudgetsFour tiers — budget bands calibrated to each plan.

Four tier patterns cover the space of most SaaS products with an agentic feature: free, pro, team, enterprise. Each carries a different budget shape — different hard ceiling, different soft threshold, different model-tier policy, different overage pattern. The combination is what makes the tier model defensible against real user behavior rather than against an idealized average.

The bands below are calibrated to the production deployments behind this framework. Absolute numbers vary by product and workload class; the relative shape — order-of-magnitude jumps between tiers, soft-threshold at 70% of hard-cap, model-tier policy that escalates with the plan — is consistent across successful deployments.

Tier budget bands · relative shape across plan levels

Source: aggregate from 12 production SaaS deployments, 2025–2026

Free tier · acquisition bandSoft-cap 50k tokens · Haiku-tier model · hard-cap pattern

50k

Pro tier · median-user bandHard-cap 500k · soft-cap 350k · Sonnet→Haiku degradation

500k

Team tier · pooled workspace bandPool 3M × seats · Sonnet-tier · pay-as-you-go overage

3M / seat

Enterprise tier · contracted bandContracted pool · custom routing · contracted overage rate

Contracted

The free-tier band is the hardest calibration decision. Too tight and the feature feels broken to evaluators; too loose and free users drain the budget that paid users are funding. The 50k ceiling on a Haiku-tier model is the floor we see working in practice — it funds roughly a week of real use, enough for an evaluator to genuinely see the product's value, not enough for indefinite daily chat-heavy use. Tighter than 25k and the feature feels constrained; looser than 100k and the free-tier conversion math stops working.

The pro-tier band is sized against the median-user formula from §02 with a 3× headroom multiple. If the median pro user generates roughly 160k effective tokens per month in real usage, the 500k hard-cap accommodates roughly 3× the median — which catches the 90th-percentile user without throttling them. The soft-threshold at 350k is the alert trigger; the degradation pattern in §06 takes over from there.

The quarterly tier-redesign rail

Token prices drop year-over-year. Sonnet-class rates fell roughly 40% across 2025 and a similar trajectory through 2026 is likely. Without a quarterly tier-redesign cadence, last year's ceilings silently become this year's floors — and the product is either passing none of the savings to customers (a competitive risk) or pocketing all of them (a margin-vs-growth tradeoff that should be a conscious decision, not a default). Forty-five minutes per quarter, high return on the time. Schedule it on the planning calendar before the quarter starts, not after the bill arrives.

One operational detail worth surfacing. The tier-budget metric on the dashboard should display two numbers — actual utilization and projected utilization at month-end based on current pace. A user at 70% utilization on day three of the month is on a different trajectory than a user at 70% on day twenty-eight; the in-product alert copy and the dashboard view should reflect the pace, not just the absolute counter. "Projected to hit cap on the 14th" is more actionable than "at 70% of monthly budget".

06 — Overage RoutingHard-cap, soft-cap, pay-as-you-go — one per tier.

Three patterns cover the space of what to do when a user or workspace hits the budget ceiling. Pick one per tier; never mix two patterns on the same tier — the customer will not understand which one applies, and the support burden of explaining a mixed-mode overage rule outweighs any flexibility it adds. The production defaults that recur across deployments: hard-cap on free, soft-cap-with-degradation on pro, pay-as-you-go on team and enterprise.

The companion token-budget calculator post walks the mechanics of each pattern in detail. The metric this section surfaces is the routing rate itself — what percentage of traffic on each tier is hitting which overage pattern — because the routing rate is the primary feedback signal on whether the tier ceilings are correctly calibrated.

Hard-cap rate

<5%

Free-tier hard-cap activation rate

Percentage of free-tier users who hit the hard ceiling in a given month. Healthy band: 5 to 15% — those are the conversion-ready users. Below 5% and the free tier is too generous (free is doing the work that pro should be doing); above 25% and the free tier is too restrictive (the feature feels broken to evaluators).

conversion signal

Soft-cap rate

<20%

Pro-tier soft-cap degradation rate

Percentage of pro-tier traffic served by the budget model after the soft threshold has been crossed. Healthy band: 10 to 20% — the long-tail recovery the soft-cap pattern is designed for. Above 30% and the pro-tier ceiling is too tight (median users are degrading); below 5% and the soft-cap is doing no work.

margin recovery

PAYG rate

<15%

Team-tier pay-as-you-go overage rate

Percentage of workspaces whose monthly bill includes a pay-as-you-go overage line. Healthy band: 5 to 15% — those are the expansion conversations. Above 25% and the pooled budget per seat is under-sized for the typical workspace; below 3% and the overage option is not being exercised, which usually means the pool is over-sized.

expansion signal

Two operational details on the overage metrics. First, the routing rates should be displayed alongside the absolute spend view, not separately — the question "is our routing healthy" is meaningless without the context of how much spend is moving through each pattern. Second, the routing rate per tier should trigger an alert at the daily level if it crosses the band — a free-tier hard-cap rate above 25% for three consecutive days is a tier-design problem, not a transient anomaly, and the dashboard should flag it before the support queue does.

One additional metric worth surfacing on the panel: the conversion rate from each overage event to a tier upgrade. Hard-cap activations should convert to upgrades at 8 to 15% within thirty days on a well-designed free tier; soft-cap-degradation events should generate upgrade-to-team conversions at 5 to 10% on pro; PAYG overage at the workspace level should generate pool-expansion renewals at 30 to 50% at the next contract cycle. The conversion rates are the proof that the overage patterns are doing the commercial work they were designed for.

"Mix two overage patterns on the same tier and the customer never knows which one applies. The support burden outweighs any flexibility it adds."— Production tier-design lessons, 2026

07 — Dashboard CadenceDaily review at scale — three rhythms, three audiences.

A cost panel without a review cadence becomes a dashboard nobody opens. Three rhythms cover the audiences that need different views of the same telemetry: a daily engineering rhythm focused on anomalies, a weekly product-and-finance rhythm focused on distribution and trajectory, and a quarterly strategic rhythm focused on tier redesign and contract cycles.

The daily rhythm is the load-bearing one. At scale, the cost-anomaly signal — a single user, feature, or workspace spending 5× their trailing-7-day average — is most useful when it is investigated within hours rather than days. The daily review is a fifteen-minute scan of the anomaly view by an on-call engineer; most days nothing is actionable, but the days something is actionable are the days the panel earns its keep.

Daily

Daily — anomaly scan

Fifteen-minute engineering review. Anomaly view sorted by deviation from trailing-7-day median. Top 10 users by 24-hour spend. Top 5 features by 24-hour spend. Any tier with overage-routing rate outside the band. Most days nothing actionable; the days something is actionable, the response is within hours rather than at end-of-week.

On-call engineering

Weekly

Weekly — distribution review

Sixty-minute product-plus-finance review. Per-tier utilization distributions. Per-feature attribution shifts week-over-week. Top-decile and top-percentile user trajectories. Tenant cost-to-ARR rollups. The view that feeds the in-quarter pricing and tier-policy conversations.

Product + finance weekly

Quarterly

Quarterly — tier redesign

Forty-five-minute leadership review. Current provider rate cards. Year-over-year per-tier usage growth. Attribution shifts (which features are taking more or less share of total spend). Any new model tiers worth routing specific workloads to. Tier-budget ceilings and soft thresholds recalibrated against the new rate cards.

Leadership quarterly

On-demand

On-demand — renewal prep

The customer-success-driven view. Per-tenant one-pagers for any workspace approaching renewal. Twelve-month actual consumption. Projected next-year consumption based on seat-growth and feature-adoption trajectory. Recommended pool size for the next contract. The view that shortens renewal cycles by weeks.

Customer success

The dashboard discipline that keeps the panel alive

A cost dashboard two clicks deeper than the rest of the product analytics gets used twice and abandoned. The home for this panel is the same place product and engineering already look — alongside the conversion funnel, the engagement metrics, the reliability dashboards. If the cost panel is part of the same wall as the rest of the product metrics, the optimization passes get done on data. If it lives in a vendor portal that requires a separate login, the optimization passes get done on intuition.

The telemetry stack that hosts the panel matters less than the review cadence that wraps it. Most agent-observability platforms — LangFuse, Helicone, the observability stack we covered in the observability stack TCO calculator — produce the eight KPIs out of the box once the SDK call sites are tagged. The choice between platforms is a tradeoff on ingestion cost, self-hosting effort, and dashboard ergonomics. The choice that matters for cost outcomes is whether the daily, weekly, and quarterly rhythms are scheduled on calendars and attended by the right audiences.

One trap to avoid: do not let the daily rhythm devolve into a cost-only firefighting view. The anomaly signal is most valuable when it is read alongside the engagement signal — a user spending 10× the median for a week is a hot-spot to investigate, but the same user generating a 30× retention lift on the feature is a customer to protect, not a customer to throttle. The cost panel does not replace product judgment; it gives product judgment a real argument to work from.

The shape of agentic cost control, 2026

Cost metrics turn agent features from cost-center to product line.

The recurring failure mode for SaaS agentic features is not that the model is wrong, the prompts are wrong, or the retrieval is wrong. It is that the feature was shipped without cost attribution — and by the time margin pressure shows up in finance dashboards, the user behavior is set and the retrofit is painful. An eight-KPI cost panel is the prevention, not the cure.

The framework above is deliberately compact: three attribution levels (per-task, per-user, per-tenant), four tier patterns (free, pro, team, enterprise), three overage-routing patterns (hard-cap, soft-cap, pay-as-you-go), three review cadences (daily, weekly, quarterly). Every piece earns its place. Skip per-task attribution and the optimization passes get done on intuition; skip per-user telemetry and the soft-cap pattern has nothing to act on; skip the quarterly rhythm and last year's ceilings become this year's margin drift.

Build the panel against your own product before the next quarterly planning cycle. Three days of telemetry instrumentation at the SDK call sites, a week of warehouse work to roll up the aggregations, and a fortnight of dashboard iteration produces a defensible cost panel that survives the next year of token-price changes and user-behavior shifts. The features that stay profitable are the ones whose cost attribution is designed into the product, not retrofitted after the margin pressure shows up.

Agent Cost Metrics: Per-Task, Per-User Framework 2026