Agentic AI scale templates are the operational artefacts that turn a working pilot into a platform without breaking the unit economics. By Stage 9 the prototype has shipped, governance is in place, and a small team is running production traffic — yet scale-out almost always exposes that the numbers which worked at pilot were never engineered to multiply. This stage is where that mismatch gets fixed deliberately, not retrofitted under margin pressure.
What changes at Stage 9 is not the technology. The model stack, the orchestrator, the prompts and the retrieval layer all carry forward intact. What changes is the operating envelope. Tokens consumed across ten workloads do not behave like ten times one workload. Ops headcount does not scale with users — it scales with incident surface area and workload heterogeneity. And the single biggest cost driver is no longer the model price, it is the absence of a gate that stops a curiosity-driven workload from silently consuming the budget allocated to three other workloads.
This playbook is the Stage 9 contribution to the broader AI transformation engagement sequence — a scale-out plan, a capacity model, a cost-control gate framework, the fan-out architectural pattern, a four-tier maturity ladder, the three anti-patterns that break scale, and the hand-off into Stage 10 continuous improvement. Every section includes the template shape teams reuse rather than rebuild.
01 Readiness assessment · 02 Strategy & roadmap · 03 Data foundation · 04 Vendor selection · 05 Prototype · 06 Production deploy · 07 Team enablement · 08 Governance · 09 Scale (this playbook) · 10 Continuous improvement.
Stage 9 inherits a production-ready agent (Stage 6), a trained ops team (Stage 7), and an enforceable governance charter (Stage 8). It hands off into Stage 10 — the retrospectives, KPI dashboards and quarterly iteration loops that keep the platform improving rather than drifting.
- 01Pilot economics break at scale — assume the math is wrong.A pilot supporting one workload at one thousand users almost never multiplies cleanly to ten workloads at ten thousand. Token costs accumulate non-linearly across heterogeneous workloads, ops surface area compounds, and the long-tail user behaviour that was invisible at pilot becomes the dominant cost line. Treat the existing numbers as input to the capacity model, not as a forecast.
- 02The capacity model must include ops headcount, not just compute and tokens.Most scale plans size compute and inference spend correctly and then under-staff the human side by half. Ops headcount scales with incident surface area and workload heterogeneity — adding workloads is what drives the curve, not adding users. Build the capacity model with three explicit budgets so the staffing line cannot quietly disappear.
- 03Cost gates prevent margin death — design them before scale-out, not after.Tier budgets, overage routing, and alert thresholds are not retrofit-friendly. A single workload with no ceiling can consume the inference budget of the next three workloads inside a month. Ship the three gate categories alongside the first new workload, not after the second margin-pressure conversation with finance.
- 04Fan-out beats monolithic — route across workloads, do not couple them.The most common scale anti-pattern is the monolithic agent that grows tool-by-tool until it owns every workload. Fan-out routing — one entry-point dispatcher and N narrowly-scoped specialist agents — beats the monolith on cost, on reliability, on observability, and on the speed at which the team can ship a new workload without regressing the existing ones.
- 05The platform tier requires explicit re-architecture, not extrapolation.Moving from the expansion tier to the platform tier is the second largest architectural step in the whole 10-stage pipeline (the largest being prototype-to-production). It is not an extension of the existing system — it is a redesign that introduces multi-tenant isolation, workload-level SLAs, dedicated capacity pools and a separate ops rotation. Budget for it as a project, not a milestone.
01 — Why Stage 9Scale-out breaks economics that worked at pilot.
The single most common cause of a failed Stage 9 is the assumption that the pilot numbers multiply. They almost never do, and the failure mode is consistent across the engagements behind this playbook. A pilot serving one workload at modest user counts has three properties that quietly disappear during scale-out: homogeneous traffic, a single point of operational attention, and a long tail of usage that has not yet appeared. Each of those three reverts toward the mean as the platform grows, and each reversion bends the cost curve upward.
Token consumption is the most visible. A second workload introduced at Stage 9 rarely behaves like the first. Its average input length, its tool-call factor, its retry rate and its output multiplier are all workload-specific, and the formula from the companion token budget calculator framework has to be re-run per workload. Teams that average the new workload into the old budget invariably under-size by 30–60% in the first quarter after scale-out.
Operational surface area is the more dangerous variable, because it is invisible until an incident exposes it. A pilot has one on-call rotation covering one agent, one prompt set, one retrieval index. At three workloads, the rotation is covering three differently-shaped failure modes — different prompts, different tools, different SLA expectations from different internal stakeholders. Without an explicit re-staffing decision, the team simply absorbs the additional load until the first major incident, which is also the first incident report management has seen since the pilot launched.
There is a counter-argument worth taking seriously: that imposing Stage 9 discipline slows scale-out, and that the slowdown is itself an opportunity cost. The evidence from the engagements behind this playbook is the opposite. The teams that shipped Stage 9 templates alongside their first scale-out workload added their next three workloads roughly 40% faster than the teams that retrofitted. The capacity model, the cost gates and the fan-out pattern are leverage, not friction — they reduce the marginal cost of adding the next workload, which is exactly the cost that matters during scale.
02 — PlanWorkload-by-workload expansion path.
The scale-out plan is the artefact that converts "we should do more with agents" into a sequenced, capacity-budgeted roadmap. The template is deliberately simple. One workload per row, one quarter per column, three commitment levels (committed / planned / candidate), and a per-row capacity estimate that feeds the model in §03. The discipline is that no workload moves from candidate to committed until the capacity estimate is signed off by both engineering and finance.
Below is the template shape teams reuse. Drop it into the existing planning surface (Linear, Notion, sheet — the medium is irrelevant), populate it with the real workload backlog, and use it as the single artefact every Stage 9 quarterly review opens with.
Columns: Workload · Owner · Status (candidate/planned/committed) · Users · Sessions/user/month · Tokens/user/month · Ops hours/week · Model tier · Cost gate · Ship quarter.
Rows (one per workload):
W-01 · Customer support chat · Sarah · committed · 4,200 · 18 · 520k · 6 · Sonnet→Haiku · soft-cap · Q2-2026
W-02 · Internal Q&A over docs · Marcus · committed · 1,100 · 42 · 980k · 4 · Sonnet · soft-cap · Q2-2026
W-03 · Sales-call summariser · Priya · planned · 280 · 60 · 1.4M · 2 · Sonnet · pay-as-you-go · Q3-2026
W-04 · Code review assistant · Tom · planned · 95 · 30 · 2.2M · 3 · Opus→Sonnet · soft-cap · Q3-2026
W-05 · Marketing copy drafts · Aisha · candidate · — · — · — · — · — · — · Q4-2026
Rule: a workload stays in candidate until the per-row capacity estimate is real. Estimates from the spreadsheet model in §03 are signed off by engineering and finance before status moves to planned; committed requires an assigned owner and a ship quarter.
Three rules about how to read the template. First, workloads are sized in users per workload, not seats per plan — the same user can appear across multiple workload rows, and that cross-workload usage is what the capacity model in §03 has to forecast. Second, the "Cost gate" column is mandatory even for committed workloads; an empty cell means the workload has no defined behaviour at the budget ceiling, and that is the failure mode the gate framework in §04 exists to prevent. Third, quarters do not stack arbitrarily — the ops hours column has a ceiling, and committing five workloads into a single quarter without re-staffing is the staffing failure mode from §07 (Anti-patterns) waiting to happen.
The plan is a living document. Re-open it at each quarterly review, slide the actuals from shipped workloads back into the capacity model to recalibrate inputs, and update the candidate rows with the latest backlog. The plan is also the artefact that the Stage 8 governance committee uses to approve the per-quarter workload portfolio — that approval gate is what keeps scale-out sequential and reviewed rather than ad-hoc.
"A workload that has no defined behaviour at its budget ceiling is a margin incident waiting for a calendar quarter."— Stage 9 review checklist
03 — CapacityCompute, tokens, ops headcount.
The capacity model is the spreadsheet that sits behind the scale-out plan. Most teams build a credible model for the first two budgets — compute and tokens — and skip or under-resource the third. The third (ops headcount) is the one that consistently causes the largest scale-out incident in the first year, because under-staffing is invisible right up to the moment the first incident exposes it.
The matrix below lays out the three budgets at the shape we recommend. Each budget has its own forecasting logic, its own review cadence, and its own owner. Treating them as a single blended "AI cost line" is the most common modelling error we see.
Compute — infrastructure
Inference endpoints, vector DB, orchestration runtime, observability stack. Sizes against peak concurrent sessions across all workloads, not against average user counts. Forecast monthly; review quarterly; owned by platform engineering. Use the same headroom buffer (typically 30–40% above peak forecast) you would on any other production service.
Owner: platform engTokens — inference spend
The per-workload, per-user token formula from the budget calculator (sessions × turns × (input + output × multiplier) × tool_call_factor × retry_factor). Forecast monthly; review monthly; owned jointly by product and finance. The discipline is per-workload — never average across workloads, because the input distributions are too different.
Owner: product · financeOps headcount — human side
Hours per week per workload covering on-call, incident review, prompt drift, eval-set maintenance and stakeholder reporting. Scales with workload count and heterogeneity, not user count. Rough baseline: 4–8 hours/week per stable workload, 12–16 in the first quarter after launch. Forecast quarterly; review quarterly; owned by the agentic ops lead.
Owner: ops leadBuffer — cross-budget reserve
A reserved 15–20% on top of the three primary budgets that absorbs in-quarter surprises — a workload with usage 2× forecast, a model price change, an unplanned ops vacancy. Held centrally rather than allocated to specific workloads. Reviewed and rebalanced at each quarterly Stage 9 review.
Reserve: 15–20%Per-workload baseline. 4–8 hours/week per stable workload covers on-call, prompt drift review, weekly eval-set checks, and stakeholder reporting. 12–16 hours/week in the first quarter after launch covers the additional incident response, prompt tuning, and observability buildout.
Compounding factor. Heterogeneous workloads (different stakeholders, different SLA expectations) cost more ops hours than homogeneous additions. A second customer-facing chat workload alongside an existing one might add 4 hours/week; a workload of a fundamentally different shape (a code review assistant alongside customer chat) typically adds 10–12.
Worked example. Five workloads × 6 hours/week average = 30 hours/week steady-state, plus an additional 12 hours/week for whichever workload is in its first quarter, plus ~4 hours/week for cross-workload platform improvements. ~46 hours/week — more than one FTE, less than two, which is exactly the staffing band most under-resourced teams sit in.
One operational detail worth surfacing. The compute and tokens budgets are both pure-cost variables — they only ever appear on the cost side of the P&L. The ops headcount budget is different; under-staffing it does not save money, it shifts the cost from the ops line into the incident-response line and the engineering-velocity line, both of which are larger and harder to forecast. The cheapest ops decision is rarely the lowest-headcount decision; it is the one that produces the lowest blended cost across all three lines.
The capacity model is also the artefact that drives the "is this workload approvable?" conversation at the governance committee in Stage 8. A workload that lacks a credible row in the capacity model is, by definition, not approvable — the committee should not advance it past candidate status without sign-off on all three budgets.
04 — Cost GatesTier budgets, overage routing, alert thresholds.
Cost gates are the production mechanism that translates the capacity model into runtime behaviour. Three gate categories cover the realistic surface — tier budgets, overage routing, and alert thresholds. Each maps to one of the failure modes from §01: tier budgets cap blast radius from per-user heavy use, overage routing defines what happens at the ceiling, alert thresholds prevent the cliff-edge customer experience that otherwise produces churn rather than upgrade.
The three patterns below are the production defaults across the engagements behind this playbook. Pick the right pattern per workload — not per platform — and resist the urge to apply a single answer everywhere. A code-review assistant and a customer-support chat have very different right answers across all three gates.
Tier budgets
Per-workload · per-tier · per-monthA monthly token ceiling per user per workload, sized using the formula in the capacity model. Critical detail: the ceiling is per workload, not platform-wide — a user heavy on one workload should not consume the budget of an unrelated workload. Soft thresholds at 70% trigger the alert flow; hard thresholds at 100% trigger the overage pattern. Re-size every quarter against the latest model rate card.
every workload · every tierOverage routing
Hard-cap · soft-cap · pay-as-you-goWhat the runtime does at the ceiling. Hard-cap blocks (right for free tier and compliance-sensitive workloads). Soft-cap silently degrades the model from premium tier to budget tier (the production default for most paid workloads). Pay-as-you-go bills metered overage to the workspace admin (right for explicitly metered relationships only). Never auto-enable metered billing on a tier that did not opt in.
pattern per workloadAlert thresholds
70% · 90% · 100%Three thresholds, three different signals. 70% is an in-product soft notice. 90% is an in-product banner plus a single email to the account owner. 100% is the contract event, behaviour determined by the overage pattern. Compute thresholds against the user's pace (days remaining in the month), never just the absolute counter. Never alert below 50% — low-watermark spam tunes the high-watermark alerts out.
in-app + emailThe gate framework is the bridge between the capacity model and the runtime. Without it, the capacity model is a forecast that production cannot enforce; with it, the forecast becomes policy. The per-workload discipline is the part teams most often skip — a single platform-wide ceiling is operationally simpler but reliably under-allocates budget to the workloads that need it and over-allocates to the workloads that do not. The additional engineering to express the ceiling per workload pays back inside one quarter.
One additional gate worth mentioning, though it is workload- specific rather than universal: a model-tier policy per workload. Some workloads are fine on a budget-tier model from day one (high-volume retrieval, autocomplete-style features). Others require the premium tier for the primary path and only permit degradation under the soft-cap pattern (complex reasoning, customer-facing summarisation). The policy is set in the scale-out plan and enforced at the route handler.
05 — Fan-OutRouting across workloads without coupling.
The fan-out pattern is the architectural answer to the most common scale failure mode in the anti-patterns section: the monolithic agent. Instead of growing a single agent tool-by-tool until it owns every workload on the platform, fan-out introduces a thin entry-point dispatcher and N narrowly-scoped specialist agents — one per workload — each with its own prompt, its own tool set, its own observability and its own cost gate.
The pattern is older than agentic AI; it is the same routing shape used for years in microservice architectures. What is new is the cost benefit. Where service-to-service routing in a classical microservice stack is largely a structural choice, agent routing at platform scale is a cost choice. A monolithic agent serves every workload through the same prompt and model tier, which means the most expensive workload sets the floor for every other workload. Fan-out lets each workload pick its own tier, which is the largest single source of margin in a multi-workload platform.
Entry-point dispatcher. A thin handler — usually a small classifier or a cheap LLM call — that maps an incoming request to one of N workload-specific agents. No business logic. No tool calls. Latency budget: 50–150ms. Owns workload routing, tenant isolation, request ID assignment.
Specialist agents (one per workload). Each agent has its own prompt, its own tool set, its own model tier policy, its own cost gate. They do not call each other. Shared state lives in the platform store, not in agent-to-agent calls.
Shared platform services. Observability, cost attribution, audit logging, identity. Every specialist agent writes to these the same way. The platform services are the only thing that knows about all workloads simultaneously.
The contract: a specialist agent can be replaced, upgraded, or retired without touching any other specialist. That property — independent evolution — is what makes fan-out cheaper to operate as workload count grows.
Three implementation rules. First, the dispatcher must be cheap. A heavyweight router (large LLM, multi-step classification) defeats the entire pattern by re-introducing the monolithic floor at the routing layer. A 200-token system prompt on a budget-tier model is the right shape for a 5-workload dispatcher; even at 10+ workloads a cheap classifier beats the cost of routing through a premium model. Second, specialist agents do not share prompts. Sharing prompts couples their behaviour and re-creates the monolith one prompt modification at a time. Third, the dispatcher must own tenant isolation — multi-tenant workloads do not safely live inside a specialist agent, because the specialist agent does not have the architectural authority to enforce isolation across other tenants.
The pattern also unlocks the per-workload cost gate from §04. With monolithic agents, "per workload" is at best a tag on telemetry and at worst a fiction — the budget is shared across whatever workloads the agent happens to handle that minute. With fan-out, each specialist agent has its own ceiling and its own degradation behaviour, which is the policy the capacity model was forecasting in the first place.
"Monolithic agents set the cost floor at the most expensive workload they serve. Fan-out lets every workload pick its own floor — and that is where the platform-tier margin comes from."— Stage 9 architecture review notes
06 — MaturityPilot → expansion → platform → optimised.
Four maturity tiers describe the realistic shape of how an agentic capability grows inside an organisation. They are not arbitrary — each tier carries different staffing, different architecture, different cost-gate complexity, and different governance cadence. Knowing which tier a platform sits in is also the answer to most planning questions; the same question ("should we centralise the eval suite?") has different answers at pilot tier and at platform tier.
The tiers are sequential. Skipping the expansion tier and attempting to jump from pilot directly to platform is the largest single source of scale failure in the engagements behind this playbook. The expansion tier is where the team learns how its specific workloads behave when sequenced, and that learning is the input the platform-tier re-architecture needs.
Single workload · single team
One workload, one on-call rotation, one set of evals, one cost line. Targets ship velocity over operational maturity. Production traffic, but limited in scope and clearly labelled as pilot. Lasts roughly one quarter — long enough to demonstrate value, short enough that operational debt does not compound.
one quarterTwo to five workloads · same team
Workloads added sequentially using the Stage 9 templates. Same engineering team scales by ~1.5x, ops headcount budget formalised. Fan-out pattern introduced explicitly. Cost gates per workload, scale-out plan reviewed quarterly with governance. Lasts roughly two to four quarters before the pattern of cross-workload concerns demands platform-tier investment.
2–4 quartersFive-plus workloads · dedicated ops
Re-architected for multi-tenant isolation, workload-level SLAs, dedicated capacity pools and a separate ops rotation. New roles: agentic ops lead, prompt engineering specialist, eval-set owner per workload class. This is a project, not a milestone — budget for it like a major platform release. Typically a one-quarter project, then steady-state.
project, then steady-stateContinuous improvement · Stage 10
The hand-off to Stage 10. Quarterly retrospectives, weekly KPI dashboard review, monthly capacity-model recalibration, and proactive workload retirement. Cost-per-workload trends down year-on-year as model prices fall and prompts mature. New workloads added at marginal cost rather than as projects. Indefinite duration; the loop never closes.
indefiniteThe transition between tiers 2 and 3 is the one teams most often mis-time. The signal is not user count — it is workload count and workload heterogeneity. When the ops headcount budget crosses two FTE-equivalents, when the third stakeholder asks for workload-specific SLAs, when the second incident reveals that the shared infrastructure cannot isolate one workload from another — those are the signals to start the platform-tier project, not to wait another quarter for things to get worse.
The transition between tiers 3 and 4 is the easiest to get right operationally and the hardest to get right culturally. By tier 3 the platform is stable, the incidents are manageable, and the temptation is to call it done. Tier 4 requires re-investing in retrospectives and continuous improvement at exactly the moment the team feels least pressure to do so. The Stage 10 playbook covers the cadences that hold tier 4 steady — the failure mode at tier 4 is not collapse, it is quiet drift as model prices and prompts both move and nobody notices.
07 — Anti-PatternsThree ways scale breaks.
Three anti-patterns account for the majority of scale failures across the engagements behind this playbook. Each is preventable, each is recognised early, and each requires a specific corrective action rather than a generalised "improve operations". Naming them explicitly is half the prevention.
Scale failure attribution by anti-pattern · representative shape
Source: aggregate from Stage 9 engagements 2025–2026AP-01 · Monolithic agent. The single most common scale anti-pattern. A successful pilot agent acquires its second workload by absorbing it into the existing prompt and tool set; the third workload follows the same way; by the fifth workload, the agent is a monolith whose prompt is too long to reliably improve, whose model tier is set by the most expensive workload, and whose incident surface area is the union of all workloads served. The corrective is the fan-out pattern in §05 — and the right time to introduce it is at the second workload, not the fifth.
AP-02 · Under-staffed ops. The ops headcount line in the capacity model is omitted or under-resourced. Workloads are added without re-staffing, the existing team absorbs the additional load, and the first major incident reveals that the team has been operating at 1.3× capacity for two quarters. The corrective is the explicit ops budget in §03 and the discipline that no committed workload ships without a sign-off on the ops budget impact.
AP-03 · Retrofit cost gates. The cost gates from §04 are deferred until the first margin-pressure conversation with finance, by which time user behaviour is entrenched and the retrofit requires either a customer-facing communication (which costs trust) or a silent change (which costs more trust if discovered). The corrective is to ship the gates alongside the first scale-out workload, not after the third — the engineering work is the same either way, the trust cost is asymmetric.
08 — Next StageHand-off to continuous improvement (Stage 10).
Stage 9 delivers a platform that has crossed the expansion-tier threshold and is operating cleanly at the platform tier. The scale-out plan is live and quarterly-reviewed; the capacity model is calibrated against actuals; the cost gates are shipped per workload; the fan-out pattern is in production; the maturity tier is documented and known. What it doesnot deliver is the steady-state operating discipline that keeps the platform from drifting once the pressure of scale-out lifts. That is the Stage 10 contribution.
The hand-off into Stage 10 is a deliberate event, not a transition by default. The artefacts the Stage 9 team passes forward are: the live scale-out plan (so Stage 10 inherits the forward pipeline), the capacity model with at least two quarters of actuals (so the calibration baseline is real), the cost-gate configuration per workload (so the runtime policy is inspectable), the fan-out architectural map (so the topology is documented), and an explicit anti-pattern register (so the failure modes the platform has already avoided are not re-introduced quietly).
Stage 10 — covered in detail in the companion continuous improvement templates playbook — picks up those artefacts and adds the cadences that hold them current: weekly KPI dashboard reviews, monthly capacity- model recalibration, quarterly retrospectives, the model- upgrade evaluation checklist, and the stakeholder communication templates. The Stage 10 loop also feeds back into Stage 1 readiness assessment for the next quarterly planning cycle, which is what closes the 10-stage pipeline as a loop rather than a sequence.
Before formally handing off to Stage 10, verify:
(1) The scale-out plan is live and has at least three quarters of forward visibility.
(2) The capacity model has been calibrated against at least two quarters of actuals across all three budgets (compute, tokens, ops).
(3) Every production workload has an explicit cost gate (tier budget + overage pattern + alert thresholds) and that gate is enforced at the runtime, not on a wiki.
(4) The platform sits cleanly at tier 3 or above on the maturity ladder — workload-level SLAs, dedicated ops rotation, multi-tenant isolation.
(5) The three anti-patterns are documented and reviewed at the most recent governance committee meeting.
If any of those five is missing, Stage 9 is not finished. Do not advance to Stage 10 — the cadences in Stage 10 assume these artefacts exist, and operating Stage 10 cadences against an incomplete Stage 9 produces motion without improvement.
One closing observation. The shape of Stage 9 is genuinely different from the previous eight stages. Stages 1 through 8 are largely sequential — each delivers an artefact that the next stage consumes. Stage 9 is the first stage that operates as a sustained activity rather than a milestone, and Stage 10 extends that property indefinitely. The mental model worth holding is that Stages 1–8 build the platform and Stages 9–10 run it. The teams that internalise that distinction make the scale-out decisions that compound; the teams that treat Stage 9 as another milestone and Stage 10 as optional are the teams that show up at the next year's Stage 1 readiness assessment having to start again.
Scale is a re-architecture, not a multiplier.
The recurring failure mode at Stage 9 is not capability — it is the assumption that the numbers from pilot multiply. They do not. The token formula is per-workload, the ops headcount line is non-optional, the cost gates are per-workload-per-tier, and the architectural pattern that beats every alternative as workload count grows is fan-out. Each of the five Stage 9 templates exists because the same anti-pattern recurred across enough engagements that explicit prevention is cheaper than repeated correction.
The playbook is deliberately compact. One scale-out plan with one row per workload; one capacity model with three budgets and a buffer; three gate categories; one architectural pattern; four maturity tiers; three anti-patterns; one hand-off checklist. Skip any of them and the platform has a hole — skip the ops headcount budget and the first incident exposes it; skip the fan-out pattern and the monolithic agent sets the cost floor; skip the cost gates and the margin pressure shows up in finance instead of in product. The compactness is the point.
Run the Stage 9 templates against the platform before the second scale-out workload, not after the third. The templates cost roughly two weeks of investment at the first workload and save roughly six weeks of retrofit at each subsequent workload — the trade-off compounds in the right direction from workload three onwards. Then formally hand off to Stage 10 with the five-point checklist, and the platform is positioned to stay healthy through the next year of model price changes, workload additions, and operational growth.