An agent stack audit is a structured, repeatable assessment of an organisation's readiness to operate agentic AI in production — scored across one hundred binary points spanning infrastructure, governance, data, operations, and skills. The goal is not a certificate. The goal is a severity-ranked list of the gaps that will silently cost a quarter if they go unfixed.
Readiness audits matter now because agentic deployments fail in predictable, expensive ways: a model-routing layer with no observability, a governance policy that has never been tested against a real incident, training data with no lineage, skills that lag procurement by six months. None of those failures are exotic — every one of them shows up on the checklist below, and every one is cheaper to find on paper than in production.
This guide covers why readiness is measurable, the five domains and their twenty checks each, the severity-weighting scheme, the four-stage maturity model, and a worked example of running the audit against a mid-market SaaS company. Bring it to your next steering-committee meeting and you will leave with a remediation roadmap rather than another open-ended discussion.
- 01Readiness is observable, not aspirational.Each of the one hundred points has a binary pass criterion. Either the artifact exists, the control runs, the metric is tracked — or it does not. Readiness scores aspirational programs as zero.
- 02Infrastructure is the cheapest domain to fix.Infra gaps are mostly procurement and configuration — observable, fixable in weeks. Governance and skills gaps require cultural change and typically dominate the remediation timeline.
- 03Governance is where audits surface the most critical findings.Highest-severity gaps cluster in policy, risk, and incident response. A program with strong infra and weak governance is a program one incident away from a board-level event.
- 04Skills lag infrastructure by six to nine months.Procurement moves faster than enablement. Plan training and operating-model changes alongside — not after — infrastructure investment, or expect a sustained capability gap.
- 05Quarterly re-audit beats annual deep-dive.Drift in tools, models, and team composition is faster than the procurement cycle. A lighter quarterly cadence catches regressions before they compound into a remediation project.
01 — Why AuditReadiness is measurable — and most teams aren't.
The most common failure mode we see on agentic-AI engagements is not a model choice or a vendor selection. It is a steering committee that cannot answer a basic question: are we ready to operate this in production? The answer is almost always a qualitative sentence — "we're piloting, we're making progress" — and qualitative sentences do not survive an incident, a budget review, or a board challenge.
Readiness becomes measurable the moment you commit to two disciplines. First, every claim about the program must map to an observable artifact: a config file, a metric, a runbook, a signed policy. Second, every artifact must have a binary pass criterion — it exists and works, or it does not. The one hundred points below are the operationalisation of those two disciplines.
What an honest audit produces: a severity-ranked list of gaps, a roughly twelve-page report, an executive summary that maps findings to maturity stage, and a remediation roadmap with ninety-day, six-month, and one-year horizons. What it does not produce: a green dashboard, a vendor recommendation, or a certificate of readiness. The point is to make the program improvable, not to declare it done.
One framing worth borrowing from financial audit practice: the value of the audit is the gap report, not the score. A team that passes ninety points and fails ten is in a meaningfully different position than a team that passes seventy points and fails thirty, but both teams should be operating on the failed points first. Headline scores invite premature celebration; severity-ranked gap reports invite action.
02 — Five DomainsInfra, governance, data, ops, skills.
The one hundred points are organised into five domains of twenty checks each. Domains are not equal in weight — governance and data failures typically carry higher severity than infrastructure or ops failures — but they are equal in count. The shape is deliberate: it makes the audit symmetric, it forces attention on every domain, and it makes the gap report visually comparable across domains.
Each domain has a primary owner in a healthy organisation. Infra usually sits with platform or data engineering; governance with legal, risk, or a dedicated AI governance function; data with the CDO or data-engineering lead; ops with site reliability or a dedicated AI ops team; skills with people operations and engineering management. When two domains have no owner, that itself is a finding — record it under governance.
Infrastructure
LLM access, vector stores, retrieval pipelines, agent runtime, observability, model routing, cost controls. The plumbing that makes agentic workloads observable and operable in production.
Owner: platform / data engGovernance
Policies, risk register, incident response runbooks, model approval workflow, vendor due diligence, audit trail, regulatory alignment. The control surface that keeps the program defensible.
Owner: legal / risk / AI govData
Source inventory, lineage, classification, retention, training-data provenance, evaluation datasets, ground-truth labels, drift monitoring. The substrate every model and agent depends on.
Owner: CDO / data engOperations
On-call rotations, deployment gates, rollback playbooks, evaluation cadence, regression suites, cost dashboards, SLOs, change management. How the program runs day-to-day.
Owner: SRE / AI opsSkills
Engineering enablement, prompt and evaluation training, governance literacy, business-stakeholder fluency, succession depth, contractor strategy, knowledge management. The human capacity to operate everything above.
Owner: people / eng mgmtOne observation from running this audit across roughly thirty engagements: the domain that scores lowest predicts where the next twelve-month firefight will happen. Teams with weak data domain scores spend the next year recovering from a lineage or data-quality incident; teams with weak governance scores spend the next year retrofitting policy around a deployed system. The audit is, in part, a forecast.
03 — InfrastructureTwenty infra checks — LLM access, vectors, observability.
The infrastructure domain is the most concrete to audit and the cheapest to remediate. Almost every gap maps to a procurement, configuration, or integration decision — none of which require cultural change. That is why infrastructure is usually the highest-scoring domain on a first audit, and also why a low infrastructure score is a leading indicator of an under-invested program.
The twenty checks below cluster into four sub-areas: LLM access and routing, retrieval and vector stores, agent runtime and tooling, and observability and cost controls. The modes-grid highlights the four sub-areas; the prose that follows lists the full twenty points with their severity weight.
LLM access & routing
5 checks · severity highMulti-provider availability, model-version pinning, fallback routing, latency and cost-aware routing policy, key rotation and per-environment isolation. The base layer everything else depends on.
Procurement-heavyRetrieval & vectors
5 checks · severity highVector store choice and scale, embedding pipeline, chunking strategy, hybrid retrieval, evaluation harness. Where most agentic-RAG programs quietly fail without ever surfacing the cause.
Quality-determiningAgent runtime & tooling
5 checks · severity mediumTool registry, sandboxing, function-calling schema discipline, MCP or equivalent transport, allow-list permissions, deterministic replay. The substrate for actually-running agents.
Capability leverObservability & cost
5 checks · severity highPer-call tracing, prompt and output capture, token-spend dashboards, alerting on budget burn, structured eval logs, p95 latency tracking. You cannot operate what you cannot see.
Operational floorThe twenty individual checks, with severity tags (C = critical, H = high, M = medium):
- Multi-provider LLM access via at least two vendors (Anthropic, OpenAI, Google, or equivalent) [H]
- Explicit model-version pinning in production code; no "latest" aliases [C]
- Automatic fallback routing on provider error or timeout [H]
- Cost-aware and latency-aware routing policy documented and enforced [M]
- API key rotation policy and per-environment key isolation [C]
- Production-grade vector store selected and provisioned for scale [H]
- Embedding pipeline owned, versioned, and re-runnable from source [H]
- Chunking strategy documented and tunable per corpus [M]
- Hybrid retrieval (vector plus keyword or BM25) for production corpora [M]
- Retrieval-quality evaluation harness with recall and precision baselines [H]
- Centralised tool registry with versioning and ownership [H]
- Sandboxed execution for code-running or shell-executing tools [C]
- Function-calling schema lint and consistency checks across tools [M]
- MCP or equivalent transport for cross-service tool composition [M]
- Per-tool allow-listing and least-privilege permissions [H]
- Per-call tracing with prompts, outputs, latencies, and token counts captured [H]
- Token-spend dashboards segmented by team, feature, and model [H]
- Budget-burn alerts with thresholds tied to monthly cost targets [M]
- Structured evaluation logs queryable by prompt template and version [M]
- p95 latency tracked per surface with documented SLO targets [M]
04 — GovernancePolicy, risk, incident response — twenty checks.
Governance is the domain where audits surface the most critical-severity findings. The pattern is almost universal: the engineering team has built capability quickly, the governance function has not caught up, and the gaps are written in policy-shaped holes. A program that scores eighteen out of twenty on infrastructure and eight out of twenty on governance is already in the danger zone — capability without controls is what produces the incidents that erase the program's political capital.
Twenty governance checks, clustered into four sub-areas:
- Policy and standards (5 checks). Written AI use policy, model approval workflow, prohibited-use list, acceptable-data-classes register, third-party vendor due diligence template.
- Risk and compliance (5 checks). AI risk register with named owners, regulatory mapping (EU AI Act, sector-specific rules), data-protection impact assessments, model-card discipline, fairness and bias evaluation cadence.
- Incident response (5 checks). Runbook for prompt-injection events, runbook for data-leak via model output, escalation paths to legal and communications, postmortem template, drill cadence at least biannual.
- Audit and reporting (5 checks). Decision-log for production model changes, sign-off authority documented per risk tier, board or steering-committee reporting cadence, quarterly internal audit, external review every twelve to eighteen months.
The single highest-severity check in this domain is the incident-response drill cadence. A runbook that has never been exercised against a real or tabletop scenario is a runbook that will fail under the first real incident. Drill at least twice a year, document the gaps each drill finds, close them before the next drill.
"A program that scores eighteen on infrastructure and eight on governance is one incident away from a board-level event. Capability without controls is not maturity — it is exposure."— Field engagement note · Digital Applied audit kit
One forward projection worth naming. Regulatory pressure on agentic AI is increasing across jurisdictions, and the audit artefacts a mature governance function produces — model cards, risk register, decision logs, drill records — are almost exactly the artefacts most emerging regulations expect to see. Investing in governance now is not just risk mitigation; it is regulatory pre-staging. Programs that wait until regulators ask will spend two to three times more to retrofit the same artefacts under timeline pressure.
05 — Data, Ops, SkillsThree more twenty-point domains.
The remaining three domains follow the same shape — twenty binary-pass checks each, clustered into four sub-areas of five checks. Below is the structure for each, with the severity distribution. The full per-check rubric is in the audit kit; this section summarises the shape and the sub-area weights.
Data — 20 checks
- Source inventory and classification (5 checks). Every data source in scope is registered with owner, classification, retention, and consent provenance. The most commonly failed point: deprecated sources still feeding live retrieval indexes.
- Lineage and provenance (5 checks). Training and evaluation data lineage traceable to source, including any synthetic generations. License compliance documented per source.
- Quality and evaluation (5 checks). Ground-truth datasets versioned, evaluation cadence on each, accepted error thresholds, drift monitoring, regression alerts.
- Privacy and minimisation (5 checks). PII handling policy enforced in pipelines, redaction at retrieval boundary, encryption in transit and at rest, deletion paths tested, data-subject-rights workflows.
Operations — 20 checks
- Deployment and rollback (5 checks). Canary-or-equivalent rollout for model changes, tested rollback within five minutes, blast-radius controls, feature flags on every agent surface, deploy log retention.
- On-call and SLOs (5 checks). Named on-call rotation, paging thresholds, error-budget policy, p95 latency SLOs per surface, weekly operational review.
- Evaluation and regression (5 checks). Regression suite on every model or prompt change, eval gating in CI, golden-prompt set, periodic full-suite re-runs, red-team cadence.
- Cost and capacity (5 checks). Per-team budget dashboards, per-feature cost attribution, quarterly capacity review, escalation when projected spend exceeds budget, charge-back model where applicable.
Skills — 20 checks
- Engineering enablement (5 checks). Training completed on prompt engineering, evaluation, retrieval design, tool authoring, observability — measured by completion and applied-knowledge check.
- Governance literacy (5 checks). Risk awareness, prohibited-use familiarity, incident-response role understanding, policy refresh cadence, leadership briefing cadence.
- Business stakeholder fluency (5 checks). Product, finance, and operating-unit leaders can describe the capability, the constraints, and the cost shape. Without this, roadmaps drift toward the wrong workloads.
- Continuity and depth (5 checks). Documented succession on key roles, contractor-to-employee ratio inside targets, knowledge-management discipline, internal wiki kept current, exit-interview review for AI-program roles.
For teams running an adjacent assessment focused on a single engineering function — Claude Code or another AI coding tool specifically — the companion fifty-point scorecard at the Claude Code team adoption audit covers configuration hygiene, hooks, skills, and productivity metrics. The two audits compose: the readiness audit gives you the program shape; the adoption audit drills into the engineering sub-function.
06 — ScoringSeverity weighting and the four-stage maturity model.
The audit produces three outputs. A raw score (sum of passed points, out of 100). A weighted score (passed points weighted by severity: critical = 3, high = 2, medium = 1). And a maturity-stage assignment per domain. The third output is usually the one steering committees engage with — it gives a clear ladder, a recognisable shape, and a destination.
The four maturity stages below are deliberately simple. A more elaborate model produces more granular labels and less actual consensus on what stage a program is in. The point of a maturity model is not precision; it is shared language between engineering, governance, and the executive layer.
Ad-Hoc
Pilots exist, but no consistent policy, no central inventory, no production observability. Capability lives in two or three engineers. Typical raw score: under 40. Action: stop scaling, build the foundation.
0-39 rawReactive
Most infrastructure is in place, governance is partial, incidents drive policy updates rather than the other way around. Capability beyond the founders is beginning to spread. Typical raw score: 40-64. Action: shift to proactive on the highest-severity gaps.
40-64 rawProactive
All domains covered by named owners, regular cadences for evaluation and review, governance artefacts in place, drills exercised. Capability scaled across the engineering org. Typical raw score: 65-84. Action: invest in optimisation and cost discipline.
65-84 rawOptimised
Continuous evaluation, automated regression and cost guardrails, governance is pre-staged for regulation, skills depth across multiple teams, audit cadence is internal-then-external. Typical raw score: 85+. Action: maintain and re-audit quarterly.
85-100 rawTwo scoring conventions are worth being explicit about. First, stage assignment is per-domain, not global — a program can be Proactive on infrastructure and Reactive on governance, and the remediation roadmap should reflect that. Second, the raw score and weighted score are reported side-by-side; if they diverge (high raw, low weighted), the failed points are concentrated in critical-severity findings and the remediation roadmap leads with those regardless of count.
Four maturity stages · raw-score bands and program shape
Source: Digital Applied readiness-audit field engagements 2024-2026.07 — Run ItA worked example — mid-market SaaS audit.
To make the framework concrete, here is a sanitised composite of an audit we ran in Q1 2026 on a mid-market B2B SaaS company — roughly 250 engineers, ten months into an agentic-AI program with three customer-facing surfaces and a handful of internal copilots. Names and exact numbers are altered; the shape and findings are typical.
The audit ran in three sessions of roughly two hours each, plus a half-day report write-up. One auditor, one platform lead, one governance lead, and rotating product and ops leads as needed. Total elapsed calendar time: two weeks.
Raw score — Reactive
Sixty-four passes across one hundred points. Weighted score 121 of 200 (60.5%). Top of the Reactive band — close to Proactive but held back by governance gaps and an under-invested data domain.
Stage 2 · ReactiveHighest vs lowest
Infrastructure 18 of 20. Governance 9 of 20. Spread of nine points across domains is a classic shape: engineering moved fast, governance and data did not catch up. Predictable, fixable in two quarters.
Infra vs governanceSeverity-critical fails
Four critical-severity points failed: unpinned model versions in two surfaces, no tested rollback runbook on the customer-facing copilot, no incident-response drill in the last twelve months, and shared API keys across staging and production.
Sprint-1 fixesThe remediation roadmap that came out of the audit had three horizons. Inside ninety days: fix the four critical-severity points, run a first incident-response drill, and stand up governance ownership with named owners on the risk register and the model-approval workflow. Inside six months: lift the data domain from eleven to fifteen by completing the source inventory, wiring lineage on retrieval indexes, and provisioning ground-truth datasets for the two top-priority retrieval surfaces. Inside twelve months: close out the skills gap with two cohorts of engineering enablement and a leadership briefing programme, then re-audit and target the Proactive band.
The single most useful artefact for this client was the severity-ranked one-page summary. It made the steering committee conversation a thirty-minute discussion rather than a ninety-minute negotiation, because the order of operations was no longer in dispute. That is the consistent value pattern from this audit format: it removes the order-of-operations argument and concentrates the conversation on resourcing and timeline.
For teams thinking about how this readiness picture fits into a broader view of where agentic AI is heading in 2026, the Q2 2026 state-of-agentic-AI quarterly report tracks the platform-and-vendor landscape the audit is being run against. The audit tells you where you are; the quarterly tells you what you are aiming at.
Audit results are only useful if they change next quarter's roadmap.
A readiness audit is a means, not an end. The one hundred points, the severity weighting, the maturity model, the worked example — all of it exists to produce a small number of decisions that change how the next quarter is resourced. If the audit closes without a remediation roadmap, the audit was a meeting. If it closes with a roadmap that names owners, horizons, and the four or five critical gaps in priority order, it was the cheapest insurance the program will buy this year.
The practical next step is to run the audit internally first — a single platform or engineering leader spending six focused hours against the rubric — and use the result to decide whether an external pass adds value. Most teams find that the internal pass is enough to surface the four to six gaps that matter, and the external pass becomes worthwhile only once those internal gaps are closed and the program is targeting the Proactive band. The audit is repeatable; run it again next quarter.