Agentic AI vendor selection is the Stage 4 procurement decision that outlives most executive sponsors — pick badly and the next two years of your roadmap inherits the call; pick well and the prototype in Stage 5 starts with the right substrate underneath it. This playbook turns the data foundation built in Stage 3 into a defensible scoring rubric, an RFP that names the AI-specific clauses procurement teams usually miss, and a per-capability build-versus-buy fork at the end.
By the time a team reaches Stage 4, the strategy is documented, the roadmap is sequenced, the data foundation is in production, and the capability list is concrete. What remains is the procurement decision — which vendors carry which capabilities, on what commercial terms, with which contractual protections, and which capabilities the team will build itself. Done casually, this stage produces vendor portfolios that age badly; done deliberately, it produces a scoring artefact that auditors, board members, and future engineering leaders can read three years from now and understand the reasoning.
This guide walks the six templates that turn vendor selection into a defensible operating discipline — scorecard matrix, RFP, evaluation rubric, reference-call script, contract clause checklist, and the one-page build-versus-buy decision. The templates assume the Stage 3 data foundation is in place; if it is not, run that playbook first, because vendor selection without the data foundation documented is a category of decision that gets re-litigated every quarter.
- 01Vendor decisions outlive executive sponsors.Average AI exec tenure is roughly eighteen months; a vendor contract signed today is on the books for twenty-four. The scoring artefact is the institutional memory that keeps the decision defensible after the original sponsor leaves.
- 02RFP scoring must be defensible — not just the right answer.An RFP is an audit trail. Weighted axes, documented criteria, and named evaluators turn the procurement decision into a reviewable artefact rather than a relationship outcome. Audit and board scrutiny is the actual test.
- 03Reference calls reveal more than demos.Vendors control the demo; reference customers do not. Ten structured questions, asked of three or more references, surface integration cost, support reality, and renewal experience that no demo can.
- 04AI-specific contract clauses are non-negotiable.Training-data rights, model-update notification, output indemnification, data-residency, and termination-for-deprecation are the clauses that distinguish a 2026 contract from a generic SaaS template. Skip them and renewal becomes a hostage negotiation.
- 05Build-vs-buy is per-capability, never portfolio-wide.Some capabilities favour build for differentiation, others favour buy for commodity efficiency, some favour the wrapped-buy third path. Decide each capability against the rubric, document the reasoning, and avoid the all-buy or all-build trap.
The ten-stage agentic AI implementation pipeline. Stage 4 turns the documented data foundation into procurement decisions; Stage 5 takes the selected stack into prototype.
- 01Readiness assessment
- 02Strategy & roadmap
- 03Data foundation
- 04Vendor selection · you are here
- 05Prototype
- 06Production deploy
- 07Team enablement
- 08Governance
- 09Scale
- 10Continuous improvement
01 — Why Stage 4Vendor selection outlives most exec sponsors.
The average tenure of a senior AI executive in 2026 is roughly eighteen months — shorter than the typical agentic AI vendor contract, materially shorter than the twenty-four-month roadmap window most teams plan against. That asymmetry is the single most under-appreciated feature of Stage 4. Whoever signs the contract is unlikely to be the person who renews it, defends it to the board after a model-update breaks production, or migrates off it when the vendor strategy shifts. The artefact has to survive the person.
That changes how the stage should be run. The output of Stage 4 is not a vendor — it is a documented scoring rubric, an RFP audit trail, a set of weighted axis scores per vendor, a contract with named AI-specific clauses, and a one-page build-versus-buy decision per capability. The vendor is a downstream consequence of those artefacts; the artefacts are what the next AI leader, the next CFO, and the next board audit will read. Treat them as the deliverable, and the vendor decision falls out cleanly.
The second reason this stage matters more than its budget line suggests: vendor lock-in compounds across the rest of the pipeline. A choice made at Stage 4 propagates into the prototype scaffolding in Stage 5, the production deployment in Stage 6, the team enablement programme in Stage 7, the governance posture in Stage 8, and the scale economics in Stage 9. Decisions taken casually at this stage carry weight through every subsequent stage; decisions taken deliberately at this stage create optionality at every subsequent stage. The framework that follows is designed to make the deliberate path cheaper than the casual one.
"The vendor is downstream of the artefacts. Build a scoring rubric your replacement can read three years from now, and the vendor decision falls out cleanly."— Digital Applied procurement engagements, on Stage 4 outputs
Two failure modes show up reliably when teams skip the artefact work. The first is the relationship-driven shortlist — a vendor invited to the RFP because the AI lead worked with their CTO at a previous role, with the scoring rationalised after the fact. The relationship may or may not be a good signal, but the scoring written after the decision is no longer an audit trail; it is justification. The second is the demo-driven decision — a vendor wins because the polished demo outshone its competitors, even though the demo capability is the easiest part of the vendor offering to replicate and the operational reality is closer to a different vendor's offering. The scorecard, rubric, and reference calls below are designed to neutralise both failure modes.
02 — ScorecardCapability × maturity × support × pricing.
The vendor scorecard is the headline artefact of Stage 4. Each vendor is rated on the same eight axes, the axes are weighted, and the rolled-up score is the directional signal that drives the shortlist. The scorecard is not the final decision — the rubric in Section 04 refines it, and the build-versus-buy fork in Section 07 decides which capabilities never enter the procurement track at all — but it is the input to every subsequent step.
The matrix below is the template we have iterated on across agentic-AI vendor selections in the last eighteen months. The specific weights are starting points calibrated against mid-market product teams; regulated-sector teams should push the governance axis up, pre-PMF startups should push the differentiation axis up, and enterprise teams should weight the support and roadmap axes more heavily than the template suggests. The structure is the durable part; the weights are tuneable.
# vendor-scorecard.template.md
# Stage 4 · Agentic AI Implementation Pipeline
# Rate each vendor 1-5 per axis. Weighted total drives shortlist.
## Vendor: <name> Date: <yyyy-mm-dd>
## Capability under evaluation: <e.g. retrieval, agent runtime, eval>
## Evaluators (named): <eng-lead, security, finance, product>
| Axis | Weight | Score | Weighted | Evidence / notes |
|-----------------|-------:|------:|---------:|-------------------------------|
| Capability fit | 20% | _ | _ | Does it solve our use case? |
| Maturity | 15% | _ | _ | Years in market, customer N |
| Support | 12% | _ | _ | SLA, response time, on-call |
| Pricing model | 15% | _ | _ | Per-call, seat, flat, hybrid |
| Security posture| 10% | _ | _ | SOC2, ISO 27001, pen-test |
| Roadmap fit | 10% | _ | _ | Aligned with our 24-mo plan? |
| Lock-in risk | 10% | _ | _ | Switching cost, data portability|
| References | 8% | _ | _ | 3+ customers contacted |
|-----------------|--------|-------|----------|-------------------------------|
| WEIGHTED TOTAL | 100% | | _ | Threshold: 3.8/5 to shortlist |
## Disqualifiers (any single failure → out)
[ ] Fails security review (no SOC2 or ISO 27001 within 12 months)
[ ] Refuses training-data clause in contract
[ ] No data-residency option matching our compliance requirements
[ ] Single-customer-concentration risk (>40% of revenue from one customer)
[ ] No named-alternative migration path documented
## Decision
Shortlisted: [ ] yes [ ] no
Sponsor: <name, title>
Reviewed: <name, title>
Next step: <RFP / reference calls / disqualified>Capability fit
The single largest axis. If the vendor cannot solve the specific use case in the capability brief, no other strength compensates. Score conservatively — a partial fit is a 3, not a 4.
Largest single axisCommercial structure
Per-call versus per-seat versus flat-fee versus hybrid changes the twenty-four-month TCO by 2-3x at constant volume. The pricing model often matters more than the headline rate.
TCO-definingSwitching cost
Data portability, schema standardisation, named-alternative migration paths. The renewal-delta line on the buy-side TCO is fully quantifiable — model it explicitly here.
Renewal leverageTwo scoring disciplines keep the scorecard honest. The first: named evaluators per axis, with each axis scored independently before the totals are visible. Group consensus scoring produces anchored answers — once the engineering lead says "8 out of 10" on capability, every subsequent score tends to cluster around that anchor. Independent scoring followed by a reconciliation session produces meaningfully better signal. The second: a hard disqualifier list that any single vendor failure removes the vendor from consideration regardless of weighted total. Security posture, training-data rights, data residency, and named-alternative migration paths are not negotiable axes — they are gates.
03 — RFP TemplateScope, requirements, evaluation criteria, SLAs.
The RFP is the formal artefact that goes to shortlisted vendors, and it is the audit trail your replacement reads when the contract is up for renewal. Six sections, in this order, every time. The order matters — the scope section sets the frame, the requirements section is the substantive ask, the evaluation criteria are published transparently so the vendor knows what they are scored on, and the SLA / commercial sections close out the document with the operational and contractual asks.
# RFP-template.md
# Agentic AI Vendor Selection · Stage 4
# Issued: <yyyy-mm-dd> Response due: <yyyy-mm-dd, +21 days typical>
## 1. Scope & background (~1 page)
- Company overview & sector
- Current agentic AI maturity (reference Stage 1 assessment)
- Capability being procured (reference roadmap from Stage 2)
- Data foundation in place (reference Stage 3 artefacts)
- 24-month volume projection: <best / base / worst case>
- Out of scope: <capabilities NOT being procured here>
## 2. Functional requirements (~2 pages)
- Must-have capabilities (numbered list, 8-15 items)
- Nice-to-have capabilities (numbered list, 5-10 items)
- Integration points (existing systems we'll connect)
- Performance targets (latency p50/p95/p99, throughput, accuracy)
- Compliance requirements (SOC2, ISO 27001, sector-specific)
## 3. Non-functional requirements (~1 page)
- Data residency options
- Encryption (in transit, at rest, customer-managed keys)
- Audit logging (retention, export format, SIEM integration)
- Disaster recovery (RPO, RTO, geographic redundancy)
- Multi-tenancy isolation model
## 4. Evaluation criteria (~1 page)
- Published axis weights (the scorecard above)
- Disqualifiers (any single fail → no contract)
- Scoring process (named evaluators, reconciliation timeline)
- Reference call requirements (3+ customers, similar use case)
- Decision date: <yyyy-mm-dd>
## 5. SLA & support (~1 page)
- Required uptime: <99.9%? 99.95%?>
- Incident response: <P1 ack < 30 min, resolution < 4 hrs>
- Support tiers: <business hours vs 24/7, named TAM>
- Escalation path documented
- Service credits formula
## 6. Commercial structure (~1 page)
- Pricing model preference (per-call, per-seat, flat, hybrid)
- 24-month TCO at <best / base / worst> volume
- Renewal terms (max % increase, notice period)
- Termination clauses (for cause, for convenience, for deprecation)
- AI-specific clauses (training data, output rights, model update
notification — see Section 06 of the playbook)
## Submission instructions
- PDF + signed cover letter
- Named contact for clarification questions
- Confidentiality undertaking on the RFP itself
- Decision communicated within 14 days of due dateTwo RFP discipline notes worth carrying forward. First, publish the evaluation criteria transparently — the axis weights, the disqualifiers, the threshold. Vendors who know they are scored on lock-in risk will write better answers to the lock-in questions; vendors who do not know will give generic answers and the team has to re-ask everything in clarification rounds. Transparent criteria save weeks of back-and-forth and produce higher-signal responses. Second, name a single point of contact for clarification questions and publish every clarification answer to every vendor in the shortlist. Selective clarification is a fairness failure and it surfaces in the post-decision audit when it matters most.
04 — RubricWeighted scoring across eight axes.
The evaluation rubric is the scorecard applied to the actual RFP responses. The eight axes below carry the weights documented in Section 02; the table that follows describes the substantive scoring criteria per axis — what a 5 looks like versus a 3 versus a 1, so that independent evaluators converge on comparable numbers rather than producing scores that mean different things.
Capability fit · weight 20%
5 = solves all must-haves and the majority of nice-to-haves out of the box, with documented evidence. 3 = solves the must-haves but requires custom integration work for some, OR meets all but performance targets are at the edge. 1 = significant capability gaps that would require either custom development or a second vendor.
Largest axis — score conservativelyMaturity · weight 15%
5 = 3+ years in market, 50+ enterprise customers, public case studies in adjacent sectors, named on Gartner / Forrester. 3 = 1-2 years in market, 10-30 enterprise customers, some case studies but in different verticals. 1 = under 12 months in market, fewer than 5 customers, customer concentration risk.
Stability proxySupport · weight 12%
5 = named technical account manager, 24/7 P1 support with sub-30-min ack, dedicated Slack / Teams channel, quarterly business reviews. 3 = business-hours support, ticketing system, no TAM, ad-hoc escalation. 1 = community / email support only, no SLA, no named contact.
Renewal predictorPricing · weight 15%
5 = transparent pricing, predictable at scale, max 7-10% annual renewal increase, no per-seat surprises. 3 = standard pricing but unpredictable at scale (per-call drift), 12-15% renewal increase. 1 = bespoke pricing per customer, unbounded renewal increases, hidden overage charges.
TCO axisSecurity · weight 10%
5 = SOC2 Type II + ISO 27001 within last 12 months, third-party pen-test report shareable, customer-managed encryption keys, SIEM-ready audit logs. 3 = SOC2 Type II only, standard encryption, shared audit log access. 1 = no recent SOC2, no pen-test, encryption at rest only.
Gate axisRoadmap fit · weight 10%
5 = published 12-month roadmap aligned with our 24-month plan, customer advisory board access, beta-feature opt-in. 3 = roadmap shared under NDA but limited customer input. 1 = roadmap opaque, feature requests handled via ticket queue with no transparency.
Strategic alignmentLock-in risk · weight 10%
5 = full data portability, open standards, documented migration path to named alternatives, no proprietary schema. 3 = data export available but in custom format, migration possible but expensive, mostly open standards. 1 = proprietary lock-in throughout, no export tooling, no documented migration path.
Renewal leverageReferences · weight 8%
5 = 3+ reference customers in our sector or with our scale, all reachable, all answer the ten reference questions positively. 3 = references available but in different sectors or smaller scale, mixed answers to reference questions. 1 = fewer than 3 references, or references coached / limited in scope.
External validationThe rubric works hardest in the reconciliation session. Independent evaluators score each axis before they see anyone else's number; the reconciliation meeting works through disagreements axis by axis, with the evidence from the RFP response and the reference calls as the arbiter. The output is a single agreed score per vendor, with the evidence trail attached. That trail is what survives the executive transition.
"Independent scoring before reconciliation produces meaningfully better signal than group consensus from a blank page. Anchor bias is real and the reconciliation session is where you control it."— Pattern across vendor selections we have facilitated
05 — Reference CallsTen questions every reference should answer.
Reference calls are the highest-signal artefact in Stage 4 and the most consistently under-used. Vendors control the demo; vendors do not control what a reference customer says when asked the right questions in a structured thirty-minute call. The script below is the version we have iterated on across the last eighteen months — ten questions, in order, with follow-ups built in.
Run the script against three references minimum, ideally five. Vendors will offer references they expect to perform well; ask specifically for one reference at your scale, one in your sector, and one that has been with the vendor for more than eighteen months. The mismatched references are usually where the most useful signal lives.
# reference-call.script.md
# 30 min call · two-evaluator format · transcribed
## Pre-call (5 min before)
- Confirm: vendor X, capability Y, customer Z (sector, scale)
- Read their public case study if one exists
- Note any LinkedIn signals on the reference contact
## Q1 — Use case parity (5 min)
"What capability are you using vendor X for? How similar is it
to <our brief use case in one sentence>?"
Follow-up: scale, volume, integration depth.
## Q2 — Implementation reality (3 min)
"How long did implementation actually take, versus what the
sales team promised? What was the biggest surprise?"
Follow-up: who did the work — them, the vendor, a partner?
## Q3 — Time to first value (2 min)
"How long from contract signature to the first production use
case generating measurable value? What blocked the path?"
## Q4 — Support experience (3 min)
"What is your support experience actually like? Tell me about
the last P1 incident — how was it handled, ack time, resolution time."
Follow-up: do you have a named TAM? Do they actually respond?
## Q5 — Renewal experience (3 min) [if customer >12 months]
"How was your most recent renewal? Did the price increase
match what you signed up for? Any surprises?"
Follow-up: would you renew again?
## Q6 — Hidden costs (2 min)
"What did the contract not cover that you ended up paying for?
Implementation, training, integration, overages — anything?"
## Q7 — Roadmap responsiveness (2 min)
"Have you requested a feature in the last 12 months? What
happened? Was it on their roadmap, did they build it, what
was the timeline?"
## Q8 — Failure modes (3 min)
"What does it look like when this vendor's product fails? How
often, what category of failure, how do they communicate?"
## Q9 — What you'd do differently (2 min)
"If you were doing this procurement again, what would you do
differently with this vendor specifically?"
## Q10 — The unsolicited tell (2 min)
"Is there anything I should be asking that I haven't?"
This is the highest-signal question in the call.
## Post-call (5 min)
- Categorise: positive / neutral / negative on each axis
- Note any specific names, dates, dollar figures cited
- Flag any answer that contradicts the vendor's RFP responseTwo operational notes. The two-evaluator format matters — one evaluator drives the questions, the other takes structured notes and watches for tells the lead may miss. Trying to do both is measurably worse, both for capture quality and for follow-up instinct. The other note: Q10 is the highest-signal question on the script. References who are coached or constrained tend to give terse answers to Q1 through Q9 and then volunteer the real story when asked the open-ended close. Almost every reference call we run produces its most useful sentence in the final two minutes.
06 — ContractTermination rights, data ownership, AI-specific clauses.
The contract is where the scoring work converts into operational reality. Four clause categories distinguish an agentic-AI vendor contract from a generic SaaS template — training-data and output-rights clauses, model-update notification, AI-specific indemnification, and termination-for-deprecation. Each is a recurring negotiation point in 2026 procurement, each is quantifiable in renewal terms, and each is the type of clause that becomes a hostage negotiation if it is not handled at sign time.
Training-data & output rights
Customer-data not used for model trainingCustomer-submitted prompts, documents, and outputs must not be used to train the vendor's models without explicit opt-in. Outputs are owned by the customer with vendor licence to operate the service. Without this clause, every prompt you send is training data.
Non-negotiable in 2026Model-update notification
30-day notice + rollback windowMaterial model changes (version bumps, deprecations, behaviour drift) require 30 days advance notice with a documented rollback path during a 60-day window. Without this clause, the vendor can change the product underneath you mid-quarter with no recourse.
Stability requirementAI-specific indemnification
Output IP, training-data IP, GDPR exposureVendor indemnifies customer against third-party IP claims arising from model outputs, against training-data provenance claims, and against GDPR / CCPA exposure from vendor data handling. Standard SaaS indemnification clauses do not cover AI-specific exposures.
Risk transferTermination for deprecation
Customer right to exit if capability degradesCustomer may terminate without penalty if a contracted capability is deprecated, materially degraded, or removed from the product. Includes pro-rata refund of prepaid fees and 90-day data-export support. Without this, you are paying for a product the vendor has stopped maintaining.
Exit rampBeyond the four AI-specific clauses, the standard SaaS contract checklist still applies — data residency commitments matching your compliance posture, encryption at rest with customer-managed keys where required, SOC2 / ISO 27001 evidence with annual refresh, named subprocessors with change-notification, and a clear escalation path. Walk through each with procurement counsel before signing. The agentic-AI clauses are additions, not replacements.
Pricing-clause discipline is worth a separate note. Renewal caps on annual price increases — 7-10% is healthy, 12-15% is acceptable, anything uncapped is a future renegotiation under duress — are the single line item that protects you against the vendor capturing all the upside as your usage grows. Most vendors will negotiate the cap if pressed; most procurement teams never ask. Ask.
"Every AI-specific clause you skip at sign time becomes a hostage negotiation at renewal. Negotiate them when you have leverage, not when you do not."— Standard advice across our vendor selection engagements
07 — Build vs BuyThe one-page decision.
Not every capability on the Stage 2 roadmap goes into the RFP. Some are built internally because they sit on the competitive surface; some are bought because they are commodity plumbing; some are wrapped — the vendor's implementation behind your own schema — to preserve switching optionality. The one-page build-versus-buy decision is the gate that decides which path each capability takes before procurement begins.
The decision sits at the top of the Stage 4 funnel — run it first, for every capability, before any RFP is drafted. The capabilities that fall to the build column never enter procurement; the capabilities that fall to wrap-buy enter procurement with a thin wrapper as part of the integration plan; the capabilities that fall to plain buy go through the standard RFP path described in Sections 02 through 06.
The build-vs-buy decision · per-capability signals
Source: Digital Applied build-vs-buy engagements, 2025-2026The full version of the build-versus-buy framework lives in our dedicated MCP build-versus-buy TCO calculator — six axes, a twenty-four-month TCO model, an explicit switching-cost component, and the wrap-buy third path discussed there. The capability-by-capability decision is the same shape whether the underlying technology is MCP, retrieval, agent runtime, or anything else procured at this stage; the framework transfers directly.
The discipline that keeps build-vs-buy honest at month eighteen, when the picture has shifted: document the axis ratings and the TCO numbers at decision time, in a short markdown file alongside the capability in your repo. Rerun the framework against the documented baseline at every renewal cycle and at any 2x volume inflection. The teams that compound advantage at this stage are the ones that rerun the decision; the teams that compound regret are the ones that treat the original call as final.
08 — Next StageHand-off to prototype (Stage 5).
Stage 4 concludes with a hand-off package to Stage 5 prototype. Five artefacts go in the hand-off, each produced by the work described above, each named in the prototype brief so the Stage 5 team starts with the documented context rather than reconstructing it from memory.
Selected vendor stack
Named vendor per capability, with the scorecard total, the rubric evidence, and the contract reference. The artefact your Stage 5 prototype team needs first.
Vendor listBuild queue
Capabilities that fell to the build column, with the rationale (differentiation, volume × cadence, thin vendor market) and a placeholder engineering estimate. Goes into the Stage 5 scaffolding.
Internal capabilitiesWrap-buy specs
Capabilities procured plus their wrapper specs — your schema, vendor's implementation. The wrapper specs feed directly into Stage 5 integration design.
Third-path capabilitiesContract clause register
Every executed contract with its AI-specific clauses, SLA terms, renewal dates, and named alternatives documented. The artefact that future renewals and audits read.
Procurement recordTCO baseline
Twenty-four-month projected TCO per capability and rolled up across the portfolio. The baseline against which Stage 5 prototype results, Stage 6 production economics, and Stage 9 scale economics are measured.
Economic baselineWith those five artefacts in hand, Stage 5 begins from a documented baseline — the prototype team knows which vendors are in scope, which capabilities are being built internally, which wrappers need to land before integration, what the contractual constraints are, and what the economic envelope is. The next playbook in the series, the Stage 5 prototype templates, picks up from here — prototype brief, eval harness, success criteria, and the prototype-to-production gate. For teams that want a partner running Stages 4 and 5 alongside the internal team, our AI transformation engagements cover the full procurement-to-prototype hand-off as a single deliverable.
"The Stage 5 prototype works hardest when the Stage 4 artefacts are documented well. Most prototype failure modes trace back to undocumented vendor decisions made one stage earlier."— Digital Applied agentic AI engagements, on the Stage 4 → Stage 5 hand-off
Vendor selection is governance — and the artefact is the rubric.
Stage 4 is the procurement stage of the agentic AI pipeline, and the highest-leverage operating discipline at this stage is the scoring artefact rather than the vendor decision itself. The vendor matters; the documented rubric, RFP, reference notes, contract clauses, and build-versus-buy decisions matter more. They are what survive the eighteen-month executive transition, the twenty-four-month renewal cycle, and the inevitable board audit that arrives sometime in between.
The templates above — eight-axis scorecard, six-section RFP, published evaluation rubric, ten-question reference script, four-clause AI contract checklist, and the per-capability build-versus-buy decision — are the artefact set we have iterated on across the last eighteen months of agentic-AI vendor selections. They are deliberately concrete; the value of Stage 4 comes from running every capability through every template, consistently, and documenting the result. Casual application of this stage produces vendor portfolios that age badly; deliberate application produces portfolios that compound advantage through the rest of the pipeline.
The next concrete step is short. Pick the one capability your team is closest to procuring, run the scorecard against the current shortlist of vendors in the next sixty minutes, draft the RFP from the template against that capability tomorrow, and book the reference calls for the week after. By the time the contract is signed, the artefact set is documented, the rubric is in the repo, and Stage 5 prototype can start the moment the procurement ink dries. That is the cadence Stage 4 is designed to enable.