This MCP server rollout at a marketing agency is the case study we point clients at when they ask whether an agentic AI rollout pays for itself at a sub-fifty-person scale. Twelve consultants across four practice groups — analytics, CRM, content, design — were building one-off AI workflows in isolation, reinventing the same credential plumbing and the same retrieval shapes every quarter. Over ninety days we replaced that with a shared MCP capability catalog, a single distribution model, a security audit baseline, and a governance charter the agency now owns and runs itself.

The agency has asked to stay anonymous. Names, account identifiers, and tool-internal labels are redacted; the four capability areas, sequencing, distribution model, audit baseline, governance gates, and outcome shape are reproduced faithfully from the engagement notebook. Where numbers appear they are the ones in our records; where ranges appear it is because the underlying variance was larger than a single point estimate would honestly convey.

This guide pairs cleanly with our org-deployment playbook — the 30/60/90-day MCP rollout plan — and with the audit baseline used in §04. If you are running this internally on a similar scale, the two documents are designed to be read together; the playbook supplies the abstract sequencing, this case study supplies the concrete shape.

Key takeaways

01
Capability catalog stops tool reinvention.Twelve consultants building parallel AI workflows is twelve copies of the same credential plumbing, retrieval shape, and report template. A shared MCP capability catalog across analytics, CRM, content, and design replaces that with one well-audited surface every consultant calls into.
02
Distribution model matches agency scale.Project-scoped distribution is the right pick at twelve consultants and four capability areas. Plugin and registry models add ceremony that pays off at much larger scale; choosing them here would have wasted the platform team's quarter and slowed shipping without measurable upside.
03
Security audit before rollout, not after.Every server passed the 75-point audit before being added to the catalog. Servers that surfaced existing vulnerabilities were remediated before consumer teams could wire against them; the audit became a feature of the rollout, not a retroactive obligation.
04
Governance gates compound.Four gates — addition, change, audit, deprecation — fit on a two-page charter and prevented the failure cluster we usually see at this scale: catalogs without ownership, servers shipping before audit, charters deferred indefinitely, re-audits skipped.
05
Quarterly re-audit aligned to client cycles.Re-audit cadence was anchored to the agency's client-reporting cycles rather than to a calendar quarter. The audit packs surface findings before client deliverables touch them; the cadence became the control without becoming a tax.

01 — SituationTwelve consultants, four practice groups, one shared problem.

The agency runs four practice groups — paid media and analytics, CRM and lifecycle, content and editorial, and brand and design — staffed by twelve senior consultants and a small platform team shared across all four. Each practice group had been quietly building its own AI scaffolding over the previous year: a custom GA4 retrieval wrapper here, a HubSpot-extract script there, a content brief generator built directly against a vendor API, a brand-system retrieval over the agency's asset library. None of these were unreasonable on their own; the problem was that there were twelve of them.

The first conversation that surfaced the rollout was a senior account director asking why three different consultants had independently built three different versions of a tool that pulled the same campaign-performance metrics out of the same GA4 properties. The honest answer was because no one owned a shared one. The deeper problem was that each version had its own credential scope, its own audit-log story (mostly: none), its own assumptions about which properties were in scope, and its own bug surface. Multiplying that pattern across analytics, CRM, content, and design was untenable, and the agency knew it.

Consultants

Senior practitioners across 4 groups

Most had built at least one bespoke AI workflow against a vendor API. Few had written internal MCP servers; none had reviewed each other's. Tool reinvention was structural, not lazy.

Senior · cross-functional

Practice groups

Paid · CRM · content · design

Each group had its own vendor surface, its own retrieval shape, and its own auth model. The platform team supported all four but had no shared abstraction across them.

Vertical specialisation

Parallel AI surfaces

Counted at week one

Nine distinct internal AI tools surfaced in the phase-one inventory survey. Self-reported count was four. The grep-the-org-for-SDK-imports step is what closed the gap.

Inventory: doubled

Shared abstraction

0 before

Pre-rollout baseline

Zero shared MCP catalog, zero shared auth library, zero centralised audit log, zero governance charter. Every consultant integrating AI was integrating against a different surface.

Sprawl risk · high

Two specific incidents in the six weeks before the engagement sharpened the brief. First, a consultant left and the GA4 integration she had built turned out to have been running under a shared service-account credential that nobody else knew the scope of; the credential was rotated reactively and three live workflows broke. Second, a content-team script that wrapped a vendor API for editorial drafts was found to be logging the full vendor response — including PII passed through in customer-research inputs — to a debug file that had quietly grown to several gigabytes. Neither incident was catastrophic; both made the case for a shared, audited surface much more concrete.

The leadership ask was specific: one capability catalog, one distribution model, one audit baseline, one governance charter, and a quarterly cadence that the platform team can run without our help after handover. The ninety-day horizon was their constraint, not ours; the agency's client-reporting cycle made starting fresh in any other quarter operationally awkward, and the platform team had a roughly three-person-month budget to spend on internal infrastructure before client work re-claimed the calendar.

"The problem wasn't that consultants were building AI tools. The problem was that twelve of them were building the same plumbing in twelve different ways, and nobody owned the shared surface."— Agency platform lead, engagement kickoff notes

02 — Approach · CatalogFour capability areas, one shared catalog.

The phase-one inventory closed with nine MCP-eligible workflows grouped under four capability areas. Each area got a single shared server in the new catalog, with the per-consultant variants either folded in as tool variations or deprecated outright. The grid below names each area, the tools it exposes, and the downstream systems each one wraps; the design rule across all four was the same — shared auth library, scope-bound tokens, centralised structured audit logging at the tool boundary.

One design decision worth flagging up front: narrow tool surfaces, not omnibus tools. The temptation in phase two was to expose "search the CRM" or "run an analytics query" as a single tool with a generic argument shape; we rejected those in favour of named, narrow tools that map cleanly onto specific consultant workflows. Narrow tools audit better, scope better, and version better; omnibus tools collapse the audit trail and make scope-limiting nearly impossible.

Area 01

Analytics capability

GA4 · Search Console · ad platforms

Read-only retrieval over GA4 properties, Search Console properties, and the agency's three ad-platform accounts. Narrow tools: campaign performance, audience composition, content-page query mapping, anomaly detection windows. No mutation tools in this server.

Read-only · scoped per client

Area 02

CRM capability

HubSpot · vendor lifecycle stack

Read and gated-write against the CRM. Contact and company retrieval, list-membership read, lifecycle-stage read, note creation gated behind explicit confirmation. Mutating tools require both the consultant's and the AI's confirmation in the agent loop.

Mixed · confirmation-gated

Area 03

Content capability

Brief library · editorial corpus

Retrieval over the agency's brief library, the per-client editorial corpus, and the brand-voice reference packs. Narrow tools per workflow: brief retrieval, prior-asset lookup, brand-voice anchor retrieval, fact-check source pull. Outputs flow back into the consultant's editor, not directly into publishing surfaces.

Read-only · client-scoped

Area 04

Design capability

Asset library · design-system tokens

Retrieval over the asset library, brand-system tokens, and the design-system component reference. Read-only by default; a narrow create-revision tool was prototyped late in phase three and held back for a phase-four engagement after the rollout closed.

Read-only · prototype hold

Two of the four servers in the catalog hold a small number of mutating tools (CRM note creation, design-system revision prototype); the other two are read-only by design. We weighted heavily toward read-only on the first pass because the audit-and-rollback story for mutations is qualitatively harder than for reads, and the agency's appetite for risk on mutations was — sensibly — lower than its appetite for retrieval-driven AI assistance.

One pattern that paid off across all four servers: per-client scope tokens, not per-consultant tokens. Every catalog server validates that the calling agent's token carries an explicit client-scope for the workflow it is running. The shared auth library enforces this; the consultant doesn't have to remember. The result is that cross-client data leaks — the single highest-cost failure mode in an agency context — become structurally hard to ship, not merely policy-disallowed.

Why narrow tools

The single largest design decision in the catalog was narrow named tools over generic omnibus tools. A "search analytics" tool with a free-text query argument audits as a single line per call; a {campaign_performance, audience_composition, content_query_mapping, anomaly_window} tool set audits as four named verbs with per-tool scopes. The second shape is what makes the audit log useful, the kill-switch surgical, and the scope contract real.

03 — Approach · DistributionProject-scoped distribution — the right weight at twelve consultants.

The three distribution models we usually recommend — project-scoped, plugin-marketplace, and internal registry — each fit a different band of organisational scale. The phase-one inventory at this agency closed with nine workflows across twelve consumers; the shape of that population pointed cleanly at the lightest of the three. Heavier models would have brought ceremony — manifest schemas, hosted registries, signing pipelines — that pays off at larger scale and would have impeded shipping here.

Project-scoped in practice meant a single internal monorepo holding the four catalog servers, published to an internal npm scope, consumed by each consultant's Claude Desktop config and the agency's internal agent runner. Versioning was per server. Updates were pull-based on the consumer side, with a shared release-notes feed the platform team posts to. The full decision matrix is in our org-deployment playbook; the choice here was made on three specific signals.

Project-scoped

Picked · monorepo + npm scope

Lowest ceremony. Single internal monorepo, internal npm scope, per-server versioning, pull-based update path. Fits a small consumer count (twelve here), strong cross-team awareness, and a platform team willing to own the monorepo as a discrete product.

12 consumers

Plugin marketplace

Rejected · ceremony exceeds payoff

Middle weight. Manifest schema, host-driven discovery, plugin packaging. Fits three to ten consumer teams across a shared host. The agency operates as one team with four practice groups, not as multiple consumer teams — the manifest schema would have been ceremony without benefit at this scale.

Wrong shape

Internal registry

Rejected · scale not yet there

Heaviest weight. Hosted catalog endpoint, signing pipeline, discovery API, supply-chain controls. Fits ten or more consumer teams or strict supply-chain requirements. The agency would re-evaluate this if the catalog grew past roughly fifteen servers or onboarded an external-facing consumption surface.

Re-pick gate later

One framing rule held across the decision: pick the lightest model that still works, with a documented re-pick gate when the population doubles. The platform team committed in writing to revisiting the model when the catalog crossed roughly fifteen servers or when an external-facing consumption surface — for instance, a client-side agent — required signed packages. Until then, the project-scoped model is the explicit choice rather than the default.

A second pattern worth naming: signing was deferred, not skipped. At twelve consumers, git provenance plus internal npm scope plus monorepo access control gave the supply-chain story enough teeth. The governance charter explicitly names the trigger for moving to a signing pipeline; the platform team is not pretending signing doesn't matter, just that the cost of building it now exceeds the marginal safety it would add at this scale.

04 — Approach · AuditSeventy-five-point audit before the catalog opened.

Every server in the catalog passed the 75-point security audit before being added; servers that failed were remediated before consumer teams could wire against them. The matrix below names the three audit pathways we ran and what each one returned. The audit baseline is the same one documented in our companion checklist; the application here was per-server, with the four servers in the catalog audited sequentially as they hit phase-two completion.

One sequencing rule held across the audit: distribution channel live but empty until audit pass. The npm scope went live in week five with zero published servers. Catalog servers were added one at a time as each audit cleared. The discipline of the freeze is what makes the audit gate real rather than aspirational; without it, the channel becomes convenient, servers ship through it, and the audit becomes a retrospective "we'll get to it" that never lands.

Auth + secrets

Shared library · scope-bound

Audit pathway one. Token validation, scope enforcement, secret-store integration, rotation paths. Every catalog server imports the shared auth library; the audit checks each server actually uses it, with no direct token handling. One server initially carried a fallback secret-store path that bypassed the library; this was remediated before catalog entry.

Library-mediated only

Tool scoping + audit log

Named tools · structured logs

Audit pathway two. Each tool is named, narrow, declares its scope, and emits one structured audit log line per invocation. Argument shapes are logged; argument values are redacted at source unless explicitly opted in. Tool response bodies live in an access-controlled store separate from the audit stream.

Source-side redaction

Abuse + injection paths

Per-server register · enumerated

Audit pathway three. Each server names its plausible abuse paths — credential exfiltration, prompt-injection-driven mutation, cross-client data leak, rate abuse — and the control that mitigates each. The register becomes a living document, attached to the catalog entry, refreshed at every quarterly re-audit.

Living document

Three categories of finding surfaced across the four-server audit pass. Critical findings — two in total — were landed inside phase two: a service-account credential scoped wider than the consumer-facing tool required, and a content-server tool that wrote vendor responses to a debug log without redaction. Both were the kind of finding the phase-one-incidents had foreshadowed; both took less than a day to remediate once named. High findings — eleven in total — were sorted onto a remediation backlog with named owners and target dates; all but one were closed before catalog open. Medium and low findings — twenty-three in total — were filed onto the quarterly re-audit schedule with no phase-two pressure to land.

The most consequential finding in the entire pass was on the smallest server. A content-retrieval tool that wrapped a single read against the editorial corpus turned out to hold a credential scoped for convenience rather than for least authority; the same credential could mutate the corpus the tool was nominally read-only against. Size is uncorrelated with blast radius. The full audit baseline is in our 75-point MCP server security audit checklist — every catalog server was audited against every point in that checklist, every time.

The pre-distribution audit rule

The sequencing rule that mattered most: distribution channel live in week five, but empty until audit pass. The channel becomes convenient the moment it accepts servers; empty channels stay disciplined. Every server in the catalog passed the audit before being added, with the two-new-servers forcing function in phase two surfacing the seams in the audit pipeline before the broader catalog filled.

05 — Approach · GovernanceFour gates, two pages, one charter.

The governance charter the agency now owns is two pages, names four gates, and was written and ratified inside phase three. The shape is the one we recommend in the org-deployment playbook; the specifics were adapted to the agency's approval flows and to the per-client scope-token rule. Charter ratification — platform lead, senior account director, and the head of the content practice signing — happened in week ten; the charter went live in week eleven alongside the re-audit schedule.

Two practical notes on writing the charter. First, length matters less than enforceability. The two-page draft we shipped is enforceable because every gate names its evidence — what artifact closes the gate, who reviews it, and where the decision is logged. A longer charter without named evidence is a longer document, not a stronger control. Second, the charter is the document that lets the platform team say nowithout it becoming a political fight every time a consultant proposes a new server. That mattered more than expected once the catalog opened and consumer demand started to outpace the platform team's phase-three throughput.

Gate 01

Addition gate

Audit · owner · scopes · cadence

A new server is addable only after passing the 75-point audit, naming a primary and secondary owner, declaring its scopes, and accepting the quarterly re-audit cadence. The approval path lists named reviewers — platform lead plus the relevant practice-group senior. Decision logged in the catalog entry.

Evidence: catalog entry

Gate 02

Change gate

Material vs non-material

Material changes — new tools, new downstream systems, new scopes — trigger a delta audit and re-approval. Non-material changes — bug fixes, refactors that don't touch tool surface — do not. The split is enumerated explicitly in the charter so the practice groups don't have to negotiate the boundary case-by-case.

Evidence: change log

Gate 03

Audit gate

Quarterly · client-cycle aligned

Quarterly re-audit on every active server, aligned to the agency's client-reporting cycles rather than to a calendar quarter. Each re-audit is a re-run of the 75-point checklist with a one-page diff against the prior pack, sorted by severity, with remediation evidence required to close each finding.

Evidence: audit pack

Gate 04

Deprecation gate

Owner · audit pass · consumer count

Servers without owners, servers that fail two consecutive re-audits without remediation, and servers without consumers for two quarters are deprecated on a documented timeline. The deprecation is announced, the migration path named, the calendar honoured. No silent removals.

Evidence: deprecation notice

One operational pattern from this phase worth calling out: the per-tool kill-switch. The shared auth library ships with a feature-flag check at each tool entry point. The platform lead and the on-call rotation each have flip authority without further approval; flipping a flag disables a single tool catalog-wide within minutes, independent of deploy. The kill-switch hasn't had to be used in production at this agency yet; the value of having it is that the incident-response runbook stops being theoretical.

On the on-call rotation: with a twelve-consultant agency and a small platform team, a formal pager rotation would have been ceremony exceeding the threat model. The compromise the charter names is business-hours coverage with named escalation — the platform lead is the primary contact during agency hours, with a senior consultant in each practice group named as the secondary for tools their group owns. Out-of-hours coverage is opt-in via the on-call channel; this fits the agency's actual incident shape better than imposing twenty-four-seven pager culture on a team that doesn't run client services overnight.

06 — OutcomesWhere the ninety days actually landed.

The outcomes below are the ones we measured at day ninety, with a soft check-in at day one-eighty for the cadence-related metrics (re-audit pass rate, consumer adoption, charter enforcement instances). Numbers are rounded where the underlying variance was larger than the rounding error; ranges appear where a single point estimate would be misleading.

The headline outcome is the one in the takeaways: tool reinvention was eliminated. Nine parallel AI workflows became four shared catalog servers, all twelve consultants now call into the same surface for analytics, CRM, content, and design retrieval, and the platform team owns one well-audited substrate instead of supporting twelve loosely-coupled ones. The structural change is what compounds; the per-metric outcomes below are evidence that the structural change held.

Structural outcomes · pre-rollout vs day 90

Source: Engagement notebook, day-90 review

Parallel AI surfaces9 pre-rollout · 4 catalog servers post-rollout

−56%

Audit-cleared servers0 pre-rollout · 4 of 4 post-rollout

4/4

Critical findings openUnknown pre-rollout · 0 at catalog open

0 open

Governance gates live0 pre-rollout · 4 charter-ratified gates

4 gates

Re-audits scheduled0 pre-rollout · next 4 quarters named

Q×4

On softer outcomes — time-to-deliverable, consultant satisfaction, the rate at which the platform team gets pulled into one-off integration questions — the picture at day ninety is consistent rather than dramatic. Time-to-deliverable on analytics-driven reports dropped by a band the agency characterised as "noticeable but not transformational"; the platform team reported being pulled into roughly a third of the one-off integration conversations they had previously been pulled into; consultants surveyed at day eighty named the shared catalog as the biggest single change to their workflow over the quarter. None of those are headline-grade numbers in isolation; the structural change is what makes them compound over the following quarters.

Tool reinvention

0× since

Day 90 to day 180

No new parallel AI workflows surfaced in the six months after catalog open. New consultant proposals routed through the addition gate. The catalog became the path of least resistance.

Structural · holding

Re-audit pass rate

100%

First quarterly re-audit

All four catalog servers passed the first quarterly re-audit at the day-180 check-in. Twelve new medium findings filed onto the next-quarter backlog; no critical or high findings opened.

Cadence · enforced

Charter enforcement

2 gates

First six months

Two proposal-stage rejections under the addition gate — both for missing-owner reasons, both resolved by the proposing consultants naming primary and secondary owners before re-submission. The charter held without becoming a political fight.

Governance · active

Audit-log volume

1× stream

Centralised at source

One structured audit-log stream across all four catalog servers. PII redaction at source. Retention aligned with the agency's client-reporting policy. The stream becomes evidence for the next quarterly re-audit.

Substrate · single

"The single change we'd call non-negotiable in retrospect: every server passes the audit before being added to the catalog. Everything else is sequencing — that one is the rule."— Agency platform lead, day-90 retrospective

One caveat worth being honest about: the rollout was paid engineering time the agency reallocated from client work for one quarter. The structural payoff is real; the cost was real too. For agencies considering whether to replicate this internally, the right framing is that ninety days of focused platform-team investment buys you a substrate that the following quarters of client AI work sit cleanly on top of — but the ninety days have to come from somewhere, and the discipline of the freeze (no new servers shipping outside the new pipeline during phase one) is harder than the engineering. That part is procedural, not technical.

07 — Lessons · ReplicationWhat we'd keep, what we'd change, what to copy.

Three lessons we'd carry forward to any similar agency rollout, and one we'd change in retrospect. Each is anchored in something that surfaced during the ninety days, not in abstract best practice.

Keep: narrow tools, not omnibus tools.The single biggest design decision that paid off across all four servers was the discipline of named, narrow tools per workflow. Omnibus tools — a single "query the CRM" with a free-text argument — would have collapsed the audit trail and made scope-limiting nearly impossible. Narrow tools audit cleanly, scope cleanly, and version cleanly; they also give the kill-switch a meaningful unit of operation. We'd hold that line again.

Keep: per-client scope tokens, not per-consultant tokens.Cross-client data leak is the single highest-cost failure mode in an agency context. Per-client scope tokens, validated by the shared auth library, made the leak structurally hard to ship rather than merely policy-disallowed. Every catalog server inherits this from the library; the consultant doesn't have to remember. We'd hold this line again, harder than we held it the first time.

Keep: re-audit cadence aligned to client cycles. Anchoring the quarterly re-audit to the agency's client-reporting cycles rather than to a calendar quarter made the cadence operationally compatible with how the agency actually plans its workload. Calendar-quarter cadence would have collided with peak client deliverable weeks for at least two of the four practice groups. Map the cadence onto the agency's rhythm, not onto an abstract calendar.

Change: prototype the design-system create tool inside phase two, not at the boundary of phase three.The create-revision tool for the design-system server was prototyped late in phase three and held back for a phase-four engagement because we ran out of audit-and-review runway. With hindsight, we should have either committed the prototype tool to phase two with a clear audit slot, or named it out of scope at the engagement brief. The middle path — "let's see if we can squeeze it in" — cost more attention in the closing week than it saved.

Replication for smaller agencies

The shape of this rollout — capability catalog, project-scoped distribution, audit baseline, governance charter, quarterly re-audit cadence — is replicable down to roughly six consultants and two capability areas before the ceremony starts to exceed the payoff. Below that, a single server with a hardened auth library and a one-page ownership document is usually the right shape. Above twelve consultants and four capability areas, the catalog itself stays the same — the distribution model is what shifts up the matrix.

For agencies considering whether to run this internally, the plan in this case study is sufficient to drive a credible rollout on your own. The reason agencies engage us is rarely capability; it is calibration — having seen the failure cluster and the sequencing trade-offs across a handful of similar rollouts, the early decisions get made faster and the governance charter lands in the form that prevents the failure modes we've seen most often. If that calibration matters, our agentic AI transformation engagements include agency MCP rollouts as a discrete deliverable. If it doesn't, the plan above is yours to run, and the org-deployment playbook is the abstract sequencing it sits on.

Conclusion

Agency MCP rollouts stop tool reinvention and compound consultant velocity.

The structural change at the heart of this rollout is the one in §02: nine parallel AI workflows became four shared catalog servers, all twelve consultants now call into the same well-audited surface, and the platform team owns one substrate instead of supporting twelve loosely-coupled ones. The structural change is what compounds across the quarters that follow; the per-metric outcomes are evidence that the structural change held.

The sequencing was the second half of the story. Discovery and audit first; distribution channel built but empty until audit pass; two new servers through the new pipeline as the forcing function; governance charter ratified before the catalog opened; quarterly re-audit cadence anchored to the agency's client cycles rather than to a calendar quarter. Each step earns the next; skip the sequencing and the failure cluster — catalogs without ownership, ship-before-audit, charter deferred, re-audit skipped — is what lands.

For agencies at a similar scale, the next concrete step is the phase-one inventory survey across the practice groups and a named platform-team lead with a documented three-month budget. The rest of the plan follows. The rollout doesn't need a larger team or longer runway to ship at this scale; it needs the discipline of the freeze, the audit before the distribution, and the charter before the catalog opens.

Case Study: MCP Server Rollout at a Marketing Agency