This is an AI coding rollout case study at a 100-developer engineering organisation. The team sustained a 28% productivity lift through month six and a 32-entry shared skill library — but the headline that mattered to the leadership team was simpler: both tools survived contact with developer preference, the productivity panel survived contact with the CFO, and the quarterly board update produced extension funding rather than scrutiny.

The org runs four product squads, twelve services, and a split-personality editor culture — roughly 60% IntelliJ-family, 30% VS Code-family, and a long tail of Vim and Emacs holdouts. A previous single-tool pilot had been abandoned at the 30-dev mark because mandating one editor created more friction than the AI tooling removed. This rollout took the opposite stance from day one: tolerate the editor choice, standardise everything else.

What follows is the playbook that worked. Situation, the four approach pillars — dual-tool rollout, shared skill library, hook and memory discipline, the productivity panel — and outcomes tracked across two quarterly reviews. Names and specifics are anonymised; numbers are exact.

Key takeaways

01
Dual-tool tolerates IDE preferences.Running Claude Code and Cursor in parallel cost less than a single-tool mandate would have lost in adoption and goodwill. Editor wars are not the hill to die on; consistency lives one layer up.
02
Shared skill library is the bridge.Thirty-two reviewed skills covering review, refactor, scaffold, debug and observability tasks were the substrate that let both tools produce comparable output. The library — not the tool — was the standards layer.
03
Hook + memory discipline scales.Pre-commit hooks, project memory templates, and a settings.json baseline kept the rollout from drifting into a hundred bespoke configurations. The discipline got harder, not easier, at scale.
04
Productivity panel must be engagement-weighted.Raw lines-of-code or PRs-per-week metrics were rejected by both engineers and finance. The panel that survived weighted by review depth, defect rate and developer-reported friction — not by tool output volume.
05
Quarterly board updates secure executive air-cover.Two formal quarterly reviews with the executive team — same metric panel, same narrative shape — converted the rollout from an engineering experiment into a funded program. Air-cover compounds.

01 — SituationA previous pilot had failed on tool mandate.

The engineering organisation had attempted a Claude Code rollout twelve months prior. The pilot covered roughly 30 developers, ran for ten weeks, and was quietly retired before reaching general availability. The post-mortem reached the leadership team with three findings: developer adoption stalled at ~45% inside the pilot cohort, half the cohort was using a different editor than the one the pilot mandated, and the metrics panel commissioned to justify the rollout produced numbers neither engineering nor finance trusted.

The new rollout was scoped explicitly to fix those three failures at once. The first reframe was the most important: the previous pilot had treated editor choice as a controllable variable. This rollout treated it as a constraint. The second reframe was the metrics panel — rebuilt from scratch around engagement and quality signals rather than raw output. The third was governance: quarterly board updates were committed to before the rollout began, not bolted on after the first complaints.

Headcount in scope

100

Four product squads

Twelve services spanning a transactional core, a customer-facing web app, two internal admin tools, and a data platform. Tech stack: TypeScript / Go / Python / a small Kotlin perimeter. Editor split roughly 60 / 30 / 10.

60% IntelliJ / 30% VS Code / 10% other

Pilot baseline

45%

Prior adoption ceiling

The 12-month-prior pilot reached 45% active adoption inside its 30-dev cohort before adoption stalled. Mandated editor switching, no shared skill library, and a metric panel finance refused to sign off on were the three named blockers.

Pilot abandoned at month 3

Quarterly cadence

Board reviews committed up front

Two formal quarterly engineering reviews were committed before the rollout started — one at month 3, one at month 6. Same panel, same narrative shape. Executive air-cover was treated as a deliverable, not an output.

Month 3 + month 6

The strategic insight from the prior pilot is worth naming directly: a developer-tooling rollout at 100-dev scale is not primarily an engineering problem. The engineering work is real but compressed. The hard work is governance, communication, metrics that survive cross-functional scrutiny, and tolerance for local variation in places where the variation is not load-bearing. The tool mandate had failed because the team got the load-bearing layers backwards.

02 — ApproachDual-tool rollout — both tools, both paid for.

The first architectural decision was to fund both Claude Code and Cursor licences for every developer in scope. The cost delta against a single-tool rollout was material — roughly 1.7× licence spend — but the team treated it as buying optionality rather than buying tools. The framing communicated to engineering was unambiguous: the standards layer is the skill library, the hooks, the memory templates, and the productivity panel. The tool is what each developer prefers to drive.

That stance produced a specific working pattern. Claude Code ran predominantly as a terminal-and-CLI surface for refactors, multi-file edits, codebase navigation, and review automation — heavy users in the JetBrains cohort kept their IDE and used Claude Code in a side pane. Cursor ran predominantly as an in-editor surface for line-level completion, in-context generation, and inline review — heavy users in the VS Code cohort kept their editor and reached for Claude Code only when a task needed terminal automation.

Terminal-first

Claude Code

CLI · multi-file · hook-aware

Primary surface for refactors, multi-file edits, codebase navigation, dependency audits, review automation. Hooks and project memory loaded automatically. Heavy users in the JetBrains cohort drove this from the terminal while keeping their IDE.

~62% primary-use share at month 6

Editor-first

Cursor

in-editor · inline · context-aware

Primary surface for line-level completion, in-context generation, inline review, and quick scaffolds. Heavy users in the VS Code cohort drove this and used Claude Code only for terminal-shaped tasks. Shared skill library callable from both surfaces.

~38% primary-use share at month 6

Why both, not one

The licence spend was the cheap line item. The expensive line was the alternative — losing six months to renewed editor-war friction and a second abandoned pilot. The team made the call explicit in the executive deck: we are paying for tolerance, and the standards live one layer up in the skill library, hooks, and memory templates.

The friction the dual-tool stance produced was not zero. Two specific issues showed up in the first quarter. First, version drift: Claude Code and Cursor ship at different cadences, and the skill library had to be authored against the lower-common-feature surface of the two. Second, prompt portability: a prompt tuned to Claude Code's terminal context occasionally produced different output when invoked from Cursor's in-editor context — same model class, different surrounding context. Both issues had observable fixes; neither justified retreat to a single-tool mandate.

"We stopped paying for tools and started paying for tolerance. The standards layer was never the editor — it was always one level up."— Engineering Director, mid-quarter 1 review

03 — ApproachA shared skill library — 32 reviewed entries, both tools.

The skill library was the rollout's standards layer. Each skill was a versioned bundle of instructions, allow-listed tools, and an example trace — invokable from either Claude Code or Cursor via a shared registry. The starting kit shipped with nine skills authored by the platform team in week one. Six months later the library held thirty-two, with roughly half contributed by squad-level engineers and reviewed by the platform team before merge.

The growth curve was not linear. The library grew quickly in weeks two through six as squads identified obvious repeating tasks — JIRA-to-PR scaffolds, service-template generation, dependency audit reports. Growth slowed deliberately in weeks seven through twelve as the platform team imposed a contribution review process. The slow-down was a feature, not a bug; the team wanted the curve to flatten before unreviewed skills proliferated.

Library at month 6

Reviewed skill entries

Authored by the platform team (9) and squad engineers (23). Each skill ships with a description, invocation pattern, allow-listed tools, an example trace, and an owner. Versioned in a dedicated repository; deployment is a CLI command on both surfaces.

9 at week 1 · 32 at month 6

Coverage families

Skill categories

Review (8 skills), Refactor (7), Scaffold (6), Debug (6), Observability (5). The shape was deliberately uneven — the categories developers reached for most were the ones the library invested in. Coverage was not the goal; usage was.

Review-heavy by design

Invocations / week

1.4k

At steady state

Roughly 1,400 skill invocations per week across both tools at month six. Distribution skewed strongly to the top 8 skills (Pareto held). Long-tail skills were retained even at low call rates because contribution velocity was treated as a leading indicator of engagement.

Top 8 ~78% of calls

The most important governance decision was who could ship a skill. Initial proposals from squad engineers ranged from "anyone with merge rights" to "platform team only". The compromise that held: any engineer can propose, but every skill needs a platform-team reviewer and an example-trace test before it lands in the shared registry. Squad-local skills are permitted but namespaced and excluded from the global library — a release valve that reduced contribution friction without diluting the shared catalogue.

The bridge metaphor

The skill library was the bridge between the two tools. Without it, the rollout would have produced two parallel cultures — one Claude Code, one Cursor — diverging on prompt patterns, conventions and quality bars. With it, the tool became a choice of interface to the same underlying engineering standards.

04 — ApproachHook + memory discipline — harder at scale.

Hooks and memory templates were the third pillar — and the one that almost broke. At a 30-dev shop, hook configuration is a single shared template that everyone copies. At a 100-dev shop spread across four squads and twelve services, the same shared template approach produces conflicts within weeks: a squad needs a pre-commit hook the platform template doesn't cover, a service needs a memory exclusion the rest of the org doesn't want, an engineer overrides settings.json locally and the override silently propagates through skill invocations.

The rollout produced a discipline layer in response. Hook and memory configuration was split into three named tiers: a global tier owned by the platform team, a squad tier owned by each squad's lead engineer, and a personal tier owned by the individual. Each tier was versioned in source control, audited quarterly, and explicitly documented in the developer onboarding. The discipline cost roughly two hours of platform- team time per week and produced fewer than ten configuration incidents across the six months.

Global tier

Platform-team owned

Pre-commit hooks for secret scanning, lint gating and test-suite triggers. Memory templates for the shared engineering style guide and the architecture documentation index. Read-only to squads and engineers; changes require a platform-team PR with a quarterly-review entry.

Audit quarterly

Squad tier

Owned by the squad lead

Squad-specific pre-commit hooks (service-template enforcement, squad-owned deployment guard rails). Squad memory entries pointing at squad-owned runbooks and architecture-decision records. Changes are squad-lead-merged; quarterly review with the platform team checks for drift against the global tier.

Squad lead approves

Personal tier

Individual override

Personal preferences — favourite output formats, editor-shaped tweaks, individual memory bookmarks. Stored in the developer's home directory and never merged. Cannot override a global or squad hook; can extend with additive rules only. The release valve that kept the global and squad tiers clean.

Additive only

Override governance

Conflict resolution

Global tier always wins on conflict; squad tier wins over personal; personal can never silently override either. Conflicts logged automatically and surfaced in the quarterly review. The discipline cost real platform-team hours but eliminated the silent-override class of incident the prior pilot had been killed by.

Global > squad > personal

The non-obvious finding was that the discipline got harder, not easier, as the rollout matured. At week two, the platform-team tier was three hooks and four memory entries — trivial to maintain. At month six, the global tier was eleven hooks and twenty-three memory entries — a meaningful surface area, and a steady drumbeat of small platform-team work to keep current. The squad tier's surface area grew faster still. Teams that attempt this should plan platform-team capacity assuming the discipline gets harder over time, not lighter.

"At thirty devs, hook config is a shared template everyone copies. At a hundred devs, hook config is governance — and governance is real work."— Platform Engineering Lead, end of quarter 2

05 — ApproachThe productivity panel — engagement-weighted, not output-weighted.

The metrics panel was the part most likely to repeat the prior pilot's failure. The prior pilot had used raw output metrics — lines of code, PRs per week, time-to-merge — and the numbers had been rejected by both finance (because they showed unattributable variance) and engineering (because the measurement felt punitive). The replacement panel was rebuilt from scratch with three goals: the engineers had to consent to the measurement, finance had to be able to reason about the attribution, and the panel had to behave the same way at the individual, squad and org levels.

The panel that shipped weighted four signals: skill invocations with successful trace completion, PRs accepted with no review-cycle defects, developer-reported friction (a two-question weekly pulse), and defect-rate drift across the twelve services. Raw output volume was deliberately excluded. Lines of code and PR count were available in the dashboard but relegated to context cards rather than headline metrics. The engineers signed off because the panel measured outcomes they cared about; finance signed off because the four signals triangulated each other.

Engagement-weighted productivity index · 6-month trajectory

Source: Anonymised engagement, engineering-leadership dashboard

Baseline (month 0)Pre-rollout productivity index · normalised

100

Month 1Adoption climbing · early friction

104

Month 2Skill library at 18 entries · first wins compounding

113

Month 3First quarterly review · panel ratified by exec team

121

Month 4Hook discipline stabilising · squad-tier rollout

126

Month 5Library at 29 entries · panel re-baselined

128

Month 6Steady state · second quarterly review

128

The shape of the curve is worth noting: the lift was front- loaded into months two and three, plateaued through month four as hook discipline absorbed attention, and reached steady state at month five. The flat months four and onward were not a failure — they were the engineering team converting acute productivity wins into sustained capacity, which the panel captured as a quality-side signal rather than an output-side signal.

The panel survived the first quarterly board update on its own merits and was ratified for the second. That ratification became the rollout's most valuable asset: a productivity measurement the executive team trusted enough to extend funding against. The two-question weekly pulse fed back into the panel as a leading indicator — when developer-reported friction ticked up two weeks in a row, the platform team treated it as a priority signal and audited the global hook tier.

The metric that mattered

The single line in the panel that produced the most internal discussion was developer-reported friction. Finance initially pushed back on the subjectivity. Engineering insisted on it. The compromise — a two-question weekly pulse aggregated to squad level, never used for individual evaluation — held for the full six months and became the panel's most- watched leading indicator.

06 — OutcomesSix months in, three numbers that mattered.

The outcomes section is short on purpose. Three numbers carried the second quarterly board update — productivity lift, library scale, and adoption depth — and a small set of stability signals confirmed the gains were not paid for in quality. The executive team extended the program; the engineering team absorbed the rollout into its ongoing platform work without ceremony.

Productivity lift

28%

Sustained at month 6

Engagement-weighted productivity index reached 128 at month 5 and held through month 6. The lift came from skill-library leverage in review, refactor and scaffold tasks. Lines-of-code and PR-count context cards moved less, as expected — the panel captured the qualitative shift the raw-output panel had missed.

Index 100 → 128

Skill library

Reviewed entries · 5 families

Nine starting skills grew to thirty-two reviewed entries across review (8), refactor (7), scaffold (6), debug (6) and observability (5). Roughly 1,400 weekly invocations at steady state, with top-8 skills accounting for ~78% of calls. Squad-local skills (~14) added on top, not counted in the global library number.

1.4k invocations / week

Adoption depth

87%

Active weekly users

Eighty-seven of one hundred developers used at least one of the two tools in any given week through months four to six. The prior pilot peaked at forty-five percent. The dual-tool stance, not the standards layer, was the principal driver of the depth.

Up from 45% in prior pilot

Quality-side signals confirmed the lift was real. Defect rate across the twelve services drifted slightly downward over the six months — not a step change, but no regression. PR review cycle counts were stable. The two-question weekly pulse trended mildly positive from month three onwards. Developer satisfaction in the annual engineering survey, conducted at month seven, registered the rollout as one of the most-cited positive themes.

The cost ledger was uncomplicated. Licence spend ran at roughly 1.7× a single-tool rollout. Platform-team headcount stayed constant; the rollout absorbed about 0.6 FTE of platform-team attention through quarter one, falling to roughly 0.3 FTE through quarter two as the discipline matured. The executive team approved an extension and a modest expansion of platform- team capacity to support continuing library and hook governance.

"The dual-tool stance was the line item. The skill library, hooks, and panel were the program."— CTO summary, quarter 2 board update

07 — Lessons + ReplicationWhat we'd replicate — and what we'd change.

Five replication notes from the six-month engagement. The pattern generalises across engineering orgs of similar shape and scale; the specifics of metric weights, library coverage and hook tiers will need re-tuning per environment.

Fund both tools deliberately.The 1.7× licence spend was the cheap line. The single-tool mandate that would have saved it would have cost more in adoption, goodwill and second-pilot-failure risk. Communicate this line in the executive deck as "buying tolerance," not as a procurement choice.
Build the skill library before the second tool. Nine starter skills covering the three highest-frequency tasks (review, refactor, scaffold) were the precondition for both tools producing comparable output. Without that substrate, the dual-tool stance becomes a two-culture problem within a quarter.
Plan platform-team time for hook governance. Hook and memory discipline gets harder, not lighter, at scale. Budget two to four hours of platform-team time per week for the global tier from week eight onward. The squad tier will demand similar from each squad lead.
Co-design the productivity panel with engineering and finance before week one. Engagement-weighted metrics produced cross-functional sign- off; raw-output metrics produced cross-functional rejection. The four-signal panel (skill traces, accepted PRs, friction pulse, defect-rate drift) is a portable starting template.
Commit to quarterly board updates before the rollout starts. Two formal reviews, same panel, same narrative shape. Executive air-cover compounds — the second review converted the rollout into a funded program rather than an experiment. Skipping the quarterly cadence is the single cheapest way to revert a rollout to pilot status.

30-dev shops

Single-tool plus shared kit

Below roughly fifty engineers the dual-tool overhead may not justify the tolerance benefit — a single-tool rollout with a shared skill kit and lightweight hook discipline is typically enough. The 30-dev pattern is documented separately.

Single-tool baseline

100-dev orgs

Dual-tool + standards layer

The pattern this case study documents. Fund both tools, invest in the skill library, govern hooks in tiers, ship an engagement-weighted productivity panel, commit to quarterly board updates. Plan 0.6 FTE of platform-team capacity through quarter one falling to 0.3 FTE through quarter two.

Adopt this pattern

500-dev orgs

Federated platform model

Above roughly two hundred fifty engineers the platform-team-as-bottleneck risk grows fast. A federated model — platform team sets policy and audits; per-domain platform engineers run library and hook governance for their domains — is the expected next step. Patterns from this case study become inputs, not the architecture.

Federate the platform

Highly regulated orgs

Add compliance overlay

Financial services, healthcare and public-sector orgs will need an audit overlay on the hook and skill registries — every change tracked, every invocation logged, every quarterly review producing a compliance artefact. The pattern accommodates this with discipline overhead rather than redesign.

Layer audit on top

One change we'd make on a redo: ship the productivity panel in week one with deliberately incomplete data, rather than at week six with the first three months ready. The panel-design conversation was the rollout's highest-leverage cross-functional artefact, and front-loading it would have compressed three weeks of executive uncertainty in the first quarter. The cost — looking a little under-prepared in week one — would have been worth it.

For engineering directors evaluating a similar move, the decisive question is whether the standards layer can be built and held one level above the tool. If yes, the dual-tool stance is on the table and the rollout pattern documented here applies with local tuning. If no — if the org cannot sustain a skill library, a tiered hook discipline and an engagement-weighted panel — the dual-tool stance will produce two cultures and should be deferred. Our team runs these engagements as part of our AI & digital transformation practice, and the playbook documented in our 30/60/90 day rollout plan and the companion 30-dev shop case study covers the pre-conditions in detail.

The reading at six months

Larger teams scale on shared infrastructure, not shared opinions.

The conclusion the engineering team converged on, mid-quarter two, was simpler than the rollout itself: at 100-dev scale, the standards layer is what matters and the tool layer is what varies. Editor wars are not load-bearing; skill libraries, hooks and panels are. Building the shared infrastructure first and then tolerating tool preference second was the strategic move that produced both the productivity number and the executive air-cover.

The replication path is portable. The dual-tool stance, the 32-entry library, the tiered hook discipline, the four-signal productivity panel and the quarterly board cadence are not the only configuration that works — they are an existence proof that the configuration can work at this scale, and that the executive team will fund it if the rollout earns its cross-functional sign-off. Most of the engineering directors we talk to are closer to that pattern than they realise.

The single insight we'd leave a peer engineering director with: the next rollout's headline is unlikely to be a number on the panel. It is more likely to be the simple organisational fact that two tools, two cultures and a hundred developers held a single set of standards for six months without a single re-org. That is what the productivity lift was paying for.

Case Study: AI Coding Rollout at a 100-Dev Team, Quarterly Data

01 — SituationA previous pilot had failed on tool mandate.

Four product squads

Prior adoption ceiling

Board reviews committed up front

02 — ApproachDual-tool rollout — both tools, both paid for.

Claude Code

Cursor

03 — ApproachA shared skill library — 32 reviewed entries, both tools.

Reviewed skill entries

Skill categories

At steady state

04 — ApproachHook + memory discipline — harder at scale.

Platform-team owned

Owned by the squad lead

Individual override

Conflict resolution

05 — ApproachThe productivity panel — engagement-weighted, not output-weighted.

Engagement-weighted productivity index · 6-month trajectory

06 — OutcomesSix months in, three numbers that mattered.

Sustained at month 6

Reviewed entries · 5 families

Active weekly users

07 — Lessons + ReplicationWhat we'd replicate — and what we'd change.

Single-tool plus shared kit

Dual-tool + standards layer

Federated platform model

Add compliance overlay

Larger teams scale on shared infrastructure, not shared opinions.

Larger teams scale on shared infrastructure, not shared opinions.

AI coding rollouts at scale

The questions engineering directors ask after the case.

Continue exploring case studies.

Case Study: Agent Observability with LangFuse Rollout 2026

Agentic AI Engineering Playbook: Coding, Review, Ops 2026

Case Study: Claude Code Adoption at a 30-Dev Shop 2026