Agentic customer support at a mid-market fintech reached 51% deflection at month five without CSAT damage. The function was carrying roughly ten thousand inbound tickets a month, sixty percent of them concentrated in a handful of repeating archetypes, and the team had a year-end mandate to reduce cost-per-ticket without conceding the customer experience that had defined the brand. This piece is the case study underneath that outcome — the program that produced it, the gates that held, and the lessons that replicate.

The headline number is not the interesting part. Fifty-percent deflection has been achievable for several years with sufficient tolerance for CSAT damage; the harder problem is fifty-percent deflection while CSAT holds. That problem is mostly about program design — knowledge audit, RAG build sequencing, pilot governance, escalation rubric — and very little about model selection. The model is the easy part of an agentic support program; the program around it is the expensive part to get right.

What follows is the engagement as it ran. The situation as we found it, the four-stage approach (knowledge audit, RAG build, CSAT-controlled pilot, escalation rubric), the outcomes against the original targets, and the lessons that any other support function can apply. The fintech vertical produced specific constraints — regulated communication, hallucinated-policy sensitivity, audit-trail expectations — but the underlying playbook generalises broadly beyond that vertical.

Key takeaways

01
Knowledge audit sits upstream of every RAG decision.Three weeks of disciplined audit work — owner, last-updated date, product-match status, archetype mapping — set the deflection ceiling for the entire engagement. Skip the audit and the model becomes a confident-hallucination engine; ship the audit and the same model produces durable answers.
02
CSAT-controlled pilot prevents damage that is hard to undo.Opening the pilot at 1% traffic with three-layer CSAT instrumentation wired before launch gave the program a fortnight of feedback before any ramp decision. Teams that ramp from 0% to 25% in the first month consistently surface CSAT damage at the quarterly review — too late to roll back cleanly.
03
Escalation rubric is half the customer experience.Per-archetype confidence floors, a never-deflect list signed off by support and legal, and an agent-facing context payload were as load-bearing as the model itself. The CSAT damage we have seen across other deployments is almost always handoff context loss, not bot wrongness.
04
Hallucinated-policy rate is the compliance metric for fintech support.Standard deflection metrics miss the failure mode that actually matters in regulated contexts: the model stating a policy that is wrong or non-existent. The engagement instrumented hallucinated-policy detection as a first-class metric, held it under 0.4%, and treated any breach as a compliance event rather than a quality event.
05
Ramp lags pilot signal — that is the point, not a bug.Deflection at month five was 51%, but it was 8% at month two and 22% at month three. The lag is the gates doing their job — holding deflection at the level the archetype mix and the CSAT trend can actually support, rather than the level the ramp plan optimistically promised. Build the gates; trust the gates.

01 — SituationTen thousand tickets, sixty-percent repeat questions.

The client was a mid-market fintech, two years past Series B, with a customer base that had grown faster than the support organisation underneath it. Inbound volume was running at roughly ten thousand tickets per month across email and in-product chat. The team had grown to twenty-two agents and was about to make the call to either keep hiring or invest in automation; the math on the hiring path no longer fit the unit-economics targets for the year, so the decision came down to whether automation could safely take a meaningful slice of volume.

The ticket mix was concentrated. A rough sixty percent of volume sat inside the top twenty archetypes — account-status questions, payment confirmations, settlement-timing inquiries, password and access flows, document re-issuance, basic on-boarding questions. The remaining forty percent was long-tail, including the categories where automation was already disallowed by policy: dispute initiation, fraud reports, churn-saves, and anything regulated. The shape of the volume distribution was the engagement's opportunity — a deflection ceiling that was likely real, but only if the program could avoid contaminating the long-tail with confident-but-wrong responses.

The original brief was "reach 60% deflection" — a number the client had pulled from a vendor pitch deck. The engagement reframed the target inside the first two weeks. The right target was the highest deflection number the CSAT band would allow, instrumented and gated, with hallucinated-policy rate as a hard ceiling. That reframing is the most consequential decision we made in the entire engagement.

The brief, restated

The client asked for 60% deflection. The engagement delivered 51% deflection with CSAT held inside a ±1.5-point band and hallucinated-policy rate under 0.4%. The lower deflection number is the right outcome — the gates did exactly what they were designed to do. Headline deflection alone is not the metric; CSAT-protected deflection is.

Two further situational notes mattered for the design of the program. First, the client had a strong existing knowledge base that had not been audited in roughly eighteen months — every article was reachable from the help centre, but product behaviour had moved underneath several dozen of them. Second, the regulatory context was sensitive enough that any policy statement produced by the AI had to be defensibly grounded in current documentation, with an audit trail back to the source. Both points shaped the knowledge audit and the RAG build that followed.

02 — Approach · Knowledge AuditFour intent categories, audited before the model.

The first three weeks of the engagement were spent entirely upstream of the model. No customer traffic touched any AI in this window; the work was an inventory of every help-centre article, internal SOP, and macro template, mapped against the top archetypes pulled from the previous ninety days of ticket data. The output was a four-bucket intent catalog and a flag list — every document that was stale, missing, contradictory, or unowned. That flag list became the work backlog for the remainder of the audit window.

The decision to organise the audit into four intent categories rather than a flat archetype list was load-bearing for the rest of the program. Each category carried a different deflection ceiling, a different confidence floor, and a different escalation pattern. Bundling them into a single "top archetypes" list would have produced an undifferentiated model with under-tuned confidence thresholds — the most common failure mode we see in support-AI deployments.

Category 01

Account & access · high-volume

~38% of volume · password, login, MFA, profile

Highest deflection ceiling. Stable archetypes, well-documented resolution paths, low compliance risk. The category where the program could ramp confidence thresholds aggressively without compliance exposure.

Deflection ceiling: high

Category 02

Payment status & settlement timing

~22% of volume · order state, settlement windows, receipts

Lookup-heavy archetypes that required clean order-state and settlement API integration. Deflection ceiling depended more on tool-integration quality than on language-model quality — the audit flagged the API surface as the limiting factor here.

Deflection ceiling: tool-bound

Category 03

Policy & documentation

~18% of volume · fees, limits, eligibility, terms

The compliance-sensitive category. Hallucinated-policy rate would be measured primarily inside this bucket. The audit found roughly a third of policy documents were stale or contradictory — these became the highest-priority remediation work before the pilot opened.

Compliance-bound

Category 04

Triage, escalation & long-tail

~22% of volume · disputes, fraud, churn, ambiguous

The never-deflect category. The program's job inside this bucket was clean triage to a human queue with full context — not deflection. Designing the escalation rubric for this category was a separate workstream because the rubric was the customer experience, not the model.

Escalation-bound

One operational detail is worth surfacing from the audit work. The five-column row shape (doc ID, title, owner, last-updated date, product-match status, mapped archetype) was deliberately narrow — the goal was a spreadsheet the support team would actually maintain after the engagement closed, not an elaborate tracker that would atrophy. The audit found a hundred and forty-three articles needing some level of update; roughly forty were updated before the pilot opened, the rest were triaged into "update by day 60" or "deprecate" buckets. The deflection ceiling at month five would have been lower by an estimated ten to fifteen percentage points without this work.

"The audit is the launch. The model is just the interface."— Field note · this engagement, week 2

03 — Approach · RAG BuildTop-200 archetypes, category-aware retrieval.

The retrieval layer was built against the top-200 ticket archetypes identified during the audit. Two-hundred rather than one-hundred because the category split needed enough breadth inside each bucket to avoid retrieval gaps — the long tail inside a category produces the awkward edge-case answers that erode CSAT faster than the head archetypes produce deflection wins.

Chunking strategy and re-ranker selection were category-driven rather than uniform. Account-and-access flows tolerated coarser chunking because the resolution paths were short. Policy-and-documentation flows required fine-grained chunking with policy-version metadata attached to every chunk, so the model could refuse to answer when the most recent policy version was not in the retrieval set. Payment-status queries pulled from structured API data, not embedded text, and were treated as a tool-call path rather than a retrieval path. This category-aware design was the most consequential engineering decision in the build.

The build, in one paragraph

Retrieval grounded against the top-200 archetypes, chunked category-by-category, re-ranked with a small per-category re-ranker tuned on held-out queries. Policy chunks carried version metadata so the model could refuse on stale grounding. Payment lookups bypassed retrieval and called the order-state and settlement APIs directly. Two weeks of build, one week of retrieval validation, one week of held-out evaluation before any generation ran through it.

Retrieval quality was validated against a held-out set of three hundred and fifty real customer queries — fifty per category, plus an additional long-tail batch — before any generation step ran through it. The held-out evaluation produced a per-category retrieval-quality score that drove chunking adjustments for the categories that landed below target. The discipline of validating retrieval before generation is one of the cheapest improvements to a RAG program; debugging a hallucination caused by a retrieval gap is meaningfully harder than debugging one caused by the model itself, and the held-out evaluation surfaces the former before it ever reaches a customer.

One nuance worth noting on tool integration. The payment-status category was the place where the engagement spent the most engineering time relative to language-model time. Settlement-timing answers in particular required not just the current state of the payment but the predicted settlement window from the payment processor — a derived value, not a direct API field. Getting this right meant the agent could answer the question definitively ("your transfer is scheduled to settle Tuesday morning") rather than vaguely ("transfers typically take 1-3 business days"), which had a measurable effect on resolution CSAT inside this category.

04 — Approach · CSAT-Controlled Pilot1% traffic, three CSAT gates, human decisions at every step.

The pilot opened in week four at 1% of inbound volume. Routing was a simple session-hash rule — one customer in a hundred, deterministic, repeatable. The other ninety-nine percent of volume stayed on the human-only path for the duration of the pilot window, which ran four weeks.

Three measurement layers were live before the pilot opened. Resolution CSAT measured immediately after each conversation closed; delayed CSAT measured at the seventy-two-hour mark to catch the cases where the conversation closed cleanly but the underlying issue resurfaced; and a model-scored conversation CSAT — the model rating its own confidence and resolution quality on every turn — was surfaced to a daily QA queue. Baseline readings on all three tiers were taken from human-only traffic in the fortnight before the pilot opened, so every ramp gate had a defensible comparison point.

Gate 01

1% pilot open · week 4

Pre-condition: audit complete, retrieval validated on the 350-query held-out set, three-layer CSAT baseline captured from human-only traffic. Measurement window: four weeks. Rollback rule: any CSAT tier drifting more than two points against baseline falls the deployment back to 0% within the shift.

Open after week 3 audit

Gate 02

8% ramp · month 2

Pre-condition: pilot CSAT neutral or positive across all three tiers, false-positive escalation rate inside agreed bound, hallucinated-policy rate under 0.5% on the daily QA sample. Measurement window: three weeks. Rollback rule: same two-point threshold, plus a hallucinated-policy breach triggers immediate fall-back regardless of CSAT.

Open after month-1 review

Gate 03

25% ramp · month 3

Pre-condition: month-2 CSAT trends neutral or positive, escalation handoff QA passes the conversation-context check across a sampled batch, agent training delivered to the broader support team. Measurement window: four weeks. Rollback rule: any signal of handoff context loss in the QA sample triggers ramp pause.

Open after month-2 review

Gate 04

51% steady state · month 5

Pre-condition: production observability live, alerting wired against the trailing two-week baseline. Measurement window: ongoing. Rollback rule: deployment now lives inside production alerting — any sustained CSAT regression or hallucinated-policy breach triggers automated ramp-down, not a meeting.

Held from month 4 onward

The pilot produced two early findings that would have been invisible at faster ramp speeds. First, the model-scored conversation CSAT diverged from resolution CSAT inside the policy category — the model was rating its own confidence as high while customers were rating the resolution as middling. The signal pointed to a confidence-calibration issue rather than a quality issue, and surfaced two weeks earlier than either survey-based layer would have caught it. That kind of leading signal is the entire reason model-scored CSAT exists inside the instrumentation stack.

Second, the false-positive escalation rate in the account-and-access category landed above expectations — the model was handing off cleanly resolvable password-reset conversations because the confidence floor for the category had been set conservatively at launch. The floor was lowered on a per-archetype basis at the month-one review and the category's deflection rate climbed substantially in month two without CSAT damage. Both adjustments were the gates doing their job — surfacing tuning decisions inside a window small enough to make them safely.

"Every ramp gate is a CSAT gate first. Deflection is a byproduct of the gates holding, not the other way around."— Field note · this engagement, month 2 review

05 — Approach · Escalation RubricConfidence floors, never-deflect lists, context payloads.

The escalation rubric was the document with the highest operational consequence in the engagement. For each of the top-200 archetypes, the rubric defined a confidence floor (below which the AI handed off automatically), a never-deflect flag (always escalates, regardless of confidence), a context payload (transcript, intent classification, confidence score, relevant customer state, retrieved policy chunks), and a routing queue. Sign-off came from the support lead, the legal lead for the regulated categories, and the product owner before the pilot opened.

The single most consequential design decision inside the rubric was the breadth of the never-deflect list. We pushed for a generous launch posture — dispute initiation, fraud reports, churn-saves, account-compromise reports, anything regulated, and any conversation where the customer used a specific set of trigger phrases ("legal", "regulator", "cancel", "close my account"). The cost of an AI deflection on the wrong archetype was much higher than the value of dozens of correctly-deflected tier-one tickets, and the never-deflect list was the mechanism that priced that asymmetry into the program.

The never-deflect principle

Be generous with the never-deflect list at launch; relax it later when the data supports it. The cost of an AI deflection on a churn-save or a fraud report is much higher than the value of dozens of correctly deflected tier-one tickets — the rubric is the mechanism that prices that asymmetry into the program.

The handoff context payload was the second load-bearing piece of the rubric. When an escalation fired, the receiving agent received the full conversation transcript, the model's stated intent, the confidence score, the customer's relevant account state, and any policy chunks the model had retrieved during the conversation. Agents were trained on a thirty-second confirmation pattern — the agent confirms the AI summary inside the first thirty seconds of the live conversation, so the customer experiences continuity rather than a context restart.

That thirty-second confirmation was, in our analysis of the month-three data, the single largest driver of escalation CSAT. Conversations where the agent confirmed the AI summary inside the window scored substantially higher on resolution CSAT than conversations where the agent opened with "can you tell me what this is about". The number is not surprising in retrospect — customers experience the handoff as a continuity test, not as a feature — but the discipline of measuring it, training to it, and QAing it explicitly is what made the number show up in the program rather than as an after-the-fact regret.

A post-handoff QA loop sampled escalations weekly, flagging any case where the customer had to repeat themselves. Flags drove both rubric refinement and individual agent re-training. The pattern that emerged across the engagement is that context loss is, by a wide margin, the single largest source of CSAT damage in support-AI deployments — worse than the bot being wrong, because the customer experiences wrong-and-frustrating rather than wrong-and-escalated.

06 — OutcomesMonth-by-month numbers, CSAT-protected throughout.

The headline outcomes ran against a five-month timeline: month one, audit and build, no customer-facing traffic; month two, 8% deflection with the pilot opened to 1% then ramped to 8% mid-month; month three, 22% deflection with the ramp gate opened to 25%; month four, 38% deflection at the same gate with the model and the rubric tuning settling; month five, 51% steady state. CSAT held inside a ±1.5-point band against baseline across all three measurement tiers throughout the engagement. Hallucinated-policy rate landed under 0.4% by month three and held there.

Deflection ramp · month 1 through month 5

Source: client engagement dashboards, May 2026

Month 1 deflectionAudit + build · no customer traffic

Month 2 deflectionPilot opens at 1%, ramps to 8% mid-month

Month 3 deflectionRamp gate to 25% · per-archetype tuning

22%

Month 4 deflectionRubric refinement · tooling depth

38%

Month 5 deflectionSteady-state · production observability live

51%

The ramp shape is the part of the chart worth reading twice. Deflection at month two was 8%, not 25% or the originally-promised 40% — the gates were holding the program at the level the archetype mix and the CSAT trend could actually support, rather than the level the ramp plan promised on slide eleven of the kick-off deck. Month three climbed substantially because the audit work paid off, the confidence thresholds had been tuned against real traffic, and the escalation rubric had been refined twice. Month four and five were the program reaching its CSAT-protected ceiling and settling.

The CSAT story is the more important half of the outcome. Across the engagement, resolution CSAT moved by 1.2 points (positive) against baseline, delayed CSAT moved by 0.8 points (positive), and model-scored conversation CSAT held essentially flat. None of the three drifted outside the ±2-point gating band at any point during the ramp, which meant no rollback event fired. Hallucinated-policy rate spiked briefly to 0.7% in week six of the pilot — a retrieval issue caused by a policy update that had not propagated to the index — and was resolved inside the same day; the rate sat under 0.4% from month three onward.

Unit-economics outcomes were consistent with the deflection ramp. The cost-per-resolved-ticket curve fell substantially once month-five steady-state held, while the function avoided the hiring path that had triggered the engagement in the first place. The agent team did not shrink — it shifted upmarket, with the highest-complexity disputes, fraud reports, and churn-saves now consuming a larger share of human attention. That is the right destination for the human side of the function.

"The deflection number is interesting. The CSAT band is the result."— Engagement closeout · this case study

07 — Lessons + ReplicationWhat replicates, what does not.

Most of the engagement is replicable. The knowledge audit shape, the four-category split, the category-aware RAG build, the CSAT-controlled pilot structure, the escalation rubric template, and the gating discipline are vertical-independent patterns. We have shipped variants of all of these inside ecommerce, SaaS, and B2B service contexts. The fintech-specific adaptations — hallucinated-policy rate as a first-class compliance metric, policy-version metadata on retrieval chunks, generous never-deflect lists for regulated trigger phrases — are the parts that need to be re-thought for a different vertical, but the rest of the program transfers.

What did not replicate from the original brief is the sixty-percent deflection target. The lower number we delivered is not a shortfall — it is the gates working. Any team committing to a deflection ceiling before measuring the archetype mix and the CSAT response of their own customer base is committing to a number that was pulled from somebody else's slide deck. The right target is the highest deflection the CSAT band will hold, instrumented and gated, on this customer base, for this archetype mix.

Lesson 01

Auditfirst

Three weeks upstream of the model

Three weeks of audit work moved the deflection ceiling by an estimated 10-15 percentage points at month five. The audit is the cheapest improvement to a support-AI program — and the one most teams skip because it does not feel like a launch activity.

Owner: support ops

Lesson 02

Gateshold

Ramp at the speed the data supports

The gates held deflection at 8% for the first month of pilot traffic, climbed to 22% in month three, and only reached 51% in month five. The gates are the program's risk-management layer; ramping faster than the gates produces CSAT damage that is hard to undo.

Owner: ramp gates

Lesson 03

Handoffis UX

Thirty-second confirmation pattern

The handoff context payload and the agent thirty-second confirmation pattern were the single largest driver of escalation CSAT. Customers experience handoff as a continuity test — pass it or lose them. The pattern transfers cleanly across verticals.

Owner: handoff script

Lesson 04

Compmetric

Hallucinated-policy rate as first-class

In any compliance-sensitive context, standard CSAT metrics miss the failure mode that actually matters. Hallucinated-policy rate needs to be instrumented as a hard ceiling with its own rollback rule, separate from CSAT. The mechanism transfers wherever regulated communication exists.

Owner: compliance

For teams designing a comparable program, the sequencing matters as much as the components. Knowledge audit before anything else, RAG build against the audited corpus rather than the raw documentation set, CSAT instrumentation wired before the pilot opens, escalation rubric signed off by support and legal before any traffic touches the AI, and gates that fire on measurement rather than calendar. Skipping any of these is the most common way successful month-five programs become disappointing month-twelve programs.

The companion piece on the 90-day launch plan walks through the operational cadence in more detail, including the templates we ship with engagements. The companion piece on the support ROI calculator and deflection formula covers the unit-economics math underneath a program like this one — break-even thresholds, cost-per-ticket ladders, and the vendor comparison work that informs build-versus-buy.

For teams that want the program delivered as a managed engagement rather than run internally, our AI transformation engagements ship the knowledge audit, the RAG build, the CSAT instrumentation, the pilot, and the agent handoff training as a phased program — calibrated to the archetype mix of the specific support function and with measurable CSAT-controlled outcomes throughout.

Conclusion

CSAT-protected support deflection is the only support deflection that counts.

The engagement delivered 51% deflection at month five against an original brief of 60%, with CSAT held inside a ±1.5-point band and hallucinated-policy rate under 0.4%. The lower deflection number is the correct outcome — the gates did exactly what they were designed to do, and the program is now in a steady state that can be defended at the executive review and at the regulatory review with equal confidence.

The deeper pattern across the engagement is that CSAT precedes deflection. Every load-bearing decision — the audit upstream of the model, the category-aware RAG build, the 1% pilot with three-layer instrumentation, the per-archetype confidence floors, the generous never-deflect list, the thirty-second agent confirmation pattern — was a decision that traded short-term deflection numbers for long-term CSAT stability. That trade is what separated the engagement from the deflection-first deployments we have seen unravel at the quarterly review.

The honest framing on replication is that most of the program transfers and some of it does not. The audit, the RAG build, the gating discipline, and the escalation rubric template generalise across verticals. The fintech-specific adaptations need re-thinking elsewhere. But the underlying principle is the same wherever an AI customer-support program ships: CSAT-protected deflection is the only deflection that counts, and the program is the mechanism that protects it.

Case Study: Agentic Customer Support at Fintech, 50% Deflection