SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentCase Study14 min readPublished May 15, 2026

A fifty-year case archive, zero tolerance for hallucinated citations — how a legal research firm shipped a 99.2%citation-accuracy RAG.

Case Study: RAG Deployment at a Legal Research Firm

A fifty-year legal research archive — thousands of opinions, briefs, and statutes — turned into a production retrieval system where every answer carries a verifiable citation. Six months from corpus design to partner adoption, with citation accuracy as the only gate that mattered.

DA
Digital Applied Team
Senior AI engineers · Published May 15, 2026
PublishedMay 15, 2026
Read time14 min
SectorLegal research
Citation accuracy
99.2%
of cited references verified
Production gate
Hallucinated citations
<0.2%
of grounded responses
Faithfulness gate
Partner adoption
78%
weekly active partners
Month 6
Timeline
6mo
corpus design to GA

A legal research firm with a fifty-year case archive needed a retrieval system their partners would actually trust enough to put in front of clients. The brief was simple to state and unforgiving to satisfy: every answer cites a real source, no fabricated case names, no invented section numbers, and the moment a citation fails verification the system has to fail loud rather than fail quietly.

What's at stake in legal RAG isn't the model layer — it's the gap between a plausible-sounding answer and a verifiable one. One hallucinated citation in a brief filed at court is a reputational event the firm cannot absorb. The technical question was never whether the model could generate fluent legal prose; it was whether retrieval, attribution, and evaluation could be tied together tightly enough to guarantee that fluency was always grounded.

This case study covers seven stages of the build: the situation we inherited, the corpus and hybrid retrieval design, the source attribution UX that made trust visible, the faithfulness evaluation harness that became the production gate, the refresh cadence aligned to the court calendar, the measured outcomes after six months, and the lessons we would replicate at a smaller firm tomorrow.

Key takeaways
  1. 01
    Hybrid retrieval beats pure vector for legal.Case names, statute numbers, and party identifiers are exactly the proper-noun patterns BM25 catches and pure vector embeddings smear. Reciprocal rank fusion of vector and keyword recall was the single largest quality lever in the build.
  2. 02
    Source attribution is the trust UX.Citation cards rendered inline next to bracketed references — with hover-reveal of the exact source paragraph — turned 'the AI said' into 'the AI said because this opinion says'. Partner trust tracked attribution UX more closely than raw answer quality.
  3. 03
    Faithfulness eval prevents hallucinated citations.A RAGAS-based harness measuring faithfulness, answer relevance, and context precision became the production gate. Every model change, prompt edit, and index rebuild had to pass the eval suite before reaching the partner UI.
  4. 04
    Court-calendar refresh prevents staleness.Ingestion ran on a cadence aligned to the publication rhythm of opinions, statutes, and rule amendments — not a flat nightly cron. Hot windows around supreme court decision days got tight refresh; quiet weeks ran cheap.
  5. 05
    Partner adoption tracks citation accuracy.Adoption curves we measured showed weekly active partners moved with citation accuracy and attribution legibility, not with answer fluency or response latency. Trust, once earned, was sticky; once broken, it was expensive to recover.

01SituationA fifty-year archive, a zero-tolerance brief.

The firm walked in with a clear research workflow problem and an unambiguous quality bar. Partners and associates spent meaningful time each week on case research — pulling opinions, locating controlling statutes, building citation chains, sanity-checking argument structure against precedent. Existing tools answered the first half of that workflow well enough but produced confidently-worded summaries with citations that, on inspection, were sometimes correct, sometimes garbled, and occasionally fabricated entirely.

That gap — between fluency and verifiability — was the entire engagement. The firm did not need a smarter writing assistant. It needed a research assistant whose citations partners could trust without re-verifying every one of them by hand. The acceptance criterion was numeric and explicit: a citation accuracy rate above 99% on a hand-graded production sample, sustained across model changes and corpus refreshes, or the system would not be permitted in front of clients.

What we inherited

  • Roughly fifty years of case material — a mix of opinions, briefs, statutes, regulations, and internal memoranda, stored across three different document management systems with inconsistent metadata and partial OCR on the older holdings.
  • No existing retrieval infrastructure. Search was keyword-only and lived inside the DMS interfaces; there was no vector index, no chunked corpus, and no evaluation framework.
  • A hard regulatory and confidentiality envelope. Privileged client material could not leave the firm's trust boundary; managed vector providers and untrusted egress paths were off the table from day one.
  • Partner skepticism, earned honestly. Multiple prior pilots of generic legal AI tools had produced exactly the hallucinated-citation failure mode the firm was trying to avoid. The bar to clear was not just technical — it was political.
The trust ledger
One fabricated citation surfaced inside a partner's research session would have ended adoption permanently. The build had to be designed around never letting that happen — not around recovering gracefully when it did.

That framing shaped every downstream decision. Retrieval had to over-recall so the right authority was never absent from the candidate set. Attribution had to be inline and verifiable, not a generic source list at the bottom of a response. Evaluation had to run as a gate, not as a dashboard. Refresh had to track when new opinions actually published, not a convenient cron window. And generation had to be permitted to refuse — to say plainly that the corpus did not support an answer — rather than reach for fluency when the retrieval signal was weak.

02Approach: Corpus + Hybrid RetrievalThe four-stage corpus pipeline.

Legal text breaks several of the assumptions that hold up well on general prose. Case names are proper nouns that pure vector embeddings tend to smear with semantically similar but legally distinct cases. Statute numbers and section references are identifiers that benefit dramatically from keyword recall. Citations within opinions follow rigid formats — Bluebook, jurisdiction-specific short forms — that carry their own retrieval signal if you preserve them through chunking. Hybrid retrieval was not optional; it was the architecture.

The pipeline ran in four stages. Each stage had its own quality check, and a document could not advance to the next stage until the check passed. That discipline meant ingestion was slower than a naive pipeline, but the failure mode of a corrupted chunk slipping into the retrieval index — and surfacing as a hallucinated citation months later — never materialized.

Stage 1
Normalize and enrich
OCR repair · citation parsing · metadata join

Older opinions arrived with broken OCR, missing headers, and inconsistent reporter formats. A normalization pass repaired OCR, parsed citations into a structured Bluebook representation, and joined missing metadata (court, jurisdiction, decision date) against authoritative reference sets.

Quality gate · OCR confidence ≥ 0.97
Stage 2
Chunk by legal structure
section · headnote · holding

Sliding-window chunking destroys the structure that makes legal documents searchable. Chunking respected section boundaries, headnotes, syllabus paragraphs, and holding statements — preserving the units a partner would actually cite.

500-900 tokens · structure-aware
Stage 3
Dual index — vector + lexical
embeddings · BM25 · entity index

Each chunk was embedded for semantic retrieval and indexed lexically for BM25. A separate entity index captured case names, statute citations, and party identifiers as exact-match keys — the proper-noun layer where pure vector retrieval falls down.

Three retrieval signals per chunk
Stage 4
Hybrid query — fuse and rerank
vector + BM25 + entity → RRF → rerank

Each query ran all three retrieval modes in parallel, fused the rankings with reciprocal rank fusion, then passed the top 30 candidates through a legal-domain reranker before the top 8 reached the generation prompt. Over-retrieve, then narrow.

Top-30 → rerank → top-8

The single largest quality lift came from the entity index. Once case names and statute numbers were indexed as exact-match keys alongside the vector and BM25 signals, queries that referenced a specific authority — "what does Marburysay about jurisdiction", "28 USC 1331 commentary" — stopped missing the authoritative chunk and stopped surfacing semantically-adjacent-but-legally-irrelevant alternatives. Pure-vector retrieval, even with strong embeddings, kept conflating cases with similar fact patterns but different holdings; the entity layer cut through that.

Structure-aware chunking carried the second-largest lift. A sliding-window chunker would split a holding statement in half, or join the end of one section with the beginning of an unrelated one. Chunking on section breaks, headnote boundaries, and syllabus paragraphs kept the retrievable units aligned with the units partners actually cited — which made attribution legible and faithfulness checks tractable. For broader background on the three-table schema and the SQL patterns underneath this kind of pipeline, the self-hosted RAG with pgvector tutorial covers the canonical shape.

Hybrid retrieval, simply
Vector caught semantic intent. BM25 caught proper nouns and identifiers. The entity index caught the exact authorities partners actually reach for. Reciprocal rank fusion stitched the three signals together; the legal-domain reranker tightened the top.

03Approach: Source AttributionCitation cards — making trust visible.

The single highest-ROI UX investment in the engagement was the citation rendering. The generation prompt required the model to reference retrieved chunks with bracketed numeric markers — [1], [2,3] — exactly the pattern that's become standard across grounded AI products. The interface then parsed those markers out of the streamed answer in real time and rendered each as a small inline card showing the case name, jurisdiction, decision date, and the exact paragraph the model grounded on.

Hover-reveal expanded the card to show surrounding context plus a link directly into the document management system. Partners could verify a citation in under a second without leaving the research session. That single interaction — the speed and confidence of verifying one citation — was what moved the trust dial faster than any model-layer improvement.

Three attribution rules

  • Citations are inline, not appended. A source list at the bottom of a response forces the reader to map claims back to authorities mentally. Inline markers tied to hover-reveal cards make the mapping explicit.
  • The card shows the exact chunk, not a snippet summary. Partners need to see the language the model read, not a paraphrase of it. Summarizing the source defeats the purpose of attribution.
  • Unsupported claims are flagged.Any sentence in a response that the post-generation verifier could not tie to a retrieved chunk got a visible "unsupported" marker. That made the refusal-shaped failure mode legible rather than hidden.
"The hover-card was worth more than the next model upgrade. Once a partner could verify a citation in a second, they actually used the tool."— Engagement retrospective, month four

Two implementation details earned their keep. First, citation parsing ran on the stream — markers became cards as the answer arrived, so the partner never saw an unattributed claim render even briefly. Second, the post-generation verifier ran in parallel with the stream and could revoke a citation marker after the fact if the verifier found that the cited chunk did not actually support the claim. A revoked citation collapsed into the unsupported marker described above. That belt-and-suspenders approach — inline rendering plus post-hoc verification — was the backbone of how the system stayed under the 0.2% hallucinated-citation threshold across the engagement.

04Approach: Faithfulness EvalThe evaluation harness as a production gate.

Evaluation was not a dashboard. It was a deployment gate that every model change, prompt edit, and index rebuild had to pass before reaching the partner UI. The harness measured three properties on a hand-labeled query set of approximately 400 representative research questions, with relevance and correctness annotated by senior associates over a two-week labeling sprint.

The three metrics — faithfulness, answer relevance, and context precision — came from the RAGAS framework, with DeepEval running alongside as a secondary check. The three signals catch different failure modes; reading them together produced a coherent picture of how the system behaved under change.

Faithfulness
Is every claim grounded?

Decomposes the generated answer into atomic claims, checks each against the retrieved context, and reports the fraction supported. The headline metric for hallucinated-citation prevention. Production gate: ≥ 0.98 on the labeled set.

Primary gate
Answer relevance
Does the answer address the question?

A grounded but tangential answer still fails the partner — the system has to be both faithful and on-point. Production gate: ≥ 0.92 on the labeled set, measured by embedding-similarity between the question and a back-generated question from the answer.

Secondary gate
Context precision
Did retrieval surface the right chunks?

Measures how concentrated the relevant chunks are at the top of the retrieval ranking. A weak signal here means the generation layer is being asked to compensate for retrieval misses — fragile and slow to fix. Production gate: ≥ 0.85 on the labeled set.

Retrieval gate
Citation accuracy
Does every cited authority verify?

Independent of the LLM-judged metrics above — a deterministic check that every bracketed citation resolves to a real chunk and the chunk actually supports the claim. The hard production gate: ≥ 99% sustained across releases.

Hard gate

Operationally, the gate ran on every pull request that touched retrieval, prompts, models, or indexes. A regression on any of the four metrics blocked the merge. The labeled query set itself was versioned alongside the code; new questions were added each sprint as partner queries surfaced failure modes the original sample missed. That tight feedback loop — partner asks a hard question, that question becomes a labeled eval case, the next regression test catches the failure before it reaches production — is what kept the system honest over six months of iteration.

For a more general treatment of how to score a production RAG system end to end — including the broader rubric items beyond faithfulness — the 80-point RAG quality scorecard is the companion reference. The harness above implements the faithfulness and retrieval-precision dimensions of that scorecard as enforceable gates rather than soft signals.

The hard truth about eval
Soft eval is not eval. A dashboard that everyone agrees to monitor drifts the moment someone is in a hurry. The only metric that holds is the one that blocks the merge when it regresses.

05Approach: Refresh + Court CalendarIngestion cadence aligned to the calendar.

A flat nightly refresh would have been wrong in two directions at once. On quiet weeks it would have burned embedding spend re-processing material that had not changed. On supreme court decision days — when a single new opinion can reshape the answer to every research question that touches it — a 24-hour refresh window was a quality failure waiting to happen.

The cadence we settled on tracked the publication rhythm of the sources themselves. Opinions, statutes, and regulations each have their own update patterns; aligning ingestion to those patterns kept the corpus fresh where freshness mattered and cheap where it didn't.

Decision days
15min
Hot-window refresh

On supreme court decision days and major regulatory announcement windows, ingestion ran every 15 minutes with priority routing through the chunking and embedding pipeline. New opinions reached the index within an hour of publication.

Court calendar–driven
Standard days
4hr
Working-hours cadence

During normal court terms, the pipeline ran four-hourly during business hours and once overnight. That covered the bulk of new filings, briefs, and lower-court opinions without saturating the embedding budget.

Cost-efficient default
Quiet weeks
24hr
Daily-only refresh

During court recess and slow regulatory windows, the pipeline dropped to a single nightly run. Checksum-based change detection skipped re-embedding for unchanged sources, so even the nightly run rarely processed much.

Recess-mode
Annual rebuild
1x/yr
Full index re-embed

Once per year — scheduled into the quietest week of the court calendar — the entire corpus was re-embedded against the current embedding model. That kept embedding-model drift from accumulating and gave the harness a fresh baseline.

Model-version hygiene

Two engineering details made the cadence reliable. First, change detection was content-checksum based, not timestamp based — a document touched but not edited never re-embedded. Second, the hot-window logic was driven by a published court calendar feed rather than ad-hoc operator triggers; the system knew without anyone telling it that the first Monday of October was a hot window, and so were the announced opinion-release dates.

The result was a corpus that felt current to partners. Asking about a decision issued that morning surfaced the opinion and its holding within the hour. Asking about an older question still grounded against an index that had been re-embedded recently enough that semantic drift was not measurable. The refresh architecture earned its keep on the highest-stakes queries — the ones tied to material the firm needed in production research the same week it published.

06OutcomesWhat the numbers said at month six.

Six months after the first partner pilot, the system was production for the full partnership and was being used in daily research workflows. The headline metrics — citation accuracy, hallucinated-citation rate, partner adoption — sat where the engagement design targeted them. The harness gates had held through approximately a dozen model upgrades, two embedding-model swaps, three index reorganizations, and the annual full re-embed.

Outcomes after six months in production

Source: production telemetry and engagement retrospective, month 6
Citation accuracyHand-graded production sample · 99.2% verified
99.2%
Faithfulness scoreRAGAS on the 400-question labeled set
0.984
Answer relevanceRAGAS · production gate ≥ 0.92
0.94
Context precisionTop-k retrieval concentration · gate ≥ 0.85
0.88
Partner adoptionWeekly active partners · month 6
78%
Research time reductionSelf-reported median time-to-first-draft
−42%
Hallucinated-citation ratePer-response · production sample
<0.2%

The adoption number deserves a closer reading. 78% weekly active partners is not 78% of partner research happening through the system — it is the fraction of partners who used the system in any given week. Inside that group, the depth of use varied widely: some partners ran ten queries a day, others used it for spot checks on argument structure. The relevant signal was that the system had crossed the trust threshold for the partnership as a whole — citation skepticism was no longer the reason partners stayed away.

The 42% self-reported reduction in time-to-first-draft on research-heavy work is the soft business metric we report with the most caution. It is self-reported, sample-biased toward the partners who adopted earliest, and likely overstates the system-attributable portion. The hard metrics — citation accuracy, faithfulness, adoption — are the ones we trust. The productivity number is the partnership's framing of why those technical metrics mattered.

What stayed honest
Six months in, the harness gates were what kept the system on the right side of the citation-accuracy line. Every regression on faithfulness blocked a release; every blocked release became a new labeled eval case. The system stayed conservative by construction rather than by promise.

07Lessons + ReplicationWhat we would build first at a smaller firm tomorrow.

Re-running this engagement at a firm with one-tenth the corpus size and one-tenth the engineering budget, the prioritization order is sharper than it was at the start of this project. Three decisions create the asymmetric upside; everything else is execution.

What to build first

  • The entity index, before the vector index. For any domain with strong proper-noun retrieval signal — legal, medical, financial, academic — exact-match recall on identifiers is the highest single quality lever. Building it first means even an early-version system surfaces the right authority on the queries that matter most.
  • Citation cards, before model upgrades. The fastest path to partner trust is making verification cheap. Spend a day on the inline citation card with hover-reveal before spending a week on prompt tuning or model selection.
  • The faithfulness gate, before the dashboard. Eval that blocks releases changes behavior. Eval that informs releases gets ignored under deadline pressure. Wire faithfulness as a CI gate on the first PR, even with a tiny labeled set, and grow the set from there.

What to skip until you need it

  • A custom reranker. A general-purpose reranker (Cohere, Voyage) gets you most of the way. Custom domain rerankers earn their keep only after the entity index, hybrid retrieval, and faithfulness gate are stable.
  • A bespoke chunking pipeline. Structure-aware chunking matters; rolling your own structure parser for every document type does not. Use the legal-document parsers that already exist and patch their edge cases rather than rebuilding from scratch.
  • An over-engineered refresh scheduler. Start with a fixed cadence and a checksum-based change detector. Layer in calendar-aware hot windows only when production usage shows the flat cadence is wrong.

One closing observation. The engagement did not produce a clever model trick or a novel retrieval algorithm; it produced a stack of pedestrian engineering decisions, each of which was disciplined enough to hold up under the next decision. Structure-aware chunking made hybrid retrieval possible. Hybrid retrieval made entity-index recall possible. Entity recall made faithful generation possible. Faithful generation made inline citations verifiable. Verifiable citations made partner trust possible. None of the steps were exotic on their own. The composition — holding all five steps to their respective bars at the same time — was the actual work. Replicating it at a smaller firm is more about that compositional discipline than about budget or headcount; the same five steps still apply, just at the scale the firm actually needs.

Conclusion

Legal RAG works when citation accuracy is the gate, not the hope.

The lesson the engagement crystallized is narrower and more actionable than the usual case-study takeaway. Legal RAG works when the team treats citation accuracy as a hard gate — measured, blocked on regression, enforced in CI — rather than as an aspirational property of the system. Every other architectural choice in this build flowed from that single commitment. Hybrid retrieval mattered because pure vector could not hit the citation bar. Structure-aware chunking mattered because sloppy chunking broke faithfulness. The citation card UI mattered because unverifiable attribution would have ended adoption regardless of the underlying accuracy.

The broader implication for any high-stakes RAG deployment — legal, medical, financial, regulatory — is that the gating metric needs to be explicit, measurable, and binding before the first line of retrieval code gets written. A team that knows it must hit 99% citation accuracy and has a verifier that says so on every PR will build a meaningfully different system than a team that intends to monitor accuracy after launch. The gating metric is the architecture.

For firms considering a similar build, the realistic six-month shape is: month one on corpus design and the harness, month two on retrieval and ingestion, month three on generation and attribution UX, month four on the partner pilot, months five and six on hardening, refresh cadence, and adoption work. The heaviest engineering happens in the first three months; the heaviest organizational work — earning partner trust, wiring the system into research workflows, closing the labeled-eval feedback loop — happens in months four through six. Both halves are necessary. Skipping either produces a system that technically works and is never used, or a system that is used until the first hallucinated citation surfaces and is then quietly abandoned.

Replicate this RAG

Legal RAG works when citation accuracy is the gate, not the hope.

Our team designs production RAG systems for vertical industries — corpus design, hybrid retrieval, source attribution, faithfulness eval, refresh cadence.

Free consultationExpert guidanceTailored solutions
What we build

Vertical RAG engagements

  • Corpus design for vertical archives
  • Hybrid retrieval implementation
  • Source-attribution UX (citation cards)
  • Faithfulness eval (RAGAS / DeepEval)
  • Refresh-cadence design aligned to industry calendar
FAQ · Legal RAG case

The questions teams ask after the case.

The corpus split into three logical layers — primary authorities (opinions, statutes, regulations), secondary authorities (treatises, restatements, internal memoranda), and procedural material (briefs, motions, filings). Each layer carried its own chunking strategy: structure-aware on primary authorities (section, headnote, holding); semantic on secondary; metadata-rich but lightly chunked on procedural. Older holdings went through an OCR repair pass with a confidence gate before they could enter the index. Citation parsing standardized references into a Bluebook representation at ingest time, which let the entity index treat citations as exact-match keys rather than free-text tokens. The whole pipeline was designed so a citation parsed once at ingest never had to be re-parsed at query time.