A legal research firm with a fifty-year case archive needed a retrieval system their partners would actually trust enough to put in front of clients. The brief was simple to state and unforgiving to satisfy: every answer cites a real source, no fabricated case names, no invented section numbers, and the moment a citation fails verification the system has to fail loud rather than fail quietly.

What's at stake in legal RAG isn't the model layer — it's the gap between a plausible-sounding answer and a verifiable one. One hallucinated citation in a brief filed at court is a reputational event the firm cannot absorb. The technical question was never whether the model could generate fluent legal prose; it was whether retrieval, attribution, and evaluation could be tied together tightly enough to guarantee that fluency was always grounded.

This case study covers seven stages of the build: the situation we inherited, the corpus and hybrid retrieval design, the source attribution UX that made trust visible, the faithfulness evaluation harness that became the production gate, the refresh cadence aligned to the court calendar, the measured outcomes after six months, and the lessons we would replicate at a smaller firm tomorrow.

Key takeaways

01
Hybrid retrieval beats pure vector for legal.Case names, statute numbers, and party identifiers are exactly the proper-noun patterns BM25 catches and pure vector embeddings smear. Reciprocal rank fusion of vector and keyword recall was the single largest quality lever in the build.
02
Source attribution is the trust UX.Citation cards rendered inline next to bracketed references — with hover-reveal of the exact source paragraph — turned 'the AI said' into 'the AI said because this opinion says'. Partner trust tracked attribution UX more closely than raw answer quality.
03
Faithfulness eval prevents hallucinated citations.A RAGAS-based harness measuring faithfulness, answer relevance, and context precision became the production gate. Every model change, prompt edit, and index rebuild had to pass the eval suite before reaching the partner UI.
04
Court-calendar refresh prevents staleness.Ingestion ran on a cadence aligned to the publication rhythm of opinions, statutes, and rule amendments — not a flat nightly cron. Hot windows around supreme court decision days got tight refresh; quiet weeks ran cheap.
05
Partner adoption tracks citation accuracy.Adoption curves we measured showed weekly active partners moved with citation accuracy and attribution legibility, not with answer fluency or response latency. Trust, once earned, was sticky; once broken, it was expensive to recover.

01 — SituationA fifty-year archive, a zero-tolerance brief.

The firm walked in with a clear research workflow problem and an unambiguous quality bar. Partners and associates spent meaningful time each week on case research — pulling opinions, locating controlling statutes, building citation chains, sanity-checking argument structure against precedent. Existing tools answered the first half of that workflow well enough but produced confidently-worded summaries with citations that, on inspection, were sometimes correct, sometimes garbled, and occasionally fabricated entirely.

That gap — between fluency and verifiability — was the entire engagement. The firm did not need a smarter writing assistant. It needed a research assistant whose citations partners could trust without re-verifying every one of them by hand. The acceptance criterion was numeric and explicit: a citation accuracy rate above 99% on a hand-graded production sample, sustained across model changes and corpus refreshes, or the system would not be permitted in front of clients.

What we inherited

Roughly fifty years of case material — a mix of opinions, briefs, statutes, regulations, and internal memoranda, stored across three different document management systems with inconsistent metadata and partial OCR on the older holdings.
No existing retrieval infrastructure. Search was keyword-only and lived inside the DMS interfaces; there was no vector index, no chunked corpus, and no evaluation framework.
A hard regulatory and confidentiality envelope. Privileged client material could not leave the firm's trust boundary; managed vector providers and untrusted egress paths were off the table from day one.
Partner skepticism, earned honestly. Multiple prior pilots of generic legal AI tools had produced exactly the hallucinated-citation failure mode the firm was trying to avoid. The bar to clear was not just technical — it was political.

The trust ledger

One fabricated citation surfaced inside a partner's research session would have ended adoption permanently. The build had to be designed around never letting that happen — not around recovering gracefully when it did.

That framing shaped every downstream decision. Retrieval had to over-recall so the right authority was never absent from the candidate set. Attribution had to be inline and verifiable, not a generic source list at the bottom of a response. Evaluation had to run as a gate, not as a dashboard. Refresh had to track when new opinions actually published, not a convenient cron window. And generation had to be permitted to refuse — to say plainly that the corpus did not support an answer — rather than reach for fluency when the retrieval signal was weak.

02 — Approach: Corpus + Hybrid RetrievalThe four-stage corpus pipeline.

Legal text breaks several of the assumptions that hold up well on general prose. Case names are proper nouns that pure vector embeddings tend to smear with semantically similar but legally distinct cases. Statute numbers and section references are identifiers that benefit dramatically from keyword recall. Citations within opinions follow rigid formats — Bluebook, jurisdiction-specific short forms — that carry their own retrieval signal if you preserve them through chunking. Hybrid retrieval was not optional; it was the architecture.

The pipeline ran in four stages. Each stage had its own quality check, and a document could not advance to the next stage until the check passed. That discipline meant ingestion was slower than a naive pipeline, but the failure mode of a corrupted chunk slipping into the retrieval index — and surfacing as a hallucinated citation months later — never materialized.

Stage 1

Normalize and enrich

OCR repair · citation parsing · metadata join

Older opinions arrived with broken OCR, missing headers, and inconsistent reporter formats. A normalization pass repaired OCR, parsed citations into a structured Bluebook representation, and joined missing metadata (court, jurisdiction, decision date) against authoritative reference sets.

Quality gate · OCR confidence ≥ 0.97

Stage 2

Chunk by legal structure

section · headnote · holding

Sliding-window chunking destroys the structure that makes legal documents searchable. Chunking respected section boundaries, headnotes, syllabus paragraphs, and holding statements — preserving the units a partner would actually cite.

500-900 tokens · structure-aware

Stage 3

Dual index — vector + lexical

embeddings · BM25 · entity index

Each chunk was embedded for semantic retrieval and indexed lexically for BM25. A separate entity index captured case names, statute citations, and party identifiers as exact-match keys — the proper-noun layer where pure vector retrieval falls down.

Three retrieval signals per chunk

Stage 4

Hybrid query — fuse and rerank

vector + BM25 + entity → RRF → rerank

Each query ran all three retrieval modes in parallel, fused the rankings with reciprocal rank fusion, then passed the top 30 candidates through a legal-domain reranker before the top 8 reached the generation prompt. Over-retrieve, then narrow.

Top-30 → rerank → top-8

The single largest quality lift came from the entity index. Once case names and statute numbers were indexed as exact-match keys alongside the vector and BM25 signals, queries that referenced a specific authority — "what does Marburysay about jurisdiction", "28 USC 1331 commentary" — stopped missing the authoritative chunk and stopped surfacing semantically-adjacent-but-legally-irrelevant alternatives. Pure-vector retrieval, even with strong embeddings, kept conflating cases with similar fact patterns but different holdings; the entity layer cut through that.

Structure-aware chunking carried the second-largest lift. A sliding-window chunker would split a holding statement in half, or join the end of one section with the beginning of an unrelated one. Chunking on section breaks, headnote boundaries, and syllabus paragraphs kept the retrievable units aligned with the units partners actually cited — which made attribution legible and faithfulness checks tractable. For broader background on the three-table schema and the SQL patterns underneath this kind of pipeline, the self-hosted RAG with pgvector tutorial covers the canonical shape.

Hybrid retrieval, simply

Vector caught semantic intent. BM25 caught proper nouns and identifiers. The entity index caught the exact authorities partners actually reach for. Reciprocal rank fusion stitched the three signals together; the legal-domain reranker tightened the top.

03 — Approach: Source AttributionCitation cards — making trust visible.

The single highest-ROI UX investment in the engagement was the citation rendering. The generation prompt required the model to reference retrieved chunks with bracketed numeric markers — [1], [2,3] — exactly the pattern that's become standard across grounded AI products. The interface then parsed those markers out of the streamed answer in real time and rendered each as a small inline card showing the case name, jurisdiction, decision date, and the exact paragraph the model grounded on.

Hover-reveal expanded the card to show surrounding context plus a link directly into the document management system. Partners could verify a citation in under a second without leaving the research session. That single interaction — the speed and confidence of verifying one citation — was what moved the trust dial faster than any model-layer improvement.

Three attribution rules

Citations are inline, not appended. A source list at the bottom of a response forces the reader to map claims back to authorities mentally. Inline markers tied to hover-reveal cards make the mapping explicit.
The card shows the exact chunk, not a snippet summary. Partners need to see the language the model read, not a paraphrase of it. Summarizing the source defeats the purpose of attribution.
Unsupported claims are flagged.Any sentence in a response that the post-generation verifier could not tie to a retrieved chunk got a visible "unsupported" marker. That made the refusal-shaped failure mode legible rather than hidden.

"The hover-card was worth more than the next model upgrade. Once a partner could verify a citation in a second, they actually used the tool."— Engagement retrospective, month four

Two implementation details earned their keep. First, citation parsing ran on the stream — markers became cards as the answer arrived, so the partner never saw an unattributed claim render even briefly. Second, the post-generation verifier ran in parallel with the stream and could revoke a citation marker after the fact if the verifier found that the cited chunk did not actually support the claim. A revoked citation collapsed into the unsupported marker described above. That belt-and-suspenders approach — inline rendering plus post-hoc verification — was the backbone of how the system stayed under the 0.2% hallucinated-citation threshold across the engagement.

04 — Approach: Faithfulness EvalThe evaluation harness as a production gate.

Evaluation was not a dashboard. It was a deployment gate that every model change, prompt edit, and index rebuild had to pass before reaching the partner UI. The harness measured three properties on a hand-labeled query set of approximately 400 representative research questions, with relevance and correctness annotated by senior associates over a two-week labeling sprint.

The three metrics — faithfulness, answer relevance, and context precision — came from the RAGAS framework, with DeepEval running alongside as a secondary check. The three signals catch different failure modes; reading them together produced a coherent picture of how the system behaved under change.

Faithfulness

Is every claim grounded?

Decomposes the generated answer into atomic claims, checks each against the retrieved context, and reports the fraction supported. The headline metric for hallucinated-citation prevention. Production gate: ≥ 0.98 on the labeled set.

Primary gate

Answer relevance

Does the answer address the question?

A grounded but tangential answer still fails the partner — the system has to be both faithful and on-point. Production gate: ≥ 0.92 on the labeled set, measured by embedding-similarity between the question and a back-generated question from the answer.

Secondary gate

Context precision

Did retrieval surface the right chunks?

Measures how concentrated the relevant chunks are at the top of the retrieval ranking. A weak signal here means the generation layer is being asked to compensate for retrieval misses — fragile and slow to fix. Production gate: ≥ 0.85 on the labeled set.

Retrieval gate

Citation accuracy

Does every cited authority verify?

Independent of the LLM-judged metrics above — a deterministic check that every bracketed citation resolves to a real chunk and the chunk actually supports the claim. The hard production gate: ≥ 99% sustained across releases.

Hard gate

Operationally, the gate ran on every pull request that touched retrieval, prompts, models, or indexes. A regression on any of the four metrics blocked the merge. The labeled query set itself was versioned alongside the code; new questions were added each sprint as partner queries surfaced failure modes the original sample missed. That tight feedback loop — partner asks a hard question, that question becomes a labeled eval case, the next regression test catches the failure before it reaches production — is what kept the system honest over six months of iteration.

For a more general treatment of how to score a production RAG system end to end — including the broader rubric items beyond faithfulness — the 80-point RAG quality scorecard is the companion reference. The harness above implements the faithfulness and retrieval-precision dimensions of that scorecard as enforceable gates rather than soft signals.

The hard truth about eval

Soft eval is not eval. A dashboard that everyone agrees to monitor drifts the moment someone is in a hurry. The only metric that holds is the one that blocks the merge when it regresses.

05 — Approach: Refresh + Court CalendarIngestion cadence aligned to the calendar.

A flat nightly refresh would have been wrong in two directions at once. On quiet weeks it would have burned embedding spend re-processing material that had not changed. On supreme court decision days — when a single new opinion can reshape the answer to every research question that touches it — a 24-hour refresh window was a quality failure waiting to happen.

The cadence we settled on tracked the publication rhythm of the sources themselves. Opinions, statutes, and regulations each have their own update patterns; aligning ingestion to those patterns kept the corpus fresh where freshness mattered and cheap where it didn't.

Decision days

15min

Hot-window refresh

On supreme court decision days and major regulatory announcement windows, ingestion ran every 15 minutes with priority routing through the chunking and embedding pipeline. New opinions reached the index within an hour of publication.

Court calendar–driven

Standard days

4hr

Working-hours cadence

During normal court terms, the pipeline ran four-hourly during business hours and once overnight. That covered the bulk of new filings, briefs, and lower-court opinions without saturating the embedding budget.

Cost-efficient default

Quiet weeks

24hr

Daily-only refresh

During court recess and slow regulatory windows, the pipeline dropped to a single nightly run. Checksum-based change detection skipped re-embedding for unchanged sources, so even the nightly run rarely processed much.

Recess-mode

Annual rebuild

1x/yr

Full index re-embed

Once per year — scheduled into the quietest week of the court calendar — the entire corpus was re-embedded against the current embedding model. That kept embedding-model drift from accumulating and gave the harness a fresh baseline.

Model-version hygiene

Two engineering details made the cadence reliable. First, change detection was content-checksum based, not timestamp based — a document touched but not edited never re-embedded. Second, the hot-window logic was driven by a published court calendar feed rather than ad-hoc operator triggers; the system knew without anyone telling it that the first Monday of October was a hot window, and so were the announced opinion-release dates.

The result was a corpus that felt current to partners. Asking about a decision issued that morning surfaced the opinion and its holding within the hour. Asking about an older question still grounded against an index that had been re-embedded recently enough that semantic drift was not measurable. The refresh architecture earned its keep on the highest-stakes queries — the ones tied to material the firm needed in production research the same week it published.

06 — OutcomesWhat the numbers said at month six.

Six months after the first partner pilot, the system was production for the full partnership and was being used in daily research workflows. The headline metrics — citation accuracy, hallucinated-citation rate, partner adoption — sat where the engagement design targeted them. The harness gates had held through approximately a dozen model upgrades, two embedding-model swaps, three index reorganizations, and the annual full re-embed.

Outcomes after six months in production

Source: production telemetry and engagement retrospective, month 6

Citation accuracyHand-graded production sample · 99.2% verified

99.2%

Faithfulness scoreRAGAS on the 400-question labeled set

0.984

Answer relevanceRAGAS · production gate ≥ 0.92

0.94

Context precisionTop-k retrieval concentration · gate ≥ 0.85

0.88

Partner adoptionWeekly active partners · month 6

78%

Research time reductionSelf-reported median time-to-first-draft

−42%

Hallucinated-citation ratePer-response · production sample

<0.2%

The adoption number deserves a closer reading. 78% weekly active partners is not 78% of partner research happening through the system — it is the fraction of partners who used the system in any given week. Inside that group, the depth of use varied widely: some partners ran ten queries a day, others used it for spot checks on argument structure. The relevant signal was that the system had crossed the trust threshold for the partnership as a whole — citation skepticism was no longer the reason partners stayed away.

The 42% self-reported reduction in time-to-first-draft on research-heavy work is the soft business metric we report with the most caution. It is self-reported, sample-biased toward the partners who adopted earliest, and likely overstates the system-attributable portion. The hard metrics — citation accuracy, faithfulness, adoption — are the ones we trust. The productivity number is the partnership's framing of why those technical metrics mattered.

What stayed honest

Six months in, the harness gates were what kept the system on the right side of the citation-accuracy line. Every regression on faithfulness blocked a release; every blocked release became a new labeled eval case. The system stayed conservative by construction rather than by promise.

07 — Lessons + ReplicationWhat we would build first at a smaller firm tomorrow.

Re-running this engagement at a firm with one-tenth the corpus size and one-tenth the engineering budget, the prioritization order is sharper than it was at the start of this project. Three decisions create the asymmetric upside; everything else is execution.

What to build first

The entity index, before the vector index. For any domain with strong proper-noun retrieval signal — legal, medical, financial, academic — exact-match recall on identifiers is the highest single quality lever. Building it first means even an early-version system surfaces the right authority on the queries that matter most.
Citation cards, before model upgrades. The fastest path to partner trust is making verification cheap. Spend a day on the inline citation card with hover-reveal before spending a week on prompt tuning or model selection.
The faithfulness gate, before the dashboard. Eval that blocks releases changes behavior. Eval that informs releases gets ignored under deadline pressure. Wire faithfulness as a CI gate on the first PR, even with a tiny labeled set, and grow the set from there.

What to skip until you need it

A custom reranker. A general-purpose reranker (Cohere, Voyage) gets you most of the way. Custom domain rerankers earn their keep only after the entity index, hybrid retrieval, and faithfulness gate are stable.
A bespoke chunking pipeline. Structure-aware chunking matters; rolling your own structure parser for every document type does not. Use the legal-document parsers that already exist and patch their edge cases rather than rebuilding from scratch.
An over-engineered refresh scheduler. Start with a fixed cadence and a checksum-based change detector. Layer in calendar-aware hot windows only when production usage shows the flat cadence is wrong.

One closing observation. The engagement did not produce a clever model trick or a novel retrieval algorithm; it produced a stack of pedestrian engineering decisions, each of which was disciplined enough to hold up under the next decision. Structure-aware chunking made hybrid retrieval possible. Hybrid retrieval made entity-index recall possible. Entity recall made faithful generation possible. Faithful generation made inline citations verifiable. Verifiable citations made partner trust possible. None of the steps were exotic on their own. The composition — holding all five steps to their respective bars at the same time — was the actual work. Replicating it at a smaller firm is more about that compositional discipline than about budget or headcount; the same five steps still apply, just at the scale the firm actually needs.

Conclusion

Legal RAG works when citation accuracy is the gate, not the hope.

The lesson the engagement crystallized is narrower and more actionable than the usual case-study takeaway. Legal RAG works when the team treats citation accuracy as a hard gate — measured, blocked on regression, enforced in CI — rather than as an aspirational property of the system. Every other architectural choice in this build flowed from that single commitment. Hybrid retrieval mattered because pure vector could not hit the citation bar. Structure-aware chunking mattered because sloppy chunking broke faithfulness. The citation card UI mattered because unverifiable attribution would have ended adoption regardless of the underlying accuracy.

The broader implication for any high-stakes RAG deployment — legal, medical, financial, regulatory — is that the gating metric needs to be explicit, measurable, and binding before the first line of retrieval code gets written. A team that knows it must hit 99% citation accuracy and has a verifier that says so on every PR will build a meaningfully different system than a team that intends to monitor accuracy after launch. The gating metric is the architecture.

For firms considering a similar build, the realistic six-month shape is: month one on corpus design and the harness, month two on retrieval and ingestion, month three on generation and attribution UX, month four on the partner pilot, months five and six on hardening, refresh cadence, and adoption work. The heaviest engineering happens in the first three months; the heaviest organizational work — earning partner trust, wiring the system into research workflows, closing the labeled-eval feedback loop — happens in months four through six. Both halves are necessary. Skipping either produces a system that technically works and is never used, or a system that is used until the first hallucinated citation surfaces and is then quietly abandoned.

Case Study: RAG Deployment at a Legal Research Firm

01 — SituationA fifty-year archive, a zero-tolerance brief.

What we inherited

02 — Approach: Corpus + Hybrid RetrievalThe four-stage corpus pipeline.

Normalize and enrich

Chunk by legal structure

Dual index — vector + lexical

Hybrid query — fuse and rerank

03 — Approach: Source AttributionCitation cards — making trust visible.

Three attribution rules

04 — Approach: Faithfulness EvalThe evaluation harness as a production gate.

Is every claim grounded?

Does the answer address the question?

Did retrieval surface the right chunks?

Does every cited authority verify?

05 — Approach: Refresh + Court CalendarIngestion cadence aligned to the calendar.

Hot-window refresh

Working-hours cadence

Daily-only refresh

Full index re-embed

06 — OutcomesWhat the numbers said at month six.

Outcomes after six months in production

07 — Lessons + ReplicationWhat we would build first at a smaller firm tomorrow.

What to build first

What to skip until you need it

Legal RAG works when citation accuracy is the gate, not the hope.

Legal RAG works when citation accuracy is the gate, not the hope.

Vertical RAG engagements

The questions teams ask after the case.

Continue exploring case studies.

Case Study: Prompt Library at a Platform Engineering Team

Open-Weight Model Q3 2026 Projection: Competitive Forecast

AI Investment Q3 2026 Projection: Deal Flow Forecast