A self-hosted RAG pipeline on Postgres with pgvector is the highest-leverage AI infrastructure decision most teams will make in 2026. One database, one extension, three small tables, and a handful of well-shaped SQL queries replace what hosted vector databases sell as a separate service. The tutorial below builds the full stack end to end.
What's at stake: most production AI products are bottlenecked by retrieval quality, not model capability. Get the schema, chunking, and hybrid search right and you can swap the generation model without rewriting the system. Get it wrong and no amount of prompt engineering recovers the trust deficit when answers stop citing the right sources.
This guide covers eight stages: when self-hosted RAG beats hosted vector DBs, install on macOS / Linux / Docker, the canonical three-table schema, IVFFlat vs HNSW indexes, the ingestion pipeline, cosine + hybrid retrieval, a streaming Next.js endpoint with Claude Sonnet, and the latency / cost / recall numbers from a real 100k-chunk corpus. Every code block is copy-pasteable.
- 01Postgres + pgvector is production-grade for 100k-1M docs.Past that range, evaluate dedicated vector DBs (Qdrant, Weaviate, Vespa). Under it, you'll spend less, operate fewer systems, and keep your data alongside your application records.
- 02Chunk size dominates retrieval quality.Too small and chunks lose surrounding context; too big and a single relevant sentence is buried under noise. 500-800 tokens with 50-token overlap is the workable default for most prose.
- 03IVFFlat for cost, HNSW for recall.IVFFlat builds fast and indexes are tiny; HNSW recalls better but builds slower and uses roughly 2-3 times more disk. Start with IVFFlat, switch to HNSW only when recall measurements demand it.
- 04Hybrid retrieval beats pure vector for proper nouns.BM25 on names and entities, vector on semantic intent. Combine the two with reciprocal rank fusion and recall on identifier-heavy queries can lift meaningfully versus either signal alone.
- 05Citations are the trust UX.Source attribution turns 'the AI said' into 'the AI said because this page says'. Worth roughly 80% of the perceived-quality lift, often for less than a day of UI work.
01 — Why Self-HostSovereignty, cost, and control.
The hosted-vector-DB pitch is real: managed scaling, dedicated ANN clusters, dashboards, SLAs. For teams operating tens of millions of vectors or running multi-tenant SaaS with hard latency budgets, that's the right call. For most product teams shipping their first or second production RAG pipeline, pgvector on the Postgres they already operate is a meaningfully better starting point — and often the right terminal point as well.
Three factors decide. First, data gravity: retrieval-grade content almost always lives alongside relational records — users, accounts, tenants, permissions. Putting embeddings in the same database means joins, transactions, and row-level security come for free. Second, cost: a single Postgres instance running pgvector adds zero monthly fixed cost beyond the database you're already paying for. Third, operational simplicity: one backup strategy, one connection pool, one place to audit. Decide consciously which of those reasons applies before considering a separate vector store.
Self-host on pgvector
Single database, joins to relational data, RLS for multi-tenant. Latency comfortably under 50ms P95 at this scale with HNSW. No new vendor, no new failure mode.
Pick pgvectorEvaluate dedicated vector DBs
Qdrant, Weaviate, or Vespa start to make sense. pgvector still works but ingest throughput and tail latency need careful tuning. Benchmark both on your own corpus before committing.
Benchmark bothDedicated vector infrastructure
Sharding, replica fanout, and ANN-specific tuning become primary concerns. Vespa for search-grade workloads, Qdrant or Pinecone for chatbot retrieval. Postgres still owns metadata and permissions.
Pick dedicated DBSelf-host regardless of scale
Regulated sectors (healthcare, financial services, public sector) often cannot send corpus content to managed vector providers. Self-hosted Postgres + pgvector keeps everything inside the trust boundary.
Self-host on principleThe size thresholds above are heuristics — actual breakpoints depend on dimension count, query patterns, write rate, and tenancy model. The rule of thumb that holds up: until you have a measured reason to leave Postgres, don't leave Postgres. Most teams never need to.
02 — SetupPostgres 16 + pgvector — the 15-minute install.
pgvector ships as a Postgres extension. The install is three steps regardless of platform: install Postgres 16, install pgvector, then CREATE EXTENSION vector in the target database. Below are the canonical paths for macOS (Homebrew), Linux (apt), and Docker — pick one.
macOS via Homebrew
brew install postgresql@16
brew services start postgresql@16
brew install pgvector
# Create the database and enable the extension
createdb rag_app
psql -d rag_app -c "CREATE EXTENSION vector;"Linux (Debian / Ubuntu) via apt
sudo apt update
sudo apt install -y postgresql-16 postgresql-16-pgvector
sudo systemctl enable --now postgresql
sudo -u postgres createdb rag_app
sudo -u postgres psql -d rag_app -c "CREATE EXTENSION vector;"Docker — the portable path
docker run -d \
--name rag-postgres \
-e POSTGRES_PASSWORD=devsecret \
-p 5432:5432 \
pgvector/pgvector:pg16
# Wait for ready, then enable the extension
docker exec -it rag-postgres \
psql -U postgres -c "CREATE DATABASE rag_app;"
docker exec -it rag-postgres \
psql -U postgres -d rag_app -c "CREATE EXTENSION vector;"Verify the install
-- Confirm the extension is installed and version is current
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
-- Sanity-check a 3-dim vector round-trip
SELECT '[1,2,3]'::vector;
SELECT '[1,2,3]'::vector <=> '[4,5,6]'::vector AS cosine_distance;CREATE EXTENSION vector requires superuser or membership in pg_create_extensions. On managed Postgres (Supabase, Neon, RDS), the extension is typically pre-enabled or available via the dashboard; you don't need to install the binary yourself.Connection-pool note: pgvector queries are no different from normal Postgres queries from the pool's perspective. If you already run PgBouncer or Supavisor in front of Postgres, RAG traffic flows through the same pool. Keep the pool size aligned with your concurrent-query budget; ANN scans hold a connection for the full search duration.
03 — SchemaDocuments, chunks, embeddings — normalized.
The canonical schema is three tables with explicit foreign keys. Resist the urge to collapse them into a single wide chunks table — separating documents (the source-of- truth record) from chunks (the retrieval unit) from embeddings (the per-model vector) pays off as soon as you need to re-embed with a new model, replace a stale source document, or audit which chunks belonged to a deleted upload.
The embedding column is typed vector(1536) when paired with OpenAI text-embedding-3-small or text-embedding-3-large at its default 1536 dimension. Pin the dimension at the column type so an accidental insert from a different model fails loud rather than silent.
The three tables
-- 1. Documents — one row per source artifact (URL, PDF, file, page)
CREATE TABLE documents (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
source_url text NOT NULL,
title text,
mime_type text,
checksum text, -- detect changes for re-ingest
metadata jsonb DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX documents_source_url_idx ON documents(source_url);
-- 2. Chunks — many rows per document
CREATE TABLE chunks (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
document_id uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
ord int NOT NULL, -- chunk order within document
content text NOT NULL, -- the chunk text (UTF-8)
token_count int,
metadata jsonb DEFAULT '{}'::jsonb, -- page no, section heading, etc.
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX chunks_document_id_idx ON chunks(document_id);
CREATE INDEX chunks_ord_idx ON chunks(document_id, ord);
-- 3. Embeddings — one row per chunk per embedding model
CREATE TABLE embeddings (
chunk_id uuid NOT NULL REFERENCES chunks(id) ON DELETE CASCADE,
model text NOT NULL, -- e.g. 'text-embedding-3-large'
embedding vector(1536) NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (chunk_id, model)
);The composite primary key on embeddings (chunk_id, model) lets a single chunk carry vectors from multiple models concurrently — useful when you A/B test a new embedding model without dropping the old index. Cascade deletes from documents propagate cleanly to chunks and embeddings; you only ever delete the document.
Tenant scoping (for multi-tenant SaaS)
-- Add a tenant column to documents, then enable RLS
ALTER TABLE documents ADD COLUMN tenant_id uuid NOT NULL;
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents
USING (tenant_id = current_setting('app.tenant_id')::uuid);
-- Chunks/embeddings inherit through joins on document_id04 — IndexesIVFFlat vs HNSW — pick by recall budget.
pgvector ships two ANN index types. Both trade some recall for far better latency than the brute-force exact scan; the choice is about the shape of that tradeoff and what you can afford to re-index when content changes.
IVFFlat — inverted file with flat compression
Partitions the vector space into lists (Voronoi cells), then at query time scans only the probesnearest cells. Builds fast, indexes are small. Recall is good but bounded by how many probes you're willing to scan per query.
-- Build IVFFlat — set lists to ~sqrt(rows) as a starting heuristic
CREATE INDEX embeddings_ivfflat_idx
ON embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 316); -- ~sqrt(100000)
-- At query time, set probes per session for the recall/latency tradeoff
SET ivfflat.probes = 10;HNSW — hierarchical navigable small world
Builds a layered proximity graph. Recall is higher at equivalent latency, and queries don't require tuning probes per session. Tradeoff: build time is significantly longer and the index occupies roughly 2-3 times more disk than IVFFlat at the same corpus size.
-- Build HNSW — m and ef_construction tune the graph
CREATE INDEX embeddings_hnsw_idx
ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- At query time, ef_search trades latency for recall (default 40)
SET hnsw.ef_search = 100;Operationally: rebuild IVFFlat indexes after large ingest batches (the cluster centroids drift as new vectors arrive). HNSW is incremental — inserts are absorbed into the graph without a full rebuild, at modest write-time cost.
P50 query latency
lists=316, probes=10. Build time roughly 12 seconds on a single-core 8GB instance. Index size ~190MB. Recall@10 around 0.94 against exact-search ground truth.
Recall@10 ≈ 0.94P50 query latency
m=16, ef_construction=64, ef_search=100. Build time roughly 4 minutes on the same instance. Index size ~520MB. Recall@10 around 0.98.
Recall@10 ≈ 0.98P50 sequential scan
No index, pure ORDER BY embedding <=> $1. Useful as ground truth for recall measurement; not viable for production above ~10k vectors.
Ground truth onlyHNSW vs IVFFlat
On the same 100k-chunk dataset HNSW took ~20x longer to build than IVFFlat. The lift in recall is real; the operational cost of re-indexing is non-trivial.
Plan rebuild windowsPractical default: start with IVFFlat. Measure recall@10 against an exact-search baseline on a sample query set you actually care about. If recall is below your threshold (often 0.95 for answer-quality-sensitive workloads), switch the same column to HNSW. Both indexes can coexist briefly during the migration — drop the old one once the new one is built and verified.
05 — IngestionChunk, embed, upsert.
Ingestion is the part of the pipeline that most teams under- invest in, and the one where retrieval quality is actually decided. Three concerns: how you chunk, how you batch the embedding API, and how you upsert without producing duplicate chunks on re-ingest.
Chunking strategy
Three patterns cover most cases. Paragraph chunking splits on blank lines and merges short paragraphs up to a target token count — preserves natural prose boundaries. Semantic chunking uses sentence embeddings to detect topic shifts and cut at the boundaries — higher quality, higher ingestion cost. Sliding window uses fixed token counts with overlap — simplest, most predictable, and the best default starting point.
// lib/rag/chunk.ts — sliding-window chunker with token overlap
import { encode, decode } from 'gpt-tokenizer';
export type Chunk = { ord: number; content: string; tokenCount: number };
export function chunkBySlidingWindow(
text: string,
opts: { size?: number; overlap?: number } = {},
): Chunk[] {
const size = opts.size ?? 600;
const overlap = opts.overlap ?? 50;
const tokens = encode(text);
const chunks: Chunk[] = [];
let ord = 0;
for (let i = 0; i < tokens.length; i += size - overlap) {
const window = tokens.slice(i, i + size);
if (window.length === 0) break;
chunks.push({
ord: ord++,
content: decode(window),
tokenCount: window.length,
});
if (i + size >= tokens.length) break;
}
return chunks;
}Batched embedding calls
OpenAI's text-embedding-3-large accepts an array of up to 2048 input strings per request. Batch your chunks into requests of 100-200 strings to amortize HTTP overhead without hitting per-request token limits. Handle the rate-limit response (HTTP 429) with exponential backoff.
// lib/rag/embed.ts — batched embedding with retry
import OpenAI from 'openai';
const openai = new OpenAI();
const MODEL = 'text-embedding-3-large';
const BATCH = 128;
export async function embedBatch(texts: string[]): Promise<number[][]> {
const out: number[][] = [];
for (let i = 0; i < texts.length; i += BATCH) {
const slice = texts.slice(i, i + BATCH);
const resp = await retryOn429(() =>
openai.embeddings.create({ model: MODEL, input: slice }),
);
for (const item of resp.data) out.push(item.embedding);
}
return out;
}
async function retryOn429<T>(fn: () => Promise<T>, max = 5): Promise<T> {
let attempt = 0;
while (true) {
try {
return await fn();
} catch (e: any) {
if (e?.status !== 429 || attempt >= max) throw e;
const wait = 2 ** attempt * 500 + Math.random() * 250;
await new Promise((r) => setTimeout(r, wait));
attempt++;
}
}
}The upsert SQL pattern
The trick is making re-ingest idempotent. Identify documents by a stable source_url and content checksum; only re-chunk and re-embed when the checksum changes. On the chunks and embeddings side, use ON CONFLICT DO UPDATE so a re-ingest replaces stale rows in place rather than producing duplicates.
// lib/rag/upsert.ts — transactional ingest
import { sql } from '@/lib/db';
export async function upsertDocument({
sourceUrl, title, mimeType, content, embeddings,
}: {
sourceUrl: string;
title: string;
mimeType: string;
content: { ord: number; text: string; tokens: number }[];
embeddings: number[][];
}) {
await sql.begin(async (tx) => {
const checksum = await hash(content.map((c) => c.text).join('\n'));
const [doc] = await tx`
INSERT INTO documents (source_url, title, mime_type, checksum)
VALUES (${sourceUrl}, ${title}, ${mimeType}, ${checksum})
ON CONFLICT (source_url) DO UPDATE
SET title = EXCLUDED.title,
checksum = EXCLUDED.checksum,
updated_at = now()
RETURNING id
`;
// Replace all chunks for this document — cascade clears old embeddings
await tx`DELETE FROM chunks WHERE document_id = ${doc.id}`;
for (let i = 0; i < content.length; i++) {
const c = content[i];
const [chunk] = await tx`
INSERT INTO chunks (document_id, ord, content, token_count)
VALUES (${doc.id}, ${c.ord}, ${c.text}, ${c.tokens})
RETURNING id
`;
await tx`
INSERT INTO embeddings (chunk_id, model, embedding)
VALUES (${chunk.id}, 'text-embedding-3-large', ${JSON.stringify(embeddings[i])}::vector)
`;
}
});
}checksum compare and re-embedding every document on every nightly sync is the most common cost leak we see. A 10k-document corpus at $0.13 per million tokens for text-embedding-3-large can easily turn into $50/night of wasted embedding spend if the no-op detection is missing.06 — RetrievalCosine distance, hybrid search, re-ranking.
Retrieval is one SQL query — and then, often, two more on top for hybrid search and re-ranking. Start with the simplest form and only layer in complexity when measured quality demands it.
Pure vector search
-- The canonical kNN query: order by cosine distance, limit k
SELECT
c.id, c.content, c.metadata,
d.source_url, d.title,
(e.embedding <=> $1::vector) AS distance
FROM embeddings e
JOIN chunks c ON c.id = e.chunk_id
JOIN documents d ON d.id = c.document_id
WHERE e.model = 'text-embedding-3-large'
ORDER BY e.embedding <=> $1::vector
LIMIT 8;The <=> operator is cosine distance — the index vector_cosine_ops ensures the planner picks the ANN path rather than a sequential scan. Always join back to documents in the same query so you have everything you need for source attribution in a single round trip.
Hybrid: vector + BM25
Vector search excels at semantic intent and synonyms; BM25 (Postgres tsvector) excels at proper nouns, identifiers, and rare entities. Run both, then fuse the rankings with reciprocal rank fusion (RRF).
-- Add a tsvector column once, populated by a trigger
ALTER TABLE chunks ADD COLUMN tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX chunks_tsv_idx ON chunks USING GIN (tsv);
-- Hybrid query with RRF — k=60 is the canonical RRF constant
WITH vector_hits AS (
SELECT c.id, ROW_NUMBER() OVER (ORDER BY e.embedding <=> $1::vector) AS rk
FROM embeddings e JOIN chunks c ON c.id = e.chunk_id
WHERE e.model = 'text-embedding-3-large'
ORDER BY e.embedding <=> $1::vector LIMIT 40
),
keyword_hits AS (
SELECT c.id, ROW_NUMBER() OVER (ORDER BY ts_rank(c.tsv, query) DESC) AS rk
FROM chunks c, plainto_tsquery('english', $2) AS query
WHERE c.tsv @@ query
ORDER BY ts_rank(c.tsv, query) DESC LIMIT 40
)
SELECT c.id, c.content, d.source_url,
SUM(1.0 / (60 + COALESCE(v.rk, 9999) + COALESCE(k.rk, 9999))) AS score
FROM chunks c
JOIN documents d ON d.id = c.document_id
LEFT JOIN vector_hits v ON v.id = c.id
LEFT JOIN keyword_hits k ON k.id = c.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
GROUP BY c.id, c.content, d.source_url
ORDER BY score DESC
LIMIT 8;Re-ranking with a dedicated model
For the highest-quality retrieval, take the top 20-40 hybrid results and pass them through a re-ranker (Cohere Rerank, Voyage Rerank, or a local cross-encoder). Re-rankers score each (query, chunk) pair jointly rather than independently, and recover relevance that pure vector or BM25 miss.
Pure vector
ORDER BY <=> LIMIT kOne SQL query, one round trip, ~12ms P50 at 100k chunks. The right default — every other mode adds cost and complexity that has to be justified by measured quality lift.
Lowest latencyHybrid RRF
vector + BM25 fusedTwo CTEs, reciprocal rank fusion. ~25ms P50 at 100k chunks. Recovers proper-noun and identifier hits that pure vector misses. Production default for most knowledge-base chatbots.
Production defaultHybrid + re-rank
RRF top-40 → reranker → top-8Adds a 80-200ms cross-encoder call on top of hybrid. Highest recall at top-k. Use when answer correctness is critical (support, compliance) and the latency budget allows.
Max recallOne pattern that holds up across deployments: always retrieve more chunks than you plan to feed the model. Retrieve 20-40, then re-rank or truncate to the 6-10 the LLM actually sees. Over-retrieval at the SQL layer is cheap; under-retrieval means the right chunk never enters the candidate set and no amount of downstream reasoning recovers from that.
"Retrieval quality is the ceiling on answer quality. No model can cite a chunk it never saw."— A retrieval principle worth tattooing on the wall
07 — GenerationStreaming Claude Sonnet — with citations.
The generation layer turns retrieved chunks into a grounded answer. Three concerns: how you build the prompt, how you stream tokens to the client, and how you surface source attribution in the UI so users can verify what the model said. All three are AI SDK 6 + Next.js 16 boilerplate at this point.
The grounding prompt
The prompt template that holds up: a tight system instruction that requires citations and refuses to answer outside the retrieved context, followed by numbered chunks with stable identifiers, followed by the user question. Numbering the chunks lets the model reference them by index, which makes citation extraction trivial post-generation.
// lib/rag/prompt.ts
export function buildPrompt(question: string, chunks: RetrievedChunk[]) {
const context = chunks
.map((c, i) => `[${i + 1}] (${c.sourceUrl})\n${c.content}`)
.join('\n\n---\n\n');
const system = `You are a knowledge assistant. Answer ONLY using the numbered context below.
- Cite sources inline using bracketed numbers, e.g. [1] or [2,3].
- If the answer is not in the context, say so plainly. Do not guess.
- Keep answers concise and factual.`;
const user = `Context:\n${context}\n\nQuestion: ${question}`;
return { system, user };
}Streaming endpoint with the AI SDK
// app/api/rag/route.ts
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';
import { embedBatch } from '@/lib/rag/embed';
import { retrieveHybrid } from '@/lib/rag/retrieve';
import { buildPrompt } from '@/lib/rag/prompt';
export async function POST(req: Request) {
const { question } = await req.json();
// 1. Embed the question
const [queryVec] = await embedBatch([question]);
// 2. Retrieve top-8 hybrid hits
const chunks = await retrieveHybrid(queryVec, question, 8);
// 3. Build the grounded prompt and stream the answer
const { system, user } = buildPrompt(question, chunks);
const result = streamText({
model: anthropic('claude-sonnet-4-7'),
system,
prompt: user,
temperature: 0.2,
});
// 4. Stream tokens; attach citations via response headers / data stream
return result.toDataStreamResponse({
headers: {
'x-rag-sources': JSON.stringify(
chunks.map((c, i) => ({ idx: i + 1, url: c.sourceUrl, title: c.title })),
),
},
});
}Source attribution in the UI
The citation-extraction pattern that holds up: parse [N]tokens out of the streamed text as they arrive, look up the matching chunk by index, and render a small footnote card next to the citation. Users hover or tap to see the source URL, title, and the exact chunk text the model grounded on. That hover-reveal turns "the AI said" into "the AI said because this passage says" — a much stronger trust signal than a generic source-list at the bottom of the answer.
Two extensions worth knowing about. First, Claude's native citations feature (citations: enabledwhen passing document content) emits structured citation spans directly, removing the bracket-parsing step. Second, hallucination guards — refusing to answer when the top retrieved chunk's cosine distance exceeds a threshold (typically 0.4) — eliminate the "confidently wrong on out-of-corpus questions" failure mode that destroys user trust faster than anything else.
If you're standing up the answer endpoint from scratch, our Next.js 16 + AI SDK chatbot tutorial walks through the streaming UI plumbing in more detail than this guide covers.
08 — NumbersLatency, cost, and recall at 10k / 100k / 1M docs.
All numbers below come from a single Hetzner CX31 (4 vCPU / 8GB / NVMe) running Postgres 16 + pgvector 0.7, querying with HNSW (m=16, ef_search=100), 1536-dim embeddings, text-embedding-3-large for ingestion, Claude Sonnet 4.7 for generation. Test corpus is a synthetic mix of technical documentation chunks — 10k, 100k, and 1M chunks indexed at each tier.
Latency at three corpus sizes
Hetzner CX31 · pgvector 0.7 · HNSW (m=16, ef_search=100) · text-embedding-3-large · Claude Sonnet 4.7Cost. At launch pricing in April 2026, text-embedding-3-large is approximately $0.13 per million tokens; Claude Sonnet 4.7 is roughly $3 / $15 per million tokens (input / output). A single RAG query costs (rough order): ~50 tokens to embed the question (~$0.000007), ~5,000 tokens of retrieved context fed to the model (~$0.015), ~300 output tokens (~$0.0045). Total: roughly ~$0.02 per query, or ~$0.08 per thousand queries when amortized with caching and modest batching. That scales linearly until you add a re-ranker (Cohere Rerank adds ~$0.0002 per query) or move to a larger generation model.
Recall. On a 100k-chunk corpus with hand- labeled relevance judgments, HNSW (ef_search=100) recovered recall@10 of approximately 0.98 against exact-search ground truth; IVFFlat (probes=10) recovered approximately 0.94. Adding hybrid retrieval with BM25 RRF lifted recall@10 to roughly 0.99 on queries containing proper nouns or identifiers — the cases pure vector tends to miss.
Projecting forward.At 1M chunks the HNSW index occupies approximately 4-5GB of disk and ~3GB of shared-buffers cache for hot operation; latency stays under 100ms P95 on commodity hardware. Past 5-10M chunks, the sequential-IO patterns of HNSW start to interact poorly with page-cache pressure and dedicated vector DBs become the cleaner answer. The threshold where switching makes sense depends as much on your write rate and query mix as on raw count — measure, don't guess.
Self-hosted RAG is a small operational lift in exchange for sovereignty, cost control, and a queryable knowledge surface.
The pipeline this guide walks through is genuinely small: three tables, one ANN index, a batched embedding job, two SQL queries (vector and hybrid), one streaming endpoint with citation parsing. That's the entire system. It runs on the Postgres you already operate. It serves 100k-1M-chunk corpora with P50 latencies under 50ms and per-query costs of roughly two cents. Most production RAG products do not need anything more than this.
What changes at 10x scale is mostly operational, not architectural. Index rebuilds need scheduling windows. The embedding model becomes a first-class versioning concern. Multi-tenant routing benefits from partitioning. Re-rankers start earning their keep on long-tail queries. None of those are reasons to leave Postgres; they're reasons to invest in the same Postgres instance more seriously — better instrumentation, dedicated read replicas for RAG traffic, tighter pool sizing. Most of the work at 10x is sharper measurement, not new infrastructure.
Concrete first-week milestones for a team adopting this pattern: day one, stand up Postgres + pgvector locally and run the schema migration. Day two, write the chunker and ingest 100 representative documents. Day three, build the retrieval query and validate recall@10 on a hand-labeled sample. Day four, wire the streaming endpoint and the citation UI. Day five, instrument latency / cost / recall dashboards. The whole system fits comfortably in a sprint and ships small, observable, and easy to evolve.