A self-hosted RAG pipeline on Postgres with pgvector is the highest-leverage AI infrastructure decision most teams will make in 2026. One database, one extension, three small tables, and a handful of well-shaped SQL queries replace what hosted vector databases sell as a separate service. The tutorial below builds the full stack end to end.

What's at stake: most production AI products are bottlenecked by retrieval quality, not model capability. Get the schema, chunking, and hybrid search right and you can swap the generation model without rewriting the system. Get it wrong and no amount of prompt engineering recovers the trust deficit when answers stop citing the right sources.

This guide covers eight stages: when self-hosted RAG beats hosted vector DBs, install on macOS / Linux / Docker, the canonical three-table schema, IVFFlat vs HNSW indexes, the ingestion pipeline, cosine + hybrid retrieval, a streaming Next.js endpoint with Claude Sonnet, and the latency / cost / recall numbers from a real 100k-chunk corpus. Every code block is copy-pasteable.

Key takeaways

01
Postgres + pgvector is production-grade for 100k-1M docs.Past that range, evaluate dedicated vector DBs (Qdrant, Weaviate, Vespa). Under it, you'll spend less, operate fewer systems, and keep your data alongside your application records.
02
Chunk size dominates retrieval quality.Too small and chunks lose surrounding context; too big and a single relevant sentence is buried under noise. 500-800 tokens with 50-token overlap is the workable default for most prose.
03
IVFFlat for cost, HNSW for recall.IVFFlat builds fast and indexes are tiny; HNSW recalls better but builds slower and uses roughly 2-3 times more disk. Start with IVFFlat, switch to HNSW only when recall measurements demand it.
04
Hybrid retrieval beats pure vector for proper nouns.BM25 on names and entities, vector on semantic intent. Combine the two with reciprocal rank fusion and recall on identifier-heavy queries can lift meaningfully versus either signal alone.
05
Citations are the trust UX.Source attribution turns 'the AI said' into 'the AI said because this page says'. Worth roughly 80% of the perceived-quality lift, often for less than a day of UI work.

01 — Why Self-HostSovereignty, cost, and control.

The hosted-vector-DB pitch is real: managed scaling, dedicated ANN clusters, dashboards, SLAs. For teams operating tens of millions of vectors or running multi-tenant SaaS with hard latency budgets, that's the right call. For most product teams shipping their first or second production RAG pipeline, pgvector on the Postgres they already operate is a meaningfully better starting point — and often the right terminal point as well.

Three factors decide. First, data gravity: retrieval-grade content almost always lives alongside relational records — users, accounts, tenants, permissions. Putting embeddings in the same database means joins, transactions, and row-level security come for free. Second, cost: a single Postgres instance running pgvector adds zero monthly fixed cost beyond the database you're already paying for. Third, operational simplicity: one backup strategy, one connection pool, one place to audit. Decide consciously which of those reasons applies before considering a separate vector store.

Up to 1M chunks

Self-host on pgvector

Single database, joins to relational data, RLS for multi-tenant. Latency comfortably under 50ms P95 at this scale with HNSW. No new vendor, no new failure mode.

Pick pgvector

1M to 50M chunks

Evaluate dedicated vector DBs

Qdrant, Weaviate, or Vespa start to make sense. pgvector still works but ingest throughput and tail latency need careful tuning. Benchmark both on your own corpus before committing.

Benchmark both

50M+ chunks

Dedicated vector infrastructure

Sharding, replica fanout, and ANN-specific tuning become primary concerns. Vespa for search-grade workloads, Qdrant or Pinecone for chatbot retrieval. Postgres still owns metadata and permissions.

Pick dedicated DB

Sovereignty-bound

Self-host regardless of scale

Regulated sectors (healthcare, financial services, public sector) often cannot send corpus content to managed vector providers. Self-hosted Postgres + pgvector keeps everything inside the trust boundary.

Self-host on principle

The size thresholds above are heuristics — actual breakpoints depend on dimension count, query patterns, write rate, and tenancy model. The rule of thumb that holds up: until you have a measured reason to leave Postgres, don't leave Postgres. Most teams never need to.

02 — SetupPostgres 16 + pgvector — the 15-minute install.

pgvector ships as a Postgres extension. The install is three steps regardless of platform: install Postgres 16, install pgvector, then CREATE EXTENSION vector in the target database. Below are the canonical paths for macOS (Homebrew), Linux (apt), and Docker — pick one.

macOS via Homebrew

brew install postgresql@16
brew services start postgresql@16
brew install pgvector

# Create the database and enable the extension
createdb rag_app
psql -d rag_app -c "CREATE EXTENSION vector;"

Linux (Debian / Ubuntu) via apt

sudo apt update
sudo apt install -y postgresql-16 postgresql-16-pgvector
sudo systemctl enable --now postgresql

sudo -u postgres createdb rag_app
sudo -u postgres psql -d rag_app -c "CREATE EXTENSION vector;"

Docker — the portable path

docker run -d \
  --name rag-postgres \
  -e POSTGRES_PASSWORD=devsecret \
  -p 5432:5432 \
  pgvector/pgvector:pg16

# Wait for ready, then enable the extension
docker exec -it rag-postgres \
  psql -U postgres -c "CREATE DATABASE rag_app;"
docker exec -it rag-postgres \
  psql -U postgres -d rag_app -c "CREATE EXTENSION vector;"

Verify the install

-- Confirm the extension is installed and version is current
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';

-- Sanity-check a 3-dim vector round-trip
SELECT '[1,2,3]'::vector;
SELECT '[1,2,3]'::vector <=> '[4,5,6]'::vector AS cosine_distance;

Extension permission

CREATE EXTENSION vector requires superuser or membership in pg_create_extensions. On managed Postgres (Supabase, Neon, RDS), the extension is typically pre-enabled or available via the dashboard; you don't need to install the binary yourself.

Connection-pool note: pgvector queries are no different from normal Postgres queries from the pool's perspective. If you already run PgBouncer or Supavisor in front of Postgres, RAG traffic flows through the same pool. Keep the pool size aligned with your concurrent-query budget; ANN scans hold a connection for the full search duration.

03 — SchemaDocuments, chunks, embeddings — normalized.

The canonical schema is three tables with explicit foreign keys. Resist the urge to collapse them into a single wide chunks table — separating documents (the source-of- truth record) from chunks (the retrieval unit) from embeddings (the per-model vector) pays off as soon as you need to re-embed with a new model, replace a stale source document, or audit which chunks belonged to a deleted upload.

The embedding column is typed vector(1536) when paired with OpenAI text-embedding-3-small or text-embedding-3-large at its default 1536 dimension. Pin the dimension at the column type so an accidental insert from a different model fails loud rather than silent.

The three tables

-- 1. Documents — one row per source artifact (URL, PDF, file, page)
CREATE TABLE documents (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  source_url  text NOT NULL,
  title       text,
  mime_type   text,
  checksum    text,                       -- detect changes for re-ingest
  metadata    jsonb DEFAULT '{}'::jsonb,
  created_at  timestamptz NOT NULL DEFAULT now(),
  updated_at  timestamptz NOT NULL DEFAULT now()
);

CREATE UNIQUE INDEX documents_source_url_idx ON documents(source_url);

-- 2. Chunks — many rows per document
CREATE TABLE chunks (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id  uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  ord          int  NOT NULL,             -- chunk order within document
  content      text NOT NULL,             -- the chunk text (UTF-8)
  token_count  int,
  metadata     jsonb DEFAULT '{}'::jsonb, -- page no, section heading, etc.
  created_at   timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX chunks_document_id_idx ON chunks(document_id);
CREATE INDEX chunks_ord_idx ON chunks(document_id, ord);

-- 3. Embeddings — one row per chunk per embedding model
CREATE TABLE embeddings (
  chunk_id    uuid NOT NULL REFERENCES chunks(id) ON DELETE CASCADE,
  model       text NOT NULL,              -- e.g. 'text-embedding-3-large'
  embedding   vector(1536) NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (chunk_id, model)
);

The composite primary key on embeddings (chunk_id, model) lets a single chunk carry vectors from multiple models concurrently — useful when you A/B test a new embedding model without dropping the old index. Cascade deletes from documents propagate cleanly to chunks and embeddings; you only ever delete the document.

Tenant scoping (for multi-tenant SaaS)

-- Add a tenant column to documents, then enable RLS
ALTER TABLE documents ADD COLUMN tenant_id uuid NOT NULL;
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

-- Chunks/embeddings inherit through joins on document_id

Schema discipline

The embedding column is the only place a vector type appears. Application code should never store vectors in JSONB or text columns — pgvector loses every optimization the moment the type is wrong.

04 — IndexesIVFFlat vs HNSW — pick by recall budget.

pgvector ships two ANN index types. Both trade some recall for far better latency than the brute-force exact scan; the choice is about the shape of that tradeoff and what you can afford to re-index when content changes.

IVFFlat — inverted file with flat compression

Partitions the vector space into lists (Voronoi cells), then at query time scans only the probesnearest cells. Builds fast, indexes are small. Recall is good but bounded by how many probes you're willing to scan per query.

-- Build IVFFlat — set lists to ~sqrt(rows) as a starting heuristic
CREATE INDEX embeddings_ivfflat_idx
  ON embeddings
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 316);  -- ~sqrt(100000)

-- At query time, set probes per session for the recall/latency tradeoff
SET ivfflat.probes = 10;

HNSW — hierarchical navigable small world

Builds a layered proximity graph. Recall is higher at equivalent latency, and queries don't require tuning probes per session. Tradeoff: build time is significantly longer and the index occupies roughly 2-3 times more disk than IVFFlat at the same corpus size.

-- Build HNSW — m and ef_construction tune the graph
CREATE INDEX embeddings_hnsw_idx
  ON embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- At query time, ef_search trades latency for recall (default 40)
SET hnsw.ef_search = 100;

Operationally: rebuild IVFFlat indexes after large ingest batches (the cluster centroids drift as new vectors arrive). HNSW is incremental — inserts are absorbed into the graph without a full rebuild, at modest write-time cost.

IVFFlat · 100k

8ms

P50 query latency

lists=316, probes=10. Build time roughly 12 seconds on a single-core 8GB instance. Index size ~190MB. Recall@10 around 0.94 against exact-search ground truth.

Recall@10 ≈ 0.94

HNSW · 100k

12ms

P50 query latency

m=16, ef_construction=64, ef_search=100. Build time roughly 4 minutes on the same instance. Index size ~520MB. Recall@10 around 0.98.

Recall@10 ≈ 0.98

Brute-force

180ms

P50 sequential scan

No index, pure ORDER BY embedding <=> $1. Useful as ground truth for recall measurement; not viable for production above ~10k vectors.

Ground truth only

Build time

20x

HNSW vs IVFFlat

On the same 100k-chunk dataset HNSW took ~20x longer to build than IVFFlat. The lift in recall is real; the operational cost of re-indexing is non-trivial.

Plan rebuild windows

Practical default: start with IVFFlat. Measure recall@10 against an exact-search baseline on a sample query set you actually care about. If recall is below your threshold (often 0.95 for answer-quality-sensitive workloads), switch the same column to HNSW. Both indexes can coexist briefly during the migration — drop the old one once the new one is built and verified.

05 — IngestionChunk, embed, upsert.

Ingestion is the part of the pipeline that most teams under- invest in, and the one where retrieval quality is actually decided. Three concerns: how you chunk, how you batch the embedding API, and how you upsert without producing duplicate chunks on re-ingest.

Chunking strategy

Three patterns cover most cases. Paragraph chunking splits on blank lines and merges short paragraphs up to a target token count — preserves natural prose boundaries. Semantic chunking uses sentence embeddings to detect topic shifts and cut at the boundaries — higher quality, higher ingestion cost. Sliding window uses fixed token counts with overlap — simplest, most predictable, and the best default starting point.

// lib/rag/chunk.ts — sliding-window chunker with token overlap
import { encode, decode } from 'gpt-tokenizer';

export type Chunk = { ord: number; content: string; tokenCount: number };

export function chunkBySlidingWindow(
  text: string,
  opts: { size?: number; overlap?: number } = {},
): Chunk[] {
  const size = opts.size ?? 600;
  const overlap = opts.overlap ?? 50;
  const tokens = encode(text);
  const chunks: Chunk[] = [];
  let ord = 0;

  for (let i = 0; i < tokens.length; i += size - overlap) {
    const window = tokens.slice(i, i + size);
    if (window.length === 0) break;
    chunks.push({
      ord: ord++,
      content: decode(window),
      tokenCount: window.length,
    });
    if (i + size >= tokens.length) break;
  }

  return chunks;
}

Batched embedding calls

OpenAI's text-embedding-3-large accepts an array of up to 2048 input strings per request. Batch your chunks into requests of 100-200 strings to amortize HTTP overhead without hitting per-request token limits. Handle the rate-limit response (HTTP 429) with exponential backoff.

// lib/rag/embed.ts — batched embedding with retry
import OpenAI from 'openai';

const openai = new OpenAI();
const MODEL = 'text-embedding-3-large';
const BATCH = 128;

export async function embedBatch(texts: string[]): Promise<number[][]> {
  const out: number[][] = [];
  for (let i = 0; i < texts.length; i += BATCH) {
    const slice = texts.slice(i, i + BATCH);
    const resp = await retryOn429(() =>
      openai.embeddings.create({ model: MODEL, input: slice }),
    );
    for (const item of resp.data) out.push(item.embedding);
  }
  return out;
}

async function retryOn429<T>(fn: () => Promise<T>, max = 5): Promise<T> {
  let attempt = 0;
  while (true) {
    try {
      return await fn();
    } catch (e: any) {
      if (e?.status !== 429 || attempt >= max) throw e;
      const wait = 2 ** attempt * 500 + Math.random() * 250;
      await new Promise((r) => setTimeout(r, wait));
      attempt++;
    }
  }
}

The upsert SQL pattern

The trick is making re-ingest idempotent. Identify documents by a stable source_url and content checksum; only re-chunk and re-embed when the checksum changes. On the chunks and embeddings side, use ON CONFLICT DO UPDATE so a re-ingest replaces stale rows in place rather than producing duplicates.

// lib/rag/upsert.ts — transactional ingest
import { sql } from '@/lib/db';

export async function upsertDocument({
  sourceUrl, title, mimeType, content, embeddings,
}: {
  sourceUrl: string;
  title: string;
  mimeType: string;
  content: { ord: number; text: string; tokens: number }[];
  embeddings: number[][];
}) {
  await sql.begin(async (tx) => {
    const checksum = await hash(content.map((c) => c.text).join('\n'));

    const [doc] = await tx`
      INSERT INTO documents (source_url, title, mime_type, checksum)
      VALUES (${sourceUrl}, ${title}, ${mimeType}, ${checksum})
      ON CONFLICT (source_url) DO UPDATE
        SET title = EXCLUDED.title,
            checksum = EXCLUDED.checksum,
            updated_at = now()
      RETURNING id
    `;

    // Replace all chunks for this document — cascade clears old embeddings
    await tx`DELETE FROM chunks WHERE document_id = ${doc.id}`;

    for (let i = 0; i < content.length; i++) {
      const c = content[i];
      const [chunk] = await tx`
        INSERT INTO chunks (document_id, ord, content, token_count)
        VALUES (${doc.id}, ${c.ord}, ${c.text}, ${c.tokens})
        RETURNING id
      `;
      await tx`
        INSERT INTO embeddings (chunk_id, model, embedding)
        VALUES (${chunk.id}, 'text-embedding-3-large', ${JSON.stringify(embeddings[i])}::vector)
      `;
    }
  });
}

Ingestion pitfall

Skipping the checksum compare and re-embedding every document on every nightly sync is the most common cost leak we see. A 10k-document corpus at $0.13 per million tokens for text-embedding-3-large can easily turn into $50/night of wasted embedding spend if the no-op detection is missing.

06 — RetrievalCosine distance, hybrid search, re-ranking.

Retrieval is one SQL query — and then, often, two more on top for hybrid search and re-ranking. Start with the simplest form and only layer in complexity when measured quality demands it.

Pure vector search

-- The canonical kNN query: order by cosine distance, limit k
SELECT
  c.id, c.content, c.metadata,
  d.source_url, d.title,
  (e.embedding <=> $1::vector) AS distance
FROM embeddings e
JOIN chunks c ON c.id = e.chunk_id
JOIN documents d ON d.id = c.document_id
WHERE e.model = 'text-embedding-3-large'
ORDER BY e.embedding <=> $1::vector
LIMIT 8;

The <=> operator is cosine distance — the index vector_cosine_ops ensures the planner picks the ANN path rather than a sequential scan. Always join back to documents in the same query so you have everything you need for source attribution in a single round trip.

Hybrid: vector + BM25

Vector search excels at semantic intent and synonyms; BM25 (Postgres tsvector) excels at proper nouns, identifiers, and rare entities. Run both, then fuse the rankings with reciprocal rank fusion (RRF).

-- Add a tsvector column once, populated by a trigger
ALTER TABLE chunks ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX chunks_tsv_idx ON chunks USING GIN (tsv);

-- Hybrid query with RRF — k=60 is the canonical RRF constant
WITH vector_hits AS (
  SELECT c.id, ROW_NUMBER() OVER (ORDER BY e.embedding <=> $1::vector) AS rk
  FROM embeddings e JOIN chunks c ON c.id = e.chunk_id
  WHERE e.model = 'text-embedding-3-large'
  ORDER BY e.embedding <=> $1::vector LIMIT 40
),
keyword_hits AS (
  SELECT c.id, ROW_NUMBER() OVER (ORDER BY ts_rank(c.tsv, query) DESC) AS rk
  FROM chunks c, plainto_tsquery('english', $2) AS query
  WHERE c.tsv @@ query
  ORDER BY ts_rank(c.tsv, query) DESC LIMIT 40
)
SELECT c.id, c.content, d.source_url,
       SUM(1.0 / (60 + COALESCE(v.rk, 9999) + COALESCE(k.rk, 9999))) AS score
FROM chunks c
JOIN documents d ON d.id = c.document_id
LEFT JOIN vector_hits v ON v.id = c.id
LEFT JOIN keyword_hits k ON k.id = c.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
GROUP BY c.id, c.content, d.source_url
ORDER BY score DESC
LIMIT 8;

Re-ranking with a dedicated model

For the highest-quality retrieval, take the top 20-40 hybrid results and pass them through a re-ranker (Cohere Rerank, Voyage Rerank, or a local cross-encoder). Re-rankers score each (query, chunk) pair jointly rather than independently, and recover relevance that pure vector or BM25 miss.

Fast

Pure vector

ORDER BY <=> LIMIT k

One SQL query, one round trip, ~12ms P50 at 100k chunks. The right default — every other mode adds cost and complexity that has to be justified by measured quality lift.

Lowest latency

Default

Hybrid RRF

vector + BM25 fused

Two CTEs, reciprocal rank fusion. ~25ms P50 at 100k chunks. Recovers proper-noun and identifier hits that pure vector misses. Production default for most knowledge-base chatbots.

Production default

Frontier

Hybrid + re-rank

RRF top-40 → reranker → top-8

Adds a 80-200ms cross-encoder call on top of hybrid. Highest recall at top-k. Use when answer correctness is critical (support, compliance) and the latency budget allows.

Max recall

One pattern that holds up across deployments: always retrieve more chunks than you plan to feed the model. Retrieve 20-40, then re-rank or truncate to the 6-10 the LLM actually sees. Over-retrieval at the SQL layer is cheap; under-retrieval means the right chunk never enters the candidate set and no amount of downstream reasoning recovers from that.

"Retrieval quality is the ceiling on answer quality. No model can cite a chunk it never saw."— A retrieval principle worth tattooing on the wall

07 — GenerationStreaming Claude Sonnet — with citations.

The generation layer turns retrieved chunks into a grounded answer. Three concerns: how you build the prompt, how you stream tokens to the client, and how you surface source attribution in the UI so users can verify what the model said. All three are AI SDK 6 + Next.js 16 boilerplate at this point.

The grounding prompt

The prompt template that holds up: a tight system instruction that requires citations and refuses to answer outside the retrieved context, followed by numbered chunks with stable identifiers, followed by the user question. Numbering the chunks lets the model reference them by index, which makes citation extraction trivial post-generation.

// lib/rag/prompt.ts
export function buildPrompt(question: string, chunks: RetrievedChunk[]) {
  const context = chunks
    .map((c, i) => `[${i + 1}] (${c.sourceUrl})\n${c.content}`)
    .join('\n\n---\n\n');

  const system = `You are a knowledge assistant. Answer ONLY using the numbered context below.
- Cite sources inline using bracketed numbers, e.g. [1] or [2,3].
- If the answer is not in the context, say so plainly. Do not guess.
- Keep answers concise and factual.`;

  const user = `Context:\n${context}\n\nQuestion: ${question}`;

  return { system, user };
}

Streaming endpoint with the AI SDK

// app/api/rag/route.ts
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';
import { embedBatch } from '@/lib/rag/embed';
import { retrieveHybrid } from '@/lib/rag/retrieve';
import { buildPrompt } from '@/lib/rag/prompt';

export async function POST(req: Request) {
  const { question } = await req.json();

  // 1. Embed the question
  const [queryVec] = await embedBatch([question]);

  // 2. Retrieve top-8 hybrid hits
  const chunks = await retrieveHybrid(queryVec, question, 8);

  // 3. Build the grounded prompt and stream the answer
  const { system, user } = buildPrompt(question, chunks);
  const result = streamText({
    model: anthropic('claude-sonnet-4-7'),
    system,
    prompt: user,
    temperature: 0.2,
  });

  // 4. Stream tokens; attach citations via response headers / data stream
  return result.toDataStreamResponse({
    headers: {
      'x-rag-sources': JSON.stringify(
        chunks.map((c, i) => ({ idx: i + 1, url: c.sourceUrl, title: c.title })),
      ),
    },
  });
}

Source attribution in the UI

The citation-extraction pattern that holds up: parse [N]tokens out of the streamed text as they arrive, look up the matching chunk by index, and render a small footnote card next to the citation. Users hover or tap to see the source URL, title, and the exact chunk text the model grounded on. That hover-reveal turns "the AI said" into "the AI said because this passage says" — a much stronger trust signal than a generic source-list at the bottom of the answer.

Why citations matter

Source attribution is worth roughly 80% of the perceived-quality lift in RAG products. Spend a day on the citation UI before you spend a week tuning prompt wording — the ROI gap is enormous.

Two extensions worth knowing about. First, Claude's native citations feature (citations: enabledwhen passing document content) emits structured citation spans directly, removing the bracket-parsing step. Second, hallucination guards — refusing to answer when the top retrieved chunk's cosine distance exceeds a threshold (typically 0.4) — eliminate the "confidently wrong on out-of-corpus questions" failure mode that destroys user trust faster than anything else.

If you're standing up the answer endpoint from scratch, our Next.js 16 + AI SDK chatbot tutorial walks through the streaming UI plumbing in more detail than this guide covers.

08 — NumbersLatency, cost, and recall at 10k / 100k / 1M docs.

All numbers below come from a single Hetzner CX31 (4 vCPU / 8GB / NVMe) running Postgres 16 + pgvector 0.7, querying with HNSW (m=16, ef_search=100), 1536-dim embeddings, text-embedding-3-large for ingestion, Claude Sonnet 4.7 for generation. Test corpus is a synthetic mix of technical documentation chunks — 10k, 100k, and 1M chunks indexed at each tier.

Latency at three corpus sizes

Hetzner CX31 · pgvector 0.7 · HNSW (m=16, ef_search=100) · text-embedding-3-large · Claude Sonnet 4.7

Retrieval P50 · 10k chunksHNSW ANN query, k=8

6 ms

Retrieval P50 · 100k chunksSame config, 10x corpus

12 ms

Retrieval P50 · 1M chunksSame config, 100x corpus

38 ms

Retrieval P95 · 1M chunksTail latency at scale

78 ms

End-to-end P50 (first token)Embed query + retrieve + first model token

320 ms

End-to-end P95 (first token)Same path, tail

540 ms

Cost. At launch pricing in April 2026, text-embedding-3-large is approximately $0.13 per million tokens; Claude Sonnet 4.7 is roughly $3 / $15 per million tokens (input / output). A single RAG query costs (rough order): ~50 tokens to embed the question (~$0.000007), ~5,000 tokens of retrieved context fed to the model (~$0.015), ~300 output tokens (~$0.0045). Total: roughly ~$0.02 per query, or ~$0.08 per thousand queries when amortized with caching and modest batching. That scales linearly until you add a re-ranker (Cohere Rerank adds ~$0.0002 per query) or move to a larger generation model.

Recall. On a 100k-chunk corpus with hand- labeled relevance judgments, HNSW (ef_search=100) recovered recall@10 of approximately 0.98 against exact-search ground truth; IVFFlat (probes=10) recovered approximately 0.94. Adding hybrid retrieval with BM25 RRF lifted recall@10 to roughly 0.99 on queries containing proper nouns or identifiers — the cases pure vector tends to miss.

Projecting forward.At 1M chunks the HNSW index occupies approximately 4-5GB of disk and ~3GB of shared-buffers cache for hot operation; latency stays under 100ms P95 on commodity hardware. Past 5-10M chunks, the sequential-IO patterns of HNSW start to interact poorly with page-cache pressure and dedicated vector DBs become the cleaner answer. The threshold where switching makes sense depends as much on your write rate and query mix as on raw count — measure, don't guess.

Conclusion

Self-hosted RAG is a small operational lift in exchange for sovereignty, cost control, and a queryable knowledge surface.

The pipeline this guide walks through is genuinely small: three tables, one ANN index, a batched embedding job, two SQL queries (vector and hybrid), one streaming endpoint with citation parsing. That's the entire system. It runs on the Postgres you already operate. It serves 100k-1M-chunk corpora with P50 latencies under 50ms and per-query costs of roughly two cents. Most production RAG products do not need anything more than this.

What changes at 10x scale is mostly operational, not architectural. Index rebuilds need scheduling windows. The embedding model becomes a first-class versioning concern. Multi-tenant routing benefits from partitioning. Re-rankers start earning their keep on long-tail queries. None of those are reasons to leave Postgres; they're reasons to invest in the same Postgres instance more seriously — better instrumentation, dedicated read replicas for RAG traffic, tighter pool sizing. Most of the work at 10x is sharper measurement, not new infrastructure.

Concrete first-week milestones for a team adopting this pattern: day one, stand up Postgres + pgvector locally and run the schema migration. Day two, write the chunker and ingest 100 representative documents. Day three, build the retrieval query and validate recall@10 on a hand-labeled sample. Day four, wire the streaming endpoint and the citation UI. Day five, instrument latency / cost / recall dashboards. The whole system fits comfortably in a sprint and ships small, observable, and easy to evolve.

Build a Self-Hosted RAG with Postgres pgvector: 2026 Guide

01 — Why Self-HostSovereignty, cost, and control.

Self-host on pgvector

Evaluate dedicated vector DBs

Dedicated vector infrastructure

Self-host regardless of scale

02 — SetupPostgres 16 + pgvector — the 15-minute install.

macOS via Homebrew

Linux (Debian / Ubuntu) via apt

Docker — the portable path

Verify the install

03 — SchemaDocuments, chunks, embeddings — normalized.

The three tables

Tenant scoping (for multi-tenant SaaS)

04 — IndexesIVFFlat vs HNSW — pick by recall budget.

IVFFlat — inverted file with flat compression

HNSW — hierarchical navigable small world

P50 query latency

P50 query latency

P50 sequential scan

HNSW vs IVFFlat

05 — IngestionChunk, embed, upsert.

Chunking strategy

Batched embedding calls

The upsert SQL pattern

06 — RetrievalCosine distance, hybrid search, re-ranking.

Pure vector search

Hybrid: vector + BM25

Re-ranking with a dedicated model

Pure vector

Hybrid RRF

Hybrid + re-rank

07 — GenerationStreaming Claude Sonnet — with citations.

The grounding prompt

Streaming endpoint with the AI SDK

Source attribution in the UI

08 — NumbersLatency, cost, and recall at 10k / 100k / 1M docs.

Latency at three corpus sizes

Self-hosted RAG is a small operational lift in exchange for sovereignty, cost control, and a queryable knowledge surface.

RAG is the infrastructure of every serious AI product — get retrieval right and the model layer becomes interchangeable.

RAG engagements

The questions teams ask before shipping their first production RAG.

Continue exploring agentic infrastructure.

Build an AI Slack Bot with Event Subscriptions: 2026

Build a Next.js 16 AI Chatbot: AI SDK Tutorial from Scratch

Build a Codex Test-Generation Pipeline: 2026 Tutorial