RAG & Search — Data Model (Postgres + pgvector)

This page defines recommended schema patterns for:

Keep schemas:

RAG knowledge store tables

Purpose: track a document/source and its indexing lifecycle.

Recommended columns:

Recommended constraints:

(scope_id, source_locator) unique (prevents duplicates per scope)
foreign keys use ON DELETE CASCADE so chunks are removed when a document is removed

Purpose: the retrieval unit for RAG (the rows you search over).

Recommended columns:

id (string/uuid): primary key
scope_id (string/uuid): same scope key (required)
document_id (fk): references rag_documents.id
chunk_number (int): stable ordering within a document
canonical_source (string): canonical URL or stable ID for citation
title (string): optional section/page title
summary (text): optional short summary
content (text): the chunk text used for retrieval and context
metadata (jsonb): lightweight enrichments (headings, tags, contextual prefixes, etc.)
embedding (vector(<dim>)): pgvector embedding
embedding_version (string): identifies how embeddings were produced (model + preprocessing version)
created_at, updated_at (timestamps)

Optional but recommended for hybrid retrieval:

search_vector (tsvector): derived from title/summary/content (+ selective metadata)

Recommended indexes:

btree(scope_id) and btree(document_id)
btree(scope_id, canonical_source) if you need per-source lookups
gin(search_vector) if using lexical search
a pgvector index (HNSW/IVFFlat) on embedding once the table is large enough to justify it

For entity tables (catalog items, FAQs, services, etc.), keep two derived search fields:

Define a deterministic function that produces the embedding input text and the FTS input text.

Rules:

Example pattern (pseudocode):

name: <name>
description: <description>
tags: <tag1> <tag2> <tag3>
guidance: <ai_guidance>

Write-path expectations:

on create: persist the entity row, then populate embedding and search_vector
on update: recompute only if a searchable field changed
on provider failure: keep the entity valid and mark embedding/indexing status for later backfill

Always scope rows by scope_id to prevent cross-tenant leakage.
Deletion of a source should remove dependent chunks via ON DELETE CASCADE.
Prefer soft-delete only when you need audit history; otherwise hard-delete is acceptable for derived retrieval stores.