RAG & Search — Data Model (Postgres + pgvector)
This page defines recommended schema patterns for:
- RAG knowledge stores (documents + chunks)
- Hybrid search on structured entities (embedding +
tsvector)
Keep schemas:
- scoped (tenant/org/project identifiers)
- versioned (embedding model/version metadata)
- indexable (pgvector + FTS indexes)
RAG knowledge store tables
rag_documents
Purpose: track a document/source and its indexing lifecycle.
Recommended columns:
id(string/uuid): primary keyscope_id(string/uuid): tenant/org/project scope key (required)source_type(enum):url,file,cms, etc.source_locator(string): canonical URL or stable external IDtitle(string): optionalstatus(enum):pending,processing,indexed,errorerror_message(text): optional, for last failurecreated_at,updated_at(timestamps)
Recommended constraints:
(scope_id, source_locator)unique (prevents duplicates per scope)- foreign keys use
ON DELETE CASCADEso chunks are removed when a document is removed
rag_chunks
Purpose: the retrieval unit for RAG (the rows you search over).
Recommended columns:
id(string/uuid): primary keyscope_id(string/uuid): same scope key (required)document_id(fk): referencesrag_documents.idchunk_number(int): stable ordering within a documentcanonical_source(string): canonical URL or stable ID for citationtitle(string): optional section/page titlesummary(text): optional short summarycontent(text): the chunk text used for retrieval and contextmetadata(jsonb): lightweight enrichments (headings, tags, contextual prefixes, etc.)embedding(vector(<dim>)): pgvector embeddingembedding_version(string): identifies how embeddings were produced (model + preprocessing version)created_at,updated_at(timestamps)
Optional but recommended for hybrid retrieval:
search_vector(tsvector): derived fromtitle/summary/content(+ selective metadata)
Recommended indexes:
btree(scope_id)andbtree(document_id)btree(scope_id, canonical_source)if you need per-source lookupsgin(search_vector)if using lexical search- a pgvector index (HNSW/IVFFlat) on
embeddingonce the table is large enough to justify it
Structured entity hybrid search
For entity tables (catalog items, FAQs, services, etc.), keep two derived search fields:
embedding vector(<dim>): semantic searchsearch_vector tsvector: lexical search
Standard: “search text” generation
Define a deterministic function that produces the embedding input text and the FTS input text.
Rules:
- include only fields that should influence search
- include labels for clarity (stable prefixes like
name:,description:) - keep ordering stable
- keep it deterministic (no timestamps, no random ordering)
Example pattern (pseudocode):
name: <name>
description: <description>
tags: <tag1> <tag2> <tag3>
guidance: <ai_guidance>
Standard: keeping derived fields up to date
Write-path expectations:
- on create: persist the entity row, then populate
embeddingandsearch_vector - on update: recompute only if a searchable field changed
- on provider failure: keep the entity valid and mark embedding/indexing status for later backfill
Data ownership and deletions
- Always scope rows by
scope_idto prevent cross-tenant leakage. - Deletion of a source should remove dependent chunks via
ON DELETE CASCADE. - Prefer soft-delete only when you need audit history; otherwise hard-delete is acceptable for derived retrieval stores.