flowCreate.solutions

RAG & Search — Data Model (Postgres + pgvector)

This page defines recommended schema patterns for:

  • RAG knowledge stores (documents + chunks)
  • Hybrid search on structured entities (embedding + tsvector)

Keep schemas:

  • scoped (tenant/org/project identifiers)
  • versioned (embedding model/version metadata)
  • indexable (pgvector + FTS indexes)

RAG knowledge store tables

rag_documents

Purpose: track a document/source and its indexing lifecycle.

Recommended columns:

  • id (string/uuid): primary key
  • scope_id (string/uuid): tenant/org/project scope key (required)
  • source_type (enum): url, file, cms, etc.
  • source_locator (string): canonical URL or stable external ID
  • title (string): optional
  • status (enum): pending, processing, indexed, error
  • error_message (text): optional, for last failure
  • created_at, updated_at (timestamps)

Recommended constraints:

  • (scope_id, source_locator) unique (prevents duplicates per scope)
  • foreign keys use ON DELETE CASCADE so chunks are removed when a document is removed

rag_chunks

Purpose: the retrieval unit for RAG (the rows you search over).

Recommended columns:

  • id (string/uuid): primary key
  • scope_id (string/uuid): same scope key (required)
  • document_id (fk): references rag_documents.id
  • chunk_number (int): stable ordering within a document
  • canonical_source (string): canonical URL or stable ID for citation
  • title (string): optional section/page title
  • summary (text): optional short summary
  • content (text): the chunk text used for retrieval and context
  • metadata (jsonb): lightweight enrichments (headings, tags, contextual prefixes, etc.)
  • embedding (vector(<dim>)): pgvector embedding
  • embedding_version (string): identifies how embeddings were produced (model + preprocessing version)
  • created_at, updated_at (timestamps)

Optional but recommended for hybrid retrieval:

  • search_vector (tsvector): derived from title/summary/content (+ selective metadata)

Recommended indexes:

  • btree(scope_id) and btree(document_id)
  • btree(scope_id, canonical_source) if you need per-source lookups
  • gin(search_vector) if using lexical search
  • a pgvector index (HNSW/IVFFlat) on embedding once the table is large enough to justify it

For entity tables (catalog items, FAQs, services, etc.), keep two derived search fields:

  • embedding vector(<dim>): semantic search
  • search_vector tsvector: lexical search

Standard: “search text” generation

Define a deterministic function that produces the embedding input text and the FTS input text.

Rules:

  • include only fields that should influence search
  • include labels for clarity (stable prefixes like name:, description:)
  • keep ordering stable
  • keep it deterministic (no timestamps, no random ordering)

Example pattern (pseudocode):

name: <name>
description: <description>
tags: <tag1> <tag2> <tag3>
guidance: <ai_guidance>

Standard: keeping derived fields up to date

Write-path expectations:

  • on create: persist the entity row, then populate embedding and search_vector
  • on update: recompute only if a searchable field changed
  • on provider failure: keep the entity valid and mark embedding/indexing status for later backfill

Data ownership and deletions

  • Always scope rows by scope_id to prevent cross-tenant leakage.
  • Deletion of a source should remove dependent chunks via ON DELETE CASCADE.
  • Prefer soft-delete only when you need audit history; otherwise hard-delete is acceptable for derived retrieval stores.