RAG & Search — Ingestion & Indexing

This page defines the end-to-end pipeline for producing retrievable content:

ingest → normalize → chunk → embed → index → serve

The goal is to make indexing repeatable, auditable, and safe.

Ingestion sources

Treat all inputs as untrusted and normalize early.

Common source types:

URLs (web pages)
files (PDF, DOCX, HTML, Markdown, TXT)
CMS exports / knowledge bases
internal database records (for entity hybrid search)

Canonical source identity

Every ingested item must have a stable identity per scope:

scope_id (tenant/org/project)
source_locator (canonical URL or stable external ID)

Rules:

canonicalize URLs (strip tracking params, normalize scheme/host/path when safe)
enforce maximum lengths on source strings
never store raw HTML as retrieval content unless you have a clear rendering/sanitization policy

Normalization and safety

Sanitization and validation

Apply the project-wide sanitization standards to every user-provided string input:
- docs/backend/01_security/xss_prevention.md
Validate and cap sizes:
- max URL length
- max content length per chunk
- max number of chunks per document (guardrail)

Text extraction

Standards:

Extract to plain text or Markdown-like text.
Strip scripts/iframes and other executable content.
Keep headings when possible (they improve chunk quality and lexical search).

Chunking strategy (RAG)

Chunking turns a long document into retrieval units.

Default guidance

Chunk by semantic boundaries when possible (headings/sections/paragraphs).
Use a token/character target with overlap as a fallback.

Recommended starting ranges (tune per domain):

target chunk size: 400–1200 tokens (or equivalent chars)
overlap: 50–150 tokens (enough to preserve continuity)

Required fields per chunk

chunk_number must be stable and start at 1
store title (document title or section title) when available
store canonical_source for citations (URL or stable ID)
store metadata for optional enrichments (e.g., contextual_prefix, headings path)

Embeddings generation

Standard utility contract

All embeddings should flow through a single utility interface, not ad-hoc provider calls.

Minimum contract:

input: text: str
output: List[float] (length = embedding dimension)
behavior:
- timeouts and retries
- provider errors mapped to typed exceptions
- structured logging (duration, success/failure, provider/model identifiers)

Embedding version metadata

Store an embedding_version string (or structured columns) so you can:

backfill only outdated rows
support migrations across model versions or preprocessing changes

Example:

embedding_version = "<provider>:<model>:<preprocess_version>"

Indexing flow and failure modes

Sync vs background

Default standard:

Create/ingest is allowed to succeed even if embedding generation fails, as long as the row is marked for later backfill.

Choose one of these patterns (document which one the project uses):

Synchronous indexing: block the request until embeddings and indexes are written.
- best when immediate searchability is required
- higher latency and higher blast radius for provider outages
Asynchronous indexing (recommended): write raw rows first, then compute embeddings in a worker.
- resilient to provider outages
- requires an explicit status lifecycle (pending → processing → indexed|error)

Derived fields

For hybrid search, keep derived fields updated:

search_vector: derived from deterministic search text using to_tsvector(...)
embedding: derived from the same (or compatible) search text

Partial failure standards

If embedding fails:
- persist the row
- set status to error (or pending) with an error message
- ensure retrieval/search endpoints degrade gracefully (lexical-only where possible)

Backfill readiness

Every indexing pipeline must be designed for backfills:

idempotent processing per (scope_id, source_locator) or row ID
explicit “needs indexing” criteria:
- missing embedding
- missing search_vector
- outdated embedding_version