RAG & Search — Ingestion & Indexing
This page defines the end-to-end pipeline for producing retrievable content:
- ingest → normalize → chunk → embed → index → serve
The goal is to make indexing repeatable, auditable, and safe.
Ingestion sources
Treat all inputs as untrusted and normalize early.
Common source types:
- URLs (web pages)
- files (PDF, DOCX, HTML, Markdown, TXT)
- CMS exports / knowledge bases
- internal database records (for entity hybrid search)
Canonical source identity
Every ingested item must have a stable identity per scope:
scope_id(tenant/org/project)source_locator(canonical URL or stable external ID)
Rules:
- canonicalize URLs (strip tracking params, normalize scheme/host/path when safe)
- enforce maximum lengths on source strings
- never store raw HTML as retrieval content unless you have a clear rendering/sanitization policy
Normalization and safety
Sanitization and validation
- Apply the project-wide sanitization standards to every user-provided string input:
docs/backend/01_security/xss_prevention.md
- Validate and cap sizes:
- max URL length
- max content length per chunk
- max number of chunks per document (guardrail)
Text extraction
Standards:
- Extract to plain text or Markdown-like text.
- Strip scripts/iframes and other executable content.
- Keep headings when possible (they improve chunk quality and lexical search).
Chunking strategy (RAG)
Chunking turns a long document into retrieval units.
Default guidance
- Chunk by semantic boundaries when possible (headings/sections/paragraphs).
- Use a token/character target with overlap as a fallback.
Recommended starting ranges (tune per domain):
- target chunk size: 400–1200 tokens (or equivalent chars)
- overlap: 50–150 tokens (enough to preserve continuity)
Required fields per chunk
chunk_numbermust be stable and start at 1- store
title(document title or section title) when available - store
canonical_sourcefor citations (URL or stable ID) - store
metadatafor optional enrichments (e.g.,contextual_prefix, headings path)
Embeddings generation
Standard utility contract
All embeddings should flow through a single utility interface, not ad-hoc provider calls.
Minimum contract:
- input:
text: str - output:
List[float](length = embedding dimension) - behavior:
- timeouts and retries
- provider errors mapped to typed exceptions
- structured logging (duration, success/failure, provider/model identifiers)
Embedding version metadata
Store an embedding_version string (or structured columns) so you can:
- backfill only outdated rows
- support migrations across model versions or preprocessing changes
Example:
embedding_version = "<provider>:<model>:<preprocess_version>"
Indexing flow and failure modes
Sync vs background
Default standard:
- Create/ingest is allowed to succeed even if embedding generation fails, as long as the row is marked for later backfill.
Choose one of these patterns (document which one the project uses):
- Synchronous indexing: block the request until embeddings and indexes are written.
- best when immediate searchability is required
- higher latency and higher blast radius for provider outages
- Asynchronous indexing (recommended): write raw rows first, then compute embeddings in a worker.
- resilient to provider outages
- requires an explicit status lifecycle (
pending → processing → indexed|error)
Derived fields
For hybrid search, keep derived fields updated:
search_vector: derived from deterministic search text usingto_tsvector(...)embedding: derived from the same (or compatible) search text
Partial failure standards
- If embedding fails:
- persist the row
- set status to
error(orpending) with an error message - ensure retrieval/search endpoints degrade gracefully (lexical-only where possible)
Backfill readiness
Every indexing pipeline must be designed for backfills:
- idempotent processing per
(scope_id, source_locator)or row ID - explicit “needs indexing” criteria:
- missing embedding
- missing search_vector
- outdated embedding_version