RAG & Search — Operations
This page defines operational standards for keeping RAG/search reliable over time:
- backfills
- re-indexing
- batching and concurrency
- degradation modes during outages
- cost controls
Operational principles
- Idempotent jobs: every job must be safe to re-run without duplicating data or corrupting state.
- Scoped safety: jobs must support running for a single
scope_id(tenant/org/project) or a small subset of IDs. - Observable: every run must report counts, durations, and failures.
- Failure-tolerant: provider outages should not break core CRUD or cause cascading failures.
Backfills
Backfills exist because embeddings and indexes are derived and may become:
- missing (provider failure, partial indexing)
- stale (new embedding_version)
- invalid (bug fix in preprocessing)
What to backfill
Common targets:
embedding IS NULLsearch_vector IS NULLembedding_version != <current_version>status in (pending, error)
Backfill job contract
Required inputs:
scope_id(optional but strongly recommended)batch_sizemax_concurrencydry_runmode (recommended)
Required outputs:
- counts: scanned, updated, skipped, failed
- timings: total duration and per-batch duration
- failure samples: top N errors with row IDs
Idempotency standards
- Use stable IDs for rows.
- Avoid “insert new chunk rows every run” unless you have strong dedupe guarantees.
- Prefer
UPSERT/update-in-place patterns.
Re-indexing (FTS)
If search_vector is stored:
- define a deterministic “search text” generator
- compute
search_vector = to_tsvector(<lang>, <search_text>)
Re-indexing standards:
- can be run independently of embedding backfills
- safe to run many times
- use batching to avoid large transactions
Embedding provider outages and degradation
Degradation modes
Define behavior for each layer:
- Create/update write-path:
- resilient mode (recommended): accept the write and mark for backfill
- strict mode: fail the write if search fields cannot be updated
- Read-path search:
- if vector retrieval fails: lexical-only
- if lexical retrieval fails: vector-only
- if both fail: empty results with a safe message
Circuit breaking and budgets
Standards:
- enforce a per-request budget for embedding calls (especially in AI tool contexts)
- implement timeouts and capped retries
- consider a circuit breaker to reduce repeated timeouts during provider incidents
Batching and concurrency
Batch sizing
Guidelines:
- choose
batch_sizeto keep memory stable and avoid long transactions - commit per batch (or smaller) so partial progress is preserved
Concurrency
Standards:
- cap concurrent embedding calls (
max_concurrency) - use exponential backoff with jitter on provider errors
- ensure DB connections are not exhausted by background workers
Cost controls
Standards:
- cache query embeddings per request (and optionally per short TTL)
- prefer hybrid retrieval with smaller
final_krather than huge context payloads - store summaries/contextual prefixes when they reduce the needed chunk count
Monitoring and alerting (minimum viable)
Track and alert on:
- embedding error rate (by provider + embedding_version)
- indexing lag (pending/error counts)
- retrieval latency percentiles
- DB query latency for vector and FTS queries