RAG & Search — Operations

This page defines operational standards for keeping RAG/search reliable over time:

backfills
re-indexing
batching and concurrency
degradation modes during outages
cost controls

Operational principles

Idempotent jobs: every job must be safe to re-run without duplicating data or corrupting state.
Scoped safety: jobs must support running for a single scope_id (tenant/org/project) or a small subset of IDs.
Observable: every run must report counts, durations, and failures.
Failure-tolerant: provider outages should not break core CRUD or cause cascading failures.

Backfills

Backfills exist because embeddings and indexes are derived and may become:

missing (provider failure, partial indexing)
stale (new embedding_version)
invalid (bug fix in preprocessing)

What to backfill

Common targets:

embedding IS NULL
search_vector IS NULL
embedding_version != <current_version>
status in (pending, error)

Backfill job contract

Required inputs:

scope_id (optional but strongly recommended)
batch_size
max_concurrency
dry_run mode (recommended)

Required outputs:

counts: scanned, updated, skipped, failed
timings: total duration and per-batch duration
failure samples: top N errors with row IDs

Idempotency standards

Use stable IDs for rows.
Avoid “insert new chunk rows every run” unless you have strong dedupe guarantees.
Prefer UPSERT/update-in-place patterns.

Re-indexing (FTS)

If search_vector is stored:

define a deterministic “search text” generator
compute search_vector = to_tsvector(<lang>, <search_text>)

Re-indexing standards:

can be run independently of embedding backfills
safe to run many times
use batching to avoid large transactions

Embedding provider outages and degradation

Degradation modes

Define behavior for each layer:

Create/update write-path:
- resilient mode (recommended): accept the write and mark for backfill
- strict mode: fail the write if search fields cannot be updated
Read-path search:
- if vector retrieval fails: lexical-only
- if lexical retrieval fails: vector-only
- if both fail: empty results with a safe message

Circuit breaking and budgets

Standards:

enforce a per-request budget for embedding calls (especially in AI tool contexts)
implement timeouts and capped retries
consider a circuit breaker to reduce repeated timeouts during provider incidents

Batching and concurrency

Batch sizing

Guidelines:

choose batch_size to keep memory stable and avoid long transactions
commit per batch (or smaller) so partial progress is preserved

Concurrency

Standards:

cap concurrent embedding calls (max_concurrency)
use exponential backoff with jitter on provider errors
ensure DB connections are not exhausted by background workers

Cost controls

Standards:

cache query embeddings per request (and optionally per short TTL)
prefer hybrid retrieval with smaller final_k rather than huge context payloads
store summaries/contextual prefixes when they reduce the needed chunk count

Monitoring and alerting (minimum viable)

Track and alert on:

embedding error rate (by provider + embedding_version)
indexing lag (pending/error counts)
retrieval latency percentiles
DB query latency for vector and FTS queries