Background Workers

This document defines generic engineering standards for background workers in backend services.

Workers are long-running async loops that perform maintenance, monitoring, and scheduled processing outside the request/response path.

Architecture

Workers run as asyncio tasks inside the application runtime and are started/stopped through FastAPI lifespan. Keep the architecture simple-by-default; only split into separate processes/services when required by scale or isolation.

Lifespan Events: Workers are started/stopped via FastAPI lifespan events (a dedicated lifespan.py is recommended).

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: start one or more worker loops
    tasks = [
        asyncio.create_task(worker_loop_a()),
        asyncio.create_task(worker_loop_b()),
    ]
    yield
    # Shutdown: cancel tasks gracefully
    for t in tasks:
        t.cancel()
    for t in tasks:
        with suppress(asyncio.CancelledError):
            await t

Common worker categories

Workers should fall into clear categories:

Scheduled monitors: periodic scans and health checks (e.g., security scans, DB health).
Reconciliation/sync loops: keep derived state consistent (e.g., periodic reconciliation jobs).
Queue processors: claim pending work items from the database and process them safely.
Maintenance: periodic cleanup or aggregation tasks.
Operational reporting: periodic suite/report generation where appropriate.

Concurrency Control

Background workers must be safe under multiple app processes (e.g., multiple Uvicorn workers). Use layered concurrency controls:

Startup single-run lock (process-level): ensure only one process starts the worker tasks.
- A common pattern is a file lock in the OS temp directory created with O_EXCL so only one process “wins”.
Per-loop single-run lock (database-level): ensure only one process executes a cycle at a time.
- A common pattern is a PostgreSQL advisory lock (pg_try_advisory_lock) held for the duration of the cycle.

Queue/claiming pattern (safe parallelism)

When processing “pending items” from the database:

Prefer SELECT ... FOR UPDATE SKIP LOCKED to claim work without contention.
If items must be processed sequentially per tenant/resource, add a second lock keyed by that id (e.g., transaction-scoped advisory locks).

Configuration

Workers should be enabled/disabled via environment variables. Standards:

Each worker loop should check its flag on each cycle so it can be disabled without redeploying.
Defaults may be environment-aware (e.g., disable some workers in local/test by default).

Example pattern:

ENABLE_<WORKER_NAME>_WORKER=1
<WORKER_NAME>_INTERVAL_SECONDS=300

Loop structure standards

Every worker loop should follow a predictable structure:

Initialize required resources (e.g., ensure DB engine/session factory exists).
Optional jitter on startup (small randomized delay) to avoid thundering-herd starts.
while True:
- Check enable flag; exit if disabled.
- Acquire per-loop lock (if used).
- Execute one cycle of work inside a fresh DB session.
- Release lock.
- Sleep for the configured interval (plus optional jitter).
Catch and log loop-level exceptions; sleep briefly and continue (avoid tight crash loops).

Database session standards

Use a dedicated worker DB session/engine (separate from request sessions) so long-running loops do not interfere with request handling.
Keep each cycle’s DB session scope tight (open → work → commit/close).

Observability + safety

Log lifecycle events: loop started, lock acquired/released, cycle complete, cycle error.
Prefer structured “counts”/summaries (e.g., scanned, processed, errors) so operators can see impact.
Never let a worker crash the app: failures should be isolated to the loop and retried safely.