Runbooks / Operations Documentation Standard

This document defines the minimum operational documentation expected for any module/system that runs in production.

Why this matters

Operational docs are how we reduce:

Mean time to detect (MTTD)
Mean time to recover (MTTR)
Institutional knowledge trapped in a few people

Where runbooks live

Multi-repo: docs/operations.md + (optionally) docs/runbooks/
Monorepo:
- System-level: docs/operations.md + (optionally) docs/runbooks/
- Module-level: <module>/docs/operations.md + (optionally) <module>/docs/runbooks/

Minimum required content (per module/system)

Health and readiness

How to determine if the service is healthy
Key health endpoints (if applicable)
Dependencies required for “healthy” (DB, queue, external services)

Observability access (no secrets)

Where logs live (tooling, dashboards, queries)
Where metrics live
Where traces live (if applicable)
Error tracking location (if applicable)

Deploy and rollback

How to deploy (high level)
How to confirm deploy success
How to rollback safely
Known risky deploy steps (migrations, cache invalidation, feature flags)

Common incidents

Document real, common failure modes:

Symptom
Likely causes
What to check first
Safe actions to take
When to escalate

Safe-by-default rules

Never put secrets in docs
Prefer deterministic, copy/pasteable commands
If a runbook step is destructive, label it clearly and include guardrails

Recommended structure for incidents

Use “symptom → diagnosis → actions” and keep it short.

Example headings:

### Symptom: Elevated 5xx
### Symptom: Worker backlog grows
### Symptom: Database connection saturation

Templates

Use the operations/runbook templates in docs/documentation/templates.md.