Runbooks / Operations Documentation Standard
This document defines the minimum operational documentation expected for any module/system that runs in production.
Why this matters
Operational docs are how we reduce:
- Mean time to detect (MTTD)
- Mean time to recover (MTTR)
- Institutional knowledge trapped in a few people
Where runbooks live
- Multi-repo:
docs/operations.md+ (optionally)docs/runbooks/ - Monorepo:
- System-level:
docs/operations.md+ (optionally)docs/runbooks/ - Module-level:
<module>/docs/operations.md+ (optionally)<module>/docs/runbooks/
- System-level:
Minimum required content (per module/system)
Health and readiness
- How to determine if the service is healthy
- Key health endpoints (if applicable)
- Dependencies required for “healthy” (DB, queue, external services)
Observability access (no secrets)
- Where logs live (tooling, dashboards, queries)
- Where metrics live
- Where traces live (if applicable)
- Error tracking location (if applicable)
Deploy and rollback
- How to deploy (high level)
- How to confirm deploy success
- How to rollback safely
- Known risky deploy steps (migrations, cache invalidation, feature flags)
Common incidents
Document real, common failure modes:
- Symptom
- Likely causes
- What to check first
- Safe actions to take
- When to escalate
Safe-by-default rules
- Never put secrets in docs
- Prefer deterministic, copy/pasteable commands
- If a runbook step is destructive, label it clearly and include guardrails
Recommended structure for incidents
Use “symptom → diagnosis → actions” and keep it short.
Example headings:
### Symptom: Elevated 5xx### Symptom: Worker backlog grows### Symptom: Database connection saturation
Templates
Use the operations/runbook templates in docs/documentation/templates.md.