flowCreate.solutions

Runbooks / Operations Documentation Standard

This document defines the minimum operational documentation expected for any module/system that runs in production.

Why this matters

Operational docs are how we reduce:

  • Mean time to detect (MTTD)
  • Mean time to recover (MTTR)
  • Institutional knowledge trapped in a few people

Where runbooks live

  • Multi-repo: docs/operations.md + (optionally) docs/runbooks/
  • Monorepo:
    • System-level: docs/operations.md + (optionally) docs/runbooks/
    • Module-level: <module>/docs/operations.md + (optionally) <module>/docs/runbooks/

Minimum required content (per module/system)

Health and readiness

  • How to determine if the service is healthy
  • Key health endpoints (if applicable)
  • Dependencies required for “healthy” (DB, queue, external services)

Observability access (no secrets)

  • Where logs live (tooling, dashboards, queries)
  • Where metrics live
  • Where traces live (if applicable)
  • Error tracking location (if applicable)

Deploy and rollback

  • How to deploy (high level)
  • How to confirm deploy success
  • How to rollback safely
  • Known risky deploy steps (migrations, cache invalidation, feature flags)

Common incidents

Document real, common failure modes:

  • Symptom
  • Likely causes
  • What to check first
  • Safe actions to take
  • When to escalate

Safe-by-default rules

  • Never put secrets in docs
  • Prefer deterministic, copy/pasteable commands
  • If a runbook step is destructive, label it clearly and include guardrails

Use “symptom → diagnosis → actions” and keep it short.

Example headings:

  • ### Symptom: Elevated 5xx
  • ### Symptom: Worker backlog grows
  • ### Symptom: Database connection saturation

Templates

Use the operations/runbook templates in docs/documentation/templates.md.