diff --git a/docs/decisions/022-backup.md b/docs/decisions/022-backup.md index 6afc790..6a4980f 100644 --- a/docs/decisions/022-backup.md +++ b/docs/decisions/022-backup.md @@ -123,8 +123,9 @@ silent" point.) **`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`, `VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting: what state exists, what is backed up, the dump command, and the per-service restore -procedure. A template lives at `docs/backup/service-backup-template.md`. Stateless -services record `backup__state: false` in their vars and note it in `BACKUP.md`. +procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless** +service declares `backup__state: false` (with a reason) in its role vars and gets **no** +`BACKUP.md`. **Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the @@ -223,7 +224,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni - **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push monitor**; no ping in ~25 h → alert. -- **Immediate failure → ntfy** on any job or `predump` error. +- **Immediate failure → ntfy** on any job or dump-step error. - **Weekly `restic check`** for repo integrity → alert on corruption. - **Tier-1 restore-verify failures → ntfy.** - *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a @@ -232,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni ### 13. Schedule - **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host → - `predump` → pull paths read-only → `restic` snapshot → `restic forget --prune` + run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune` → `rclone sync` → pCloud. Sequential, off-hours. - **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`. - **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path. diff --git a/docs/superpowers/specs/2026-06-10-backup-strategy-design.md b/docs/superpowers/specs/2026-06-10-backup-strategy-design.md index eed9c1c..4393107 100644 --- a/docs/superpowers/specs/2026-06-10-backup-strategy-design.md +++ b/docs/superpowers/specs/2026-06-10-backup-strategy-design.md @@ -1,7 +1,7 @@ # Design — Backup & disaster recovery strategy - **Date:** 2026-06-10 -- **Status:** Approved design — pending implementation plan +- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022) - **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up, incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all "planned") @@ -113,13 +113,18 @@ database. Each **service role declares its backup needs** in role vars — the s render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021): ```yaml -backup__paths: # bind-mount dirs / files holding state +backup__service: nextcloud # identifier; matches the role / compose project +backup__state: true # false = stateless → no BACKUP.md (pair with a reason) +backup__paths: # bind-mount dirs / files holding state ([] = none) - /srv/nextcloud/data -backup__dumps: # logical app-consistent dumps (list; [] = none) +backup__dumps: # logical app-consistent dumps (list; [] = none) - cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud" dest: nextcloud-db.sql +backup__quiesce: false # true = stop→back up→restart escape hatch ``` +(ADR-022 is authoritative for the contract.) + The pull orchestrator reads these (rendered from inventory) and, per service: SSH in → run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not @@ -219,7 +224,7 @@ anything (a rogue USB does nothing). Success/failure pings alone miss the worst case (*the job silently stopped running*): - **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push monitor** (already in the planned stack); no ping in ~25 h → alert. -- **Immediate failure → ntfy** on any job or `predump` error. +- **Immediate failure → ntfy** on any job or dump-step error. - **Periodic `restic check`** (weekly) for repo integrity → alert on corruption. - **Tier-1 restore-verify failures → ntfy.** - *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a @@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni ### 13. Schedule - **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host → - `predump` → pull paths read-only → `restic` snapshot → `restic forget --prune` + run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune` (Decision 9) → `rclone sync` → pCloud. Sequential, off-hours. - **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`. - **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path. @@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni ┌─────────────────────────────────────────┐ docker_hosts / etc. │ fisi (backup node) │ ┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │ - │ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │ + │ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │ │ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│ └───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│ ┌───────────┐ │ 4. restic forget --prune (GFS) │