docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:32:06 +02:00 · 2026-06-10 11:32:06 +02:00 · ed6d5463aa
commit ed6d5463aa
parent 1e85c11ede
2 changed files with 16 additions and 10 deletions
--- a/docs/decisions/022-backup.md
+++ b/docs/decisions/022-backup.md
@ -123,8 +123,9 @@ silent" point.)
 **`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
 `VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
 what state exists, what is backed up, the dump command, and the per-service restore
-procedure. A template lives at `docs/backup/service-backup-template.md`. Stateless
-services record `backup__state: false` in their vars and note it in `BACKUP.md`.
+procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
+service declares `backup__state: false` (with a reason) in its role vars and gets **no**
+`BACKUP.md`.

 **Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
 touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
@ -223,7 +224,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni

 - **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
  monitor**; no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or `predump` error.
+- **Immediate failure → ntfy** on any job or dump-step error.
 - **Weekly `restic check`** for repo integrity → alert on corruption.
 - **Tier-1 restore-verify failures → ntfy.**
 - *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
@ -232,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
 ### 13. Schedule

 - **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
-  `predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
+  run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
  → `rclone sync` → pCloud. Sequential, off-hours.
 - **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
 - **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
--- a/docs/superpowers/specs/2026-06-10-backup-strategy-design.md
+++ b/docs/superpowers/specs/2026-06-10-backup-strategy-design.md
@ -1,7 +1,7 @@
 # Design — Backup & disaster recovery strategy

 - **Date:** 2026-06-10
- **Status:** Approved design — pending implementation plan
+- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
 - **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
  incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
  all "planned")
@ -113,13 +113,18 @@ database. Each **service role declares its backup needs** in role vars — the s
 render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):

 ```yaml
-backup__paths:          # bind-mount dirs / files holding state
+backup__service: nextcloud   # identifier; matches the role / compose project
+backup__state: true          # false = stateless → no BACKUP.md (pair with a reason)
+backup__paths:               # bind-mount dirs / files holding state ([] = none)
  - /srv/nextcloud/data
-backup__dumps:          # logical app-consistent dumps (list; [] = none)
+backup__dumps:               # logical app-consistent dumps (list; [] = none)
  - cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
    dest: nextcloud-db.sql
+backup__quiesce: false       # true = stop→back up→restart escape hatch
 ```

+(ADR-022 is authoritative for the contract.)
+
 The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
 run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
 service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
@ -219,7 +224,7 @@ anything (a rogue USB does nothing).
 Success/failure pings alone miss the worst case (*the job silently stopped running*):
 - **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
  monitor** (already in the planned stack); no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or `predump` error.
+- **Immediate failure → ntfy** on any job or dump-step error.
 - **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
 - **Tier-1 restore-verify failures → ntfy.**
 - *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
 ### 13. Schedule

 - **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
-  `predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
+  run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
  (Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
 - **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
 - **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
                         ┌─────────────────────────────────────────┐
  docker_hosts / etc.    │                fisi (backup node)        │
  ┌───────────┐  SSH     │  pull orchestrator (reads backup__* )    │
-  │ service A │◀─────────│   1. ssh host → run predump (pg_dump…)   │
+  │ service A │◀─────────│   1. ssh host → run dumps (pg_dump…)     │
  │  + DB     │  pull RO │   2. pull dump + backup__paths (read-only)│
  └───────────┘─────────▶│   3. restic snapshot → local repo (mirror)│
  ┌───────────┐          │   4. restic forget --prune (GFS)         │