docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-10 11:32:06 +02:00
parent 1e85c11ede
commit ed6d5463aa
2 changed files with 16 additions and 10 deletions

View file

@ -123,8 +123,9 @@ silent" point.)
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
what state exists, what is backed up, the dump command, and the per-service restore
procedure. A template lives at `docs/backup/service-backup-template.md`. Stateless
services record `backup__state: false` in their vars and note it in `BACKUP.md`.
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
`BACKUP.md`.
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
@ -223,7 +224,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor**; no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or `predump` error.
- **Immediate failure → ntfy** on any job or dump-step error.
- **Weekly `restic check`** for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
@ -232,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
`rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.

View file

@ -1,7 +1,7 @@
# Design — Backup & disaster recovery strategy
- **Date:** 2026-06-10
- **Status:** Approved design — pending implementation plan
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
all "planned")
@ -113,13 +113,18 @@ database. Each **service role declares its backup needs** in role vars — the s
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
```yaml
backup__paths: # bind-mount dirs / files holding state
backup__service: nextcloud # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs / files holding state ([] = none)
- /srv/nextcloud/data
backup__dumps: # logical app-consistent dumps (list; [] = none)
backup__dumps: # logical app-consistent dumps (list; [] = none)
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
dest: nextcloud-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch
```
(ADR-022 is authoritative for the contract.)
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
@ -219,7 +224,7 @@ anything (a rogue USB does nothing).
Success/failure pings alone miss the worst case (*the job silently stopped running*):
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor** (already in the planned stack); no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or `predump` error.
- **Immediate failure → ntfy** on any job or dump-step error.
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
┌─────────────────────────────────────────┐
docker_hosts / etc. │ fisi (backup node) │
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
│ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…)
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
┌───────────┐ │ 4. restic forget --prune (GFS) │