docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
1e85c11ede
commit
ed6d5463aa
2 changed files with 16 additions and 10 deletions
|
|
@ -123,8 +123,9 @@ silent" point.)
|
|||
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
||||
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||
what state exists, what is backed up, the dump command, and the per-service restore
|
||||
procedure. A template lives at `docs/backup/service-backup-template.md`. Stateless
|
||||
services record `backup__state: false` in their vars and note it in `BACKUP.md`.
|
||||
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
|
||||
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
|
||||
`BACKUP.md`.
|
||||
|
||||
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
||||
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
||||
|
|
@ -223,7 +224,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
|||
|
||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||
monitor**; no ping in ~25 h → alert.
|
||||
- **Immediate failure → ntfy** on any job or `predump` error.
|
||||
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
||||
- **Tier-1 restore-verify failures → ntfy.**
|
||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||
|
|
@ -232,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
|||
### 13. Schedule
|
||||
|
||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
→ `rclone sync` → pCloud. Sequential, off-hours.
|
||||
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Design — Backup & disaster recovery strategy
|
||||
|
||||
- **Date:** 2026-06-10
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
|
||||
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
||||
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
||||
all "planned")
|
||||
|
|
@ -113,13 +113,18 @@ database. Each **service role declares its backup needs** in role vars — the s
|
|||
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||
|
||||
```yaml
|
||||
backup__paths: # bind-mount dirs / files holding state
|
||||
backup__service: nextcloud # identifier; matches the role / compose project
|
||||
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||
- /srv/nextcloud/data
|
||||
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
||||
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
||||
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
||||
dest: nextcloud-db.sql
|
||||
backup__quiesce: false # true = stop→back up→restart escape hatch
|
||||
```
|
||||
|
||||
(ADR-022 is authoritative for the contract.)
|
||||
|
||||
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
||||
|
|
@ -219,7 +224,7 @@ anything (a rogue USB does nothing).
|
|||
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
||||
- **Immediate failure → ntfy** on any job or `predump` error.
|
||||
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
||||
- **Tier-1 restore-verify failures → ntfy.**
|
||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||
|
|
@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
|||
### 13. Schedule
|
||||
|
||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
||||
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||
|
|
@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
|||
┌─────────────────────────────────────────┐
|
||||
docker_hosts / etc. │ fisi (backup node) │
|
||||
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
||||
│ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │
|
||||
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │
|
||||
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
||||
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
||||
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue