docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
1e85c11ede
commit
ed6d5463aa
2 changed files with 16 additions and 10 deletions
|
|
@ -123,8 +123,9 @@ silent" point.)
|
||||||
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
||||||
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||||
what state exists, what is backed up, the dump command, and the per-service restore
|
what state exists, what is backed up, the dump command, and the per-service restore
|
||||||
procedure. A template lives at `docs/backup/service-backup-template.md`. Stateless
|
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
|
||||||
services record `backup__state: false` in their vars and note it in `BACKUP.md`.
|
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
|
||||||
|
`BACKUP.md`.
|
||||||
|
|
||||||
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
||||||
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
||||||
|
|
@ -223,7 +224,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
||||||
|
|
||||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||||
monitor**; no ping in ~25 h → alert.
|
monitor**; no ping in ~25 h → alert.
|
||||||
- **Immediate failure → ntfy** on any job or `predump` error.
|
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||||
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
||||||
- **Tier-1 restore-verify failures → ntfy.**
|
- **Tier-1 restore-verify failures → ntfy.**
|
||||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||||
|
|
@ -232,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
||||||
### 13. Schedule
|
### 13. Schedule
|
||||||
|
|
||||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||||
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||||
→ `rclone sync` → pCloud. Sequential, off-hours.
|
→ `rclone sync` → pCloud. Sequential, off-hours.
|
||||||
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
||||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
# Design — Backup & disaster recovery strategy
|
# Design — Backup & disaster recovery strategy
|
||||||
|
|
||||||
- **Date:** 2026-06-10
|
- **Date:** 2026-06-10
|
||||||
- **Status:** Approved design — pending implementation plan
|
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
|
||||||
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
||||||
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
||||||
all "planned")
|
all "planned")
|
||||||
|
|
@ -113,13 +113,18 @@ database. Each **service role declares its backup needs** in role vars — the s
|
||||||
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
backup__paths: # bind-mount dirs / files holding state
|
backup__service: nextcloud # identifier; matches the role / compose project
|
||||||
|
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||||
|
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||||
- /srv/nextcloud/data
|
- /srv/nextcloud/data
|
||||||
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
||||||
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
||||||
dest: nextcloud-db.sql
|
dest: nextcloud-db.sql
|
||||||
|
backup__quiesce: false # true = stop→back up→restart escape hatch
|
||||||
```
|
```
|
||||||
|
|
||||||
|
(ADR-022 is authoritative for the contract.)
|
||||||
|
|
||||||
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||||
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||||
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
||||||
|
|
@ -219,7 +224,7 @@ anything (a rogue USB does nothing).
|
||||||
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||||
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
||||||
- **Immediate failure → ntfy** on any job or `predump` error.
|
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||||
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
||||||
- **Tier-1 restore-verify failures → ntfy.**
|
- **Tier-1 restore-verify failures → ntfy.**
|
||||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||||
|
|
@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
||||||
### 13. Schedule
|
### 13. Schedule
|
||||||
|
|
||||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||||
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||||
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
||||||
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
||||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||||
|
|
@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni
|
||||||
┌─────────────────────────────────────────┐
|
┌─────────────────────────────────────────┐
|
||||||
docker_hosts / etc. │ fisi (backup node) │
|
docker_hosts / etc. │ fisi (backup node) │
|
||||||
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
||||||
│ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │
|
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │
|
||||||
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
||||||
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
||||||
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue