docs(backup): add foundation-layer implementation plan (ADR-022)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-10 11:05:17 +02:00
parent eaffd8d900
commit 2041bd3b70

View file

@ -0,0 +1,476 @@
# Backup & DR Strategy — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
**Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs/<concern>/`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
**Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch.
**Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 113 referenced by number throughout).
---
## Decomposition & roadmap
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
- **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role.
- **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`.
- **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live.
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
---
## Plan 1 file map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 113 |
| `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape |
| `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) |
| `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
| `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate |
| `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
| `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row |
| `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
| `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract |
| `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 |
**Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
```bash
git checkout -b feat/backup-foundation
```
Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`.
---
### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
**Files:**
- Create: `docs/decisions/022-backup.md`
- Modify: `CLAUDE.md` (Further-reading table; role-conventions block)
- Modify: `STATUS.md` ("Designed but not built" table)
- Modify: `docs/TODO.md` (item 3.8)
- [ ] **Step 1: Write `docs/decisions/022-backup.md`**
Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
2. **One tier, ~24 h RPO.** (Decision 2)
3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4)
5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5)
6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7)
8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9)
10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10)
11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11)
12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12)
13. **Schedule.** (Decision 13)
`## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 23 as the build path.
- [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`**
In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add:
```
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
```
- [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`**
In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add:
```
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
```
- [ ] **Step 4: Add STATUS.md rows**
In the "Designed but not built" table in `STATUS.md`, add two rows:
```
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
```
- [ ] **Step 5: Update TODO item 3.8**
In `docs/TODO.md`, change the item-3.8 line:
From:
```
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
```
To:
```
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
pCloud + rotated USB air-gap. Build: Plans 23.
```
- [ ] **Step 6: Verify**
Run: `make lint`
Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed.
Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md`
Expected: matches in every listed file (cross-references resolve).
- [ ] **Step 7: Commit**
```bash
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
```
---
### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract
**Files:**
- Create: `docs/backup/service-backup-template.md`
- [ ] **Step 1: Create the template**
Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
````markdown
# Per-service backup record — template
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.
Delete this preamble in the copy and start from the heading below.
---
# Backup — <service>
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>``<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
- **Quiesce:** `<backup__quiesce>``true` means the service is stopped → backed up →
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
## Restore notes
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.
- <none yet>
````
The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
```yaml
backup__service: <name> # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs/files holding state ([] = none)
- /srv/<service>/data
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
dest: <service>-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
```
- [ ] **Step 2: Verify**
Run: `test -f docs/backup/service-backup-template.md && echo PRESENT`
Expected: `PRESENT`
Run: `make lint`
Expected: PASS (markdown only; confirms no regression).
- [ ] **Step 3: Commit**
```bash
git add docs/backup/service-backup-template.md
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
```
---
### Task 3: Strengthen the per-service checklist gate
**Files:**
- Modify: `docs/security/service-checklist.md` (Operability section)
- [ ] **Step 1: Replace the weak backup line with the ADR-022 gate**
In the "Operability (security-adjacent)" section, replace this line:
```
- [ ] Backup/restore is covered if the service holds state
```
with (mirroring the existing ADR-021 access line directly below it):
```
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
reports the declared paths/dumps captured in the latest snapshot — or the service
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
```
- [ ] **Step 2: Verify**
Run: `grep -n "ADR-022" docs/security/service-checklist.md`
Expected: one match (the new gate line).
Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md`
Expected: `0` (old weak line gone).
- [ ] **Step 3: Commit**
```bash
git add docs/security/service-checklist.md
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
```
---
### Task 4: Add the BACKUP.md step to the new-role runbook
**Files:**
- Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step)
- [ ] **Step 1: Insert the new step**
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
```markdown
### 12. Write the per-service backup record (stateful services)
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps``cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
```
- [ ] **Step 2: Renumber the commit step**
Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`.
- [ ] **Step 3: Verify**
Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md`
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
- [ ] **Step 4: Commit**
```bash
git add docs/runbooks/new-role.md
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
```
---
### Task 5: Create the dormant `/check-backup` verifier command
**Files:**
- Create: `.claude/commands/check-backup.md`
- [ ] **Step 1: Write the command**
Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write:
````markdown
---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.
## Preconditions
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
`not-yet-available` and stop.
## Checks (when live)
Load the `backup__*` data for the resolved role, then:
| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
````
- [ ] **Step 2: Verify**
Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md`
Expected: file present, first line `---` (valid frontmatter).
Run: `grep -n "not-yet-available" .claude/commands/check-backup.md`
Expected: matches (dormancy explicit).
- [ ] **Step 3: Commit**
```bash
git add .claude/commands/check-backup.md
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
```
---
### Task 6: Update hardware reference and capabilities
**Files:**
- Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table)
- Modify: `docs/CAPABILITIES.md` (§9 Data & backup)
- [ ] **Step 1: Update the `ubongo` prose block**
In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine:
From:
```
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
```
To:
```
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
```
- [ ] **Step 2: Add a `fisi` prose block**
After the `ubongo` block in §1, add:
```
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
```
- [ ] **Step 3: Update the machine-readable capacity table**
In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`):
From:
```
| ubongo | 4 | 16 | 250 |
```
To:
```
| ubongo | 4 | 16 | 1000 |
| fisi | 4 | 16 | 8000 |
```
- [ ] **Step 4: Update CAPABILITIES §9**
In `docs/CAPABILITIES.md` §9 table, replace the three backup rows:
From:
```
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
```
To:
```
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
```
- [ ] **Step 5: Verify**
Run: `make lint`
Expected: PASS.
Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK`
Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted).
Run: `grep -n "ADR-022" docs/CAPABILITIES.md`
Expected: three matches (the updated backup rows).
- [ ] **Step 6: Commit**
```bash
git add docs/hardware/reference.md docs/CAPABILITIES.md
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
```
---
### Task 7: Final review and merge
- [ ] **Step 1: Full lint + capacity sanity**
Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN`
Expected: `ALL_GREEN`.
- [ ] **Step 2: Cross-reference audit**
Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/`
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
- [ ] **Step 3: Merge to main and delete the branch**
```bash
git checkout main
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
git branch -d feat/backup-foundation
git push origin main
```
---
## Self-review (completed by plan author)
- **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 26. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 23 (see *Decomposition & roadmap*), not silently dropped.
- **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (`<service>`/`<name>` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
- **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.