boma/docs/superpowers/plans/2026-06-10-backup-strategy.md
sjat 2041bd3b70 docs(backup): add foundation-layer implementation plan (ADR-022)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:05:17 +02:00

476 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Backup & DR Strategy — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
**Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs/<concern>/`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
**Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch.
**Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 113 referenced by number throughout).
---
## Decomposition & roadmap
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
- **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role.
- **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`.
- **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live.
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
---
## Plan 1 file map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 113 |
| `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape |
| `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) |
| `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
| `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate |
| `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
| `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row |
| `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
| `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract |
| `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 |
**Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
```bash
git checkout -b feat/backup-foundation
```
Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`.
---
### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
**Files:**
- Create: `docs/decisions/022-backup.md`
- Modify: `CLAUDE.md` (Further-reading table; role-conventions block)
- Modify: `STATUS.md` ("Designed but not built" table)
- Modify: `docs/TODO.md` (item 3.8)
- [ ] **Step 1: Write `docs/decisions/022-backup.md`**
Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
2. **One tier, ~24 h RPO.** (Decision 2)
3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4)
5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5)
6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7)
8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9)
10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10)
11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11)
12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12)
13. **Schedule.** (Decision 13)
`## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 23 as the build path.
- [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`**
In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add:
```
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
```
- [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`**
In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add:
```
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
```
- [ ] **Step 4: Add STATUS.md rows**
In the "Designed but not built" table in `STATUS.md`, add two rows:
```
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
```
- [ ] **Step 5: Update TODO item 3.8**
In `docs/TODO.md`, change the item-3.8 line:
From:
```
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
```
To:
```
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
pCloud + rotated USB air-gap. Build: Plans 23.
```
- [ ] **Step 6: Verify**
Run: `make lint`
Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed.
Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md`
Expected: matches in every listed file (cross-references resolve).
- [ ] **Step 7: Commit**
```bash
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
```
---
### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract
**Files:**
- Create: `docs/backup/service-backup-template.md`
- [ ] **Step 1: Create the template**
Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
````markdown
# Per-service backup record — template
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.
Delete this preamble in the copy and start from the heading below.
---
# Backup — <service>
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
## Restore notes
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.
- <none yet>
````
The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
```yaml
backup__service: <name> # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs/files holding state ([] = none)
- /srv/<service>/data
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
dest: <service>-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
```
- [ ] **Step 2: Verify**
Run: `test -f docs/backup/service-backup-template.md && echo PRESENT`
Expected: `PRESENT`
Run: `make lint`
Expected: PASS (markdown only; confirms no regression).
- [ ] **Step 3: Commit**
```bash
git add docs/backup/service-backup-template.md
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
```
---
### Task 3: Strengthen the per-service checklist gate
**Files:**
- Modify: `docs/security/service-checklist.md` (Operability section)
- [ ] **Step 1: Replace the weak backup line with the ADR-022 gate**
In the "Operability (security-adjacent)" section, replace this line:
```
- [ ] Backup/restore is covered if the service holds state
```
with (mirroring the existing ADR-021 access line directly below it):
```
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
reports the declared paths/dumps captured in the latest snapshot — or the service
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
```
- [ ] **Step 2: Verify**
Run: `grep -n "ADR-022" docs/security/service-checklist.md`
Expected: one match (the new gate line).
Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md`
Expected: `0` (old weak line gone).
- [ ] **Step 3: Commit**
```bash
git add docs/security/service-checklist.md
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
```
---
### Task 4: Add the BACKUP.md step to the new-role runbook
**Files:**
- Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step)
- [ ] **Step 1: Insert the new step**
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
```markdown
### 12. Write the per-service backup record (stateful services)
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
```
- [ ] **Step 2: Renumber the commit step**
Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`.
- [ ] **Step 3: Verify**
Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md`
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
- [ ] **Step 4: Commit**
```bash
git add docs/runbooks/new-role.md
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
```
---
### Task 5: Create the dormant `/check-backup` verifier command
**Files:**
- Create: `.claude/commands/check-backup.md`
- [ ] **Step 1: Write the command**
Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write:
````markdown
---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.
## Preconditions
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
`not-yet-available` and stop.
## Checks (when live)
Load the `backup__*` data for the resolved role, then:
| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
````
- [ ] **Step 2: Verify**
Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md`
Expected: file present, first line `---` (valid frontmatter).
Run: `grep -n "not-yet-available" .claude/commands/check-backup.md`
Expected: matches (dormancy explicit).
- [ ] **Step 3: Commit**
```bash
git add .claude/commands/check-backup.md
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
```
---
### Task 6: Update hardware reference and capabilities
**Files:**
- Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table)
- Modify: `docs/CAPABILITIES.md` (§9 Data & backup)
- [ ] **Step 1: Update the `ubongo` prose block**
In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine:
From:
```
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
```
To:
```
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
```
- [ ] **Step 2: Add a `fisi` prose block**
After the `ubongo` block in §1, add:
```
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
```
- [ ] **Step 3: Update the machine-readable capacity table**
In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`):
From:
```
| ubongo | 4 | 16 | 250 |
```
To:
```
| ubongo | 4 | 16 | 1000 |
| fisi | 4 | 16 | 8000 |
```
- [ ] **Step 4: Update CAPABILITIES §9**
In `docs/CAPABILITIES.md` §9 table, replace the three backup rows:
From:
```
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
```
To:
```
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
```
- [ ] **Step 5: Verify**
Run: `make lint`
Expected: PASS.
Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK`
Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted).
Run: `grep -n "ADR-022" docs/CAPABILITIES.md`
Expected: three matches (the updated backup rows).
- [ ] **Step 6: Commit**
```bash
git add docs/hardware/reference.md docs/CAPABILITIES.md
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
```
---
### Task 7: Final review and merge
- [ ] **Step 1: Full lint + capacity sanity**
Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN`
Expected: `ALL_GREEN`.
- [ ] **Step 2: Cross-reference audit**
Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/`
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
- [ ] **Step 3: Merge to main and delete the branch**
```bash
git checkout main
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
git branch -d feat/backup-foundation
git push origin main
```
---
## Self-review (completed by plan author)
- **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 26. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 23 (see *Decomposition & roadmap*), not silently dropped.
- **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (`<service>`/`<name>` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
- **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.