boma/docs/superpowers/plans/2026-06-10-backup-strategy.md
sjat 2041bd3b70 docs(backup): add foundation-layer implementation plan (ADR-022)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:05:17 +02:00

22 KiB
Raw Permalink Blame History

Backup & DR Strategy — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Land the foundation layer of the backup strategy — ADR-022, the per-service backup__* data contract + BACKUP.md governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.

Architecture: This is the first of three sequenced plans (see Decomposition & roadmap below). It is doc/governance only — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under docs/<concern>/, one line in docs/security/service-checklist.md, a step in docs/runbooks/new-role.md, and a dormant verifier command (/check-access → here /check-backup). boma deliberately gates these per-service docs via checklist+runbook, not an automated lint script — so this plan adds no scripts/check-*.py. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)

Tech Stack: Markdown docs, Ansible role-var conventions (backup__*, double-underscore namespace per CLAUDE.md), make lint (yamllint + ansible-lint + check-tags.py) as the only automated gate, git trunk-based on a feature branch.

Source spec: docs/superpowers/specs/2026-06-10-backup-strategy-design.md (Decisions 113 referenced by number throughout).


Decomposition & roadmap

The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, fisi unprovisioned, Terraform never inited, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:

  • Plan 1 — Foundation (THIS PLAN). ADR + backup__* contract + BACKUP.md governance + doc/inventory updates. Buildable and verifiable today with zero live infra. Unblocks every service role.
  • Plan 2 — The backup role (FUTURE). make new-role NAME=backup: pull orchestrator, restic wrapper, rclone→pCloud, retention prune, udev air-gap unit + restic copy, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the firewall concern was — buildable now, functionally testable only once fisi + hosts exist. Blocked on: fisi provisioned (SATA power cable), backup_hosts inventory group, at least one service role declaring backup__*.
  • Plan 3 — Live wire-up + restore testing (FUTURE). Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on ubongo, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. Blocked on: Plan 2 deployed, real VMs/staging, services with VERIFY.md, Vaultwarden live.

Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.


Plan 1 file map

File Action Responsibility
docs/decisions/022-backup.md create ADR of record; distils the spec's Decisions 113
docs/backup/service-backup-template.md create BACKUP.md template; defines the backup__* contract shape
.claude/commands/check-backup.md create Dormant verifier (mirrors check-access.md)
CLAUDE.md modify Role-conventions: BACKUP.md required for service roles; Further-reading row
docs/security/service-checklist.md modify Strengthen the Operability backup line to the ADR-022 gate
docs/runbooks/new-role.md modify Add the per-service BACKUP.md step (new §12, renumber commit)
docs/hardware/reference.md modify ubongo → M70q/1TB; add fisi node + capacity row
docs/CAPABILITIES.md modify §9: restic+rclone+USB committed; PBS deferred; ref ADR-022
STATUS.md modify Add "Designed but not built" rows for backup role + contract
docs/TODO.md modify Mark item 3.8 decided; reference ADR-022

Working branch (all tasks): AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).

git checkout -b feat/backup-foundation

Before any commit, confirm rbw unlocked exits 0 (the pre-commit hook decrypts vault.yml); if not, stop and ask the operator to rbw unlock.


Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md

Files:

  • Create: docs/decisions/022-backup.md

  • Modify: CLAUDE.md (Further-reading table; role-conventions block)

  • Modify: STATUS.md ("Designed but not built" table)

  • Modify: docs/TODO.md (item 3.8)

  • Step 1: Write docs/decisions/022-backup.md

Mirror the structure of docs/decisions/021-operational-access.md (## Context, ## Decision, subsections, ## Consequences). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:

  1. Recovery model A — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
  2. One tier, ~24 h RPO. (Decision 2)
  3. Engine: restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
  4. Topology: central off-cluster pull node (fisi, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New backup_hosts inventory group, base role applies. (Decision 4)
  5. 3-2-1 mapping incl. USB air-gap as the immutable backstop. (Decision 5)
  6. Per-service contract: backup__* role vars + required BACKUP.md, rendered from the data (the ADR-021 pattern). Governance reconciliation: gated via the per-service checklist + new-role runbook + dormant /check-backup verifier — not an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
  7. Consistency: logical dumps first (pg_dump/mysqldump), quiesce escape hatch; FS snapshots not the sole DB method. (Decision 7)
  8. Restore testing: Tier-1 weekly rolling container restore-verify on ubongo (reuses VERIFY.md); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. ubongo stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
  9. Retention (GFS): --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1. (Decision 9)
  10. Encryption + escrow + break-glass: one restic password protects all copies; escrowed to fisi(+vault) / Vaultwarden / paper; paper holds both the restic password and the Ansible vault password (breaks the Model-A circular dependency); mamba is the break-glass clone (ADR-015). (Decision 10)
  11. USB air-gap: udev serial-allowlist → restic copy to a USB restic repo → restic check → ntfy; rotate off-site. (Decision 11)
  12. Failure alerting: Uptime-Kuma dead-man's-switch + ntfy on failure + weekly restic check. (Decision 12)
  13. Schedule. (Decision 13)

## Consequences must note: pCloud is off-site but sync-coupled (deletes propagate) → USB is the only immutable copy; fisi is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 23 as the build path.

  • Step 2: Add the Further-reading row in CLAUDE.md

In the Further-reading table, immediately after the Operational access … 021-operational-access.md row, add:

| Backup & disaster recovery | `docs/decisions/022-backup.md`        |
  • Step 3: Add the BACKUP.md role-convention in CLAUDE.md

In the "Role conventions" list, immediately after the ACCESS.md (ADR-021) bullet, add:

- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
  copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
  data. A stateless service records `backup__state: false` with a reason.
  • Step 4: Add STATUS.md rows

In the "Designed but not built" table in STATUS.md, add two rows:

| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
  • Step 5: Update TODO item 3.8

In docs/TODO.md, change the item-3.8 line:

From:

   8. Ensure the right things are backed up (incl. database dumps if we land on PBS).

To:

   8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
      DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
      node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
      pCloud + rotated USB air-gap. Build: Plans 23.
  • Step 6: Verify

Run: make lint Expected: PASS (yamllint, ansible-lint, check-tags: OK …). No new YAML/tags introduced, so this confirms nothing regressed.

Run: grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md Expected: matches in every listed file (cross-references resolve).

  • Step 7: Commit
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"

Task 2: Create the BACKUP.md template and define the backup__* contract

Files:

  • Create: docs/backup/service-backup-template.md

  • Step 1: Create the template

Mirror docs/access/service-access-template.md (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:

# Per-service backup record — template

Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.

A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.

Delete this preamble in the copy and start from the heading below.

---

# Backup — <service>

## State captured

Rendered from `backup__*`:

| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>``<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |

- **Quiesce:** `<backup__quiesce>``true` means the service is stopped → backed up →
  restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).

## Restore procedure

1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).

## Restore notes

Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.

- <none yet>

The backup__* contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):

backup__service: <name>          # identifier; matches the role / compose project
backup__state: true              # false = stateless → no BACKUP.md (pair with a reason)
backup__paths:                   # bind-mount dirs/files holding state ([] = none)
  - /srv/<service>/data
backup__dumps:                   # logical app-consistent dumps (Decision 7 default; [] = none)
  - cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
    dest: <service>-db.sql
backup__quiesce: false           # true = stop→back up→restart escape hatch (Decision 7 B)
  • Step 2: Verify

Run: test -f docs/backup/service-backup-template.md && echo PRESENT Expected: PRESENT

Run: make lint Expected: PASS (markdown only; confirms no regression).

  • Step 3: Commit
git add docs/backup/service-backup-template.md
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"

Task 3: Strengthen the per-service checklist gate

Files:

  • Modify: docs/security/service-checklist.md (Operability section)

  • Step 1: Replace the weak backup line with the ADR-022 gate

In the "Operability (security-adjacent)" section, replace this line:

- [ ] Backup/restore is covered if the service holds state

with (mirroring the existing ADR-021 access line directly below it):

- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
      `backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
      reports the declared paths/dumps captured in the latest snapshot — or the service
      sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
  • Step 2: Verify

Run: grep -n "ADR-022" docs/security/service-checklist.md Expected: one match (the new gate line).

Run: grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md Expected: 0 (old weak line gone).

  • Step 3: Commit
git add docs/security/service-checklist.md
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"

Task 4: Add the BACKUP.md step to the new-role runbook

Files:

  • Modify: docs/runbooks/new-role.md (insert a new step after the §11 ACCESS step; renumber the commit step)

  • Step 1: Insert the new step

Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:

### 12. Write the per-service backup record (stateful services)

For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps``cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
  • Step 2: Renumber the commit step

Change the heading ### 12. Commit (now the following heading) to ### 13. Commit.

  • Step 3: Verify

Run: grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.

  • Step 4: Commit
git add docs/runbooks/new-role.md
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"

Task 5: Create the dormant /check-backup verifier command

Files:

  • Create: .claude/commands/check-backup.md

  • Step 1: Write the command

Mirror the sibling .claude/commands/check-access.md (same frontmatter/sections, same "dormant until infra exists" framing). Write:

---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---

Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).

**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.

## Preconditions

- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
  `not-yet-available` and stop.

## Checks (when live)

Load the `backup__*` data for the resolved role, then:

| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |

Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
  • Step 2: Verify

Run: test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md Expected: file present, first line --- (valid frontmatter).

Run: grep -n "not-yet-available" .claude/commands/check-backup.md Expected: matches (dormancy explicit).

  • Step 3: Commit
git add .claude/commands/check-backup.md
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"

Task 6: Update hardware reference and capabilities

Files:

  • Modify: docs/hardware/reference.md (ubongo spec; new fisi node; capacity table)

  • Modify: docs/CAPABILITIES.md (§9 Data & backup)

  • Step 1: Update the ubongo prose block

In docs/hardware/reference.md §1, replace the ubongo Storage line target with the real machine:

From:

- **Storage:** _TBD (target 250 GB SSD/NVMe)_

To:

- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
  • Step 2: Add a fisi prose block

After the ubongo block in §1, add:

### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
  docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
  Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
  • Step 3: Update the machine-readable capacity table

In §4 "Node capacity", change the ubongo row disk from 250 to 1000 and add a fisi row. Keep the header and integer/decimal format intact (parsed by capacity-scan.py):

From:

| ubongo | 4   | 16     | 250     |

To:

| ubongo | 4   | 16     | 1000    |
| fisi   | 4   | 16     | 8000    |
  • Step 4: Update CAPABILITIES §9

In docs/CAPABILITIES.md §9 table, replace the three backup rows:

From:

| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |

To:

| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
  • Step 5: Verify

Run: make lint Expected: PASS.

Run: python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK Expected: CAPACITY_OK (the capacity table headers are still parseable; new fisi row accepted).

Run: grep -n "ADR-022" docs/CAPABILITIES.md Expected: three matches (the updated backup rows).

  • Step 6: Commit
git add docs/hardware/reference.md docs/CAPABILITIES.md
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"

Task 7: Final review and merge

  • Step 1: Full lint + capacity sanity

Run: make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN Expected: ALL_GREEN.

  • Step 2: Cross-reference audit

Run: grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/ Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.

  • Step 3: Merge to main and delete the branch
git checkout main
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
git branch -d feat/backup-foundation
git push origin main

Self-review (completed by plan author)

  • Spec coverage: All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The foundation obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 26. Decisions whose implementation is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 23 (see Decomposition & roadmap), not silently dropped.
  • Placeholder scan: No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (<service>/<name> inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
  • Consistency: backup__* field names (backup__service, backup__state, backup__paths, backup__dumps[].cmd/.dest, backup__quiesce) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and /check-backup (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.