# Backup & DR Strategy — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists. **Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs//`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.) **Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch. **Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 1–13 referenced by number throughout). --- ## Decomposition & roadmap The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own: - **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role. - **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`. - **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live. Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1. --- ## Plan 1 file map | File | Action | Responsibility | |---|---|---| | `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 1–13 | | `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape | | `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) | | `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row | | `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate | | `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) | | `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row | | `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 | | `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract | | `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 | **Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions). ```bash git checkout -b feat/backup-foundation ``` Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`. --- ### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md **Files:** - Create: `docs/decisions/022-backup.md` - Modify: `CLAUDE.md` (Further-reading table; role-conventions block) - Modify: `STATUS.md` ("Designed but not built" table) - Modify: `docs/TODO.md` (item 3.8) - [ ] **Step 1: Write `docs/decisions/022-backup.md`** Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision: 1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1) 2. **One tier, ~24 h RPO.** (Decision 2) 3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3) 4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4) 5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5) 6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6) 7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7) 8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8) 9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9) 10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10) 11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11) 12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12) 13. **Schedule.** (Decision 13) `## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 2–3 as the build path. - [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`** In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add: ``` | Backup & disaster recovery | `docs/decisions/022-backup.md` | ``` - [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`** In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add: ``` - Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) — copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*` data. A stateless service records `backup__state: false` with a reason. ``` - [ ] **Step 4: Add STATUS.md rows** In the "Designed but not built" table in `STATUS.md`, add two rows: ``` | Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. | | Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. | ``` - [ ] **Step 5: Update TODO item 3.8** In `docs/TODO.md`, change the item-3.8 line: From: ``` 8. Ensure the right things are backed up (incl. database dumps if we land on PBS). ``` To: ``` 8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~ DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via pCloud + rotated USB air-gap. Build: Plans 2–3. ``` - [ ] **Step 6: Verify** Run: `make lint` Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed. Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md` Expected: matches in every listed file (cross-references resolve). - [ ] **Step 7: Commit** ```bash git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO" ``` --- ### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract **Files:** - Create: `docs/backup/service-backup-template.md` - [ ] **Step 1: Create the template** Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly: ````markdown # Per-service backup record — template Copy this file to `roles//BACKUP.md` when building a **stateful** service role (ADR-022). It is the per-service **backup record**: what state the service holds, how it is captured consistently, and how it is restored. The structured parts are **rendered from the role's `backup__*` data** (the single source of truth that also drives `/check-backup`) — keep the data authoritative and regenerate this file rather than hand-editing the tables. The prose "Restore notes" tail is hand-written. A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets `backup__state: false` with a reason in its role defaults instead. Delete this preamble in the copy and start from the heading below. --- # Backup — ## State captured Rendered from `backup__*`: | What | Source | How captured | |---|---|---| | data dir(s) | `` | file-level, pulled read-only | | database | `` → `` | logical dump (default; ADR-022 Decision 7) | - **Quiesce:** `` — `true` means the service is stopped → backed up → restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B). - **RPO:** ~24 h (nightly; ADR-022 Decision 2). ## Restore procedure 1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. 2. `restic restore` the latest snapshot for `` into ``. 3. Replay each `` into its database. 4. Confirm with this role's `VERIFY.md` checks (ADR-008/017). ## Restore notes Prose the data can't capture — ordering gotchas, "restore the DB before the data dir", known-tricky migrations. - ```` The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it): ```yaml backup__service: # identifier; matches the role / compose project backup__state: true # false = stateless → no BACKUP.md (pair with a reason) backup__paths: # bind-mount dirs/files holding state ([] = none) - /srv//data backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none) - cmd: "docker compose -p exec -T db pg_dump -U {{ vault..db_user }} " dest: -db.sql backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B) ``` - [ ] **Step 2: Verify** Run: `test -f docs/backup/service-backup-template.md && echo PRESENT` Expected: `PRESENT` Run: `make lint` Expected: PASS (markdown only; confirms no regression). - [ ] **Step 3: Commit** ```bash git add docs/backup/service-backup-template.md git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)" ``` --- ### Task 3: Strengthen the per-service checklist gate **Files:** - Modify: `docs/security/service-checklist.md` (Operability section) - [ ] **Step 1: Replace the weak backup line with the ADR-022 gate** In the "Operability (security-adjacent)" section, replace this line: ``` - [ ] Backup/restore is covered if the service holds state ``` with (mirroring the existing ADR-021 access line directly below it): ``` - [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries `backup__*` data, `roles//BACKUP.md` is rendered, and `/check-backup` reports the declared paths/dumps captured in the latest snapshot — or the service sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`. ``` - [ ] **Step 2: Verify** Run: `grep -n "ADR-022" docs/security/service-checklist.md` Expected: one match (the new gate line). Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md` Expected: `0` (old weak line gone). - [ ] **Step 3: Commit** ```bash git add docs/security/service-checklist.md git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)" ``` --- ### Task 4: Add the BACKUP.md step to the new-role runbook **Files:** - Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step) - [ ] **Step 1: Insert the new step** Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert: ```markdown ### 12. Write the per-service backup record (stateful services) For a **stateful** service role, copy `docs/backup/service-backup-template.md` to `roles//BACKUP.md` and populate the role's `backup__*` data (`backup__service`, `backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`; ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md` is rendered from that data. A **stateless** service sets `backup__state: false` with a reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup ` proves the declared state is captured — part of the service-clearance gate (`docs/security/service-checklist.md`). ``` - [ ] **Step 2: Renumber the commit step** Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`. - [ ] **Step 3: Verify** Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md` Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers. - [ ] **Step 4: Commit** ```bash git add docs/runbooks/new-role.md git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)" ``` --- ### Task 5: Create the dormant `/check-backup` verifier command **Files:** - Create: `.claude/commands/check-backup.md` - [ ] **Step 1: Write the command** Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write: ````markdown --- description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured. --- Verify that a service's **declared** backup data (`backup__*`) is actually captured in the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern, applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`). **Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this command reports `not-yet-available` rather than failing. ## Preconditions - `roles//` carries `backup__*` data (or `backup__state: false` with a reason). - The backup node (`fisi`) is reachable and its restic repo exists. If not → report `not-yet-available` and stop. ## Checks (when live) Load the `backup__*` data for the resolved role, then: | Check | How | Green when | |---|---|---| | snapshot freshness | `restic snapshots --tag --latest 1` | a snapshot ≤ ~24 h old exists | | paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present | | dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present | | integrity | `restic check --read-data-subset` (sampled) | no errors | Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`. ```` - [ ] **Step 2: Verify** Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md` Expected: file present, first line `---` (valid frontmatter). Run: `grep -n "not-yet-available" .claude/commands/check-backup.md` Expected: matches (dormancy explicit). - [ ] **Step 3: Commit** ```bash git add .claude/commands/check-backup.md git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)" ``` --- ### Task 6: Update hardware reference and capabilities **Files:** - Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table) - Modify: `docs/CAPABILITIES.md` (§9 Data & backup) - [ ] **Step 1: Update the `ubongo` prose block** In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine: From: ``` - **Storage:** _TBD (target 250 GB SSD/NVMe)_ ``` To: ``` - **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022) ``` - [ ] **Step 2: Add a `fisi` prose block** After the `ubongo` block in §1, add: ``` ### fisi (backup node — outside the cluster; provisional) - **Model / form factor:** HP Elite 600 G9 (tower) - **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node - **RAM:** 16 GB+ (TBD exact) - **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk) - **NICs:** wired GbE - **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud, docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs. Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand). ``` - [ ] **Step 3: Update the machine-readable capacity table** In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`): From: ``` | ubongo | 4 | 16 | 250 | ``` To: ``` | ubongo | 4 | 16 | 1000 | | fisi | 4 | 16 | 8000 | ``` - [ ] **Step 4: Update CAPABILITIES §9** In `docs/CAPABILITIES.md` §9 table, replace the three backup rows: From: ``` | Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 | | Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | | | Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation | ``` To: ``` | Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) | | Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled | | Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` | ``` - [ ] **Step 5: Verify** Run: `make lint` Expected: PASS. Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK` Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted). Run: `grep -n "ADR-022" docs/CAPABILITIES.md` Expected: three matches (the updated backup rows). - [ ] **Step 6: Commit** ```bash git add docs/hardware/reference.md docs/CAPABILITIES.md git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)" ``` --- ### Task 7: Final review and merge - [ ] **Step 1: Full lint + capacity sanity** Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN` Expected: `ALL_GREEN`. - [ ] **Step 2: Cross-reference audit** Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/` Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed. - [ ] **Step 3: Merge to main and delete the branch** ```bash git checkout main git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)" git branch -d feat/backup-foundation git push origin main ``` --- ## Self-review (completed by plan author) - **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 2–6. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 2–3 (see *Decomposition & roadmap*), not silently dropped. - **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (``/`` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.) - **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.