22 KiB
Backup & DR Strategy — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Land the foundation layer of the backup strategy — ADR-022, the per-service backup__* data contract + BACKUP.md governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
Architecture: This is the first of three sequenced plans (see Decomposition & roadmap below). It is doc/governance only — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under docs/<concern>/, one line in docs/security/service-checklist.md, a step in docs/runbooks/new-role.md, and a dormant verifier command (/check-access → here /check-backup). boma deliberately gates these per-service docs via checklist+runbook, not an automated lint script — so this plan adds no scripts/check-*.py. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
Tech Stack: Markdown docs, Ansible role-var conventions (backup__*, double-underscore namespace per CLAUDE.md), make lint (yamllint + ansible-lint + check-tags.py) as the only automated gate, git trunk-based on a feature branch.
Source spec: docs/superpowers/specs/2026-06-10-backup-strategy-design.md (Decisions 1–13 referenced by number throughout).
Decomposition & roadmap
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, fisi unprovisioned, Terraform never inited, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
- Plan 1 — Foundation (THIS PLAN). ADR +
backup__*contract +BACKUP.mdgovernance + doc/inventory updates. Buildable and verifiable today with zero live infra. Unblocks every service role. - Plan 2 — The
backuprole (FUTURE).make new-role NAME=backup: pull orchestrator, restic wrapper,rclone→pCloud, retention prune, udev air-gap unit +restic copy, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way thefirewallconcern was — buildable now, functionally testable only oncefisi+ hosts exist. Blocked on:fisiprovisioned (SATA power cable),backup_hostsinventory group, at least one service role declaringbackup__*. - Plan 3 — Live wire-up + restore testing (FUTURE). Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on
ubongo, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. Blocked on: Plan 2 deployed, real VMs/staging, services withVERIFY.md, Vaultwarden live.
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
Plan 1 file map
| File | Action | Responsibility |
|---|---|---|
docs/decisions/022-backup.md |
create | ADR of record; distils the spec's Decisions 1–13 |
docs/backup/service-backup-template.md |
create | BACKUP.md template; defines the backup__* contract shape |
.claude/commands/check-backup.md |
create | Dormant verifier (mirrors check-access.md) |
CLAUDE.md |
modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
docs/security/service-checklist.md |
modify | Strengthen the Operability backup line to the ADR-022 gate |
docs/runbooks/new-role.md |
modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
docs/hardware/reference.md |
modify | ubongo → M70q/1TB; add fisi node + capacity row |
docs/CAPABILITIES.md |
modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
STATUS.md |
modify | Add "Designed but not built" rows for backup role + contract |
docs/TODO.md |
modify | Mark item 3.8 decided; reference ADR-022 |
Working branch (all tasks): AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
git checkout -b feat/backup-foundation
Before any commit, confirm rbw unlocked exits 0 (the pre-commit hook decrypts vault.yml); if not, stop and ask the operator to rbw unlock.
Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
Files:
-
Create:
docs/decisions/022-backup.md -
Modify:
CLAUDE.md(Further-reading table; role-conventions block) -
Modify:
STATUS.md("Designed but not built" table) -
Modify:
docs/TODO.md(item 3.8) -
Step 1: Write
docs/decisions/022-backup.md
Mirror the structure of docs/decisions/021-operational-access.md (## Context, ## Decision, subsections, ## Consequences). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
- Recovery model A — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
- One tier, ~24 h RPO. (Decision 2)
- Engine: restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
- Topology: central off-cluster pull node (
fisi, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. Newbackup_hostsinventory group,baserole applies. (Decision 4) - 3-2-1 mapping incl. USB air-gap as the immutable backstop. (Decision 5)
- Per-service contract:
backup__*role vars + requiredBACKUP.md, rendered from the data (the ADR-021 pattern). Governance reconciliation: gated via the per-service checklist + new-role runbook + dormant/check-backupverifier — not an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6) - Consistency: logical dumps first (
pg_dump/mysqldump),quiesceescape hatch; FS snapshots not the sole DB method. (Decision 7) - Restore testing: Tier-1 weekly rolling container restore-verify on
ubongo(reusesVERIFY.md); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass.ubongostays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8) - Retention (GFS):
--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1. (Decision 9) - Encryption + escrow + break-glass: one restic password protects all copies; escrowed to
fisi(+vault) / Vaultwarden / paper; paper holds both the restic password and the Ansible vault password (breaks the Model-A circular dependency);mambais the break-glass clone (ADR-015). (Decision 10) - USB air-gap: udev serial-allowlist →
restic copyto a USB restic repo →restic check→ ntfy; rotate off-site. (Decision 11) - Failure alerting: Uptime-Kuma dead-man's-switch + ntfy on failure + weekly
restic check. (Decision 12) - Schedule. (Decision 13)
## Consequences must note: pCloud is off-site but sync-coupled (deletes propagate) → USB is the only immutable copy; fisi is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 2–3 as the build path.
- Step 2: Add the Further-reading row in
CLAUDE.md
In the Further-reading table, immediately after the Operational access … 021-operational-access.md row, add:
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
- Step 3: Add the BACKUP.md role-convention in
CLAUDE.md
In the "Role conventions" list, immediately after the ACCESS.md (ADR-021) bullet, add:
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
- Step 4: Add STATUS.md rows
In the "Designed but not built" table in STATUS.md, add two rows:
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
- Step 5: Update TODO item 3.8
In docs/TODO.md, change the item-3.8 line:
From:
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
To:
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
pCloud + rotated USB air-gap. Build: Plans 2–3.
- Step 6: Verify
Run: make lint
Expected: PASS (yamllint, ansible-lint, check-tags: OK …). No new YAML/tags introduced, so this confirms nothing regressed.
Run: grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md
Expected: matches in every listed file (cross-references resolve).
- Step 7: Commit
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
Task 2: Create the BACKUP.md template and define the backup__* contract
Files:
-
Create:
docs/backup/service-backup-template.md -
Step 1: Create the template
Mirror docs/access/service-access-template.md (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
# Per-service backup record — template
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.
Delete this preamble in the copy and start from the heading below.
---
# Backup — <service>
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
## Restore notes
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.
- <none yet>
The backup__* contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
backup__service: <name> # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs/files holding state ([] = none)
- /srv/<service>/data
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
dest: <service>-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
- Step 2: Verify
Run: test -f docs/backup/service-backup-template.md && echo PRESENT
Expected: PRESENT
Run: make lint
Expected: PASS (markdown only; confirms no regression).
- Step 3: Commit
git add docs/backup/service-backup-template.md
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
Task 3: Strengthen the per-service checklist gate
Files:
-
Modify:
docs/security/service-checklist.md(Operability section) -
Step 1: Replace the weak backup line with the ADR-022 gate
In the "Operability (security-adjacent)" section, replace this line:
- [ ] Backup/restore is covered if the service holds state
with (mirroring the existing ADR-021 access line directly below it):
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
reports the declared paths/dumps captured in the latest snapshot — or the service
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
- Step 2: Verify
Run: grep -n "ADR-022" docs/security/service-checklist.md
Expected: one match (the new gate line).
Run: grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md
Expected: 0 (old weak line gone).
- Step 3: Commit
git add docs/security/service-checklist.md
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
Task 4: Add the BACKUP.md step to the new-role runbook
Files:
-
Modify:
docs/runbooks/new-role.md(insert a new step after the §11 ACCESS step; renumber the commit step) -
Step 1: Insert the new step
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
### 12. Write the per-service backup record (stateful services)
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
- Step 2: Renumber the commit step
Change the heading ### 12. Commit (now the following heading) to ### 13. Commit.
- Step 3: Verify
Run: grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
- Step 4: Commit
git add docs/runbooks/new-role.md
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
Task 5: Create the dormant /check-backup verifier command
Files:
-
Create:
.claude/commands/check-backup.md -
Step 1: Write the command
Mirror the sibling .claude/commands/check-access.md (same frontmatter/sections, same "dormant until infra exists" framing). Write:
---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.
## Preconditions
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
`not-yet-available` and stop.
## Checks (when live)
Load the `backup__*` data for the resolved role, then:
| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
- Step 2: Verify
Run: test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md
Expected: file present, first line --- (valid frontmatter).
Run: grep -n "not-yet-available" .claude/commands/check-backup.md
Expected: matches (dormancy explicit).
- Step 3: Commit
git add .claude/commands/check-backup.md
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
Task 6: Update hardware reference and capabilities
Files:
-
Modify:
docs/hardware/reference.md(ubongospec; newfisinode; capacity table) -
Modify:
docs/CAPABILITIES.md(§9 Data & backup) -
Step 1: Update the
ubongoprose block
In docs/hardware/reference.md §1, replace the ubongo Storage line target with the real machine:
From:
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
To:
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
- Step 2: Add a
fisiprose block
After the ubongo block in §1, add:
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
- Step 3: Update the machine-readable capacity table
In §4 "Node capacity", change the ubongo row disk from 250 to 1000 and add a fisi row. Keep the header and integer/decimal format intact (parsed by capacity-scan.py):
From:
| ubongo | 4 | 16 | 250 |
To:
| ubongo | 4 | 16 | 1000 |
| fisi | 4 | 16 | 8000 |
- Step 4: Update CAPABILITIES §9
In docs/CAPABILITIES.md §9 table, replace the three backup rows:
From:
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
To:
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
- Step 5: Verify
Run: make lint
Expected: PASS.
Run: python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK
Expected: CAPACITY_OK (the capacity table headers are still parseable; new fisi row accepted).
Run: grep -n "ADR-022" docs/CAPABILITIES.md
Expected: three matches (the updated backup rows).
- Step 6: Commit
git add docs/hardware/reference.md docs/CAPABILITIES.md
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
Task 7: Final review and merge
- Step 1: Full lint + capacity sanity
Run: make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN
Expected: ALL_GREEN.
- Step 2: Cross-reference audit
Run: grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
- Step 3: Merge to main and delete the branch
git checkout main
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
git branch -d feat/backup-foundation
git push origin main
Self-review (completed by plan author)
- Spec coverage: All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The foundation obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 2–6. Decisions whose implementation is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 2–3 (see Decomposition & roadmap), not silently dropped.
- Placeholder scan: No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (
<service>/<name>inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.) - Consistency:
backup__*field names (backup__service,backup__state,backup__paths,backup__dumps[].cmd/.dest,backup__quiesce) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and/check-backup(Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.