boma/docs/superpowers/specs/2026-06-10-backup-strategy-design.md
sjat f5c97d1f36 docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 11:19:01 +02:00

310 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Backup & disaster recovery strategy
- **Date:** 2026-06-10
- **Status:** Approved design — pending implementation plan
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
all "planned")
- **Grounds:** the backup substrate that ADR-011 (update management) already leans on
("snapshot-before + backups remain the rollback mechanism", "always dumps the DB /
takes a backup first") but never defined
- **Reuses:** ADR-004 (one service = one role; per-service doc conventions),
ADR-008/017 (`VERIFY.md` per-service checks), ADR-021 (`ACCESS.md` rendered from
role `access__*` data — the same render-from-data pattern), ADR-015 (`ubongo`
recovery model; `mamba` break-glass clone)
- **Becomes:** ADR-022 (this design is the basis for that ADR)
---
## Problem
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
copies live, *how* they're encrypted, or *whether restores actually work*.
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
but commits to nothing.
This design defines the policy end-to-end: recovery model, what is captured and how,
the 3-2-1 topology, encryption and key escrow with a break-glass path, restore
testing, retention, failure alerting, and the air-gap mechanism.
## Scope
- **In:** application *state* backup for boma's hosts and services; off-site and
air-gapped copies; encryption + key escrow; restore testing; failure alerting;
retention; the backup node.
- **Out (for now):** whole-VM image backup (Proxmox Backup Server) — explicitly
deferred, see Decision 1; a central-vs-per-app database decision (TODO 3.9 — this
design is agnostic to it); Prometheus backup metrics (noted as a later add).
## Decisions (as settled)
### 1. Recovery model — data-only backups, rebuild from code (Model A)
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
Ansible re-renders the Docker Compose stack. So backups protect **state only** — DB
contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
To recover a host: Terraform re-provisions the VM → Ansible redeploys → restic
restores the data. **No Proxmox Backup Server.** This keeps 3-2-1 cheap, fits
pCloud's 1 TB comfortably, and turns every restore into a continuous proof that the
IaC *and* the backups both work.
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +
data restore, potentially hours), and it bets the repo is complete enough to rebuild
from nothing — which Tier-2 restore testing (Decision 8) exists to verify. **PBS
(Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO proves
too slow; nothing here precludes it.
### 2. One backup tier, ~24 h RPO
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
the board. No per-data-type tiering yet — revisit once there is real-world data and
experience to justify the added machinery.
### 3. Engine — restic (data) + rclone (off-site); no PBS
- **restic** captures state into an encrypted, deduplicated repository.
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
rclone has a first-class pCloud backend).
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
encryption layer, no pCloud "crypto folder."
### 4. Topology — central pull node (`fisi`), off the cluster
A single backup node owns the canonical restic repo. It is **off the Proxmox
cluster** — an independent failure domain, so copy 2 survives a PVE node (or the whole
cluster) dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
(off-site): a manually-provisioned physical node in its own inventory group, still
Ansible-managed (base hardening + a `backup` role).
**Pull model.** The backup node holds SSH keys to each host; per service it runs the
declared dump command remotely, pulls the declared paths read-only, then `restic`
snapshots the staged data into its *local* repo. **Hosts hold no backup credentials
and cannot reach the repo** — so a compromised or ransomwared service host cannot
delete backup history.
**Backup node assignment:** `fisi` (an HP Elite 600 G9 tower), penciled in / provisional
— the *role* ("the backup node") is load-bearing; the physical assignment may be
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
(ZFS or mdraid → 8 TB usable, survives one disk failure; not a stripe). It owns the
repo, runs the pull orchestration, runs `rclone → pCloud`, and **docks the USB
air-gap drives** (Decision 11). Pending one hardware item: the SATA power cable from
the board/PSU to the drives. A data-only restic node is a featherweight workload, so
the G9 is comfortably over-specced.
### 5. 3-2-1 mapping
| Copy | Location | Medium | Off-site | Notes |
|---|---|---|---|---|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Decision 9 / threat model) |
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
≥3 copies, ≥2 media, ≥1 off-site — satisfied, with the air-gap drive as a fourth,
offline copy that no online compromise can reach.
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md` (hard convention)
Almost every boma service is the same shape: a Docker bind-mount data dir + maybe a
database. Each **service role declares its backup needs** in role vars — the same
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
```yaml
backup__paths: # bind-mount dirs / files holding state
- /srv/nextcloud/data
backup__dumps: # logical app-consistent dumps (list; [] = none)
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
dest: nextcloud-db.sql
```
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
silent).
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md` /
`VERIFY.md` / `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
what state exists, what is backed up, the dump command, and the per-service **restore**
procedure. A template lives at `docs/backup/service-backup-template.md`. `make lint`
gates its presence for service roles.
### 7. Consistency — logical dumps first, quiesce as an escape hatch
- **Default (A):** databases are captured with logical dumps (`pg_dump` /
`mysqldump`) — portable, version-independent, restorable to a fresh DB. Plain data
dirs are backed up as files. No downtime. Cost: every stateful service must declare
a working dump command, *tested by restore drills*.
- **Escape hatch (B):** a service whose data cannot be dumped live declares a
quiesce step (stop container → back up volume → restart) in the same contract.
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
crash-consistent for a live database).
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
way, each service declares how to dump its own data.
### 8. Restore testing — two tiers
- **Tier 1 — frequent, automated, rolling restore-verify (weekly).** Pick the next
service in rotation, restore its latest snapshot into a throwaway **container on
`ubongo`** (reusing boma's existing Molecule harness, ADR-015), start the app
against the restored data, and **run that service's `VERIFY.md` checks**
(ADR-008/017) against it, then tear down. This catches the failure that actually
kills people — *silently corrupt or unrestorable backups*. Failures alert via ntfy.
- **Tier 2 — rare, full DR rehearsal (semi-annual), driven from `ubongo` onto PVE
staging.** Rebuild a host from zero via Terraform + Ansible + restic restore on the
staging cluster (only a real PVE node can host the VM; `ubongo` orchestrates). This
validates the whole Model-A recovery chain, not just "can I read a snapshot."
**At least once a year the rehearsal exercises the paper-secret break-glass path**
(Decision 10) end-to-end.
`ubongo` stays **bare Debian, not a hypervisor** (ADR-015 unchanged): its job is to be
the independent recovery anchor — "the tool used to rebuild the cluster must not live
inside the thing it rebuilds." Higher-fidelity real-VM testing is *better* served by
the PVE staging env (same hardware class, same cluster, same provisioning path) than
by converting `ubongo`. `ubongo`'s real spec is a ThinkCentre M70q (i3-10100T / 16 GB
/ **1 TB NVMe**) — the 1 TB gives ample room for Tier-1 dataset restores; disk
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
### 9. Retention — GFS via restic
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
Tune once real repo growth is observed.
### 10. Encryption + key escrow + break-glass
restic already encrypts the repo, so **one secret — the restic repo password —
protects all copies uniformly** (fisi, pCloud, USB). One thing to escrow, not three.
**Escrow locations:**
- **`fisi`, root-only** (+ in the Ansible vault) — so backups run non-interactively
and `fisi` is redeployable.
- **Vaultwarden** — the day-to-day human-accessible copy.
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
copy that survives "everything is down."
**Model-A twist — the paper holds *two* secrets, not one:**
1. the **restic repo password** (to read any backup at all), and
2. the **Ansible vault master password** (to rebuild hosts from the repo — normally
from Vaultwarden via `rbw`, which is itself down in a from-zero recovery).
With both on paper, the break-glass chain has **no circular dependency**: paper →
restic restores Vaultwarden + repo data → the vault password (from paper) drives
Terraform/Ansible re-provisioning → services return, `rbw` works again. `ubongo`'s
ADR-015 recovery model already establishes **`mamba` (laptop) as a break-glass clone**
(repo + toolchain + mesh + `rbw`, with Terraform state synced to it) — the rebuild can
be driven from `mamba` if `ubongo` is also gone. The printed sheet is a short
**break-glass runbook** assuming zero running boma infrastructure: install restic on
any machine, point it at pCloud *or* a USB drive with the password, restore Vaultwarden
first, then rebuild with the vault password.
### 11. USB air-gap trigger (plug-and-go cold copy)
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
systemd unit → script that: mounts the drive, confirms it is an expected drive, runs
**`restic copy` from the local repo → a restic repo on the USB drive** (dedup-aware,
same password → ciphertext if lost/stolen), runs `restic check` on the USB copy,
unmounts, and **notifies via ntfy** with the result. Only allowlisted serials trigger
anything (a rogue USB does nothing).
`restic copy` (not rsync) so the USB is itself a valid restic repo — restorable
**directly** in a break-glass with nothing else alive. Rotate among a few drives,
**stored off-site** → also a second *geographic* off-site copy independent of pCloud.
### 12. Failure alerting — guard against silent death
Success/failure pings alone miss the worst case (*the job silently stopped running*):
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor** (already in the planned stack); no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or `predump` error.
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
- **USB air-gap:** manual, ~monthly, whenever a drive is docked.
## Architecture & data flow (nightly run)
```
┌─────────────────────────────────────────┐
docker_hosts / etc. │ fisi (backup node) │
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
│ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
┌───────────┐ │ 4. restic forget --prune (GFS) │
│ service B │ │ 5. rclone sync repo → pCloud (offsite) │
└───────────┘ │ 6. heartbeat → Uptime Kuma; errors→ntfy│
└───────────────┬──────────────────────────┘
│ (manual, ~monthly)
udev: known drive plugged
restic copy → USB repo (air-gap, offline)
```
Restore (Model A): Terraform re-provisions the VM → Ansible redeploys the role →
restic restores `backup__paths` + replays the dump → `VERIFY.md` confirms.
## Components & boundaries
- **`backup` role (on `fisi`):** pull orchestrator, restic repo management, retention
prune, rclone→pCloud sync, udev/air-gap unit, alerting hooks. New inventory group
(e.g. `backup_hosts`) with the `base` role applied, like `control`/`offsite_hosts`.
- **Per-service backup contract:** `backup__*` role vars + rendered `BACKUP.md`
(Decision 6); a hard convention enforced by `make lint`.
- **`ubongo`:** schedules/drives Tier-1 (local container) and Tier-2 (onto staging);
unchanged role per ADR-015.
- **Secrets:** restic password + rclone token in `fisi` (root-only) and the Ansible
vault; escrowed per Decision 10.
## Threat model / 3-2-1 honesty
- **`rclone sync` propagates deletions** — a prune, or a *malicious* wipe of `fisi`'s
repo, replicates to pCloud. pCloud is therefore the **off-site** copy but **not
immutable**. Mitigations: the **USB air-gap drive is the immutable backstop**
(offline = unreachable by any online compromise) and **pCloud's own file-version
history** is enabled as a recovery cushion.
- **Pull model** stops a compromised *service host* from touching the repo.
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
gets full base hardening and tight access. restic encryption means a stolen `fisi`
(or USB, or pCloud blob) yields ciphertext only.
- **pCloud's 1 TB is the smallest copy → the off-site capacity ceiling.** Data-only
backups fit for years at homelab scale; flag for `/capacity-review` if the repo
trends toward ~1 TB.
## What this changes in the repo (for the plan)
- New `backup` role + `backup_hosts` inventory group; `fisi` hardware-reference entry.
- New per-service convention: `backup__*` vars + `BACKUP.md` (template at
`docs/backup/service-backup-template.md`); `make lint` gate; update role-conventions
in `CLAUDE.md` and the new-role scaffolding/runbook.
- Update `docs/hardware/reference.md`: `ubongo` = M70q (i3-10100T/16 GB/**1 TB**);
add `fisi`.
- Update `CAPABILITIES.md` §9 (PBS → deferred; restic+rclone+USB the committed engine).
- Close `docs/TODO.md` 3.8; cross-reference from ADR-011.
- The break-glass runbook (printed sheet + `docs/runbooks/`), referencing ADR-015's
`mamba` clone and Terraform-state survival.
## Non-goals / YAGNI
- No PBS / whole-VM images in v1 (Decision 1).
- No per-data-type RPO tiering in v1 (Decision 2).
- No second encryption layer over restic (Decision 3).
- No central NAS/file-share scope creep on `fisi` — it stays single-purpose.
## Open / deferred
- Central vs per-app database (TODO 3.9) — orthogonal; this design works either way.
- Prometheus backup metrics — later add (Decision 12).
- PBS (Model B) or hybrid (Model C) — revisit if real-world RTO is too slow.