boma/docs/decisions/022-backup.md
sjat ed6d5463aa docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:32:06 +02:00

277 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1
## Status
Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up,
incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all
"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which
assumed a backup policy existed but never defined one.
**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service
contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB
air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi`
node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet.
Designed now, built in the implementation plan referenced at the foot of this ADR.
## Context
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
copies live, *how* they are encrypted, or *whether restores actually work*.
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
but commits to nothing.
The gap is not just theoretical. Every boma service is stateful in some dimension:
DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in
the stack. Without a backup policy the IaC is not reproducible from nothing; it is
reproducible-modulo-data. This ADR closes that gap.
## Decision
### 1. Recovery model — data-only backups, rebuild from code (Model A)
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
Ansible re-renders the Docker Compose stack. Backups therefore protect **state only**
DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic
restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1
topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into
a continuous proof that the IaC *and* the backups both work.
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run
plus data restore, potentially hours), and it bets the repo is complete enough to
rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify.
**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO
proves too slow; nothing here precludes it.
### 2. One backup tier, ~24 h RPO
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
the board. No per-data-type tiering yet — revisit once there is real-world data and
experience to justify the added machinery.
### 3. Engine — restic (data) + rclone (off-site); no second encryption layer
- **restic** captures state into an encrypted, deduplicated repository.
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
rclone has a first-class pCloud backend).
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
encryption layer, no pCloud "crypto folder."
No PBS in v1 (see Decision 1).
### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group
A single backup node owns the canonical restic repo. It is **off the Proxmox cluster**
— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster)
dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
(off-site): a manually-provisioned physical node in its own inventory group, still
Ansible-managed (the `base` role applies, plus a `backup` role).
**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared
dump command remotely, pulls the declared paths read-only, then `restic` snapshots the
staged data into its local repo. **Hosts hold no backup credentials and cannot reach
the repo** — a compromised or ransomwared service host cannot delete backup history.
**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional —
the *role* ("the backup node") is load-bearing; the physical assignment may be
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the
pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives
(Decision 11).
**Inventory:** a new `backup_hosts` group is added to both inventories, structured
like `control` and `offsite_hosts`. The `base` role applies.
### 5. 3-2-1 mapping
| Copy | Location | Medium | Off-site? | Notes |
|---|---|---|---|---|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) |
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a
fourth, offline copy that no online compromise can reach.
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance
Each service role declares its backup needs in role vars — the same render-from-data
pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
```yaml
backup__service: nextcloud # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs / files holding state ([] = none)
- /srv/nextcloud/data
backup__dumps: # logical app-consistent dumps ([] = none)
- cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud"
dest: nextcloud-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
```
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
service with **no** `backup__paths` must explicitly declare `backup__state: false` with
a reason; omission is never an implicit "nothing to back up." (`backup__state` and the
list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not
silent" point.)
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
what state exists, what is backed up, the dump command, and the per-service restore
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
`BACKUP.md`.
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
service checklist (`docs/security/service-checklist.md`) gains a backup item; the
`new-role` runbook gains a fill/render/`check-backup` step (copy
`docs/backup/service-backup-template.md` into `roles/<service>/BACKUP.md` and
populate the `backup__*` data); and a checklist gate blocks service clearance until
the record exists and a restore drill confirms it (or a deviation is recorded in
`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check
analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md`
presence** — same manual-copy-plus-review pattern the sibling records use. The design
document's "make lint gates its presence" wording is superseded by this governance
choice.
### 7. Consistency — logical dumps first; quiesce as escape hatch
- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) —
portable, version-independent, restorable to a fresh DB. Plain data dirs are backed
up as files. No downtime required.
- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step
(stop container → back up volume → restart) via `backup__quiesce` in the same contract.
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
crash-consistent for a live database).
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
way, each service declares how to dump its own data.
### 8. Restore testing — two tiers; `ubongo` stays bare Debian
- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in
rotation, restore its latest snapshot into a throwaway container on `ubongo`
(reusing the Molecule harness, ADR-015), start the app against the restored data,
and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure
that actually kills people — *silently corrupt or unrestorable backups*. Failures
alert via ntfy.
- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging.
Rebuild a host from zero via Terraform + Ansible + restic restore on the staging
cluster. This validates the whole Model-A recovery chain. **At least once a year the
rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end.
**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to
be the independent recovery anchor — "the tool used to rebuild the cluster must not
live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served
by the PVE staging environment (same hardware class, same cluster, same provisioning
path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
### 9. Retention — GFS via restic
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
Tune once real repo growth is observed.
### 10. Encryption + key escrow + break-glass
restic encrypts the repo at rest, so **one secret — the restic repo password —
protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three.
**Escrow locations:**
- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively
and `fisi` is redeployable.
- **Vaultwarden** — the day-to-day human-accessible copy.
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
copy that survives "everything is down."
**The paper holds *two* secrets:** (1) the **restic repo password** (to read any
backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from
the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero
recovery). With both on paper, the break-glass chain has **no circular dependency**:
paper → restic restores Vaultwarden + repo data → the vault password (from paper)
drives Terraform/Ansible re-provisioning → services return, `rbw` works again.
**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh +
`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if
`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming
zero running boma infrastructure: install restic on any machine, point it at pCloud
*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the
vault password.
### 11. USB air-gap — plug-and-go cold copy
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
systemd unit / script that: mounts the drive, confirms it is an expected drive, runs
**`restic copy` from the local repo → a restic repo on the USB drive** (same
password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts,
and **notifies via ntfy** with the result. Only allowlisted serials trigger anything —
a rogue USB does nothing.
`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable
directly in a break-glass with nothing else alive. Drives are rotated and **stored
off-site** — a second geographic off-site copy independent of pCloud.
### 12. Failure alerting — guard against silent death
Success/failure pings alone miss the worst case (*the job silently stopped running*):
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor**; no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or dump-step error.
- **Weekly `restic check`** for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
`rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
- **USB air-gap:** manual, approximately monthly, whenever a drive is docked.
## Consequences
- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left
open; "backup-first" and "snapshot + backup" are no longer assumed.
- Every service role that holds state must declare its backup contract (`backup__*`
vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost:
per-service declarations and a rendered doc to maintain (mitigated by the new-role
runbook step + checklist gate).
- **pCloud is off-site but sync-coupled** — `rclone sync` propagates deletions (a
prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB
air-gap drive is the only truly immutable copy**; pCloud's own file-version history
is enabled as a secondary cushion.
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
receives full `base` hardening and tight access. restic encryption means a stolen
`fisi`, USB drive, or pCloud blob yields ciphertext only.
- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years
at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB.
- Recovery time under Model A (full Ansible run + data restore) is potentially hours —
slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected.
- The paper break-glass must be kept current (restic password + vault password). An
outdated paper sheet is the one failure mode this ADR cannot prevent mechanically —
the semi-annual DR rehearsal is the human control.
Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`.
Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`.
## Related
ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one
role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness
reused for Tier-1), ADR-011 (update management: backup-first rule now grounded),
ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant),
ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy
→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021
(render-from-data pattern: `access__*`/`ACCESS.md``backup__*`/`BACKUP.md`;
runbook+gate governance model).