277 lines
16 KiB
Markdown
277 lines
16 KiB
Markdown
# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1
|
||
|
||
## Status
|
||
|
||
Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up,
|
||
incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all
|
||
"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which
|
||
assumed a backup policy existed but never defined one.
|
||
|
||
**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service
|
||
contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB
|
||
air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi`
|
||
node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet.
|
||
Designed now, built in the implementation plan referenced at the foot of this ADR.
|
||
|
||
## Context
|
||
|
||
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
||
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
||
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
||
copies live, *how* they are encrypted, or *whether restores actually work*.
|
||
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
||
but commits to nothing.
|
||
|
||
The gap is not just theoretical. Every boma service is stateful in some dimension:
|
||
DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in
|
||
the stack. Without a backup policy the IaC is not reproducible from nothing; it is
|
||
reproducible-modulo-data. This ADR closes that gap.
|
||
|
||
## Decision
|
||
|
||
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
||
|
||
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
||
Ansible re-renders the Docker Compose stack. Backups therefore protect **state only** —
|
||
DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
||
|
||
Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic
|
||
restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1
|
||
topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into
|
||
a continuous proof that the IaC *and* the backups both work.
|
||
|
||
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run
|
||
plus data restore, potentially hours), and it bets the repo is complete enough to
|
||
rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify.
|
||
**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO
|
||
proves too slow; nothing here precludes it.
|
||
|
||
### 2. One backup tier, ~24 h RPO
|
||
|
||
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
||
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
||
experience to justify the added machinery.
|
||
|
||
### 3. Engine — restic (data) + rclone (off-site); no second encryption layer
|
||
|
||
- **restic** captures state into an encrypted, deduplicated repository.
|
||
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
||
rclone has a first-class pCloud backend).
|
||
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
||
encryption layer, no pCloud "crypto folder."
|
||
|
||
No PBS in v1 (see Decision 1).
|
||
|
||
### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group
|
||
|
||
A single backup node owns the canonical restic repo. It is **off the Proxmox cluster**
|
||
— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster)
|
||
dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
||
(off-site): a manually-provisioned physical node in its own inventory group, still
|
||
Ansible-managed (the `base` role applies, plus a `backup` role).
|
||
|
||
**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared
|
||
dump command remotely, pulls the declared paths read-only, then `restic` snapshots the
|
||
staged data into its local repo. **Hosts hold no backup credentials and cannot reach
|
||
the repo** — a compromised or ransomwared service host cannot delete backup history.
|
||
|
||
**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional —
|
||
the *role* ("the backup node") is load-bearing; the physical assignment may be
|
||
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
||
(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the
|
||
pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives
|
||
(Decision 11).
|
||
|
||
**Inventory:** a new `backup_hosts` group is added to both inventories, structured
|
||
like `control` and `offsite_hosts`. The `base` role applies.
|
||
|
||
### 5. 3-2-1 mapping
|
||
|
||
| Copy | Location | Medium | Off-site? | Notes |
|
||
|---|---|---|---|---|
|
||
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
||
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
||
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) |
|
||
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
||
|
||
≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a
|
||
fourth, offline copy that no online compromise can reach.
|
||
|
||
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance
|
||
|
||
Each service role declares its backup needs in role vars — the same render-from-data
|
||
pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||
|
||
```yaml
|
||
backup__service: nextcloud # identifier; matches the role / compose project
|
||
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||
- /srv/nextcloud/data
|
||
backup__dumps: # logical app-consistent dumps ([] = none)
|
||
- cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud"
|
||
dest: nextcloud-db.sql
|
||
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
|
||
```
|
||
|
||
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||
service with **no** `backup__paths` must explicitly declare `backup__state: false` with
|
||
a reason; omission is never an implicit "nothing to back up." (`backup__state` and the
|
||
list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not
|
||
silent" point.)
|
||
|
||
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
||
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||
what state exists, what is backed up, the dump command, and the per-service restore
|
||
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
|
||
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
|
||
`BACKUP.md`.
|
||
|
||
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
||
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
||
service checklist (`docs/security/service-checklist.md`) gains a backup item; the
|
||
`new-role` runbook gains a fill/render/`check-backup` step (copy
|
||
`docs/backup/service-backup-template.md` into `roles/<service>/BACKUP.md` and
|
||
populate the `backup__*` data); and a checklist gate blocks service clearance until
|
||
the record exists and a restore drill confirms it (or a deviation is recorded in
|
||
`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check
|
||
analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md`
|
||
presence** — same manual-copy-plus-review pattern the sibling records use. The design
|
||
document's "make lint gates its presence" wording is superseded by this governance
|
||
choice.
|
||
|
||
### 7. Consistency — logical dumps first; quiesce as escape hatch
|
||
|
||
- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) —
|
||
portable, version-independent, restorable to a fresh DB. Plain data dirs are backed
|
||
up as files. No downtime required.
|
||
- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step
|
||
(stop container → back up volume → restart) via `backup__quiesce` in the same contract.
|
||
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
||
crash-consistent for a live database).
|
||
|
||
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
||
way, each service declares how to dump its own data.
|
||
|
||
### 8. Restore testing — two tiers; `ubongo` stays bare Debian
|
||
|
||
- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in
|
||
rotation, restore its latest snapshot into a throwaway container on `ubongo`
|
||
(reusing the Molecule harness, ADR-015), start the app against the restored data,
|
||
and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure
|
||
that actually kills people — *silently corrupt or unrestorable backups*. Failures
|
||
alert via ntfy.
|
||
- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging.
|
||
Rebuild a host from zero via Terraform + Ansible + restic restore on the staging
|
||
cluster. This validates the whole Model-A recovery chain. **At least once a year the
|
||
rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end.
|
||
|
||
**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to
|
||
be the independent recovery anchor — "the tool used to rebuild the cluster must not
|
||
live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served
|
||
by the PVE staging environment (same hardware class, same cluster, same provisioning
|
||
path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk
|
||
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
||
|
||
### 9. Retention — GFS via restic
|
||
|
||
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
||
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
||
Tune once real repo growth is observed.
|
||
|
||
### 10. Encryption + key escrow + break-glass
|
||
|
||
restic encrypts the repo at rest, so **one secret — the restic repo password —
|
||
protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three.
|
||
|
||
**Escrow locations:**
|
||
- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively
|
||
and `fisi` is redeployable.
|
||
- **Vaultwarden** — the day-to-day human-accessible copy.
|
||
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
||
copy that survives "everything is down."
|
||
|
||
**The paper holds *two* secrets:** (1) the **restic repo password** (to read any
|
||
backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from
|
||
the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero
|
||
recovery). With both on paper, the break-glass chain has **no circular dependency**:
|
||
paper → restic restores Vaultwarden + repo data → the vault password (from paper)
|
||
drives Terraform/Ansible re-provisioning → services return, `rbw` works again.
|
||
|
||
**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh +
|
||
`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if
|
||
`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming
|
||
zero running boma infrastructure: install restic on any machine, point it at pCloud
|
||
*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the
|
||
vault password.
|
||
|
||
### 11. USB air-gap — plug-and-go cold copy
|
||
|
||
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
||
systemd unit / script that: mounts the drive, confirms it is an expected drive, runs
|
||
**`restic copy` from the local repo → a restic repo on the USB drive** (same
|
||
password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts,
|
||
and **notifies via ntfy** with the result. Only allowlisted serials trigger anything —
|
||
a rogue USB does nothing.
|
||
|
||
`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable
|
||
directly in a break-glass with nothing else alive. Drives are rotated and **stored
|
||
off-site** — a second geographic off-site copy independent of pCloud.
|
||
|
||
### 12. Failure alerting — guard against silent death
|
||
|
||
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||
|
||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||
monitor**; no ping in ~25 h → alert.
|
||
- **Immediate failure → ntfy** on any job or dump-step error.
|
||
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
||
- **Tier-1 restore-verify failures → ntfy.**
|
||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
||
|
||
### 13. Schedule
|
||
|
||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||
→ `rclone sync` → pCloud. Sequential, off-hours.
|
||
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||
- **USB air-gap:** manual, approximately monthly, whenever a drive is docked.
|
||
|
||
## Consequences
|
||
|
||
- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left
|
||
open; "backup-first" and "snapshot + backup" are no longer assumed.
|
||
- Every service role that holds state must declare its backup contract (`backup__*`
|
||
vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost:
|
||
per-service declarations and a rendered doc to maintain (mitigated by the new-role
|
||
runbook step + checklist gate).
|
||
- **pCloud is off-site but sync-coupled** — `rclone sync` propagates deletions (a
|
||
prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB
|
||
air-gap drive is the only truly immutable copy**; pCloud's own file-version history
|
||
is enabled as a secondary cushion.
|
||
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
||
receives full `base` hardening and tight access. restic encryption means a stolen
|
||
`fisi`, USB drive, or pCloud blob yields ciphertext only.
|
||
- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years
|
||
at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB.
|
||
- Recovery time under Model A (full Ansible run + data restore) is potentially hours —
|
||
slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected.
|
||
- The paper break-glass must be kept current (restic password + vault password). An
|
||
outdated paper sheet is the one failure mode this ADR cannot prevent mechanically —
|
||
the semi-annual DR rehearsal is the human control.
|
||
|
||
Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`.
|
||
Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`.
|
||
|
||
## Related
|
||
|
||
ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one
|
||
role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness
|
||
reused for Tier-1), ADR-011 (update management: backup-first rule now grounded),
|
||
ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant),
|
||
ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy
|
||
→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021
|
||
(render-from-data pattern: `access__*`/`ACCESS.md` → `backup__*`/`BACKUP.md`;
|
||
runbook+gate governance model).
|