311 lines
17 KiB
Markdown
311 lines
17 KiB
Markdown
|
|
# Design — Backup & disaster recovery strategy
|
|||
|
|
|
|||
|
|
- **Date:** 2026-06-10
|
|||
|
|
- **Status:** Approved design — pending implementation plan
|
|||
|
|
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
|||
|
|
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
|||
|
|
all "planned")
|
|||
|
|
- **Grounds:** the backup substrate that ADR-011 (update management) already leans on
|
|||
|
|
("snapshot-before + backups remain the rollback mechanism", "always dumps the DB /
|
|||
|
|
takes a backup first") but never defined
|
|||
|
|
- **Reuses:** ADR-004 (one service = one role; per-service doc conventions),
|
|||
|
|
ADR-008/017 (`VERIFY.md` per-service checks), ADR-021 (`ACCESS.md` rendered from
|
|||
|
|
role `access__*` data — the same render-from-data pattern), ADR-015 (`ubongo`
|
|||
|
|
recovery model; `mamba` break-glass clone)
|
|||
|
|
- **Becomes:** ADR-022 (this design is the basis for that ADR)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Problem
|
|||
|
|
|
|||
|
|
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
|||
|
|
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
|||
|
|
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
|||
|
|
copies live, *how* they're encrypted, or *whether restores actually work*.
|
|||
|
|
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
|||
|
|
but commits to nothing.
|
|||
|
|
|
|||
|
|
This design defines the policy end-to-end: recovery model, what is captured and how,
|
|||
|
|
the 3-2-1 topology, encryption and key escrow with a break-glass path, restore
|
|||
|
|
testing, retention, failure alerting, and the air-gap mechanism.
|
|||
|
|
|
|||
|
|
## Scope
|
|||
|
|
|
|||
|
|
- **In:** application *state* backup for boma's hosts and services; off-site and
|
|||
|
|
air-gapped copies; encryption + key escrow; restore testing; failure alerting;
|
|||
|
|
retention; the backup node.
|
|||
|
|
- **Out (for now):** whole-VM image backup (Proxmox Backup Server) — explicitly
|
|||
|
|
deferred, see Decision 1; a central-vs-per-app database decision (TODO 3.9 — this
|
|||
|
|
design is agnostic to it); Prometheus backup metrics (noted as a later add).
|
|||
|
|
|
|||
|
|
## Decisions (as settled)
|
|||
|
|
|
|||
|
|
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
|||
|
|
|
|||
|
|
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
|||
|
|
Ansible re-renders the Docker Compose stack. So backups protect **state only** — DB
|
|||
|
|
contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
|||
|
|
|
|||
|
|
To recover a host: Terraform re-provisions the VM → Ansible redeploys → restic
|
|||
|
|
restores the data. **No Proxmox Backup Server.** This keeps 3-2-1 cheap, fits
|
|||
|
|
pCloud's 1 TB comfortably, and turns every restore into a continuous proof that the
|
|||
|
|
IaC *and* the backups both work.
|
|||
|
|
|
|||
|
|
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +
|
|||
|
|
data restore, potentially hours), and it bets the repo is complete enough to rebuild
|
|||
|
|
from nothing — which Tier-2 restore testing (Decision 8) exists to verify. **PBS
|
|||
|
|
(Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO proves
|
|||
|
|
too slow; nothing here precludes it.
|
|||
|
|
|
|||
|
|
### 2. One backup tier, ~24 h RPO
|
|||
|
|
|
|||
|
|
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
|||
|
|
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
|||
|
|
experience to justify the added machinery.
|
|||
|
|
|
|||
|
|
### 3. Engine — restic (data) + rclone (off-site); no PBS
|
|||
|
|
|
|||
|
|
- **restic** captures state into an encrypted, deduplicated repository.
|
|||
|
|
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
|||
|
|
rclone has a first-class pCloud backend).
|
|||
|
|
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
|||
|
|
encryption layer, no pCloud "crypto folder."
|
|||
|
|
|
|||
|
|
### 4. Topology — central pull node (`fisi`), off the cluster
|
|||
|
|
|
|||
|
|
A single backup node owns the canonical restic repo. It is **off the Proxmox
|
|||
|
|
cluster** — an independent failure domain, so copy 2 survives a PVE node (or the whole
|
|||
|
|
cluster) dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
|||
|
|
(off-site): a manually-provisioned physical node in its own inventory group, still
|
|||
|
|
Ansible-managed (base hardening + a `backup` role).
|
|||
|
|
|
|||
|
|
**Pull model.** The backup node holds SSH keys to each host; per service it runs the
|
|||
|
|
declared dump command remotely, pulls the declared paths read-only, then `restic`
|
|||
|
|
snapshots the staged data into its *local* repo. **Hosts hold no backup credentials
|
|||
|
|
and cannot reach the repo** — so a compromised or ransomwared service host cannot
|
|||
|
|
delete backup history.
|
|||
|
|
|
|||
|
|
**Backup node assignment:** `fisi` (an HP Elite 600 G9 tower), penciled in / provisional
|
|||
|
|
— the *role* ("the backup node") is load-bearing; the physical assignment may be
|
|||
|
|
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
|||
|
|
(ZFS or mdraid → 8 TB usable, survives one disk failure; not a stripe). It owns the
|
|||
|
|
repo, runs the pull orchestration, runs `rclone → pCloud`, and **docks the USB
|
|||
|
|
air-gap drives** (Decision 11). Pending one hardware item: the SATA power cable from
|
|||
|
|
the board/PSU to the drives. A data-only restic node is a featherweight workload, so
|
|||
|
|
the G9 is comfortably over-specced.
|
|||
|
|
|
|||
|
|
### 5. 3-2-1 mapping
|
|||
|
|
|
|||
|
|
| Copy | Location | Medium | Off-site | Notes |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
|||
|
|
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
|||
|
|
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Decision 9 / threat model) |
|
|||
|
|
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
|||
|
|
|
|||
|
|
≥3 copies, ≥2 media, ≥1 off-site — satisfied, with the air-gap drive as a fourth,
|
|||
|
|
offline copy that no online compromise can reach.
|
|||
|
|
|
|||
|
|
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md` (hard convention)
|
|||
|
|
|
|||
|
|
Almost every boma service is the same shape: a Docker bind-mount data dir + maybe a
|
|||
|
|
database. Each **service role declares its backup needs** in role vars — the same
|
|||
|
|
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
backup__paths: # bind-mount dirs / files holding state
|
|||
|
|
- /srv/nextcloud/data
|
|||
|
|
backup__predump: # optional: command that emits an app-consistent dump
|
|||
|
|
cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
|||
|
|
dest: "nextcloud-db.sql"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
|||
|
|
run `predump` → pull the dump + declared paths read-only → `restic` snapshot. A
|
|||
|
|
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
|||
|
|
silent).
|
|||
|
|
|
|||
|
|
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md` /
|
|||
|
|
`VERIFY.md` / `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
|||
|
|
what state exists, what is backed up, the dump command, and the per-service **restore**
|
|||
|
|
procedure. A template lives at `docs/backup/service-backup-template.md`. `make lint`
|
|||
|
|
gates its presence for service roles.
|
|||
|
|
|
|||
|
|
### 7. Consistency — logical dumps first, quiesce as an escape hatch
|
|||
|
|
|
|||
|
|
- **Default (A):** databases are captured with logical dumps (`pg_dump` /
|
|||
|
|
`mysqldump`) — portable, version-independent, restorable to a fresh DB. Plain data
|
|||
|
|
dirs are backed up as files. No downtime. Cost: every stateful service must declare
|
|||
|
|
a working dump command, *tested by restore drills*.
|
|||
|
|
- **Escape hatch (B):** a service whose data cannot be dumped live declares a
|
|||
|
|
quiesce step (stop container → back up volume → restart) in the same contract.
|
|||
|
|
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
|||
|
|
crash-consistent for a live database).
|
|||
|
|
|
|||
|
|
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
|||
|
|
way, each service declares how to dump its own data.
|
|||
|
|
|
|||
|
|
### 8. Restore testing — two tiers
|
|||
|
|
|
|||
|
|
- **Tier 1 — frequent, automated, rolling restore-verify (weekly).** Pick the next
|
|||
|
|
service in rotation, restore its latest snapshot into a throwaway **container on
|
|||
|
|
`ubongo`** (reusing boma's existing Molecule harness, ADR-015), start the app
|
|||
|
|
against the restored data, and **run that service's `VERIFY.md` checks**
|
|||
|
|
(ADR-008/017) against it, then tear down. This catches the failure that actually
|
|||
|
|
kills people — *silently corrupt or unrestorable backups*. Failures alert via ntfy.
|
|||
|
|
- **Tier 2 — rare, full DR rehearsal (semi-annual), driven from `ubongo` onto PVE
|
|||
|
|
staging.** Rebuild a host from zero via Terraform + Ansible + restic restore on the
|
|||
|
|
staging cluster (only a real PVE node can host the VM; `ubongo` orchestrates). This
|
|||
|
|
validates the whole Model-A recovery chain, not just "can I read a snapshot."
|
|||
|
|
**At least once a year the rehearsal exercises the paper-secret break-glass path**
|
|||
|
|
(Decision 10) end-to-end.
|
|||
|
|
|
|||
|
|
`ubongo` stays **bare Debian, not a hypervisor** (ADR-015 unchanged): its job is to be
|
|||
|
|
the independent recovery anchor — "the tool used to rebuild the cluster must not live
|
|||
|
|
inside the thing it rebuilds." Higher-fidelity real-VM testing is *better* served by
|
|||
|
|
the PVE staging env (same hardware class, same cluster, same provisioning path) than
|
|||
|
|
by converting `ubongo`. `ubongo`'s real spec is a ThinkCentre M70q (i3-10100T / 16 GB
|
|||
|
|
/ **1 TB NVMe**) — the 1 TB gives ample room for Tier-1 dataset restores; disk
|
|||
|
|
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
|||
|
|
|
|||
|
|
### 9. Retention — GFS via restic
|
|||
|
|
|
|||
|
|
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
|||
|
|
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
|||
|
|
Tune once real repo growth is observed.
|
|||
|
|
|
|||
|
|
### 10. Encryption + key escrow + break-glass
|
|||
|
|
|
|||
|
|
restic already encrypts the repo, so **one secret — the restic repo password —
|
|||
|
|
protects all copies uniformly** (fisi, pCloud, USB). One thing to escrow, not three.
|
|||
|
|
|
|||
|
|
**Escrow locations:**
|
|||
|
|
- **`fisi`, root-only** (+ in the Ansible vault) — so backups run non-interactively
|
|||
|
|
and `fisi` is redeployable.
|
|||
|
|
- **Vaultwarden** — the day-to-day human-accessible copy.
|
|||
|
|
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
|||
|
|
copy that survives "everything is down."
|
|||
|
|
|
|||
|
|
**Model-A twist — the paper holds *two* secrets, not one:**
|
|||
|
|
1. the **restic repo password** (to read any backup at all), and
|
|||
|
|
2. the **Ansible vault master password** (to rebuild hosts from the repo — normally
|
|||
|
|
from Vaultwarden via `rbw`, which is itself down in a from-zero recovery).
|
|||
|
|
|
|||
|
|
With both on paper, the break-glass chain has **no circular dependency**: paper →
|
|||
|
|
restic restores Vaultwarden + repo data → the vault password (from paper) drives
|
|||
|
|
Terraform/Ansible re-provisioning → services return, `rbw` works again. `ubongo`'s
|
|||
|
|
ADR-015 recovery model already establishes **`mamba` (laptop) as a break-glass clone**
|
|||
|
|
(repo + toolchain + mesh + `rbw`, with Terraform state synced to it) — the rebuild can
|
|||
|
|
be driven from `mamba` if `ubongo` is also gone. The printed sheet is a short
|
|||
|
|
**break-glass runbook** assuming zero running boma infrastructure: install restic on
|
|||
|
|
any machine, point it at pCloud *or* a USB drive with the password, restore Vaultwarden
|
|||
|
|
first, then rebuild with the vault password.
|
|||
|
|
|
|||
|
|
### 11. USB air-gap trigger (plug-and-go cold copy)
|
|||
|
|
|
|||
|
|
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
|||
|
|
systemd unit → script that: mounts the drive, confirms it is an expected drive, runs
|
|||
|
|
**`restic copy` from the local repo → a restic repo on the USB drive** (dedup-aware,
|
|||
|
|
same password → ciphertext if lost/stolen), runs `restic check` on the USB copy,
|
|||
|
|
unmounts, and **notifies via ntfy** with the result. Only allowlisted serials trigger
|
|||
|
|
anything (a rogue USB does nothing).
|
|||
|
|
|
|||
|
|
`restic copy` (not rsync) so the USB is itself a valid restic repo — restorable
|
|||
|
|
**directly** in a break-glass with nothing else alive. Rotate among a few drives,
|
|||
|
|
**stored off-site** → also a second *geographic* off-site copy independent of pCloud.
|
|||
|
|
|
|||
|
|
### 12. Failure alerting — guard against silent death
|
|||
|
|
|
|||
|
|
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
|||
|
|
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
|||
|
|
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
|||
|
|
- **Immediate failure → ntfy** on any job or `predump` error.
|
|||
|
|
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
|||
|
|
- **Tier-1 restore-verify failures → ntfy.**
|
|||
|
|
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
|||
|
|
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
|||
|
|
|
|||
|
|
### 13. Schedule
|
|||
|
|
|
|||
|
|
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
|||
|
|
`predump` → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
|||
|
|
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
|||
|
|
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
|||
|
|
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
|||
|
|
- **USB air-gap:** manual, ~monthly, whenever a drive is docked.
|
|||
|
|
|
|||
|
|
## Architecture & data flow (nightly run)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
docker_hosts / etc. │ fisi (backup node) │
|
|||
|
|
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
|||
|
|
│ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │
|
|||
|
|
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
|||
|
|
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
|||
|
|
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
|||
|
|
│ service B │ │ 5. rclone sync repo → pCloud (offsite) │
|
|||
|
|
└───────────┘ │ 6. heartbeat → Uptime Kuma; errors→ntfy│
|
|||
|
|
└───────────────┬──────────────────────────┘
|
|||
|
|
│ (manual, ~monthly)
|
|||
|
|
udev: known drive plugged
|
|||
|
|
▼
|
|||
|
|
restic copy → USB repo (air-gap, offline)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Restore (Model A): Terraform re-provisions the VM → Ansible redeploys the role →
|
|||
|
|
restic restores `backup__paths` + replays the dump → `VERIFY.md` confirms.
|
|||
|
|
|
|||
|
|
## Components & boundaries
|
|||
|
|
|
|||
|
|
- **`backup` role (on `fisi`):** pull orchestrator, restic repo management, retention
|
|||
|
|
prune, rclone→pCloud sync, udev/air-gap unit, alerting hooks. New inventory group
|
|||
|
|
(e.g. `backup_hosts`) with the `base` role applied, like `control`/`offsite_hosts`.
|
|||
|
|
- **Per-service backup contract:** `backup__*` role vars + rendered `BACKUP.md`
|
|||
|
|
(Decision 6); a hard convention enforced by `make lint`.
|
|||
|
|
- **`ubongo`:** schedules/drives Tier-1 (local container) and Tier-2 (onto staging);
|
|||
|
|
unchanged role per ADR-015.
|
|||
|
|
- **Secrets:** restic password + rclone token in `fisi` (root-only) and the Ansible
|
|||
|
|
vault; escrowed per Decision 10.
|
|||
|
|
|
|||
|
|
## Threat model / 3-2-1 honesty
|
|||
|
|
|
|||
|
|
- **`rclone sync` propagates deletions** — a prune, or a *malicious* wipe of `fisi`'s
|
|||
|
|
repo, replicates to pCloud. pCloud is therefore the **off-site** copy but **not
|
|||
|
|
immutable**. Mitigations: the **USB air-gap drive is the immutable backstop**
|
|||
|
|
(offline = unreachable by any online compromise) and **pCloud's own file-version
|
|||
|
|
history** is enabled as a recovery cushion.
|
|||
|
|
- **Pull model** stops a compromised *service host* from touching the repo.
|
|||
|
|
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
|||
|
|
gets full base hardening and tight access. restic encryption means a stolen `fisi`
|
|||
|
|
(or USB, or pCloud blob) yields ciphertext only.
|
|||
|
|
- **pCloud's 1 TB is the smallest copy → the off-site capacity ceiling.** Data-only
|
|||
|
|
backups fit for years at homelab scale; flag for `/capacity-review` if the repo
|
|||
|
|
trends toward ~1 TB.
|
|||
|
|
|
|||
|
|
## What this changes in the repo (for the plan)
|
|||
|
|
|
|||
|
|
- New `backup` role + `backup_hosts` inventory group; `fisi` hardware-reference entry.
|
|||
|
|
- New per-service convention: `backup__*` vars + `BACKUP.md` (template at
|
|||
|
|
`docs/backup/service-backup-template.md`); `make lint` gate; update role-conventions
|
|||
|
|
in `CLAUDE.md` and the new-role scaffolding/runbook.
|
|||
|
|
- Update `docs/hardware/reference.md`: `ubongo` = M70q (i3-10100T/16 GB/**1 TB**);
|
|||
|
|
add `fisi`.
|
|||
|
|
- Update `CAPABILITIES.md` §9 (PBS → deferred; restic+rclone+USB the committed engine).
|
|||
|
|
- Close `docs/TODO.md` 3.8; cross-reference from ADR-011.
|
|||
|
|
- The break-glass runbook (printed sheet + `docs/runbooks/`), referencing ADR-015's
|
|||
|
|
`mamba` clone and Terraform-state survival.
|
|||
|
|
|
|||
|
|
## Non-goals / YAGNI
|
|||
|
|
|
|||
|
|
- No PBS / whole-VM images in v1 (Decision 1).
|
|||
|
|
- No per-data-type RPO tiering in v1 (Decision 2).
|
|||
|
|
- No second encryption layer over restic (Decision 3).
|
|||
|
|
- No central NAS/file-share scope creep on `fisi` — it stays single-purpose.
|
|||
|
|
|
|||
|
|
## Open / deferred
|
|||
|
|
|
|||
|
|
- Central vs per-app database (TODO 3.9) — orthogonal; this design works either way.
|
|||
|
|
- Prometheus backup metrics — later add (Decision 12).
|
|||
|
|
- PBS (Model B) or hybrid (Model C) — revisit if real-world RTO is too slow.
|