Compare commits
11 commits
032adf1525
...
9be4366ac3
| Author | SHA1 | Date | |
|---|---|---|---|
| 9be4366ac3 | |||
| ed6d5463aa | |||
| 1e85c11ede | |||
| 5f946ac640 | |||
| 01e47d0890 | |||
| 81dac4f28b | |||
| f3f80443d0 | |||
| f5c97d1f36 | |||
| da116e1d92 | |||
| 2041bd3b70 | |||
| eaffd8d900 |
13 changed files with 1197 additions and 8 deletions
29
.claude/commands/check-backup.md
Normal file
29
.claude/commands/check-backup.md
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
|
||||
---
|
||||
|
||||
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
|
||||
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
|
||||
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
|
||||
|
||||
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
|
||||
command reports `not-yet-available` rather than failing.
|
||||
|
||||
## Preconditions
|
||||
|
||||
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
|
||||
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
|
||||
`not-yet-available` and stop.
|
||||
|
||||
## Checks (when live)
|
||||
|
||||
Load the `backup__*` data for the resolved role, then:
|
||||
|
||||
| Check | How | Green when |
|
||||
|---|---|---|
|
||||
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
|
||||
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
|
||||
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
|
||||
| integrity | `restic check --read-data-subset` (sampled) | no errors |
|
||||
|
||||
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
|
||||
|
|
@ -89,6 +89,9 @@ Full design rationale: `docs/decisions/`
|
|||
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
|
||||
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
|
||||
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
|
||||
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
|
||||
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
|
||||
data. A stateless service records `backup__state: false` with a reason.
|
||||
- One service = one self-contained role; no shared multi-service roles (ADR-004)
|
||||
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
|
||||
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
|
||||
|
|
@ -227,6 +230,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
||||
| Firewall strategy | `docs/decisions/020-firewall.md` |
|
||||
| Operational access | `docs/decisions/021-operational-access.md` |
|
||||
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
||||
|
|
|
|||
|
|
@ -63,6 +63,8 @@ So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially
|
|||
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
|
||||
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
|
||||
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
|
||||
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
||||
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
||||
|
||||
## Keeping this honest
|
||||
|
||||
|
|
|
|||
|
|
@ -104,9 +104,9 @@ role from a shared `group_vars` service catalog. The host `nftables` layer is bu
|
|||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||
|---|---|---|---|---|---|
|
||||
| Databases | Postgres/MariaDB — central *vs* per-app | P | candidate | Backing store for stateful apps | Open: central server vs per-service (TODO 3.9) |
|
||||
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
|
||||
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
|
||||
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
|
||||
| Backup engine | restic (data-only) | S | planned | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
|
||||
| Off-site target | pCloud (via rclone) | S | planned | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
|
||||
| Air-gap target | USB hard drives | S | planned | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
|
||||
|
||||
## 10. Operations & support — [S]
|
||||
|
||||
|
|
|
|||
|
|
@ -136,3 +136,17 @@ earning its keep.
|
|||
user's standing override. → The standing preference outranks skill scripts: when a
|
||||
skill's handoff offers the execution-mode menu, skip it and proceed subagent-driven;
|
||||
only ask if the user signals otherwise this session.
|
||||
|
||||
## 2026-06-10
|
||||
|
||||
- `[recurring]` **Asked the execution-mode question AGAIN** — presented the
|
||||
"subagent-driven (recommended) vs inline" menu at the `writing-plans` → execution
|
||||
handoff (backup-strategy plan), despite the 2026-06-05 standing preference, the
|
||||
`always-subagent-driven-execution` memory, and two prior FRICTION entries (06-06,
|
||||
06-09) all saying don't ask. **Fourth occurrence.** Doc/memory escalations are not
|
||||
holding: each session I re-read the skill's scripted menu and follow it over the
|
||||
standing override. → Prose reminders have demonstrably failed four times; the fix is
|
||||
no longer "try harder to remember" but **mechanical** — a hook or a `writing-plans`
|
||||
local override that suppresses the handoff menu (cf. `update-config`: standing
|
||||
automated behaviours need a hook, not memory). Flag as the top systematization
|
||||
candidate for the next kaizen review.
|
||||
|
|
|
|||
|
|
@ -39,7 +39,10 @@
|
|||
7. ~~Define a tagging standard that lets us target runs without over-tagging.~~
|
||||
DECIDED (ADR-019): two-tier — role-name tags (auto, at play level) + a closed
|
||||
9-tag concern list (`tests/tags.yml`); union-only targeting; enforced by `make lint`.
|
||||
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
||||
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
|
||||
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
|
||||
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
|
||||
pCloud + rotated USB air-gap. Build: Plans 2–3.
|
||||
9. Decide: a central database server, or individual database services per app?
|
||||
10. Should we keep the custom base-container (Molecule test image) method for role testing, or revisit it as boma's testing approach matures (ADR-008)?
|
||||
11. ~~Deliberate tagging strategy.~~ DECIDED (ADR-019) — folded into 3.7.
|
||||
|
|
|
|||
44
docs/backup/service-backup-template.md
Normal file
44
docs/backup/service-backup-template.md
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
# Per-service backup record — template
|
||||
|
||||
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
|
||||
role (ADR-022). It is the per-service **backup record**: what state the service holds,
|
||||
how it is captured consistently, and how it is restored. The structured parts are
|
||||
**rendered from the role's `backup__*` data** (the single source of truth that also
|
||||
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
|
||||
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
|
||||
|
||||
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
|
||||
`backup__state: false` with a reason in its role defaults instead.
|
||||
|
||||
Delete this preamble in the copy and start from the heading below.
|
||||
|
||||
---
|
||||
|
||||
# Backup — <service>
|
||||
|
||||
## State captured
|
||||
|
||||
Rendered from `backup__*`:
|
||||
|
||||
| What | Source | How captured |
|
||||
|---|---|---|
|
||||
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
|
||||
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
|
||||
|
||||
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
|
||||
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
|
||||
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
|
||||
|
||||
## Restore procedure
|
||||
|
||||
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
|
||||
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
|
||||
3. Replay each `<backup__dumps[*].dest>` into its database.
|
||||
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
|
||||
|
||||
## Restore notes
|
||||
|
||||
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
|
||||
known-tricky migrations.
|
||||
|
||||
- <none yet>
|
||||
277
docs/decisions/022-backup.md
Normal file
277
docs/decisions/022-backup.md
Normal file
|
|
@ -0,0 +1,277 @@
|
|||
# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up,
|
||||
incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all
|
||||
"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which
|
||||
assumed a backup policy existed but never defined one.
|
||||
|
||||
**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service
|
||||
contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB
|
||||
air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi`
|
||||
node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet.
|
||||
Designed now, built in the implementation plan referenced at the foot of this ADR.
|
||||
|
||||
## Context
|
||||
|
||||
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
||||
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
||||
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
||||
copies live, *how* they are encrypted, or *whether restores actually work*.
|
||||
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
||||
but commits to nothing.
|
||||
|
||||
The gap is not just theoretical. Every boma service is stateful in some dimension:
|
||||
DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in
|
||||
the stack. Without a backup policy the IaC is not reproducible from nothing; it is
|
||||
reproducible-modulo-data. This ADR closes that gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
||||
|
||||
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
||||
Ansible re-renders the Docker Compose stack. Backups therefore protect **state only** —
|
||||
DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
||||
|
||||
Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic
|
||||
restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1
|
||||
topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into
|
||||
a continuous proof that the IaC *and* the backups both work.
|
||||
|
||||
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run
|
||||
plus data restore, potentially hours), and it bets the repo is complete enough to
|
||||
rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify.
|
||||
**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO
|
||||
proves too slow; nothing here precludes it.
|
||||
|
||||
### 2. One backup tier, ~24 h RPO
|
||||
|
||||
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
||||
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
||||
experience to justify the added machinery.
|
||||
|
||||
### 3. Engine — restic (data) + rclone (off-site); no second encryption layer
|
||||
|
||||
- **restic** captures state into an encrypted, deduplicated repository.
|
||||
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
||||
rclone has a first-class pCloud backend).
|
||||
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
||||
encryption layer, no pCloud "crypto folder."
|
||||
|
||||
No PBS in v1 (see Decision 1).
|
||||
|
||||
### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group
|
||||
|
||||
A single backup node owns the canonical restic repo. It is **off the Proxmox cluster**
|
||||
— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster)
|
||||
dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
||||
(off-site): a manually-provisioned physical node in its own inventory group, still
|
||||
Ansible-managed (the `base` role applies, plus a `backup` role).
|
||||
|
||||
**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared
|
||||
dump command remotely, pulls the declared paths read-only, then `restic` snapshots the
|
||||
staged data into its local repo. **Hosts hold no backup credentials and cannot reach
|
||||
the repo** — a compromised or ransomwared service host cannot delete backup history.
|
||||
|
||||
**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional —
|
||||
the *role* ("the backup node") is load-bearing; the physical assignment may be
|
||||
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
||||
(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the
|
||||
pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives
|
||||
(Decision 11).
|
||||
|
||||
**Inventory:** a new `backup_hosts` group is added to both inventories, structured
|
||||
like `control` and `offsite_hosts`. The `base` role applies.
|
||||
|
||||
### 5. 3-2-1 mapping
|
||||
|
||||
| Copy | Location | Medium | Off-site? | Notes |
|
||||
|---|---|---|---|---|
|
||||
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
||||
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
||||
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) |
|
||||
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
||||
|
||||
≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a
|
||||
fourth, offline copy that no online compromise can reach.
|
||||
|
||||
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance
|
||||
|
||||
Each service role declares its backup needs in role vars — the same render-from-data
|
||||
pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||
|
||||
```yaml
|
||||
backup__service: nextcloud # identifier; matches the role / compose project
|
||||
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||
- /srv/nextcloud/data
|
||||
backup__dumps: # logical app-consistent dumps ([] = none)
|
||||
- cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud"
|
||||
dest: nextcloud-db.sql
|
||||
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
|
||||
```
|
||||
|
||||
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||
service with **no** `backup__paths` must explicitly declare `backup__state: false` with
|
||||
a reason; omission is never an implicit "nothing to back up." (`backup__state` and the
|
||||
list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not
|
||||
silent" point.)
|
||||
|
||||
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
||||
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||
what state exists, what is backed up, the dump command, and the per-service restore
|
||||
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
|
||||
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
|
||||
`BACKUP.md`.
|
||||
|
||||
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
||||
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
||||
service checklist (`docs/security/service-checklist.md`) gains a backup item; the
|
||||
`new-role` runbook gains a fill/render/`check-backup` step (copy
|
||||
`docs/backup/service-backup-template.md` into `roles/<service>/BACKUP.md` and
|
||||
populate the `backup__*` data); and a checklist gate blocks service clearance until
|
||||
the record exists and a restore drill confirms it (or a deviation is recorded in
|
||||
`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check
|
||||
analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md`
|
||||
presence** — same manual-copy-plus-review pattern the sibling records use. The design
|
||||
document's "make lint gates its presence" wording is superseded by this governance
|
||||
choice.
|
||||
|
||||
### 7. Consistency — logical dumps first; quiesce as escape hatch
|
||||
|
||||
- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) —
|
||||
portable, version-independent, restorable to a fresh DB. Plain data dirs are backed
|
||||
up as files. No downtime required.
|
||||
- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step
|
||||
(stop container → back up volume → restart) via `backup__quiesce` in the same contract.
|
||||
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
||||
crash-consistent for a live database).
|
||||
|
||||
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
||||
way, each service declares how to dump its own data.
|
||||
|
||||
### 8. Restore testing — two tiers; `ubongo` stays bare Debian
|
||||
|
||||
- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in
|
||||
rotation, restore its latest snapshot into a throwaway container on `ubongo`
|
||||
(reusing the Molecule harness, ADR-015), start the app against the restored data,
|
||||
and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure
|
||||
that actually kills people — *silently corrupt or unrestorable backups*. Failures
|
||||
alert via ntfy.
|
||||
- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging.
|
||||
Rebuild a host from zero via Terraform + Ansible + restic restore on the staging
|
||||
cluster. This validates the whole Model-A recovery chain. **At least once a year the
|
||||
rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end.
|
||||
|
||||
**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to
|
||||
be the independent recovery anchor — "the tool used to rebuild the cluster must not
|
||||
live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served
|
||||
by the PVE staging environment (same hardware class, same cluster, same provisioning
|
||||
path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk
|
||||
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
||||
|
||||
### 9. Retention — GFS via restic
|
||||
|
||||
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
||||
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
||||
Tune once real repo growth is observed.
|
||||
|
||||
### 10. Encryption + key escrow + break-glass
|
||||
|
||||
restic encrypts the repo at rest, so **one secret — the restic repo password —
|
||||
protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three.
|
||||
|
||||
**Escrow locations:**
|
||||
- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively
|
||||
and `fisi` is redeployable.
|
||||
- **Vaultwarden** — the day-to-day human-accessible copy.
|
||||
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
||||
copy that survives "everything is down."
|
||||
|
||||
**The paper holds *two* secrets:** (1) the **restic repo password** (to read any
|
||||
backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from
|
||||
the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero
|
||||
recovery). With both on paper, the break-glass chain has **no circular dependency**:
|
||||
paper → restic restores Vaultwarden + repo data → the vault password (from paper)
|
||||
drives Terraform/Ansible re-provisioning → services return, `rbw` works again.
|
||||
|
||||
**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh +
|
||||
`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if
|
||||
`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming
|
||||
zero running boma infrastructure: install restic on any machine, point it at pCloud
|
||||
*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the
|
||||
vault password.
|
||||
|
||||
### 11. USB air-gap — plug-and-go cold copy
|
||||
|
||||
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
||||
systemd unit / script that: mounts the drive, confirms it is an expected drive, runs
|
||||
**`restic copy` from the local repo → a restic repo on the USB drive** (same
|
||||
password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts,
|
||||
and **notifies via ntfy** with the result. Only allowlisted serials trigger anything —
|
||||
a rogue USB does nothing.
|
||||
|
||||
`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable
|
||||
directly in a break-glass with nothing else alive. Drives are rotated and **stored
|
||||
off-site** — a second geographic off-site copy independent of pCloud.
|
||||
|
||||
### 12. Failure alerting — guard against silent death
|
||||
|
||||
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||
|
||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||
monitor**; no ping in ~25 h → alert.
|
||||
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
||||
- **Tier-1 restore-verify failures → ntfy.**
|
||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
||||
|
||||
### 13. Schedule
|
||||
|
||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
→ `rclone sync` → pCloud. Sequential, off-hours.
|
||||
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||
- **USB air-gap:** manual, approximately monthly, whenever a drive is docked.
|
||||
|
||||
## Consequences
|
||||
|
||||
- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left
|
||||
open; "backup-first" and "snapshot + backup" are no longer assumed.
|
||||
- Every service role that holds state must declare its backup contract (`backup__*`
|
||||
vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost:
|
||||
per-service declarations and a rendered doc to maintain (mitigated by the new-role
|
||||
runbook step + checklist gate).
|
||||
- **pCloud is off-site but sync-coupled** — `rclone sync` propagates deletions (a
|
||||
prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB
|
||||
air-gap drive is the only truly immutable copy**; pCloud's own file-version history
|
||||
is enabled as a secondary cushion.
|
||||
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
||||
receives full `base` hardening and tight access. restic encryption means a stolen
|
||||
`fisi`, USB drive, or pCloud blob yields ciphertext only.
|
||||
- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years
|
||||
at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB.
|
||||
- Recovery time under Model A (full Ansible run + data restore) is potentially hours —
|
||||
slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected.
|
||||
- The paper break-glass must be kept current (restic password + vault password). An
|
||||
outdated paper sheet is the one failure mode this ADR cannot prevent mechanically —
|
||||
the semi-annual DR rehearsal is the human control.
|
||||
|
||||
Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`.
|
||||
Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`.
|
||||
|
||||
## Related
|
||||
|
||||
ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one
|
||||
role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness
|
||||
reused for Tier-1), ADR-011 (update management: backup-first rule now grounded),
|
||||
ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant),
|
||||
ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy
|
||||
→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021
|
||||
(render-from-data pattern: `access__*`/`ACCESS.md` → `backup__*`/`BACKUP.md`;
|
||||
runbook+gate governance model).
|
||||
|
|
@ -22,10 +22,20 @@
|
|||
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
|
||||
- **CPU:** _TBD (target 4 cores, x86-64)_
|
||||
- **RAM:** _TBD (target 16 GB)_
|
||||
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
|
||||
- **NICs:** _wired GbE_
|
||||
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
|
||||
|
||||
### fisi (backup node — outside the cluster; provisional)
|
||||
- **Model / form factor:** HP Elite 600 G9 (tower)
|
||||
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
|
||||
- **RAM:** 16 GB+ (TBD exact)
|
||||
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
|
||||
- **NICs:** wired GbE
|
||||
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
|
||||
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
|
||||
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
|
||||
## 2. Network gear
|
||||
|
|
@ -54,7 +64,8 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
|||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
| ubongo | 4 | 16 | 250 |
|
||||
| ubongo | 4 | 16 | 1000 |
|
||||
| fisi | 4 | 16 | 8000 |
|
||||
|
||||
## 5. Capacity notes
|
||||
|
||||
|
|
|
|||
|
|
@ -103,7 +103,18 @@ rendered from that data; the admin-API path must `firewall_ref` an entry in the
|
|||
`/check-access <rolename>` proves the documented paths are live — part of the
|
||||
service-clearance gate (`docs/security/service-checklist.md`).
|
||||
|
||||
### 12. Commit
|
||||
### 12. Write the per-service backup record (stateful services)
|
||||
|
||||
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
|
||||
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
|
||||
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
|
||||
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
|
||||
is rendered from that data. A **stateless** service sets `backup__state: false` with a
|
||||
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
|
||||
proves the declared state is captured — part of the service-clearance gate
|
||||
(`docs/security/service-checklist.md`).
|
||||
|
||||
### 13. Commit
|
||||
|
||||
```bash
|
||||
git checkout -b role/<rolename>
|
||||
|
|
|
|||
|
|
@ -47,7 +47,10 @@ This checklist is the generic **bar**. Each service answers it in its own
|
|||
## Operability (security-adjacent)
|
||||
|
||||
- [ ] Logs go somewhere reviewable (central aggregation when available)
|
||||
- [ ] Backup/restore is covered if the service holds state
|
||||
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
|
||||
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
|
||||
reports the declared paths/dumps captured in the latest snapshot — or the service
|
||||
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
|
||||
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
|
||||
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
|
||||
verified (ADR-008 Level 4 / ADR-017)
|
||||
|
|
|
|||
476
docs/superpowers/plans/2026-06-10-backup-strategy.md
Normal file
476
docs/superpowers/plans/2026-06-10-backup-strategy.md
Normal file
|
|
@ -0,0 +1,476 @@
|
|||
# Backup & DR Strategy — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
|
||||
|
||||
**Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs/<concern>/`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
|
||||
|
||||
**Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch.
|
||||
|
||||
**Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 1–13 referenced by number throughout).
|
||||
|
||||
---
|
||||
|
||||
## Decomposition & roadmap
|
||||
|
||||
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
|
||||
|
||||
- **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role.
|
||||
- **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`.
|
||||
- **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live.
|
||||
|
||||
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
|
||||
|
||||
---
|
||||
|
||||
## Plan 1 file map
|
||||
|
||||
| File | Action | Responsibility |
|
||||
|---|---|---|
|
||||
| `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 1–13 |
|
||||
| `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape |
|
||||
| `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) |
|
||||
| `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
|
||||
| `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate |
|
||||
| `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
|
||||
| `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row |
|
||||
| `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
|
||||
| `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract |
|
||||
| `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 |
|
||||
|
||||
**Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
|
||||
|
||||
```bash
|
||||
git checkout -b feat/backup-foundation
|
||||
```
|
||||
|
||||
Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/022-backup.md`
|
||||
- Modify: `CLAUDE.md` (Further-reading table; role-conventions block)
|
||||
- Modify: `STATUS.md` ("Designed but not built" table)
|
||||
- Modify: `docs/TODO.md` (item 3.8)
|
||||
|
||||
- [ ] **Step 1: Write `docs/decisions/022-backup.md`**
|
||||
|
||||
Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
|
||||
|
||||
1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
|
||||
2. **One tier, ~24 h RPO.** (Decision 2)
|
||||
3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
|
||||
4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4)
|
||||
5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5)
|
||||
6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
|
||||
7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7)
|
||||
8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
|
||||
9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9)
|
||||
10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10)
|
||||
11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11)
|
||||
12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12)
|
||||
13. **Schedule.** (Decision 13)
|
||||
|
||||
`## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 2–3 as the build path.
|
||||
|
||||
- [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`**
|
||||
|
||||
In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add:
|
||||
|
||||
```
|
||||
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`**
|
||||
|
||||
In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add:
|
||||
|
||||
```
|
||||
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
|
||||
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
|
||||
data. A stateless service records `backup__state: false` with a reason.
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Add STATUS.md rows**
|
||||
|
||||
In the "Designed but not built" table in `STATUS.md`, add two rows:
|
||||
|
||||
```
|
||||
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
||||
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Update TODO item 3.8**
|
||||
|
||||
In `docs/TODO.md`, change the item-3.8 line:
|
||||
|
||||
From:
|
||||
```
|
||||
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
||||
```
|
||||
To:
|
||||
```
|
||||
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
|
||||
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
|
||||
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
|
||||
pCloud + rotated USB air-gap. Build: Plans 2–3.
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Verify**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed.
|
||||
|
||||
Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md`
|
||||
Expected: matches in every listed file (cross-references resolve).
|
||||
|
||||
- [ ] **Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
|
||||
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/backup/service-backup-template.md`
|
||||
|
||||
- [ ] **Step 1: Create the template**
|
||||
|
||||
Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
|
||||
|
||||
````markdown
|
||||
# Per-service backup record — template
|
||||
|
||||
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
|
||||
role (ADR-022). It is the per-service **backup record**: what state the service holds,
|
||||
how it is captured consistently, and how it is restored. The structured parts are
|
||||
**rendered from the role's `backup__*` data** (the single source of truth that also
|
||||
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
|
||||
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
|
||||
|
||||
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
|
||||
`backup__state: false` with a reason in its role defaults instead.
|
||||
|
||||
Delete this preamble in the copy and start from the heading below.
|
||||
|
||||
---
|
||||
|
||||
# Backup — <service>
|
||||
|
||||
## State captured
|
||||
|
||||
Rendered from `backup__*`:
|
||||
|
||||
| What | Source | How captured |
|
||||
|---|---|---|
|
||||
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
|
||||
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
|
||||
|
||||
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
|
||||
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
|
||||
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
|
||||
|
||||
## Restore procedure
|
||||
|
||||
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
|
||||
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
|
||||
3. Replay each `<backup__dumps[*].dest>` into its database.
|
||||
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
|
||||
|
||||
## Restore notes
|
||||
|
||||
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
|
||||
known-tricky migrations.
|
||||
|
||||
- <none yet>
|
||||
````
|
||||
|
||||
The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
|
||||
|
||||
```yaml
|
||||
backup__service: <name> # identifier; matches the role / compose project
|
||||
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||
backup__paths: # bind-mount dirs/files holding state ([] = none)
|
||||
- /srv/<service>/data
|
||||
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
|
||||
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
|
||||
dest: <service>-db.sql
|
||||
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify**
|
||||
|
||||
Run: `test -f docs/backup/service-backup-template.md && echo PRESENT`
|
||||
Expected: `PRESENT`
|
||||
|
||||
Run: `make lint`
|
||||
Expected: PASS (markdown only; confirms no regression).
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/backup/service-backup-template.md
|
||||
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Strengthen the per-service checklist gate
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/security/service-checklist.md` (Operability section)
|
||||
|
||||
- [ ] **Step 1: Replace the weak backup line with the ADR-022 gate**
|
||||
|
||||
In the "Operability (security-adjacent)" section, replace this line:
|
||||
|
||||
```
|
||||
- [ ] Backup/restore is covered if the service holds state
|
||||
```
|
||||
|
||||
with (mirroring the existing ADR-021 access line directly below it):
|
||||
|
||||
```
|
||||
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
|
||||
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
|
||||
reports the declared paths/dumps captured in the latest snapshot — or the service
|
||||
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify**
|
||||
|
||||
Run: `grep -n "ADR-022" docs/security/service-checklist.md`
|
||||
Expected: one match (the new gate line).
|
||||
|
||||
Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md`
|
||||
Expected: `0` (old weak line gone).
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/security/service-checklist.md
|
||||
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Add the BACKUP.md step to the new-role runbook
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step)
|
||||
|
||||
- [ ] **Step 1: Insert the new step**
|
||||
|
||||
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
|
||||
|
||||
```markdown
|
||||
### 12. Write the per-service backup record (stateful services)
|
||||
|
||||
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
|
||||
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
|
||||
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
|
||||
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
|
||||
is rendered from that data. A **stateless** service sets `backup__state: false` with a
|
||||
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
|
||||
proves the declared state is captured — part of the service-clearance gate
|
||||
(`docs/security/service-checklist.md`).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Renumber the commit step**
|
||||
|
||||
Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`.
|
||||
|
||||
- [ ] **Step 3: Verify**
|
||||
|
||||
Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md`
|
||||
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/runbooks/new-role.md
|
||||
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Create the dormant `/check-backup` verifier command
|
||||
|
||||
**Files:**
|
||||
- Create: `.claude/commands/check-backup.md`
|
||||
|
||||
- [ ] **Step 1: Write the command**
|
||||
|
||||
Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write:
|
||||
|
||||
````markdown
|
||||
---
|
||||
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
|
||||
---
|
||||
|
||||
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
|
||||
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
|
||||
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
|
||||
|
||||
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
|
||||
command reports `not-yet-available` rather than failing.
|
||||
|
||||
## Preconditions
|
||||
|
||||
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
|
||||
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
|
||||
`not-yet-available` and stop.
|
||||
|
||||
## Checks (when live)
|
||||
|
||||
Load the `backup__*` data for the resolved role, then:
|
||||
|
||||
| Check | How | Green when |
|
||||
|---|---|---|
|
||||
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
|
||||
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
|
||||
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
|
||||
| integrity | `restic check --read-data-subset` (sampled) | no errors |
|
||||
|
||||
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
|
||||
````
|
||||
|
||||
- [ ] **Step 2: Verify**
|
||||
|
||||
Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md`
|
||||
Expected: file present, first line `---` (valid frontmatter).
|
||||
|
||||
Run: `grep -n "not-yet-available" .claude/commands/check-backup.md`
|
||||
Expected: matches (dormancy explicit).
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add .claude/commands/check-backup.md
|
||||
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Update hardware reference and capabilities
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table)
|
||||
- Modify: `docs/CAPABILITIES.md` (§9 Data & backup)
|
||||
|
||||
- [ ] **Step 1: Update the `ubongo` prose block**
|
||||
|
||||
In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine:
|
||||
|
||||
From:
|
||||
```
|
||||
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||
```
|
||||
To:
|
||||
```
|
||||
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add a `fisi` prose block**
|
||||
|
||||
After the `ubongo` block in §1, add:
|
||||
|
||||
```
|
||||
### fisi (backup node — outside the cluster; provisional)
|
||||
- **Model / form factor:** HP Elite 600 G9 (tower)
|
||||
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
|
||||
- **RAM:** 16 GB+ (TBD exact)
|
||||
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
|
||||
- **NICs:** wired GbE
|
||||
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
|
||||
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
|
||||
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Update the machine-readable capacity table**
|
||||
|
||||
In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`):
|
||||
|
||||
From:
|
||||
```
|
||||
| ubongo | 4 | 16 | 250 |
|
||||
```
|
||||
To:
|
||||
```
|
||||
| ubongo | 4 | 16 | 1000 |
|
||||
| fisi | 4 | 16 | 8000 |
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Update CAPABILITIES §9**
|
||||
|
||||
In `docs/CAPABILITIES.md` §9 table, replace the three backup rows:
|
||||
|
||||
From:
|
||||
```
|
||||
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
|
||||
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
|
||||
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
|
||||
```
|
||||
To:
|
||||
```
|
||||
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
|
||||
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
|
||||
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Verify**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: PASS.
|
||||
|
||||
Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK`
|
||||
Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted).
|
||||
|
||||
Run: `grep -n "ADR-022" docs/CAPABILITIES.md`
|
||||
Expected: three matches (the updated backup rows).
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/hardware/reference.md docs/CAPABILITIES.md
|
||||
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Final review and merge
|
||||
|
||||
- [ ] **Step 1: Full lint + capacity sanity**
|
||||
|
||||
Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN`
|
||||
Expected: `ALL_GREEN`.
|
||||
|
||||
- [ ] **Step 2: Cross-reference audit**
|
||||
|
||||
Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/`
|
||||
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
|
||||
|
||||
- [ ] **Step 3: Merge to main and delete the branch**
|
||||
|
||||
```bash
|
||||
git checkout main
|
||||
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
|
||||
git branch -d feat/backup-foundation
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review (completed by plan author)
|
||||
|
||||
- **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 2–6. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 2–3 (see *Decomposition & roadmap*), not silently dropped.
|
||||
- **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (`<service>`/`<name>` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
|
||||
- **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.
|
||||
315
docs/superpowers/specs/2026-06-10-backup-strategy-design.md
Normal file
315
docs/superpowers/specs/2026-06-10-backup-strategy-design.md
Normal file
|
|
@ -0,0 +1,315 @@
|
|||
# Design — Backup & disaster recovery strategy
|
||||
|
||||
- **Date:** 2026-06-10
|
||||
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
|
||||
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
||||
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
||||
all "planned")
|
||||
- **Grounds:** the backup substrate that ADR-011 (update management) already leans on
|
||||
("snapshot-before + backups remain the rollback mechanism", "always dumps the DB /
|
||||
takes a backup first") but never defined
|
||||
- **Reuses:** ADR-004 (one service = one role; per-service doc conventions),
|
||||
ADR-008/017 (`VERIFY.md` per-service checks), ADR-021 (`ACCESS.md` rendered from
|
||||
role `access__*` data — the same render-from-data pattern), ADR-015 (`ubongo`
|
||||
recovery model; `mamba` break-glass clone)
|
||||
- **Becomes:** ADR-022 (this design is the basis for that ADR)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
||||
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
||||
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
||||
copies live, *how* they're encrypted, or *whether restores actually work*.
|
||||
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
||||
but commits to nothing.
|
||||
|
||||
This design defines the policy end-to-end: recovery model, what is captured and how,
|
||||
the 3-2-1 topology, encryption and key escrow with a break-glass path, restore
|
||||
testing, retention, failure alerting, and the air-gap mechanism.
|
||||
|
||||
## Scope
|
||||
|
||||
- **In:** application *state* backup for boma's hosts and services; off-site and
|
||||
air-gapped copies; encryption + key escrow; restore testing; failure alerting;
|
||||
retention; the backup node.
|
||||
- **Out (for now):** whole-VM image backup (Proxmox Backup Server) — explicitly
|
||||
deferred, see Decision 1; a central-vs-per-app database decision (TODO 3.9 — this
|
||||
design is agnostic to it); Prometheus backup metrics (noted as a later add).
|
||||
|
||||
## Decisions (as settled)
|
||||
|
||||
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
||||
|
||||
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
||||
Ansible re-renders the Docker Compose stack. So backups protect **state only** — DB
|
||||
contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
||||
|
||||
To recover a host: Terraform re-provisions the VM → Ansible redeploys → restic
|
||||
restores the data. **No Proxmox Backup Server.** This keeps 3-2-1 cheap, fits
|
||||
pCloud's 1 TB comfortably, and turns every restore into a continuous proof that the
|
||||
IaC *and* the backups both work.
|
||||
|
||||
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +
|
||||
data restore, potentially hours), and it bets the repo is complete enough to rebuild
|
||||
from nothing — which Tier-2 restore testing (Decision 8) exists to verify. **PBS
|
||||
(Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO proves
|
||||
too slow; nothing here precludes it.
|
||||
|
||||
### 2. One backup tier, ~24 h RPO
|
||||
|
||||
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
||||
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
||||
experience to justify the added machinery.
|
||||
|
||||
### 3. Engine — restic (data) + rclone (off-site); no PBS
|
||||
|
||||
- **restic** captures state into an encrypted, deduplicated repository.
|
||||
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
||||
rclone has a first-class pCloud backend).
|
||||
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
||||
encryption layer, no pCloud "crypto folder."
|
||||
|
||||
### 4. Topology — central pull node (`fisi`), off the cluster
|
||||
|
||||
A single backup node owns the canonical restic repo. It is **off the Proxmox
|
||||
cluster** — an independent failure domain, so copy 2 survives a PVE node (or the whole
|
||||
cluster) dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
||||
(off-site): a manually-provisioned physical node in its own inventory group, still
|
||||
Ansible-managed (base hardening + a `backup` role).
|
||||
|
||||
**Pull model.** The backup node holds SSH keys to each host; per service it runs the
|
||||
declared dump command remotely, pulls the declared paths read-only, then `restic`
|
||||
snapshots the staged data into its *local* repo. **Hosts hold no backup credentials
|
||||
and cannot reach the repo** — so a compromised or ransomwared service host cannot
|
||||
delete backup history.
|
||||
|
||||
**Backup node assignment:** `fisi` (an HP Elite 600 G9 tower), penciled in / provisional
|
||||
— the *role* ("the backup node") is load-bearing; the physical assignment may be
|
||||
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
||||
(ZFS or mdraid → 8 TB usable, survives one disk failure; not a stripe). It owns the
|
||||
repo, runs the pull orchestration, runs `rclone → pCloud`, and **docks the USB
|
||||
air-gap drives** (Decision 11). Pending one hardware item: the SATA power cable from
|
||||
the board/PSU to the drives. A data-only restic node is a featherweight workload, so
|
||||
the G9 is comfortably over-specced.
|
||||
|
||||
### 5. 3-2-1 mapping
|
||||
|
||||
| Copy | Location | Medium | Off-site | Notes |
|
||||
|---|---|---|---|---|
|
||||
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
||||
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
||||
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Decision 9 / threat model) |
|
||||
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
||||
|
||||
≥3 copies, ≥2 media, ≥1 off-site — satisfied, with the air-gap drive as a fourth,
|
||||
offline copy that no online compromise can reach.
|
||||
|
||||
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md` (hard convention)
|
||||
|
||||
Almost every boma service is the same shape: a Docker bind-mount data dir + maybe a
|
||||
database. Each **service role declares its backup needs** in role vars — the same
|
||||
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||
|
||||
```yaml
|
||||
backup__service: nextcloud # identifier; matches the role / compose project
|
||||
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||
- /srv/nextcloud/data
|
||||
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
||||
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
||||
dest: nextcloud-db.sql
|
||||
backup__quiesce: false # true = stop→back up→restart escape hatch
|
||||
```
|
||||
|
||||
(ADR-022 is authoritative for the contract.)
|
||||
|
||||
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
||||
silent).
|
||||
|
||||
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md` /
|
||||
`VERIFY.md` / `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||
what state exists, what is backed up, the dump command, and the per-service **restore**
|
||||
procedure. A template lives at `docs/backup/service-backup-template.md`. `make lint`
|
||||
gates its presence for service roles.
|
||||
|
||||
### 7. Consistency — logical dumps first, quiesce as an escape hatch
|
||||
|
||||
- **Default (A):** databases are captured with logical dumps (`pg_dump` /
|
||||
`mysqldump`) — portable, version-independent, restorable to a fresh DB. Plain data
|
||||
dirs are backed up as files. No downtime. Cost: every stateful service must declare
|
||||
a working dump command, *tested by restore drills*.
|
||||
- **Escape hatch (B):** a service whose data cannot be dumped live declares a
|
||||
quiesce step (stop container → back up volume → restart) in the same contract.
|
||||
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
||||
crash-consistent for a live database).
|
||||
|
||||
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
||||
way, each service declares how to dump its own data.
|
||||
|
||||
### 8. Restore testing — two tiers
|
||||
|
||||
- **Tier 1 — frequent, automated, rolling restore-verify (weekly).** Pick the next
|
||||
service in rotation, restore its latest snapshot into a throwaway **container on
|
||||
`ubongo`** (reusing boma's existing Molecule harness, ADR-015), start the app
|
||||
against the restored data, and **run that service's `VERIFY.md` checks**
|
||||
(ADR-008/017) against it, then tear down. This catches the failure that actually
|
||||
kills people — *silently corrupt or unrestorable backups*. Failures alert via ntfy.
|
||||
- **Tier 2 — rare, full DR rehearsal (semi-annual), driven from `ubongo` onto PVE
|
||||
staging.** Rebuild a host from zero via Terraform + Ansible + restic restore on the
|
||||
staging cluster (only a real PVE node can host the VM; `ubongo` orchestrates). This
|
||||
validates the whole Model-A recovery chain, not just "can I read a snapshot."
|
||||
**At least once a year the rehearsal exercises the paper-secret break-glass path**
|
||||
(Decision 10) end-to-end.
|
||||
|
||||
`ubongo` stays **bare Debian, not a hypervisor** (ADR-015 unchanged): its job is to be
|
||||
the independent recovery anchor — "the tool used to rebuild the cluster must not live
|
||||
inside the thing it rebuilds." Higher-fidelity real-VM testing is *better* served by
|
||||
the PVE staging env (same hardware class, same cluster, same provisioning path) than
|
||||
by converting `ubongo`. `ubongo`'s real spec is a ThinkCentre M70q (i3-10100T / 16 GB
|
||||
/ **1 TB NVMe**) — the 1 TB gives ample room for Tier-1 dataset restores; disk
|
||||
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
||||
|
||||
### 9. Retention — GFS via restic
|
||||
|
||||
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
||||
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
||||
Tune once real repo growth is observed.
|
||||
|
||||
### 10. Encryption + key escrow + break-glass
|
||||
|
||||
restic already encrypts the repo, so **one secret — the restic repo password —
|
||||
protects all copies uniformly** (fisi, pCloud, USB). One thing to escrow, not three.
|
||||
|
||||
**Escrow locations:**
|
||||
- **`fisi`, root-only** (+ in the Ansible vault) — so backups run non-interactively
|
||||
and `fisi` is redeployable.
|
||||
- **Vaultwarden** — the day-to-day human-accessible copy.
|
||||
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
||||
copy that survives "everything is down."
|
||||
|
||||
**Model-A twist — the paper holds *two* secrets, not one:**
|
||||
1. the **restic repo password** (to read any backup at all), and
|
||||
2. the **Ansible vault master password** (to rebuild hosts from the repo — normally
|
||||
from Vaultwarden via `rbw`, which is itself down in a from-zero recovery).
|
||||
|
||||
With both on paper, the break-glass chain has **no circular dependency**: paper →
|
||||
restic restores Vaultwarden + repo data → the vault password (from paper) drives
|
||||
Terraform/Ansible re-provisioning → services return, `rbw` works again. `ubongo`'s
|
||||
ADR-015 recovery model already establishes **`mamba` (laptop) as a break-glass clone**
|
||||
(repo + toolchain + mesh + `rbw`, with Terraform state synced to it) — the rebuild can
|
||||
be driven from `mamba` if `ubongo` is also gone. The printed sheet is a short
|
||||
**break-glass runbook** assuming zero running boma infrastructure: install restic on
|
||||
any machine, point it at pCloud *or* a USB drive with the password, restore Vaultwarden
|
||||
first, then rebuild with the vault password.
|
||||
|
||||
### 11. USB air-gap trigger (plug-and-go cold copy)
|
||||
|
||||
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
||||
systemd unit → script that: mounts the drive, confirms it is an expected drive, runs
|
||||
**`restic copy` from the local repo → a restic repo on the USB drive** (dedup-aware,
|
||||
same password → ciphertext if lost/stolen), runs `restic check` on the USB copy,
|
||||
unmounts, and **notifies via ntfy** with the result. Only allowlisted serials trigger
|
||||
anything (a rogue USB does nothing).
|
||||
|
||||
`restic copy` (not rsync) so the USB is itself a valid restic repo — restorable
|
||||
**directly** in a break-glass with nothing else alive. Rotate among a few drives,
|
||||
**stored off-site** → also a second *geographic* off-site copy independent of pCloud.
|
||||
|
||||
### 12. Failure alerting — guard against silent death
|
||||
|
||||
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
||||
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
||||
- **Tier-1 restore-verify failures → ntfy.**
|
||||
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
||||
|
||||
### 13. Schedule
|
||||
|
||||
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
||||
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
||||
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||
- **USB air-gap:** manual, ~monthly, whenever a drive is docked.
|
||||
|
||||
## Architecture & data flow (nightly run)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
docker_hosts / etc. │ fisi (backup node) │
|
||||
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
||||
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │
|
||||
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
||||
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
||||
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
||||
│ service B │ │ 5. rclone sync repo → pCloud (offsite) │
|
||||
└───────────┘ │ 6. heartbeat → Uptime Kuma; errors→ntfy│
|
||||
└───────────────┬──────────────────────────┘
|
||||
│ (manual, ~monthly)
|
||||
udev: known drive plugged
|
||||
▼
|
||||
restic copy → USB repo (air-gap, offline)
|
||||
```
|
||||
|
||||
Restore (Model A): Terraform re-provisions the VM → Ansible redeploys the role →
|
||||
restic restores `backup__paths` + replays the dump → `VERIFY.md` confirms.
|
||||
|
||||
## Components & boundaries
|
||||
|
||||
- **`backup` role (on `fisi`):** pull orchestrator, restic repo management, retention
|
||||
prune, rclone→pCloud sync, udev/air-gap unit, alerting hooks. New inventory group
|
||||
(e.g. `backup_hosts`) with the `base` role applied, like `control`/`offsite_hosts`.
|
||||
- **Per-service backup contract:** `backup__*` role vars + rendered `BACKUP.md`
|
||||
(Decision 6); a hard convention enforced by `make lint`.
|
||||
- **`ubongo`:** schedules/drives Tier-1 (local container) and Tier-2 (onto staging);
|
||||
unchanged role per ADR-015.
|
||||
- **Secrets:** restic password + rclone token in `fisi` (root-only) and the Ansible
|
||||
vault; escrowed per Decision 10.
|
||||
|
||||
## Threat model / 3-2-1 honesty
|
||||
|
||||
- **`rclone sync` propagates deletions** — a prune, or a *malicious* wipe of `fisi`'s
|
||||
repo, replicates to pCloud. pCloud is therefore the **off-site** copy but **not
|
||||
immutable**. Mitigations: the **USB air-gap drive is the immutable backstop**
|
||||
(offline = unreachable by any online compromise) and **pCloud's own file-version
|
||||
history** is enabled as a recovery cushion.
|
||||
- **Pull model** stops a compromised *service host* from touching the repo.
|
||||
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
||||
gets full base hardening and tight access. restic encryption means a stolen `fisi`
|
||||
(or USB, or pCloud blob) yields ciphertext only.
|
||||
- **pCloud's 1 TB is the smallest copy → the off-site capacity ceiling.** Data-only
|
||||
backups fit for years at homelab scale; flag for `/capacity-review` if the repo
|
||||
trends toward ~1 TB.
|
||||
|
||||
## What this changes in the repo (for the plan)
|
||||
|
||||
- New `backup` role + `backup_hosts` inventory group; `fisi` hardware-reference entry.
|
||||
- New per-service convention: `backup__*` vars + `BACKUP.md` (template at
|
||||
`docs/backup/service-backup-template.md`); `make lint` gate; update role-conventions
|
||||
in `CLAUDE.md` and the new-role scaffolding/runbook.
|
||||
- Update `docs/hardware/reference.md`: `ubongo` = M70q (i3-10100T/16 GB/**1 TB**);
|
||||
add `fisi`.
|
||||
- Update `CAPABILITIES.md` §9 (PBS → deferred; restic+rclone+USB the committed engine).
|
||||
- Close `docs/TODO.md` 3.8; cross-reference from ADR-011.
|
||||
- The break-glass runbook (printed sheet + `docs/runbooks/`), referencing ADR-015's
|
||||
`mamba` clone and Terraform-state survival.
|
||||
|
||||
## Non-goals / YAGNI
|
||||
|
||||
- No PBS / whole-VM images in v1 (Decision 1).
|
||||
- No per-data-type RPO tiering in v1 (Decision 2).
|
||||
- No second encryption layer over restic (Decision 3).
|
||||
- No central NAS/file-share scope creep on `fisi` — it stays single-purpose.
|
||||
|
||||
## Open / deferred
|
||||
|
||||
- Central vs per-app database (TODO 3.9) — orthogonal; this design works either way.
|
||||
- Prometheus backup metrics — later add (Decision 12).
|
||||
- PBS (Model B) or hybrid (Model C) — revisit if real-world RTO is too slow.
|
||||
Loading…
Add table
Reference in a new issue