diff --git a/.claude/commands/check-backup.md b/.claude/commands/check-backup.md new file mode 100644 index 0000000..cacb043 --- /dev/null +++ b/.claude/commands/check-backup.md @@ -0,0 +1,29 @@ +--- +description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured. +--- + +Verify that a service's **declared** backup data (`backup__*`) is actually captured in +the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern, +applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`). + +**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this +command reports `not-yet-available` rather than failing. + +## Preconditions + +- `roles//` carries `backup__*` data (or `backup__state: false` with a reason). +- The backup node (`fisi`) is reachable and its restic repo exists. If not → report + `not-yet-available` and stop. + +## Checks (when live) + +Load the `backup__*` data for the resolved role, then: + +| Check | How | Green when | +|---|---|---| +| snapshot freshness | `restic snapshots --tag --latest 1` | a snapshot ≤ ~24 h old exists | +| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present | +| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present | +| integrity | `restic check --read-data-subset` (sampled) | no errors | + +Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`. diff --git a/CLAUDE.md b/CLAUDE.md index 0891b2f..cab6b27 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -89,6 +89,9 @@ Full design rationale: `docs/decisions/` - Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md` - Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy `docs/access/service-access-template.md`; rendered from the role's `access__*` data +- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) — + copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*` + data. A stateless service records `backup__state: false` with a reason. - One service = one self-contained role; no shared multi-service roles (ADR-004) - Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`) - Use `make new-role NAME=` to scaffold — never create role structure by hand @@ -227,6 +230,7 @@ Single-contributor, trunk-based (no merge requests / approval gates): | Tagging & run-targeting | `docs/decisions/019-tagging.md` | | Firewall strategy | `docs/decisions/020-firewall.md` | | Operational access | `docs/decisions/021-operational-access.md` | +| Backup & disaster recovery | `docs/decisions/022-backup.md` | | Adding a new role | `docs/runbooks/new-role.md` | | Adding a new host | `docs/runbooks/new-host.md` | | Rotating vault secrets | `docs/runbooks/rotate-secrets.md` | diff --git a/STATUS.md b/STATUS.md index 32cdb7b..3c395f3 100644 --- a/STATUS.md +++ b/STATUS.md @@ -63,6 +63,8 @@ So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially | `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. | | `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). | | Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. | +| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. | +| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. | ## Keeping this honest diff --git a/docs/CAPABILITIES.md b/docs/CAPABILITIES.md index efa33f6..56cee84 100644 --- a/docs/CAPABILITIES.md +++ b/docs/CAPABILITIES.md @@ -104,9 +104,9 @@ role from a shared `group_vars` service catalog. The host `nftables` layer is bu | Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open | |---|---|---|---|---|---| | Databases | Postgres/MariaDB — central *vs* per-app | P | candidate | Backing store for stateful apps | Open: central server vs per-service (TODO 3.9) | -| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 | -| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | | -| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation | +| Backup engine | restic (data-only) | S | planned | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) | +| Off-site target | pCloud (via rclone) | S | planned | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled | +| Air-gap target | USB hard drives | S | planned | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` | ## 10. Operations & support — [S] diff --git a/docs/TODO.md b/docs/TODO.md index 6bf3a60..f5bf4e8 100644 --- a/docs/TODO.md +++ b/docs/TODO.md @@ -39,7 +39,10 @@ 7. ~~Define a tagging standard that lets us target runs without over-tagging.~~ DECIDED (ADR-019): two-tier — role-name tags (auto, at play level) + a closed 9-tag concern list (`tests/tags.yml`); union-only targeting; enforced by `make lint`. - 8. Ensure the right things are backed up (incl. database dumps if we land on PBS). + 8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~ + DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster + node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via + pCloud + rotated USB air-gap. Build: Plans 2–3. 9. Decide: a central database server, or individual database services per app? 10. Should we keep the custom base-container (Molecule test image) method for role testing, or revisit it as boma's testing approach matures (ADR-008)? 11. ~~Deliberate tagging strategy.~~ DECIDED (ADR-019) — folded into 3.7. diff --git a/docs/backup/service-backup-template.md b/docs/backup/service-backup-template.md new file mode 100644 index 0000000..911ffc1 --- /dev/null +++ b/docs/backup/service-backup-template.md @@ -0,0 +1,44 @@ +# Per-service backup record — template + +Copy this file to `roles//BACKUP.md` when building a **stateful** service +role (ADR-022). It is the per-service **backup record**: what state the service holds, +how it is captured consistently, and how it is restored. The structured parts are +**rendered from the role's `backup__*` data** (the single source of truth that also +drives `/check-backup`) — keep the data authoritative and regenerate this file rather +than hand-editing the tables. The prose "Restore notes" tail is hand-written. + +A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets +`backup__state: false` with a reason in its role defaults instead. + +Delete this preamble in the copy and start from the heading below. + +--- + +# Backup — + +## State captured + +Rendered from `backup__*`: + +| What | Source | How captured | +|---|---|---| +| data dir(s) | `` | file-level, pulled read-only | +| database | `` → `` | logical dump (default; ADR-022 Decision 7) | + +- **Quiesce:** `` — `true` means the service is stopped → backed up → + restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B). +- **RPO:** ~24 h (nightly; ADR-022 Decision 2). + +## Restore procedure + +1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. +2. `restic restore` the latest snapshot for `` into ``. +3. Replay each `` into its database. +4. Confirm with this role's `VERIFY.md` checks (ADR-008/017). + +## Restore notes + +Prose the data can't capture — ordering gotchas, "restore the DB before the data dir", +known-tricky migrations. + +- diff --git a/docs/decisions/022-backup.md b/docs/decisions/022-backup.md new file mode 100644 index 0000000..6a4980f --- /dev/null +++ b/docs/decisions/022-backup.md @@ -0,0 +1,277 @@ +# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1 + +## Status + +Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up, +incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all +"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which +assumed a backup policy existed but never defined one. + +**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service +contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB +air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi` +node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet. +Designed now, built in the implementation plan referenced at the foot of this ADR. + +## Context + +boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes +"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback +path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where* +copies live, *how* they are encrypted, or *whether restores actually work*. +`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap) +but commits to nothing. + +The gap is not just theoretical. Every boma service is stateful in some dimension: +DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in +the stack. Without a backup policy the IaC is not reproducible from nothing; it is +reproducible-modulo-data. This ADR closes that gap. + +## Decision + +### 1. Recovery model — data-only backups, rebuild from code (Model A) + +boma's *configuration* is reproducible from this repo: Terraform recreates the VM, +Ansible re-renders the Docker Compose stack. Backups therefore protect **state only** — +DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images. + +Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic +restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1 +topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into +a continuous proof that the IaC *and* the backups both work. + +Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +plus data restore, potentially hours), and it bets the repo is complete enough to +rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify. +**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO +proves too slow; nothing here precludes it. + +### 2. One backup tier, ~24 h RPO + +A single tier: nightly backup of all state, accepting up to ~24 h of data loss across +the board. No per-data-type tiering yet — revisit once there is real-world data and +experience to justify the added machinery. + +### 3. Engine — restic (data) + rclone (off-site); no second encryption layer + +- **restic** captures state into an encrypted, deduplicated repository. +- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client; + rclone has a first-class pCloud backend). +- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second + encryption layer, no pCloud "crypto folder." + +No PBS in v1 (see Decision 1). + +### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group + +A single backup node owns the canonical restic repo. It is **off the Proxmox cluster** +— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster) +dying. This mirrors the existing pattern for `ubongo` (control) and `askari` +(off-site): a manually-provisioned physical node in its own inventory group, still +Ansible-managed (the `base` role applies, plus a `backup` role). + +**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared +dump command remotely, pulls the declared paths read-only, then `restic` snapshots the +staged data into its local repo. **Hosts hold no backup credentials and cannot reach +the repo** — a compromised or ransomwared service host cannot delete backup history. + +**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional — +the *role* ("the backup node") is load-bearing; the physical assignment may be +revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror** +(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the +pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives +(Decision 11). + +**Inventory:** a new `backup_hosts` group is added to both inventories, structured +like `control` and `offsite_hosts`. The `base` role applies. + +### 5. 3-2-1 mapping + +| Copy | Location | Medium | Off-site? | Notes | +|---|---|---|---|---| +| 1 | Live data on each host | NVMe/SSD | no | The working data | +| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo | +| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) | +| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated | + +≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a +fourth, offline copy that no online compromise can reach. + +### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance + +Each service role declares its backup needs in role vars — the same render-from-data +pattern boma uses for `access__*`/`ACCESS.md` (ADR-021): + +```yaml +backup__service: nextcloud # identifier; matches the role / compose project +backup__state: true # false = stateless → no BACKUP.md (pair with a reason) +backup__paths: # bind-mount dirs / files holding state ([] = none) + - /srv/nextcloud/data +backup__dumps: # logical app-consistent dumps ([] = none) + - cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud" + dest: nextcloud-db.sql +backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B) +``` + +The pull orchestrator reads these (rendered from inventory) and, per service: SSH in → +run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A +service with **no** `backup__paths` must explicitly declare `backup__state: false` with +a reason; omission is never an implicit "nothing to back up." (`backup__state` and the +list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not +silent" point.) + +**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`, +`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting: +what state exists, what is backed up, the dump command, and the per-service restore +procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless** +service declares `backup__state: false` (with a reason) in its role vars and gets **no** +`BACKUP.md`. + +**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light +touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the +service checklist (`docs/security/service-checklist.md`) gains a backup item; the +`new-role` runbook gains a fill/render/`check-backup` step (copy +`docs/backup/service-backup-template.md` into `roles//BACKUP.md` and +populate the `backup__*` data); and a checklist gate blocks service clearance until +the record exists and a restore drill confirms it (or a deviation is recorded in +`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check +analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md` +presence** — same manual-copy-plus-review pattern the sibling records use. The design +document's "make lint gates its presence" wording is superseded by this governance +choice. + +### 7. Consistency — logical dumps first; quiesce as escape hatch + +- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) — + portable, version-independent, restorable to a fresh DB. Plain data dirs are backed + up as files. No downtime required. +- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step + (stop container → back up volume → restart) via `backup__quiesce` in the same contract. +- ZFS/filesystem snapshots are **not** used as the sole DB method (only + crash-consistent for a live database). + +This is agnostic to the open central-vs-per-app database question (TODO 3.9): either +way, each service declares how to dump its own data. + +### 8. Restore testing — two tiers; `ubongo` stays bare Debian + +- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in + rotation, restore its latest snapshot into a throwaway container on `ubongo` + (reusing the Molecule harness, ADR-015), start the app against the restored data, + and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure + that actually kills people — *silently corrupt or unrestorable backups*. Failures + alert via ntfy. +- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging. + Rebuild a host from zero via Terraform + Ansible + restic restore on the staging + cluster. This validates the whole Model-A recovery chain. **At least once a year the + rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end. + +**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to +be the independent recovery anchor — "the tool used to rebuild the cluster must not +live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served +by the PVE staging environment (same hardware class, same cluster, same provisioning +path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk +headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`). + +### 9. Retention — GFS via restic + +Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. +`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo. +Tune once real repo growth is observed. + +### 10. Encryption + key escrow + break-glass + +restic encrypts the repo at rest, so **one secret — the restic repo password — +protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three. + +**Escrow locations:** +- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively + and `fisi` is redeployable. +- **Vaultwarden** — the day-to-day human-accessible copy. +- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only + copy that survives "everything is down." + +**The paper holds *two* secrets:** (1) the **restic repo password** (to read any +backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from +the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero +recovery). With both on paper, the break-glass chain has **no circular dependency**: +paper → restic restores Vaultwarden + repo data → the vault password (from paper) +drives Terraform/Ansible re-provisioning → services return, `rbw` works again. + +**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh + +`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if +`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming +zero running boma infrastructure: install restic on any machine, point it at pCloud +*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the +vault password. + +### 11. USB air-gap — plug-and-go cold copy + +A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a +systemd unit / script that: mounts the drive, confirms it is an expected drive, runs +**`restic copy` from the local repo → a restic repo on the USB drive** (same +password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts, +and **notifies via ntfy** with the result. Only allowlisted serials trigger anything — +a rogue USB does nothing. + +`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable +directly in a break-glass with nothing else alive. Drives are rotated and **stored +off-site** — a second geographic off-site copy independent of pCloud. + +### 12. Failure alerting — guard against silent death + +Success/failure pings alone miss the worst case (*the job silently stopped running*): + +- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push + monitor**; no ping in ~25 h → alert. +- **Immediate failure → ntfy** on any job or dump-step error. +- **Weekly `restic check`** for repo integrity → alert on corruption. +- **Tier-1 restore-verify failures → ntfy.** +- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a + Grafana panel (fits ADR-018's monitoring direction; not required for v1). + +### 13. Schedule + +- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host → + run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune` + → `rclone sync` → pCloud. Sequential, off-hours. +- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`. +- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path. +- **USB air-gap:** manual, approximately monthly, whenever a drive is docked. + +## Consequences + +- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left + open; "backup-first" and "snapshot + backup" are no longer assumed. +- Every service role that holds state must declare its backup contract (`backup__*` + vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost: + per-service declarations and a rendered doc to maintain (mitigated by the new-role + runbook step + checklist gate). +- **pCloud is off-site but sync-coupled** — `rclone sync` propagates deletions (a + prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB + air-gap drive is the only truly immutable copy**; pCloud's own file-version history + is enabled as a secondary cushion. +- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it + receives full `base` hardening and tight access. restic encryption means a stolen + `fisi`, USB drive, or pCloud blob yields ciphertext only. +- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years + at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB. +- Recovery time under Model A (full Ansible run + data restore) is potentially hours — + slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected. +- The paper break-glass must be kept current (restic password + vault password). An + outdated paper sheet is the one failure mode this ADR cannot prevent mechanically — + the semi-annual DR rehearsal is the human control. + +Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`. +Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`. + +## Related + +ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one +role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness +reused for Tier-1), ADR-011 (update management: backup-first rule now grounded), +ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant), +ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy +→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021 +(render-from-data pattern: `access__*`/`ACCESS.md` → `backup__*`/`BACKUP.md`; +runbook+gate governance model). diff --git a/docs/hardware/reference.md b/docs/hardware/reference.md index 8098ebf..3df2c3a 100644 --- a/docs/hardware/reference.md +++ b/docs/hardware/reference.md @@ -22,10 +22,20 @@ - **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_ - **CPU:** _TBD (target 4 cores, x86-64)_ - **RAM:** _TBD (target 16 GB)_ -- **Storage:** _TBD (target 250 GB SSD/NVMe)_ +- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022) - **NICs:** _wired GbE_ - **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_ +### fisi (backup node — outside the cluster; provisional) +- **Model / form factor:** HP Elite 600 G9 (tower) +- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node +- **RAM:** 16 GB+ (TBD exact) +- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk) +- **NICs:** wired GbE +- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud, + docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs. + Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand). + _(repeat for pve1, pve2, askari)_ ## 2. Network gear @@ -54,7 +64,8 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals. |------|-------|--------|---------| | pve0 | 20 | 64 | 4000 | | pve1 | 20 | 64 | 4000 | -| ubongo | 4 | 16 | 250 | +| ubongo | 4 | 16 | 1000 | +| fisi | 4 | 16 | 8000 | ## 5. Capacity notes diff --git a/docs/runbooks/new-role.md b/docs/runbooks/new-role.md index 037dc2c..714e1fe 100644 --- a/docs/runbooks/new-role.md +++ b/docs/runbooks/new-role.md @@ -103,7 +103,18 @@ rendered from that data; the admin-API path must `firewall_ref` an entry in the `/check-access ` proves the documented paths are live — part of the service-clearance gate (`docs/security/service-checklist.md`). -### 12. Commit +### 12. Write the per-service backup record (stateful services) + +For a **stateful** service role, copy `docs/backup/service-backup-template.md` to +`roles//BACKUP.md` and populate the role's `backup__*` data (`backup__service`, +`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`; +ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md` +is rendered from that data. A **stateless** service sets `backup__state: false` with a +reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup ` +proves the declared state is captured — part of the service-clearance gate +(`docs/security/service-checklist.md`). + +### 13. Commit ```bash git checkout -b role/ diff --git a/docs/security/service-checklist.md b/docs/security/service-checklist.md index ea30151..4184dc7 100644 --- a/docs/security/service-checklist.md +++ b/docs/security/service-checklist.md @@ -47,7 +47,10 @@ This checklist is the generic **bar**. Each service answers it in its own ## Operability (security-adjacent) - [ ] Logs go somewhere reviewable (central aggregation when available) -- [ ] Backup/restore is covered if the service holds state +- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries + `backup__*` data, `roles//BACKUP.md` is rendered, and `/check-backup` + reports the declared paths/dumps captured in the latest snapshot — or the service + sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`. - [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the service has a populated `roles//VERIFY.md` and its critical journeys verified (ADR-008 Level 4 / ADR-017) diff --git a/docs/superpowers/specs/2026-06-10-backup-strategy-design.md b/docs/superpowers/specs/2026-06-10-backup-strategy-design.md index ce62454..4393107 100644 --- a/docs/superpowers/specs/2026-06-10-backup-strategy-design.md +++ b/docs/superpowers/specs/2026-06-10-backup-strategy-design.md @@ -1,7 +1,7 @@ # Design — Backup & disaster recovery strategy - **Date:** 2026-06-10 -- **Status:** Approved design — pending implementation plan +- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022) - **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up, incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all "planned") @@ -113,15 +113,20 @@ database. Each **service role declares its backup needs** in role vars — the s render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021): ```yaml -backup__paths: # bind-mount dirs / files holding state +backup__service: nextcloud # identifier; matches the role / compose project +backup__state: true # false = stateless → no BACKUP.md (pair with a reason) +backup__paths: # bind-mount dirs / files holding state ([] = none) - /srv/nextcloud/data -backup__predump: # optional: command that emits an app-consistent dump - cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud" - dest: "nextcloud-db.sql" +backup__dumps: # logical app-consistent dumps (list; [] = none) + - cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud" + dest: nextcloud-db.sql +backup__quiesce: false # true = stop→back up→restart escape hatch ``` +(ADR-022 is authoritative for the contract.) + The pull orchestrator reads these (rendered from inventory) and, per service: SSH in → -run `predump` → pull the dump + declared paths read-only → `restic` snapshot. A +run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not silent). @@ -219,7 +224,7 @@ anything (a rogue USB does nothing). Success/failure pings alone miss the worst case (*the job silently stopped running*): - **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push monitor** (already in the planned stack); no ping in ~25 h → alert. -- **Immediate failure → ntfy** on any job or `predump` error. +- **Immediate failure → ntfy** on any job or dump-step error. - **Periodic `restic check`** (weekly) for repo integrity → alert on corruption. - **Tier-1 restore-verify failures → ntfy.** - *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a @@ -228,7 +233,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni ### 13. Schedule - **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host → - `predump` → pull paths read-only → `restic` snapshot → `restic forget --prune` + run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune` (Decision 9) → `rclone sync` → pCloud. Sequential, off-hours. - **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`. - **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path. @@ -240,7 +245,7 @@ Success/failure pings alone miss the worst case (*the job silently stopped runni ┌─────────────────────────────────────────┐ docker_hosts / etc. │ fisi (backup node) │ ┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │ - │ service A │◀─────────│ 1. ssh host → run predump (pg_dump…) │ + │ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │ │ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│ └───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│ ┌───────────┐ │ 4. restic forget --prune (GFS) │