sjat/boma

sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)

- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-14 19:06:33 +02:00

8.5 KiB

Raw Permalink Blame History

ADR-011 — Update and upgrade management

Status

Proposed (2026-06-04) — draft for discussion; not yet accepted. The core decisions below are settled in intent, but several specifics remain open (see "Open questions").

Context

boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things drift over time and must be kept current without breaking the homelab: the host OS (kernel, libc, packages → sometimes a reboot) and the container images.

Decision

1. Every service is classified stateful or stateless

Each container role declares its class, e.g. <role>__stateful: true|false (default false). The split is the load-bearing classification for the whole policy.

Stateless — no durable data of its own; losing the container loses nothing. Rebuild = re-pull. Examples: the *arr stack, Jellyfin, exporters, whoami, Caddy, reverse proxies, FlareSolverr.
Stateful — owns data, schema, or migrations: databases, and apps with their own store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT). When in doubt, classify stateful (the safer, slower path).

2. Image pinning follows the split

Stateless → rolling tags (latest/stable), refreshed by the weekly run and watched by DIUN. Always-current, cheap to roll back. No digest pin — it would defeat the rolling design.
Stateful → pinned tag@digest — a readable minor tag where the image offers it (e.g. mariadb:11.4, not bare :11) plus its digest (mariadb:11.4@sha256:…). Reproducible and tamper-evident; upgrades are deliberate (bump tag and digest together), never incidental.

Readable tag and digest, not one or the other: the tag keeps diffs legible, the digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1). Snapshot-before + backups remain the rollback mechanism for a broken update; the digest is what guards against a swapped image, which snapshots cannot.

3. Weekly OS + stateless run — Friday night, fail-stop, staggered

A scheduled run on Friday night (giving the weekend to fix anything it breaks), per host, in strict order with a verification gate between every phase:

OS update — apt upgrade.
Reboot — only if required (kernel/libc); detect via /var/run/reboot-required.
Verify — health-check harness. Fail-stop: if a host fails, halt that host's run, leave it as-is, alert loudly — do not proceed to container updates on a wobbly host.
Stateless container update — compose pull + recreate-if-changed.
Verify again; alert on failure.

Host ordering: infrastructure hosts (DNS, then reverse proxy) update and validate before the rest follow — so a DNS/Caddy failure doesn't make every host look broken at once and hide the real cause. Never reboot the whole fleet simultaneously.

4. Snapshot-before is the rollback mechanism

Because these are primarily Proxmox VMs, take a VM snapshot before the Friday window and auto-expire it after ~1 week if health checks stayed green.

5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first

Stateful services are never touched by the weekly run. Instead, every 8 weeks an automated analysis job (a scheduled claude -p, per the scheduled_jobs design in docs/TODO.md 8.3, not yet built) does:

Read changelogs / breaking-change notes for each pinned stateful image; diff the pinned tag against what's available.
Emit a recommended upgrade plan as a Forgejo issue/PR — proposed target version, migration steps, and a backup-first checklist — for a human to approve.
On approval, the upgrade runs in a deliberate maintenance window (not the Friday auto-run), and always dumps the DB / takes a backup first (ties to the backup work — TODO 3.8). DB major-version bumps are the highest-risk case and get their own migration plan.

6. The verification gate is the health-check harness

"Check everything still works" is the load-bearing 80% of this ADR, and it is the same capability as the test methodology (ADR-008) and the sanity checks already on the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around that harness.

Sequencing is deliberate: the health-check harness is built first, and no update automation (decisions 3–5) is deployed until it is in order. This is not a blocker to work around — it's the order of operations. An update run without a working verification gate is "update and pray," so it simply does not ship until the gate is real.

7. Security fast-path overrides the slow cadence

The 8-weekly stateful cadence is for routine drift. It is too slow for a critical CVE in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's new-image alert stays as the out-of-band trigger: an urgent advisory gets a manual, backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent = alert-driven.

Open questions (for discussion)

Where does the Proxmox snapshot get driven from? Control node calling the Proxmox API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
Exact cadences — Friday weekly and 8-weekly stateful are starting points. Is weekly OS patching the right rhythm, or should reboots be rarer than apt upgrades?
Where does the health-check harness live, and what is the minimum bar that counts as "in order" before the weekly run ships (decision 6 fixes the sequencing; this pins down the threshold)?
Classification home — a per-role __stateful flag (proposed) vs a list in group_vars.
Staging first? Should the weekly run hit a staging host before production, or is snapshot-before + Friday timing enough for a homelab of this size?
Notification + control channel — boma defines its own ntfy topics (decided fresh per ADR-013, not reused from V4), and does the run need a "skip this week" / "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)

What was ruled out

Option	Reason
One uniform policy for all services	Ignores blast radius; stateful data loss ≠ stateless re-pull.
Rolling `latest` for stateful services	Unattended schema/migration changes are how you lose data.
Digest-only pin (no readable tag) for stateful	Unreadable in diffs — the tiered rule pins `tag@digest` (readable tag and digest) instead (Decision 2).
Pinning the stateless tier	No durable data to protect; pins just add churn DIUN already covers.
Auto-updating stateful on a timer	Must be human-gated and backup-first; only the analysis is automated.
Updating the whole fleet at once	Simultaneous reboots hide which host/phase actually broke.
8-weekly as the only stateful path	Too slow for urgent CVEs — hence the DIUN security fast-path.

Consequences

A single uniform update policy is rejected: the stateful/stateless split is load-bearing, so stateless services roll on rolling tags while stateful services are pinned tag@digest, human-gated, and backup-first (see "What was ruled out").
The weekly run never touches stateful services and the whole fleet is never updated at once, accepting the added orchestration of host ordering and an 8-weekly + fast-path cadence in exchange for bounded blast radius (see "What was ruled out").
No update automation ships until the health-check verification gate is in order; the pipeline is deliberately sequenced behind that harness (see Decision 6).
Several points remain open for discussion (see "Open questions"): where the Proxmox snapshot is driven from across the TF/Ansible boundary; the exact cadences; where the health-check harness lives and the minimum bar that counts as "in order"; whether classification is a per-role __stateful flag or a group_vars list; whether the weekly run hits staging first; and the notification + "skip/pause" control channel.

8.5 KiB Raw Permalink Blame History Unescape Escape