# ADR-011 — Update and upgrade management **Status: Proposed — draft for discussion (not yet accepted).** ## Context boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things drift over time and must be kept current without breaking the homelab: the **host OS** (kernel, libc, packages → sometimes a reboot) and the **container images**. --- ## Decisions ### 1. Every service is classified stateful or stateless Each container role declares its class, e.g. `__stateful: true|false` (default `false`). The split is the load-bearing classification for the whole policy. - **Stateless** — no durable data of its own; losing the container loses nothing. Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik, reverse proxies, FlareSolverr. - **Stateful** — owns data, schema, or migrations: databases, and apps with their own store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT). When in doubt, classify **stateful** (the safer, slower path). ### 2. Image pinning follows the split - **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and watched by DIUN. Always-current, cheap to roll back. No digest pin — it would defeat the rolling design. - **Stateful → pinned `tag@digest`** — a readable **minor** tag where the image offers it (e.g. `mariadb:11.4`, not bare `:11`) **plus its digest** (`mariadb:11.4@sha256:…`). Reproducible and tamper-evident; upgrades are deliberate (bump tag and digest together), never incidental. Readable tag **and** digest, not one or the other: the tag keeps diffs legible, the digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1). Snapshot-before + backups remain the rollback mechanism for a *broken* update; the digest is what guards against a *swapped* image, which snapshots cannot. ### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered A scheduled run on **Friday night** (giving the weekend to fix anything it breaks), per host, in strict order with a verification gate between every phase: 1. **OS update** — `apt` upgrade. 2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`. 3. **Verify** — health-check harness. **Fail-stop:** if a host fails, halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed to container updates on a wobbly host. 4. **Stateless container update** — `compose pull` + recreate-if-changed. 5. **Verify** again; alert on failure. **Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate **before** the rest follow — so a DNS/Traefik failure doesn't make every host look broken at once and hide the real cause. Never reboot the whole fleet simultaneously. ### 4. Snapshot-before is the rollback mechanism Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and **auto-expire it after ~1 week** if health checks stayed green. ### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks** an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and ADR-010) does: 1. Read changelogs / breaking-change notes for each pinned stateful image; diff the pinned tag against what's available. 2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version, migration steps, and a backup-first checklist — for a human to approve. 3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday auto-run), and **always dumps the DB / takes a backup first** (ties to the backup work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their own migration plan. ### 6. The verification gate is the health-check harness "Check everything still works" is the load-bearing 80% of this ADR, and it is the **same capability** as the test methodology (ADR-008) and the sanity checks already on the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around that harness. **Sequencing is deliberate: the health-check harness is built first, and no update automation (decisions 3–5) is deployed until it is in order.** This is not a blocker to work around — it's the order of operations. An update run without a working verification gate is "update and pray," so it simply does not ship until the gate is real. ### 7. Security fast-path overrides the slow cadence The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE** in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual, backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent = alert-driven. --- ## Open questions (for discussion) 1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence. 2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades? 3. **Where does the health-check harness live, and what is the minimum bar that counts as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this pins down the threshold)? 4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in group_vars. 5. **Staging first?** Should the weekly run hit a staging host before production, or is snapshot-before + Friday timing enough for a homelab of this size? 6. **Notification + control channel** — boma defines its own ntfy topics (decided fresh per ADR-013, not reused from V4), and does the run need a "skip this week" / "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.) --- ## What was ruled out | Option | Reason | | -------------------------------------- | ----------------------------------------------------------------------------- | | One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. | | Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. | | Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. | | Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. | | Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. | | Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. | | 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. | ---