- STATUS: docker_host is built+applied, not scaffold-only (O1) - ADR-004: backup points to ADR-022, not "out of scope"; service-role file table gains ACCESS.md + BACKUP.md rows (O2, O5) - Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22) - ADR-016/017/018 now lead with ## Status per ADR-023 (O4) - ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6) - CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7) - ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18) - new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15) O9 (hosts.yml header) left open: the file is generator-owned (hook-protected); fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.5 KiB
ADR-011 — Update and upgrade management
Status
Proposed (2026-06-04) — draft for discussion; not yet accepted. The core decisions below are settled in intent, but several specifics remain open (see "Open questions").
Context
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things drift over time and must be kept current without breaking the homelab: the host OS (kernel, libc, packages → sometimes a reboot) and the container images.
Decision
1. Every service is classified stateful or stateless
Each container role declares its class, e.g. <role>__stateful: true|false (default
false). The split is the load-bearing classification for the whole policy.
- Stateless — no durable data of its own; losing the container loses nothing. Rebuild = re-pull. Examples: the *arr stack, Jellyfin, exporters, whoami, Caddy, reverse proxies, FlareSolverr.
- Stateful — owns data, schema, or migrations: databases, and apps with their own store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT). When in doubt, classify stateful (the safer, slower path).
2. Image pinning follows the split
- Stateless → rolling tags (
latest/stable), refreshed by the weekly run and watched by DIUN. Always-current, cheap to roll back. No digest pin — it would defeat the rolling design. - Stateful → pinned
tag@digest— a readable minor tag where the image offers it (e.g.mariadb:11.4, not bare:11) plus its digest (mariadb:11.4@sha256:…). Reproducible and tamper-evident; upgrades are deliberate (bump tag and digest together), never incidental.
Readable tag and digest, not one or the other: the tag keeps diffs legible, the digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1). Snapshot-before + backups remain the rollback mechanism for a broken update; the digest is what guards against a swapped image, which snapshots cannot.
3. Weekly OS + stateless run — Friday night, fail-stop, staggered
A scheduled run on Friday night (giving the weekend to fix anything it breaks), per host, in strict order with a verification gate between every phase:
- OS update —
aptupgrade. - Reboot — only if required (kernel/libc); detect via
/var/run/reboot-required. - Verify — health-check harness. Fail-stop: if a host fails, halt that host's run, leave it as-is, alert loudly — do not proceed to container updates on a wobbly host.
- Stateless container update —
compose pull+ recreate-if-changed. - Verify again; alert on failure.
Host ordering: infrastructure hosts (DNS, then reverse proxy) update and validate before the rest follow — so a DNS/Caddy failure doesn't make every host look broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
4. Snapshot-before is the rollback mechanism
Because these are primarily Proxmox VMs, take a VM snapshot before the Friday window and auto-expire it after ~1 week if health checks stayed green.
5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
Stateful services are never touched by the weekly run. Instead, every 8 weeks
an automated analysis job (a scheduled claude -p, per the scheduled_jobs design in
docs/TODO.md 8.3, not yet built) does:
- Read changelogs / breaking-change notes for each pinned stateful image; diff the pinned tag against what's available.
- Emit a recommended upgrade plan as a Forgejo issue/PR — proposed target version, migration steps, and a backup-first checklist — for a human to approve.
- On approval, the upgrade runs in a deliberate maintenance window (not the Friday auto-run), and always dumps the DB / takes a backup first (ties to the backup work — TODO 3.8). DB major-version bumps are the highest-risk case and get their own migration plan.
6. The verification gate is the health-check harness
"Check everything still works" is the load-bearing 80% of this ADR, and it is the same capability as the test methodology (ADR-008) and the sanity checks already on the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around that harness.
Sequencing is deliberate: the health-check harness is built first, and no update automation (decisions 3–5) is deployed until it is in order. This is not a blocker to work around — it's the order of operations. An update run without a working verification gate is "update and pray," so it simply does not ship until the gate is real.
7. Security fast-path overrides the slow cadence
The 8-weekly stateful cadence is for routine drift. It is too slow for a critical CVE in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's new-image alert stays as the out-of-band trigger: an urgent advisory gets a manual, backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent = alert-driven.
Open questions (for discussion)
- Where does the Proxmox snapshot get driven from? Control node calling the Proxmox API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
- Exact cadences — Friday weekly and 8-weekly stateful are starting points. Is
weekly OS patching the right rhythm, or should reboots be rarer than
aptupgrades? - Where does the health-check harness live, and what is the minimum bar that counts as "in order" before the weekly run ships (decision 6 fixes the sequencing; this pins down the threshold)?
- Classification home — a per-role
__statefulflag (proposed) vs a list in group_vars. - Staging first? Should the weekly run hit a staging host before production, or is snapshot-before + Friday timing enough for a homelab of this size?
- Notification + control channel — boma defines its own ntfy topics (decided fresh per ADR-013, not reused from V4), and does the run need a "skip this week" / "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
What was ruled out
| Option | Reason |
|---|---|
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
Rolling latest for stateful services |
Unattended schema/migration changes are how you lose data. |
| Digest-only pin (no readable tag) for stateful | Unreadable in diffs — the tiered rule pins tag@digest (readable tag and digest) instead (Decision 2). |
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the analysis is automated. |
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
Consequences
- A single uniform update policy is rejected: the stateful/stateless split is
load-bearing, so stateless services roll on rolling tags while stateful services are
pinned
tag@digest, human-gated, and backup-first (see "What was ruled out"). - The weekly run never touches stateful services and the whole fleet is never updated at once, accepting the added orchestration of host ordering and an 8-weekly + fast-path cadence in exchange for bounded blast radius (see "What was ruled out").
- No update automation ships until the health-check verification gate is in order; the pipeline is deliberately sequenced behind that harness (see Decision 6).
- Several points remain open for discussion (see "Open questions"): where the Proxmox
snapshot is driven from across the TF/Ansible boundary; the exact cadences; where the
health-check harness lives and the minimum bar that counts as "in order"; whether
classification is a per-role
__statefulflag or a group_vars list; whether the weekly run hits staging first; and the notification + "skip/pause" control channel.