boma/docs/decisions/011-update-management.md
sjat 0e4050fa59 Add ADR-013 (V4 heritage policy); track ADR-011
ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it:
translate-don't-transplant — V4 is evidence, never authority. It is a legitimate
source only of operational gotchas and working config snippets (re-derived on
boma's terms); never requirements, domain values, structure, or conventions.
Provenance stays transient (commits/conversation), durable docs stay clean. AI
consultation guardrails included. Resolves TODO 3.3 and 10.1.

Also bring ADR-011 (update management, Proposed draft) under version control:
- fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013)
- track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests
  pinning with the digest-pinning the security work now mandates (R1 / checklist /
  15.6) — they currently conflict.

CLAUDE.md gains a V4 guardrail + ADR-013 pointer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 19:07:48 +02:00

6.9 KiB
Raw Blame History

ADR-011 — Update and upgrade management

Status: Proposed — draft for discussion (not yet accepted).

Context

boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things drift over time and must be kept current without breaking the homelab: the host OS (kernel, libc, packages → sometimes a reboot) and the container images.


Decisions

1. Every service is classified stateful or stateless

Each container role declares its class, e.g. <role>__stateful: true|false (default false). The split is the load-bearing classification for the whole policy.

  • Stateless — no durable data of its own; losing the container loses nothing. Rebuild = re-pull. Examples: the *arr stack, Jellyfin, exporters, whoami, Traefik, reverse proxies, FlareSolverr.
  • Stateful — owns data, schema, or migrations: databases, and apps with their own store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT). When in doubt, classify stateful (the safer, slower path).

2. Image pinning follows the split

  • Stateless → rolling tags (latest/stable), refreshed by the weekly run and watched by DIUN. Always-current, cheap to roll back.
  • Stateful → pinned to a readable tag, minor where the image offers it (e.g. mariadb:11.4, not bare :11 and not a digest). Reproducible; upgrades are deliberate, never incidental.

Tags, not digests — readable in diffs; immutability is bought instead via snapshot-before and backups.

3. Weekly OS + stateless run — Friday night, fail-stop, staggered

A scheduled run on Friday night (giving the weekend to fix anything it breaks), per host, in strict order with a verification gate between every phase:

  1. OS updateapt upgrade.
  2. Reboot — only if required (kernel/libc); detect via /var/run/reboot-required.
  3. Verify — health-check harness. Fail-stop: if a host fails, halt that host's run, leave it as-is, alert loudly — do not proceed to container updates on a wobbly host.
  4. Stateless container updatecompose pull + recreate-if-changed.
  5. Verify again; alert on failure.

Host ordering: infrastructure hosts (DNS, then reverse proxy) update and validate before the rest follow — so a DNS/Traefik failure doesn't make every host look broken at once and hide the real cause. Never reboot the whole fleet simultaneously.

4. Snapshot-before is the rollback mechanism

Because these are primarily Proxmox VMs, take a VM snapshot before the Friday window and auto-expire it after ~1 week if health checks stayed green.

5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first

Stateful services are never touched by the weekly run. Instead, every 8 weeks an automated analysis job (a scheduled claude -p, per the scheduled_jobs plan and ADR-010) does:

  1. Read changelogs / breaking-change notes for each pinned stateful image; diff the pinned tag against what's available.
  2. Emit a recommended upgrade plan as a Forgejo issue/PR — proposed target version, migration steps, and a backup-first checklist — for a human to approve.
  3. On approval, the upgrade runs in a deliberate maintenance window (not the Friday auto-run), and always dumps the DB / takes a backup first (ties to the backup work — TODO 3.8). DB major-version bumps are the highest-risk case and get their own migration plan.

6. The verification gate is the health-check harness

"Check everything still works" is the load-bearing 80% of this ADR, and it is the same capability as the test methodology (ADR-008) and the sanity checks already on the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around that harness.

Sequencing is deliberate: the health-check harness is built first, and no update automation (decisions 35) is deployed until it is in order. This is not a blocker to work around — it's the order of operations. An update run without a working verification gate is "update and pray," so it simply does not ship until the gate is real.

7. Security fast-path overrides the slow cadence

The 8-weekly stateful cadence is for routine drift. It is too slow for a critical CVE in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's new-image alert stays as the out-of-band trigger: an urgent advisory gets a manual, backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent = alert-driven.


Open questions (for discussion)

  1. Where does the Proxmox snapshot get driven from? Control node calling the Proxmox API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
  2. Exact cadences — Friday weekly and 8-weekly stateful are starting points. Is weekly OS patching the right rhythm, or should reboots be rarer than apt upgrades?
  3. Where does the health-check harness live, and what is the minimum bar that counts as "in order" before the weekly run ships (decision 6 fixes the sequencing; this pins down the threshold)?
  4. Classification home — a per-role __stateful flag (proposed) vs a list in group_vars.
  5. Staging first? Should the weekly run hit a staging host before production, or is snapshot-before + Friday timing enough for a homelab of this size?
  6. Notification + control channel — boma defines its own ntfy topics (decided fresh per ADR-013, not reused from V4), and does the run need a "skip this week" / "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)

What was ruled out

Option Reason
One uniform policy for all services Ignores blast radius; stateful data loss ≠ stateless re-pull.
Rolling latest for stateful services Unattended schema/migration changes are how you lose data.
Digest-pinning the stateful tier Unreadable in diffs; snapshot-before + backups give the immutability instead.
Pinning the stateless tier No durable data to protect; pins just add churn DIUN already covers.
Auto-updating stateful on a timer Must be human-gated and backup-first; only the analysis is automated.
Updating the whole fleet at once Simultaneous reboots hide which host/phase actually broke.
8-weekly as the only stateful path Too slow for urgent CVEs — hence the DIUN security fast-path.