boma/docs/decisions/011-update-management.md
sjat db76be2a63 review-repo: clear O7-O12 clarity items
- ADR-011: ruled-out row was "digest-pinning stateful" (contradicted Decision 2);
  now "digest-only (no readable tag)" — tag@digest is adopted (O7)
- ADR-003/010: act_runner names ubongo as the runner host, runner VM as a future
  option (O8)
- ADR-008: WireGuard Molecule-exclusion row reframed to NetBird wt0 data plane (O9)
- ADR-011: scheduled_jobs xref points to TODO 8.3, not ADR-010 (O10)
- CAPABILITIES: add /verify-service Level 4 capability row (O11)
- TODO 3.10: rewrite the garbled base-container question (O12)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:28:07 +02:00

134 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-011 — Update and upgrade management
**Status: Proposed — draft for discussion (not yet accepted).**
## Context
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
drift over time and must be kept current without breaking the homelab: the **host OS**
(kernel, libc, packages → sometimes a reboot) and the **container images**.
---
## Decisions
### 1. Every service is classified stateful or stateless
Each container role declares its class, e.g. `<role>__stateful: true|false` (default
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
When in doubt, classify **stateful** (the safer, slower path).
### 2. Image pinning follows the split
- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
watched by DIUN. Always-current, cheap to roll back. No digest pin — it would
defeat the rolling design.
- **Stateful → pinned `tag@digest`** — a readable **minor** tag where the image
offers it (e.g. `mariadb:11.4`, not bare `:11`) **plus its digest**
(`mariadb:11.4@sha256:…`). Reproducible and tamper-evident; upgrades are deliberate
(bump tag and digest together), never incidental.
Readable tag **and** digest, not one or the other: the tag keeps diffs legible, the
digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1).
Snapshot-before + backups remain the rollback mechanism for a *broken* update; the
digest is what guards against a *swapped* image, which snapshots cannot.
### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
per host, in strict order with a verification gate between every phase:
1. **OS update**`apt` upgrade.
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
to container updates on a wobbly host.
4. **Stateless container update**`compose pull` + recreate-if-changed.
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism
Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
**auto-expire it after ~1 week** if health checks stayed green.
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` design in
`docs/TODO.md` 8.3, not yet built) does:
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
pinned tag against what's available.
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
migration steps, and a backup-first checklist — for a human to approve.
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
own migration plan.
### 6. The verification gate is the health-check harness
"Check everything still works" is the load-bearing 80% of this ADR, and it is the
**same capability** as the test methodology (ADR-008) and the sanity checks already on
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
that harness.
**Sequencing is deliberate: the health-check harness is built first, and no update
automation (decisions 35) is deployed until it is in order.** This is not a blocker to
work around — it's the order of operations. An update run without a working verification
gate is "update and pray," so it simply does not ship until the gate is real.
### 7. Security fast-path overrides the slow cadence
The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
alert-driven.
---
## Open questions (for discussion)
1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
3. **Where does the health-check harness live, and what is the minimum bar that counts
as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
pins down the threshold)?
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
group_vars.
5. **Staging first?** Should the weekly run hit a staging host before production, or is
snapshot-before + Friday timing enough for a homelab of this size?
6. **Notification + control channel** — boma defines its own ntfy topics (decided
fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
"pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
---
## What was ruled out
| Option | Reason |
| -------------------------------------- | ----------------------------------------------------------------------------- |
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
| Digest-_only_ pin (no readable tag) for stateful | Unreadable in diffs — the tiered rule pins `tag@digest` (readable tag *and* digest) instead (Decision 2). |
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
---