boma/docs/decisions/011-update-management.md
sjat 0e4050fa59 Add ADR-013 (V4 heritage policy); track ADR-011
ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it:
translate-don't-transplant — V4 is evidence, never authority. It is a legitimate
source only of operational gotchas and working config snippets (re-derived on
boma's terms); never requirements, domain values, structure, or conventions.
Provenance stays transient (commits/conversation), durable docs stay clean. AI
consultation guardrails included. Resolves TODO 3.3 and 10.1.

Also bring ADR-011 (update management, Proposed draft) under version control:
- fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013)
- track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests
  pinning with the digest-pinning the security work now mandates (R1 / checklist /
  15.6) — they currently conflict.

CLAUDE.md gains a V4 guardrail + ADR-013 pointer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 19:07:48 +02:00

130 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-011 — Update and upgrade management
**Status: Proposed — draft for discussion (not yet accepted).**
## Context
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
drift over time and must be kept current without breaking the homelab: the **host OS**
(kernel, libc, packages → sometimes a reboot) and the **container images**.
---
## Decisions
### 1. Every service is classified stateful or stateless
Each container role declares its class, e.g. `<role>__stateful: true|false` (default
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
When in doubt, classify **stateful** (the safer, slower path).
### 2. Image pinning follows the split
- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
watched by DIUN. Always-current, cheap to roll back.
- **Stateful → pinned** to a readable tag, **minor** where the image offers it
(e.g. `mariadb:11.4`, not bare `:11` and not a digest). Reproducible; upgrades are
deliberate, never incidental.
Tags, not digests — readable in diffs; immutability is bought instead via
snapshot-before and backups.
### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
per host, in strict order with a verification gate between every phase:
1. **OS update**`apt` upgrade.
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
to container updates on a wobbly host.
4. **Stateless container update**`compose pull` + recreate-if-changed.
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism
Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
**auto-expire it after ~1 week** if health checks stayed green.
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
ADR-010) does:
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
pinned tag against what's available.
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
migration steps, and a backup-first checklist — for a human to approve.
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
own migration plan.
### 6. The verification gate is the health-check harness
"Check everything still works" is the load-bearing 80% of this ADR, and it is the
**same capability** as the test methodology (ADR-008) and the sanity checks already on
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
that harness.
**Sequencing is deliberate: the health-check harness is built first, and no update
automation (decisions 35) is deployed until it is in order.** This is not a blocker to
work around — it's the order of operations. An update run without a working verification
gate is "update and pray," so it simply does not ship until the gate is real.
### 7. Security fast-path overrides the slow cadence
The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
alert-driven.
---
## Open questions (for discussion)
1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
3. **Where does the health-check harness live, and what is the minimum bar that counts
as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
pins down the threshold)?
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
group_vars.
5. **Staging first?** Should the weekly run hit a staging host before production, or is
snapshot-before + Friday timing enough for a homelab of this size?
6. **Notification + control channel** — boma defines its own ntfy topics (decided
fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
"pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
---
## What was ruled out
| Option | Reason |
| -------------------------------------- | ----------------------------------------------------------------------------- |
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
---