ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it: translate-don't-transplant — V4 is evidence, never authority. It is a legitimate source only of operational gotchas and working config snippets (re-derived on boma's terms); never requirements, domain values, structure, or conventions. Provenance stays transient (commits/conversation), durable docs stay clean. AI consultation guardrails included. Resolves TODO 3.3 and 10.1. Also bring ADR-011 (update management, Proposed draft) under version control: - fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013) - track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests pinning with the digest-pinning the security work now mandates (R1 / checklist / 15.6) — they currently conflict. CLAUDE.md gains a V4 guardrail + ADR-013 pointer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
130 lines
6.9 KiB
Markdown
130 lines
6.9 KiB
Markdown
# ADR-011 — Update and upgrade management
|
||
|
||
**Status: Proposed — draft for discussion (not yet accepted).**
|
||
|
||
## Context
|
||
|
||
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
|
||
drift over time and must be kept current without breaking the homelab: the **host OS**
|
||
(kernel, libc, packages → sometimes a reboot) and the **container images**.
|
||
|
||
---
|
||
|
||
## Decisions
|
||
|
||
### 1. Every service is classified stateful or stateless
|
||
|
||
Each container role declares its class, e.g. `<role>__stateful: true|false` (default
|
||
`false`). The split is the load-bearing classification for the whole policy.
|
||
|
||
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
|
||
reverse proxies, FlareSolverr.
|
||
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
||
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
||
When in doubt, classify **stateful** (the safer, slower path).
|
||
|
||
### 2. Image pinning follows the split
|
||
|
||
- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
|
||
watched by DIUN. Always-current, cheap to roll back.
|
||
- **Stateful → pinned** to a readable tag, **minor** where the image offers it
|
||
(e.g. `mariadb:11.4`, not bare `:11` and not a digest). Reproducible; upgrades are
|
||
deliberate, never incidental.
|
||
|
||
Tags, not digests — readable in diffs; immutability is bought instead via
|
||
snapshot-before and backups.
|
||
|
||
### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
|
||
|
||
A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
|
||
per host, in strict order with a verification gate between every phase:
|
||
|
||
1. **OS update** — `apt` upgrade.
|
||
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
|
||
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
|
||
halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
|
||
to container updates on a wobbly host.
|
||
4. **Stateless container update** — `compose pull` + recreate-if-changed.
|
||
5. **Verify** again; alert on failure.
|
||
|
||
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
||
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
|
||
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
||
|
||
### 4. Snapshot-before is the rollback mechanism
|
||
|
||
Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
|
||
**auto-expire it after ~1 week** if health checks stayed green.
|
||
|
||
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
|
||
|
||
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
|
||
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
|
||
ADR-010) does:
|
||
|
||
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
|
||
pinned tag against what's available.
|
||
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
|
||
migration steps, and a backup-first checklist — for a human to approve.
|
||
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
|
||
auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
|
||
work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
|
||
own migration plan.
|
||
|
||
### 6. The verification gate is the health-check harness
|
||
|
||
"Check everything still works" is the load-bearing 80% of this ADR, and it is the
|
||
**same capability** as the test methodology (ADR-008) and the sanity checks already on
|
||
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
|
||
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
|
||
that harness.
|
||
|
||
**Sequencing is deliberate: the health-check harness is built first, and no update
|
||
automation (decisions 3–5) is deployed until it is in order.** This is not a blocker to
|
||
work around — it's the order of operations. An update run without a working verification
|
||
gate is "update and pray," so it simply does not ship until the gate is real.
|
||
|
||
### 7. Security fast-path overrides the slow cadence
|
||
|
||
The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
|
||
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
|
||
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
|
||
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
|
||
alert-driven.
|
||
|
||
---
|
||
|
||
## Open questions (for discussion)
|
||
|
||
1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
|
||
API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
|
||
boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
|
||
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
|
||
weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
|
||
3. **Where does the health-check harness live, and what is the minimum bar that counts
|
||
as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
|
||
pins down the threshold)?
|
||
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
|
||
group_vars.
|
||
5. **Staging first?** Should the weekly run hit a staging host before production, or is
|
||
snapshot-before + Friday timing enough for a homelab of this size?
|
||
6. **Notification + control channel** — boma defines its own ntfy topics (decided
|
||
fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
|
||
"pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
|
||
|
||
---
|
||
|
||
## What was ruled out
|
||
|
||
| Option | Reason |
|
||
| -------------------------------------- | ----------------------------------------------------------------------------- |
|
||
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
|
||
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
|
||
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
|
||
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
|
||
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
|
||
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
|
||
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
|
||
|
||
---
|