boma/docs/decisions/011-update-management.md

# ADR-011 — Update and upgrade management

**Status: Proposed — draft for discussion (not yet accepted).**

## Context

boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
drift over time and must be kept current without breaking the homelab: the **host OS**
(kernel, libc, packages → sometimes a reboot) and the **container images**.

---

## Decisions

### 1. Every service is classified stateful or stateless

Each container role declares its class, e.g. `<role>__stateful: true|false` (default
`false`). The split is the load-bearing classification for the whole policy.

- **Stateless** — no durable data of its own; losing the container loses nothing.
  Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
  reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
  store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
  When in doubt, classify **stateful** (the safer, slower path).

### 2. Image pinning follows the split

- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
  watched by DIUN. Always-current, cheap to roll back. No digest pin — it would
  defeat the rolling design.
- **Stateful → pinned `tag@digest`** — a readable **minor** tag where the image
  offers it (e.g. `mariadb:11.4`, not bare `:11`) **plus its digest**
  (`mariadb:11.4@sha256:…`). Reproducible and tamper-evident; upgrades are deliberate
  (bump tag and digest together), never incidental.

Readable tag **and** digest, not one or the other: the tag keeps diffs legible, the
digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1).
Snapshot-before + backups remain the rollback mechanism for a *broken* update; the
digest is what guards against a *swapped* image, which snapshots cannot.

### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered

A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
per host, in strict order with a verification gate between every phase:

1. **OS update** — `apt` upgrade.
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
   halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
   to container updates on a wobbly host.
4. **Stateless container update** — `compose pull` + recreate-if-changed.
5. **Verify** again; alert on failure.

**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.

### 4. Snapshot-before is the rollback mechanism

Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
**auto-expire it after ~1 week** if health checks stayed green.

### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first

Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
ADR-010) does:

1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
   pinned tag against what's available.
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
   migration steps, and a backup-first checklist — for a human to approve.
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
   auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
   work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
   own migration plan.

### 6. The verification gate is the health-check harness

"Check everything still works" is the load-bearing 80% of this ADR, and it is the
**same capability** as the test methodology (ADR-008) and the sanity checks already on
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
that harness.

**Sequencing is deliberate: the health-check harness is built first, and no update
automation (decisions 3–5) is deployed until it is in order.** This is not a blocker to
work around — it's the order of operations. An update run without a working verification
gate is "update and pray," so it simply does not ship until the gate is real.

### 7. Security fast-path overrides the slow cadence

The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
alert-driven.

---

## Open questions (for discussion)

1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
   API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
   boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
   weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
3. **Where does the health-check harness live, and what is the minimum bar that counts
   as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
   pins down the threshold)?
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
   group_vars.
5. **Staging first?** Should the weekly run hit a staging host before production, or is
   snapshot-before + Friday timing enough for a homelab of this size?
6. **Notification + control channel** — boma defines its own ntfy topics (decided
   fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
   "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)

---

## What was ruled out

| Option                                 | Reason                                                                        |
| -------------------------------------- | ----------------------------------------------------------------------------- |
| One uniform policy for all services    | Ignores blast radius; stateful data loss ≠ stateless re-pull.                 |
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data.                    |
| Digest-pinning the stateful tier       | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
| Pinning the stateless tier             | No durable data to protect; pins just add churn DIUN already covers.          |
| Auto-updating stateful on a timer      | Must be human-gated and backup-first; only the _analysis_ is automated.       |
| Updating the whole fleet at once       | Simultaneous reboots hide which host/phase actually broke.                    |
| 8-weekly as the only stateful path     | Too slow for urgent CVEs — hence the DIUN security fast-path.                 |

---
-												Add ADR-013 (V4 heritage policy); track ADR-011

ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it:
translate-don't-transplant — V4 is evidence, never authority. It is a legitimate
source only of operational gotchas and working config snippets (re-derived on
boma's terms); never requirements, domain values, structure, or conventions.
Provenance stays transient (commits/conversation), durable docs stay clean. AI
consultation guardrails included. Resolves TODO 3.3 and 10.1.

Also bring ADR-011 (update management, Proposed draft) under version control:
- fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013)
- track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests
  pinning with the digest-pinning the security work now mandates (R1 / checklist /
  15.6) — they currently conflict.

CLAUDE.md gains a V4 guardrail + ADR-013 pointer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-04 19:07:48 +02:00
+								# ADR-011 — Update and upgrade management
 								**Status: Proposed — draft for discussion (not yet accepted).**
 								## Context
 								boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
 								drift over time and must be kept current without breaking the homelab: the **host OS**
 								(kernel, libc, packages → sometimes a reboot) and the **container images**.
 								---
 								## Decisions
 								### 1. Every service is classified stateful or stateless
 								Each container role declares its class, e.g. `<role>__stateful: true|false` (default
 								`false`). The split is the load-bearing classification for the whole policy.
 								- **Stateless** — no durable data of its own; losing the container loses nothing.
 								  Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
 								  reverse proxies, FlareSolverr.
 								- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
 								  store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
 								  When in doubt, classify **stateful** (the safer, slower path).
 								### 2. Image pinning follows the split
 								- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
-												Reconcile image pinning to a tiered tag@digest rule

Resolve the conflict between ADR-011 (tags-not-digests) and the security work
(digest pinning) with one coherent rule that respects ADR-011's stateless/stateful
split:

- Stateful → pin `tag@digest` (readable tag + integrity digest): legible diffs AND
  tamper-evidence. Snapshots cover broken updates; the digest covers swapped images.
- Stateless → rolling tags (latest/stable); digest-pinning would defeat the rolling
  design. Integrity rests on official/verified images + disposability.

Aligned across ADR-011 (decision 2), ADR-004 (image management), ADR-002
(supply-chain row), accepted-risk R1, the service checklist, and TODO 15.6.
TODO 16.7 marked decided.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-04 19:21:36 +02:00
+								  watched by DIUN. Always-current, cheap to roll back. No digest pin — it would
 								  defeat the rolling design.
 								- **Stateful → pinned `tag@digest`** — a readable **minor** tag where the image
 								  offers it (e.g. `mariadb:11.4`, not bare `:11`) **plus its digest**
 								  (`mariadb:11.4@sha256:…`). Reproducible and tamper-evident; upgrades are deliberate
 								  (bump tag and digest together), never incidental.
 								Readable tag **and** digest, not one or the other: the tag keeps diffs legible, the
 								digest pins the exact bytes for supply-chain integrity (ADR-002, accepted-risk R1).
 								Snapshot-before + backups remain the rollback mechanism for a *broken* update; the
 								digest is what guards against a *swapped* image, which snapshots cannot.
-												Add ADR-013 (V4 heritage policy); track ADR-011

ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it:
translate-don't-transplant — V4 is evidence, never authority. It is a legitimate
source only of operational gotchas and working config snippets (re-derived on
boma's terms); never requirements, domain values, structure, or conventions.
Provenance stays transient (commits/conversation), durable docs stay clean. AI
consultation guardrails included. Resolves TODO 3.3 and 10.1.

Also bring ADR-011 (update management, Proposed draft) under version control:
- fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013)
- track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests
  pinning with the digest-pinning the security work now mandates (R1 / checklist /
  15.6) — they currently conflict.

CLAUDE.md gains a V4 guardrail + ADR-013 pointer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-04 19:07:48 +02:00
 								### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
 								A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
 								per host, in strict order with a verification gate between every phase:
 . **OS update** — `apt` upgrade.
 . **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
 . **Verify** — health-check harness. **Fail-stop:** if a host fails,
 								   halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
 								   to container updates on a wobbly host.
 . **Stateless container update** — `compose pull` + recreate-if-changed.
 . **Verify** again; alert on failure.
 								**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
 								**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
 								broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
 								### 4. Snapshot-before is the rollback mechanism
 								Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
 								**auto-expire it after ~1 week** if health checks stayed green.
 								### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
 								Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
 								an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
 								ADR-010) does:
 . Read changelogs / breaking-change notes for each pinned stateful image; diff the
 								   pinned tag against what's available.
 . Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
 								   migration steps, and a backup-first checklist — for a human to approve.
 . On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
 								   auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
 								   work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
 								   own migration plan.
 								### 6. The verification gate is the health-check harness
 								"Check everything still works" is the load-bearing 80% of this ADR, and it is the
 								**same capability** as the test methodology (ADR-008) and the sanity checks already on
 								the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
 								pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
 								that harness.
 								**Sequencing is deliberate: the health-check harness is built first, and no update
 								automation (decisions 3–5) is deployed until it is in order.** This is not a blocker to
 								work around — it's the order of operations. An update run without a working verification
 								gate is "update and pray," so it simply does not ship until the gate is real.
 								### 7. Security fast-path overrides the slow cadence
 								The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
 								in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
 								new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
 								backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
 								alert-driven.
 								---
 								## Open questions (for discussion)
 . **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
 								   API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
 								   boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
 . **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
 								   weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
 . **Where does the health-check harness live, and what is the minimum bar that counts
 								   as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
 								   pins down the threshold)?
 . **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
 								   group_vars.
 . **Staging first?** Should the weekly run hit a staging host before production, or is
 								   snapshot-before + Friday timing enough for a homelab of this size?
 . **Notification + control channel** — boma defines its own ntfy topics (decided
 								   fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
 								   "pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
 								---
 								## What was ruled out
 								| Option                                 | Reason                                                                        |
 								| -------------------------------------- | ----------------------------------------------------------------------------- |
 								| One uniform policy for all services    | Ignores blast radius; stateful data loss ≠ stateless re-pull.                 |
 								| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data.                    |
 								| Digest-pinning the stateful tier       | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
 								| Pinning the stateless tier             | No durable data to protect; pins just add churn DIUN already covers.          |
 								| Auto-updating stateful on a timer      | Must be human-gated and backup-first; only the _analysis_ is automated.       |
 								| Updating the whole fleet at once       | Simultaneous reboots hide which host/phase actually broke.                    |
 								| 8-weekly as the only stateful path     | Too slow for urgent CVEs — hence the DIUN security fast-path.                 |
 								---