Add ADR-013 (V4 heritage policy); track ADR-011

ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it:
translate-don't-transplant — V4 is evidence, never authority. It is a legitimate
source only of operational gotchas and working config snippets (re-derived on
boma's terms); never requirements, domain values, structure, or conventions.
Provenance stays transient (commits/conversation), durable docs stay clean. AI
consultation guardrails included. Resolves TODO 3.3 and 10.1.

Also bring ADR-011 (update management, Proposed draft) under version control:
- fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013)
- track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests
  pinning with the digest-pinning the security work now mandates (R1 / checklist /
  15.6) — they currently conflict.

CLAUDE.md gains a V4 guardrail + ADR-013 pointer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-04 19:07:48 +02:00
parent 3b029352b6
commit 0e4050fa59
4 changed files with 225 additions and 2 deletions

View file

@ -160,6 +160,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
- Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)
---
@ -172,6 +173,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
| Per-service security record (template) | `docs/security/service-security-template.md` |
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |

View file

@ -13,7 +13,9 @@
3. **Building services**
1. Decide how to manage logs.
2. Decide how to manage APIs / API access.
3. Decide how to import or integrate from baobabAnsibleV4.
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
translate-don't-transplant — V4 is a source only of gotchas + working config
snippets, re-derived on boma's terms; never structure/requirements/values.
4. Decide what each node runs — base packages plus which apps/services.
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
@ -71,7 +73,7 @@
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
remaining setup to carry out from this decision?
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
1. ~~Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.~~ DECIDED — ADR-013.
2. Policy for how we write key documents like ADRs.
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
@ -114,3 +116,20 @@
6. Supply-chain hygiene: enforce image digest pinning + official/verified images
via the service checklist; revisit active scanning (Trivy/Grype) once a
triage stack exists (accepted-risk R1).
16. **ADR-011 (update management) — resolve open questions + accept.** Committed as
**Proposed**; resolve before marking Accepted:
1. Snapshot driver — control node calling the Proxmox API vs a Proxmox-side hook
(crosses the TF/Ansible boundary, ADR-006/009).
2. Cadences — is weekly OS patching right; should reboots be rarer than `apt`?
3. Health-check harness — where it lives and the minimum bar that counts as
"in order" before the weekly run ships (ties to ADR-008, TODO 2.2 / 8.2).
4. Stateful classification home — per-role `__stateful` flag vs a group_vars list.
5. Staging-first? — hit a staging host before production, or is snapshot-before +
Friday timing enough at this scale?
6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this
week" / "pause" switch (ties to TODO 9).
7. **Reconcile pinning conflict:** ADR-011 decision 2 chose *tags, not digests*
(readability + snapshot/backup immutability), but the security work says
*digest pinning* (accepted-risk R1, service checklist, 15.6 above). Decide one
coherent rule (e.g. readable tag + recorded digest?) and align all of them.

View file

@ -0,0 +1,130 @@
# ADR-011 — Update and upgrade management
**Status: Proposed — draft for discussion (not yet accepted).**
## Context
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
drift over time and must be kept current without breaking the homelab: the **host OS**
(kernel, libc, packages → sometimes a reboot) and the **container images**.
---
## Decisions
### 1. Every service is classified stateful or stateless
Each container role declares its class, e.g. `<role>__stateful: true|false` (default
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
When in doubt, classify **stateful** (the safer, slower path).
### 2. Image pinning follows the split
- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
watched by DIUN. Always-current, cheap to roll back.
- **Stateful → pinned** to a readable tag, **minor** where the image offers it
(e.g. `mariadb:11.4`, not bare `:11` and not a digest). Reproducible; upgrades are
deliberate, never incidental.
Tags, not digests — readable in diffs; immutability is bought instead via
snapshot-before and backups.
### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
per host, in strict order with a verification gate between every phase:
1. **OS update**`apt` upgrade.
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
to container updates on a wobbly host.
4. **Stateless container update**`compose pull` + recreate-if-changed.
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism
Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
**auto-expire it after ~1 week** if health checks stayed green.
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
ADR-010) does:
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
pinned tag against what's available.
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
migration steps, and a backup-first checklist — for a human to approve.
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
own migration plan.
### 6. The verification gate is the health-check harness
"Check everything still works" is the load-bearing 80% of this ADR, and it is the
**same capability** as the test methodology (ADR-008) and the sanity checks already on
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
that harness.
**Sequencing is deliberate: the health-check harness is built first, and no update
automation (decisions 35) is deployed until it is in order.** This is not a blocker to
work around — it's the order of operations. An update run without a working verification
gate is "update and pray," so it simply does not ship until the gate is real.
### 7. Security fast-path overrides the slow cadence
The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
alert-driven.
---
## Open questions (for discussion)
1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
3. **Where does the health-check harness live, and what is the minimum bar that counts
as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
pins down the threshold)?
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
group_vars.
5. **Staging first?** Should the weekly run hit a staging host before production, or is
snapshot-before + Friday timing enough for a homelab of this size?
6. **Notification + control channel** — boma defines its own ntfy topics (decided
fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
"pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
---
## What was ruled out
| Option | Reason |
| -------------------------------------- | ----------------------------------------------------------------------------- |
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
---

View file

@ -0,0 +1,72 @@
# ADR-013 — Heritage: learning from AnsibleBaobabV4 without inheriting it
## Context
boma is the methodology successor to AnsibleBaobabV4 (and V3 before it) — not a new
version of the same project, but a deliberate restart on different principles. V4 (a
~100+ role project) still exists on disk and is a genuine reservoir of hard-won,
real-world knowledge. The standing risk is that referencing it lets V4's old
structure and assumptions creep back in under the guise of "inspiration." This ADR
sets the policy for drawing on V4 without inheriting it. (Resolves the questions
previously parked in TODO 3.3 and 10.1.)
## Principle — translate, don't transplant
V4 is **evidence, never authority.** It can show what was needed or what went wrong;
it can never be the reason boma does something a certain way.
- **Banned:** justifying a choice by V4 precedent — "we do X because V4 did X."
- **Required:** decide on boma's own terms; V4 may simply be where a need or a
pitfall was first noticed.
- **Acceptance test** for anything V4-derived: *can it be justified purely from
boma's principles, with zero reference to V4?* If not, it does not land.
## What V4 is — and is not — a source of
| Legitimate source of | Never a source of |
|---|---|
| **Operational lessons / gotchas** — evidence of a pitfall that *informs* a boma decision (not the decision itself) | **Requirements & scope** — boma decides what it runs, from scratch |
| **Working config snippets** — adapted and re-templated on boma's terms, never copied wholesale | **Domain facts / values** — ntfy topics, IPs, hostnames, etc. are decided fresh; no reuse |
| | **Structure, role design, layout, naming, conventions** |
| | **Methodology, assumptions, and authority / justification** |
Only concrete, verifiable, low-level knowledge crosses over — precisely because it is
safe to re-derive, whereas structure and requirements drag assumptions along.
## Provenance — transient only
When a boma decision was prompted by a V4 lesson, or a config adapted from V4, the
lineage is recorded only in **transient** places: the commit message, the working
conversation, or a clearly-temporary migration-notes scratch doc if a structured
extraction warrants one. **Durable artifacts (ADRs, role READMEs, `SECURITY.md`)
stand on boma's own terms with no V4 reference.** Honest about lineage in history;
clean in the living repo.
## AI consultation guardrails
The AI is the main consumer of V4 — it is on disk and readable. When consulting it:
- **May** read V4 when building something V4 also did — but only to mine **gotchas
and working config snippets**, nothing else.
- **Must announce** it is consulting V4 ("checking how V4 handled X").
- **Must re-derive** the result on boma's terms and pass the acceptance test before
it lands.
- **Must flag any V4 ↔ boma conflict** (an approach that assumes something boma
rejects) rather than absorbing it.
- **Must never** import V4's requirements, scope, domain values, structure, or
conventions, and never cite V4 as justification.
V4 is a searchable field-notes archive consulted under a leash — not a template to
copy.
## Consequences
- V4's real value (what broke, what worked at the config level) stays accessible; its
structure and assumptions do not follow.
- Some things V4 already "solved" get re-decided from scratch (e.g. ntfy topics —
ADR-011 defines boma's own rather than reusing V4's). That re-work is the intended
cost of a clean methodological break.
- The policy is enforceable in review and by the AI guardrails above.
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
(update management — ntfy topics decided fresh per this policy).