Add ADR-013 (V4 heritage policy); track ADR-011
ADR-013 sets how boma draws on AnsibleBaobabV4 without inheriting it: translate-don't-transplant — V4 is evidence, never authority. It is a legitimate source only of operational gotchas and working config snippets (re-derived on boma's terms); never requirements, domain values, structure, or conventions. Provenance stays transient (commits/conversation), durable docs stay clean. AI consultation guardrails included. Resolves TODO 3.3 and 10.1. Also bring ADR-011 (update management, Proposed draft) under version control: - fix its "reuse V4's ntfy topics" line to "boma defines its own" (ADR-013) - track its 6 open questions in TODO 16, plus a 7th: reconcile its tags-not-digests pinning with the digest-pinning the security work now mandates (R1 / checklist / 15.6) — they currently conflict. CLAUDE.md gains a V4 guardrail + ADR-013 pointer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
3b029352b6
commit
0e4050fa59
4 changed files with 225 additions and 2 deletions
|
|
@ -160,6 +160,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
|
||||
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
|
||||
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
|
||||
- Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -172,6 +173,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Accepted security risks | `docs/security/accepted-risks.md` |
|
||||
| Per-service security checklist | `docs/security/service-checklist.md` |
|
||||
| Per-service security record (template) | `docs/security/service-security-template.md` |
|
||||
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
|
||||
| Toolchain choices | `docs/decisions/003-toolchain.md` |
|
||||
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
|
|
|
|||
23
docs/TODO.md
23
docs/TODO.md
|
|
@ -13,7 +13,9 @@
|
|||
3. **Building services**
|
||||
1. Decide how to manage logs.
|
||||
2. Decide how to manage APIs / API access.
|
||||
3. Decide how to import or integrate from baobabAnsibleV4.
|
||||
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
|
||||
translate-don't-transplant — V4 is a source only of gotchas + working config
|
||||
snippets, re-derived on boma's terms; never structure/requirements/values.
|
||||
4. Decide what each node runs — base packages plus which apps/services.
|
||||
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
|
||||
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
||||
|
|
@ -71,7 +73,7 @@
|
|||
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
|
||||
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
|
||||
remaining setup to carry out from this decision?
|
||||
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
|
||||
1. ~~Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.~~ DECIDED — ADR-013.
|
||||
2. Policy for how we write key documents like ADRs.
|
||||
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
|
||||
4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
|
||||
|
|
@ -114,3 +116,20 @@
|
|||
6. Supply-chain hygiene: enforce image digest pinning + official/verified images
|
||||
via the service checklist; revisit active scanning (Trivy/Grype) once a
|
||||
triage stack exists (accepted-risk R1).
|
||||
|
||||
16. **ADR-011 (update management) — resolve open questions + accept.** Committed as
|
||||
**Proposed**; resolve before marking Accepted:
|
||||
1. Snapshot driver — control node calling the Proxmox API vs a Proxmox-side hook
|
||||
(crosses the TF/Ansible boundary, ADR-006/009).
|
||||
2. Cadences — is weekly OS patching right; should reboots be rarer than `apt`?
|
||||
3. Health-check harness — where it lives and the minimum bar that counts as
|
||||
"in order" before the weekly run ships (ties to ADR-008, TODO 2.2 / 8.2).
|
||||
4. Stateful classification home — per-role `__stateful` flag vs a group_vars list.
|
||||
5. Staging-first? — hit a staging host before production, or is snapshot-before +
|
||||
Friday timing enough at this scale?
|
||||
6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this
|
||||
week" / "pause" switch (ties to TODO 9).
|
||||
7. **Reconcile pinning conflict:** ADR-011 decision 2 chose *tags, not digests*
|
||||
(readability + snapshot/backup immutability), but the security work says
|
||||
*digest pinning* (accepted-risk R1, service checklist, 15.6 above). Decide one
|
||||
coherent rule (e.g. readable tag + recorded digest?) and align all of them.
|
||||
|
|
|
|||
130
docs/decisions/011-update-management.md
Normal file
130
docs/decisions/011-update-management.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
# ADR-011 — Update and upgrade management
|
||||
|
||||
**Status: Proposed — draft for discussion (not yet accepted).**
|
||||
|
||||
## Context
|
||||
|
||||
boma runs Debian 13 VMs, each hosting a set of Docker Compose services. Two things
|
||||
drift over time and must be kept current without breaking the homelab: the **host OS**
|
||||
(kernel, libc, packages → sometimes a reboot) and the **container images**.
|
||||
|
||||
---
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Every service is classified stateful or stateless
|
||||
|
||||
Each container role declares its class, e.g. `<role>__stateful: true|false` (default
|
||||
`false`). The split is the load-bearing classification for the whole policy.
|
||||
|
||||
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
|
||||
reverse proxies, FlareSolverr.
|
||||
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
||||
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
||||
When in doubt, classify **stateful** (the safer, slower path).
|
||||
|
||||
### 2. Image pinning follows the split
|
||||
|
||||
- **Stateless → rolling tags** (`latest`/`stable`), refreshed by the weekly run and
|
||||
watched by DIUN. Always-current, cheap to roll back.
|
||||
- **Stateful → pinned** to a readable tag, **minor** where the image offers it
|
||||
(e.g. `mariadb:11.4`, not bare `:11` and not a digest). Reproducible; upgrades are
|
||||
deliberate, never incidental.
|
||||
|
||||
Tags, not digests — readable in diffs; immutability is bought instead via
|
||||
snapshot-before and backups.
|
||||
|
||||
### 3. Weekly OS + stateless run — Friday night, fail-stop, staggered
|
||||
|
||||
A scheduled run on **Friday night** (giving the weekend to fix anything it breaks),
|
||||
per host, in strict order with a verification gate between every phase:
|
||||
|
||||
1. **OS update** — `apt` upgrade.
|
||||
2. **Reboot** — only if required (kernel/libc); detect via `/var/run/reboot-required`.
|
||||
3. **Verify** — health-check harness. **Fail-stop:** if a host fails,
|
||||
halt _that host's_ run, leave it as-is, alert loudly — do **not** proceed
|
||||
to container updates on a wobbly host.
|
||||
4. **Stateless container update** — `compose pull` + recreate-if-changed.
|
||||
5. **Verify** again; alert on failure.
|
||||
|
||||
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
||||
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
|
||||
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
||||
|
||||
### 4. Snapshot-before is the rollback mechanism
|
||||
|
||||
Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday window** and
|
||||
**auto-expire it after ~1 week** if health checks stayed green.
|
||||
|
||||
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
|
||||
|
||||
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
|
||||
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
|
||||
ADR-010) does:
|
||||
|
||||
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
|
||||
pinned tag against what's available.
|
||||
2. Emit a **recommended upgrade plan** as a Forgejo issue/PR — proposed target version,
|
||||
migration steps, and a backup-first checklist — for a human to approve.
|
||||
3. On approval, the upgrade runs in a **deliberate maintenance window** (not the Friday
|
||||
auto-run), and **always dumps the DB / takes a backup first** (ties to the backup
|
||||
work — TODO 3.8). DB **major**-version bumps are the highest-risk case and get their
|
||||
own migration plan.
|
||||
|
||||
### 6. The verification gate is the health-check harness
|
||||
|
||||
"Check everything still works" is the load-bearing 80% of this ADR, and it is the
|
||||
**same capability** as the test methodology (ADR-008) and the sanity checks already on
|
||||
the TODO (2.2 — API/curl/log/headless checks; 8.2 — "does PhotoPrism have its
|
||||
pictures?", "is email flowing?"). The update pipeline is a scheduler wrapped around
|
||||
that harness.
|
||||
|
||||
**Sequencing is deliberate: the health-check harness is built first, and no update
|
||||
automation (decisions 3–5) is deployed until it is in order.** This is not a blocker to
|
||||
work around — it's the order of operations. An update run without a working verification
|
||||
gate is "update and pray," so it simply does not ship until the gate is real.
|
||||
|
||||
### 7. Security fast-path overrides the slow cadence
|
||||
|
||||
The 8-weekly stateful cadence is for routine drift. It is **too slow for a critical CVE**
|
||||
in an internet-facing stateful service (Vaultwarden, Nextcloud, Forgejo). DIUN's
|
||||
new-image alert stays as the **out-of-band trigger**: an urgent advisory gets a manual,
|
||||
backup-first upgrade immediately, not in up to 8 weeks. Routine = scheduled; urgent =
|
||||
alert-driven.
|
||||
|
||||
---
|
||||
|
||||
## Open questions (for discussion)
|
||||
|
||||
1. **Where does the Proxmox snapshot get driven from?** Control node calling the Proxmox
|
||||
API as a pre-step in the run, vs a Proxmox-side hook. Crosses the Terraform/Ansible
|
||||
boundary (ADR-006/009) — TF "owns VM existence," but a snapshot isn't existence.
|
||||
2. **Exact cadences** — Friday weekly and 8-weekly stateful are starting points. Is
|
||||
weekly OS patching the right rhythm, or should reboots be rarer than `apt` upgrades?
|
||||
3. **Where does the health-check harness live, and what is the minimum bar that counts
|
||||
as "in order"** before the weekly run ships (decision 6 fixes the sequencing; this
|
||||
pins down the threshold)?
|
||||
4. **Classification home** — a per-role `__stateful` flag (proposed) vs a list in
|
||||
group_vars.
|
||||
5. **Staging first?** Should the weekly run hit a staging host before production, or is
|
||||
snapshot-before + Friday timing enough for a homelab of this size?
|
||||
6. **Notification + control channel** — boma defines its own ntfy topics (decided
|
||||
fresh per ADR-013, not reused from V4), and does the run need a "skip this week" /
|
||||
"pause updates" switch? (Relates to TODO item 9 — a tool→user messaging function.)
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
| -------------------------------------- | ----------------------------------------------------------------------------- |
|
||||
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
|
||||
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
|
||||
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
|
||||
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
|
||||
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
|
||||
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
|
||||
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
|
||||
|
||||
---
|
||||
72
docs/decisions/013-heritage-v4.md
Normal file
72
docs/decisions/013-heritage-v4.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# ADR-013 — Heritage: learning from AnsibleBaobabV4 without inheriting it
|
||||
|
||||
## Context
|
||||
|
||||
boma is the methodology successor to AnsibleBaobabV4 (and V3 before it) — not a new
|
||||
version of the same project, but a deliberate restart on different principles. V4 (a
|
||||
~100+ role project) still exists on disk and is a genuine reservoir of hard-won,
|
||||
real-world knowledge. The standing risk is that referencing it lets V4's old
|
||||
structure and assumptions creep back in under the guise of "inspiration." This ADR
|
||||
sets the policy for drawing on V4 without inheriting it. (Resolves the questions
|
||||
previously parked in TODO 3.3 and 10.1.)
|
||||
|
||||
## Principle — translate, don't transplant
|
||||
|
||||
V4 is **evidence, never authority.** It can show what was needed or what went wrong;
|
||||
it can never be the reason boma does something a certain way.
|
||||
|
||||
- **Banned:** justifying a choice by V4 precedent — "we do X because V4 did X."
|
||||
- **Required:** decide on boma's own terms; V4 may simply be where a need or a
|
||||
pitfall was first noticed.
|
||||
- **Acceptance test** for anything V4-derived: *can it be justified purely from
|
||||
boma's principles, with zero reference to V4?* If not, it does not land.
|
||||
|
||||
## What V4 is — and is not — a source of
|
||||
|
||||
| Legitimate source of | Never a source of |
|
||||
|---|---|
|
||||
| **Operational lessons / gotchas** — evidence of a pitfall that *informs* a boma decision (not the decision itself) | **Requirements & scope** — boma decides what it runs, from scratch |
|
||||
| **Working config snippets** — adapted and re-templated on boma's terms, never copied wholesale | **Domain facts / values** — ntfy topics, IPs, hostnames, etc. are decided fresh; no reuse |
|
||||
| | **Structure, role design, layout, naming, conventions** |
|
||||
| | **Methodology, assumptions, and authority / justification** |
|
||||
|
||||
Only concrete, verifiable, low-level knowledge crosses over — precisely because it is
|
||||
safe to re-derive, whereas structure and requirements drag assumptions along.
|
||||
|
||||
## Provenance — transient only
|
||||
|
||||
When a boma decision was prompted by a V4 lesson, or a config adapted from V4, the
|
||||
lineage is recorded only in **transient** places: the commit message, the working
|
||||
conversation, or a clearly-temporary migration-notes scratch doc if a structured
|
||||
extraction warrants one. **Durable artifacts (ADRs, role READMEs, `SECURITY.md`)
|
||||
stand on boma's own terms with no V4 reference.** Honest about lineage in history;
|
||||
clean in the living repo.
|
||||
|
||||
## AI consultation guardrails
|
||||
|
||||
The AI is the main consumer of V4 — it is on disk and readable. When consulting it:
|
||||
|
||||
- **May** read V4 when building something V4 also did — but only to mine **gotchas
|
||||
and working config snippets**, nothing else.
|
||||
- **Must announce** it is consulting V4 ("checking how V4 handled X").
|
||||
- **Must re-derive** the result on boma's terms and pass the acceptance test before
|
||||
it lands.
|
||||
- **Must flag any V4 ↔ boma conflict** (an approach that assumes something boma
|
||||
rejects) rather than absorbing it.
|
||||
- **Must never** import V4's requirements, scope, domain values, structure, or
|
||||
conventions, and never cite V4 as justification.
|
||||
|
||||
V4 is a searchable field-notes archive consulted under a leash — not a template to
|
||||
copy.
|
||||
|
||||
## Consequences
|
||||
|
||||
- V4's real value (what broke, what worked at the config level) stays accessible; its
|
||||
structure and assumptions do not follow.
|
||||
- Some things V4 already "solved" get re-decided from scratch (e.g. ntfy topics —
|
||||
ADR-011 defines boma's own rather than reusing V4's). That re-work is the intended
|
||||
cost of a clean methodological break.
|
||||
- The policy is enforceable in review and by the AI guardrails above.
|
||||
|
||||
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
|
||||
(update management — ntfy topics decided fresh per this policy).
|
||||
Loading…
Add table
Reference in a new issue