Compare commits
327 commits
chore/ubon
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| d1c3eb681a | |||
| 1299eef6ea | |||
| 0030b45bbd | |||
| a483f4e55c | |||
| c09b7fe6a5 | |||
| 74e54b359b | |||
| f83d68d7a0 | |||
| 0286c78f36 | |||
| 3ba22d199a | |||
| f10fe8bb60 | |||
| dfc64da2eb | |||
| 0194865437 | |||
| d6e80990b2 | |||
| d1941c987e | |||
| dc5cc8933f | |||
| 4933186d31 | |||
| 9f0626040b | |||
| 8ca42c389c | |||
| 1042f161b6 | |||
| d9b8676fce | |||
| ab328a2f79 | |||
| 61cbcc6c18 | |||
| 6be758bece | |||
| a178729587 | |||
| ef5e049e9b | |||
| 215060bac1 | |||
| fa2c4c6368 | |||
| a881185c73 | |||
| 180af46879 | |||
| 8d8c86fa39 | |||
| 468f8c3a92 | |||
| 26bb7e442d | |||
| 6ac5afaf67 | |||
| b3e14decb4 | |||
| b10a33f439 | |||
| 66a9a0af08 | |||
| e14e347047 | |||
| 24a1d909c9 | |||
| 77a20b8d40 | |||
| a23ecd708d | |||
| bc8592616b | |||
| d7bd31babb | |||
| cc772ff845 | |||
| 3fe6f68316 | |||
| b1aa0f49d9 | |||
| 172ae37953 | |||
| 051c040343 | |||
| c7194ca147 | |||
| 35446538df | |||
| 83983d739c | |||
| 941141e270 | |||
| f27514860e | |||
| 65bacb25fa | |||
| e5256696d6 | |||
| 147eb874ea | |||
| ed1187d1c3 | |||
| f51ae1a13d | |||
| 4732730515 | |||
| edcc347a95 | |||
| d68734267b | |||
| 3769c9ebb9 | |||
| 10121e72d3 | |||
| 0989f047eb | |||
| 4fb4cf99c3 | |||
| 68abd67ce6 | |||
| 8ea9966d88 | |||
| d1c91930ac | |||
| fdd4df34b1 | |||
| af76763c16 | |||
| a8dc3c787a | |||
| 6f53d00b71 | |||
| b5d5dffeaf | |||
| 64767ac187 | |||
| ac6a01296a | |||
| 65533be4d9 | |||
| 02e1eb7449 | |||
| 69faaf5e43 | |||
| 958e35e3c3 | |||
| 847d9885e2 | |||
| b0511179cb | |||
| cc21344ab1 | |||
| 3b30e70ba5 | |||
| 39d2ad38ca | |||
| dfa363cecd | |||
| 292c204752 | |||
| e5a8e5d3b9 | |||
| 5947ba8756 | |||
| a0762c563e | |||
| c1323a3f29 | |||
| 39904a778a | |||
| 8f1c7d47ec | |||
| b0c0150db2 | |||
| 959f9b30b5 | |||
| 5d14efc864 | |||
| 8d2a064542 | |||
| 4c8fb9e03b | |||
| d202b89480 | |||
| 9b3f8f826f | |||
| 44c4978b5f | |||
| 98eb09d8ba | |||
| 4cfc3cddd5 | |||
| 55776fb03c | |||
| 4142bb15f8 | |||
| 94dd6da14c | |||
| 684718f4a5 | |||
| 3a31b8e6f4 | |||
| 0e8d448f2b | |||
| 070d6f293b | |||
| 1333ec181f | |||
| 3762be4622 | |||
| ab1b0678ab | |||
| 19e675fa5a | |||
| b3468b34e4 | |||
| 6e38693499 | |||
| d407aeabb2 | |||
| 293c1f88d8 | |||
| 13ae674cc9 | |||
| d1e1e38879 | |||
| 8d2f564382 | |||
| fd1e83a378 | |||
| b185ac4765 | |||
| c6f66ee634 | |||
| 72b9262f34 | |||
| 859732b04d | |||
| d14639e80a | |||
| 1a0e30e278 | |||
| e5867422d0 | |||
| f821006e9e | |||
| 9e0c264658 | |||
| 9b5851ba4b | |||
| 175777e36a | |||
| cb8f924d4b | |||
| 718781053f | |||
| 64f1e821d8 | |||
| e3461375f5 | |||
| 1862b7a828 | |||
| b7e919d6b3 | |||
| 9c169561d7 | |||
| 1ee343dfca | |||
| 50b6445bdd | |||
| 456c27d12b | |||
| d10f6de84b | |||
| dd8c6825ba | |||
| 65cf20a993 | |||
| 181a02fd3a | |||
| 9d787a4f53 | |||
| db1e5db138 | |||
| a111a20cc8 | |||
| deec75de0f | |||
| 22021210c4 | |||
| cff368ece2 | |||
| a1c0f4814b | |||
| 917005174a | |||
| e83c777b44 | |||
| 839fc632a1 | |||
| 9d4a49d49d | |||
| 09b0aad342 | |||
| 3588904528 | |||
| fd86ec6848 | |||
| 07af037ff3 | |||
| 127ade59a3 | |||
| bbc287900a | |||
| 29921428c4 | |||
| 993d7885e4 | |||
| 76bd1d63fc | |||
| 078d1ad9d9 | |||
| 3cb6436ad2 | |||
| f170ffd936 | |||
| e247af6e55 | |||
| a0a3e4d356 | |||
| bd84dd0213 | |||
| 9311968363 | |||
| 91ad629c02 | |||
| 70c302d7e5 | |||
| 6f5c7b2bfb | |||
| e96480692d | |||
| b131ee317e | |||
| 602550fdaa | |||
| 32d480efcf | |||
| 79f2315eee | |||
| 43e5a4aa53 | |||
| f7fac5f5e3 | |||
| 7a47dd9dec | |||
| be2679cc66 | |||
| 3cfcb1c2e9 | |||
| 03d33f83dd | |||
| 1da117d65b | |||
| 67f2aba9d8 | |||
| aea4f8c3d6 | |||
| 6203513220 | |||
| 607423d0e7 | |||
| a2bb99928c | |||
| f3f382ae69 | |||
| b9daf2a0ad | |||
| 349d10d65c | |||
| 7b5fd17e55 | |||
| 7b190e4313 | |||
| 7ebbc113ab | |||
| fa3db421dc | |||
| d0a3307822 | |||
| 0df24909e3 | |||
| 40a428975a | |||
| 6d7d27b03b | |||
| b3ca510380 | |||
| 44dbd4628f | |||
| 188882449d | |||
| 9b1502cf7d | |||
| a9aab9d040 | |||
| 3c920ae630 | |||
| ab14d65aa1 | |||
| 89179dd7c9 | |||
| a3ea0f7d80 | |||
| ce3319cbed | |||
| dfbe37916f | |||
| 4116286ed0 | |||
| 91713127cb | |||
| 2dbcac11a0 | |||
| 9be4366ac3 | |||
| ed6d5463aa | |||
| 1e85c11ede | |||
| 5f946ac640 | |||
| 01e47d0890 | |||
| 81dac4f28b | |||
| f3f80443d0 | |||
| f5c97d1f36 | |||
| da116e1d92 | |||
| 2041bd3b70 | |||
| eaffd8d900 | |||
| 032adf1525 | |||
| f151e99d04 | |||
| 13f0d482bd | |||
| 649925b303 | |||
| 384b94e34b | |||
| 0c507bbace | |||
| 46d091e82e | |||
| f8098c2e15 | |||
| 0fe9e45f57 | |||
| cdbd66410a | |||
| fd4bbbc977 | |||
| fcfb056591 | |||
| 402913efb3 | |||
| 90683c7912 | |||
| 6fb104e934 | |||
| b006196cc5 | |||
| 026a29f609 | |||
| bca74458fb | |||
| eeab5ed8de | |||
| 7dae93e4e1 | |||
| 4127f8bc6b | |||
| 390cd3b335 | |||
| 2486e31f7d | |||
| 03329d7d25 | |||
| d7fbaca554 | |||
| 2ad50e4d5b | |||
| a9287427e3 | |||
| e24aab28b2 | |||
| d311f67098 | |||
| 8d1d8a88ea | |||
| f700f4a475 | |||
| 2a65391c0e | |||
| 86bb3559ad | |||
| fac438cc92 | |||
| 5aeeb094eb | |||
| 2e5a1e1e23 | |||
| 24b5e9361e | |||
| 9584cc2c76 | |||
| 0b59107b33 | |||
| a3ea2aceb2 | |||
| b45118dac3 | |||
| 24397fa280 | |||
| 04bfc26422 | |||
| 4ed9e9a8bf | |||
| 9bdb3017bb | |||
| 12baeba750 | |||
| 1021c6d25d | |||
| c6aa45037d | |||
| 687d623a52 | |||
| 6f68f8b8c5 | |||
| 30c6a93c28 | |||
| 2894319f01 | |||
| 96f8f20c05 | |||
| 8eb5ccf97d | |||
| 568729e7bd | |||
| db76be2a63 | |||
| 8e4bf3dd88 | |||
| d8afa94c4b | |||
| f0d189ca09 | |||
| 3dd03d4198 | |||
| 666ad42634 | |||
| f566fd17eb | |||
| 66d11cc352 | |||
| d5c62c99ad | |||
| 91d851fe4d | |||
| 01e4f96983 | |||
| eb415db96e | |||
| 920e47b50d | |||
| 22c0747c0b | |||
| 25f04002df | |||
| 05abb3b6a5 | |||
| 2df1f98153 | |||
| cc3337502f | |||
| be6a064f44 | |||
| 2bd11b5aa9 | |||
| 5322cce5c6 | |||
| cd62c5e098 | |||
| ed9fdcc10a | |||
| 787aa3b8e1 | |||
| 841f666de9 | |||
| 08165ffb68 | |||
| 2ae5cf4535 | |||
| 5a32dd46d3 | |||
| ff796c64ca | |||
| 4b85b14f1f | |||
| 99ace3eb48 | |||
| a53941dffe | |||
| 7a48a60f14 | |||
| a30c1af3f0 | |||
| 9653a34241 | |||
| 55a3666d16 | |||
| a2db8058e7 | |||
| b89ca8835a | |||
| 3fb780c286 | |||
| 66064be7b2 | |||
| 07bc1c83f0 | |||
| 1064716d49 | |||
| 15779be086 | |||
| 5aca796fa0 |
288 changed files with 26579 additions and 379 deletions
|
|
@ -6,6 +6,7 @@ exclude_paths:
|
||||||
- .venv/
|
- .venv/
|
||||||
- .collections/
|
- .collections/
|
||||||
- .scaffold/
|
- .scaffold/
|
||||||
|
- tests/integration/.run/ # transient harness run dir (gitignored, generated)
|
||||||
- "**/vault.yml" # ansible-vault encrypted — not lintable YAML
|
- "**/vault.yml" # ansible-vault encrypted — not lintable YAML
|
||||||
|
|
||||||
# Warn only (don't fail) on these rules during initial setup
|
# Warn only (don't fail) on these rules during initial setup
|
||||||
|
|
|
||||||
49
.claude/commands/check-access.md
Normal file
49
.claude/commands/check-access.md
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
Operational-access verification (ADR-021)
|
||||||
|
|
||||||
|
Probe every documented way in to a service or host from `ubongo` and report which paths
|
||||||
|
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
|
||||||
|
`ACCESS.md` can never disagree. Argument: a service/role name or a host
|
||||||
|
(e.g. `/check-access photoprism`, `/check-access docker01`).
|
||||||
|
|
||||||
|
## Prerequisites (this is forward-looking — ADR-021 dependencies)
|
||||||
|
|
||||||
|
This skill cannot run until these exist; if any is missing, say so and stop — do not
|
||||||
|
improvise around it:
|
||||||
|
|
||||||
|
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
|
||||||
|
- The target host/service is deployed (staging or production inventory).
|
||||||
|
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
|
||||||
|
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
|
||||||
|
|
||||||
|
## Process
|
||||||
|
|
||||||
|
### Phase 0 — resolve the target
|
||||||
|
|
||||||
|
Resolve the argument to a host or a service role + its host. Load the `access__*` data
|
||||||
|
(service) or the host-baseline + break-glass record (host). State what you will probe.
|
||||||
|
|
||||||
|
### Phase 1 — probe each declared path
|
||||||
|
|
||||||
|
| Path | Probe | Green = |
|
||||||
|
|---|---|---|
|
||||||
|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
|
||||||
|
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
|
||||||
|
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
|
||||||
|
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
|
||||||
|
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
|
||||||
|
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
|
||||||
|
|
||||||
|
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
|
||||||
|
fallback exists, do not drive it.
|
||||||
|
|
||||||
|
### Phase 2 — report
|
||||||
|
|
||||||
|
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
|
||||||
|
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
|
||||||
|
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Read-only and non-destructive — probes confirm reachability, they do not change state.
|
||||||
|
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
|
||||||
|
control node + hosts exist.
|
||||||
29
.claude/commands/check-backup.md
Normal file
29
.claude/commands/check-backup.md
Normal file
|
|
@ -0,0 +1,29 @@
|
||||||
|
---
|
||||||
|
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
|
||||||
|
---
|
||||||
|
|
||||||
|
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
|
||||||
|
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
|
||||||
|
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
|
||||||
|
|
||||||
|
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
|
||||||
|
command reports `not-yet-available` rather than failing.
|
||||||
|
|
||||||
|
## Preconditions
|
||||||
|
|
||||||
|
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
|
||||||
|
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
|
||||||
|
`not-yet-available` and stop.
|
||||||
|
|
||||||
|
## Checks (when live)
|
||||||
|
|
||||||
|
Load the `backup__*` data for the resolved role, then:
|
||||||
|
|
||||||
|
| Check | How | Green when |
|
||||||
|
|---|---|---|
|
||||||
|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
|
||||||
|
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
|
||||||
|
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
|
||||||
|
| integrity | `restic check --read-data-subset` (sampled) | no errors |
|
||||||
|
|
||||||
|
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
|
||||||
63
.claude/commands/kaizen.md
Normal file
63
.claude/commands/kaizen.md
Normal file
|
|
@ -0,0 +1,63 @@
|
||||||
|
# Kaizen — curate the friction log into improvements
|
||||||
|
|
||||||
|
Consume the **Open signals** in `docs/FRICTION.md`: decide a verdict for each, migrate
|
||||||
|
durable knowledge into the right docs, and archive consumed signals into the decisions
|
||||||
|
ledger. **Curate-only** — do not hunt for new signals; capture stays manual. This is an
|
||||||
|
interactive, judgment-dense pass: propose, the operator decides, you apply on approval.
|
||||||
|
|
||||||
|
Design: `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`.
|
||||||
|
|
||||||
|
## Phase 0 — scan
|
||||||
|
Run `python3 scripts/friction-scan.py > /tmp/kaizen.json`. It returns each Open signal as
|
||||||
|
`{tag, first_seen, age_days, recurrence_count, referenced_paths, still_exists, text}`.
|
||||||
|
Treat `still_exists: false` as a hint the signal may already be resolved.
|
||||||
|
|
||||||
|
## Phase 1 — triage
|
||||||
|
Order signals by `recurrence_count` desc, then `age_days` desc, then tag. **Group signals
|
||||||
|
that share a root cause** and curate them together. Present the agenda before editing
|
||||||
|
anything: total open, how many recurring (≥3), how many look already-resolved.
|
||||||
|
|
||||||
|
## Phase 2 — per-signal curation (interactive)
|
||||||
|
For each signal/group, present: a one-line restatement, the evidence (age, recurrence,
|
||||||
|
still-real), and a proposed **verdict**. Verdicts:
|
||||||
|
|
||||||
|
- **SYSTEMATIZE** — migrate the durable lesson into its right home (a runbook, an ADR,
|
||||||
|
`CLAUDE.md`, a new `scripts/repo-scan.py` check, or a hook).
|
||||||
|
- **CHANGE** — adjust an existing tool/convention/config rather than document it.
|
||||||
|
- **PARK** — *out-of-phase but not obsolete*. Remove from the active tree, but write a
|
||||||
|
ledger row recording **where it now lives (git SHA/branch/doc) and a resurrection
|
||||||
|
trigger**. The default for "not touched lately but not wrong."
|
||||||
|
- **REMOVE** — *obsolete*: superseded, wrong, never worked, duplicated. Ledger row states
|
||||||
|
why.
|
||||||
|
- **ALREADY-BUILT** — the systematization already exists / the fix landed; archive.
|
||||||
|
- **ACCEPTED** — conscious no-op (revisit-if-recurs); archive.
|
||||||
|
- **KEEP-OPEN** — still accruing, not ripe; leave it in *Open signals* (no ledger row).
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- **Knowledge is never removed** — SYSTEMATIZE/migrate it; only *active surface* (scripts,
|
||||||
|
checks, conventions, plugins) is parked/removed.
|
||||||
|
- Every reductive verdict must classify *why unused*: **obsolete → REMOVE**,
|
||||||
|
**out-of-phase → PARK**.
|
||||||
|
- The operator approves / modifies / rejects each verdict. On approval: do the mechanical
|
||||||
|
edit (migrate text into the target doc; **move the signal from *Open signals* into the
|
||||||
|
ledger table**; delete the parked/removed file) and show the diff.
|
||||||
|
- PARK and REMOVE both delete from the active tree — the difference is the ledger row.
|
||||||
|
Git history + the ledger row are the park mechanism; never create a `parked/` directory.
|
||||||
|
|
||||||
|
## Phase 3 — close-out
|
||||||
|
- Add a new dated block under `## Kaizen reviews — decisions ledger` (newest first), same
|
||||||
|
shape as the existing block: a table with columns **Signal (first seen) | Verdict |
|
||||||
|
Resolution / where it lives now**.
|
||||||
|
- **Bias-to-remove discipline check:** if every verdict this pass was SYSTEMATIZE/CHANGE
|
||||||
|
(only accreting), say so explicitly.
|
||||||
|
- **Self-eval (light):** is `/kaizen` being run often enough (oldest consumed age)? Should
|
||||||
|
the nudge thresholds in `scripts/friction-scan.py` change? Note it.
|
||||||
|
- Run `make lint` if any code/docs changed; revert anything that breaks it.
|
||||||
|
- Commit per `CLAUDE.md` git conventions (one logical unit — straight to `main` if
|
||||||
|
small/safe, a branch if sweeping; show the diff first for a branch).
|
||||||
|
- Print a one-line summary: `consumed X · parked Y · removed Z · kept-open W · migrated → <docs>`.
|
||||||
|
|
||||||
|
## Headless / cron (future)
|
||||||
|
Deferred until the notify + cron stack exists (`docs/TODO.md` 11.3). When run
|
||||||
|
non-interactively, **report only**: print the proposed verdicts and the nudge, do not edit
|
||||||
|
or commit.
|
||||||
|
|
@ -25,7 +25,17 @@ report the rest, and write a tracked report to `docs/reviews/`.
|
||||||
### Phase 0 — deterministic pre-scan
|
### Phase 0 — deterministic pre-scan
|
||||||
Run `python3 scripts/repo-scan.py > /tmp/repo-scan.json`. It returns the **inventory**
|
Run `python3 scripts/repo-scan.py > /tmp/repo-scan.json`. It returns the **inventory**
|
||||||
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
|
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
|
||||||
(markers, broken refs, unencrypted vaults). Fold these into the report verbatim.
|
(markers, broken refs, unencrypted vaults, ADR-structure violations). Fold these into
|
||||||
|
the report verbatim.
|
||||||
|
|
||||||
|
It also emits two deferral checks (see Phase 2): `open-deferred-item` (every still-open
|
||||||
|
ADR "Deferred/Open" entry — a checklist to confirm) and `stale-deferred` (an entry
|
||||||
|
another file describes as resolved but which isn't marked resolved in place —
|
||||||
|
high-confidence, usually auto-fixable by marking the source ADR's entry RESOLVED).
|
||||||
|
|
||||||
|
Also run `python3 scripts/friction-scan.py --nudge` and include its one-line output in the
|
||||||
|
report's summary — it flags when the kaizen loop (`/kaizen`) is overdue (recurring signals,
|
||||||
|
backlog size, or age). This is a reminder only; do not act on `FRICTION.md` from here.
|
||||||
|
|
||||||
### Phase 1 — fan-out judgement review
|
### Phase 1 — fan-out judgement review
|
||||||
Scale to repo size:
|
Scale to repo size:
|
||||||
|
|
@ -42,6 +52,13 @@ location (file:line), description, suggested_fix, auto_fixable (bool)}`.
|
||||||
- Merge and dedupe all findings (deterministic + reviewer).
|
- Merge and dedupe all findings (deterministic + reviewer).
|
||||||
- Run **one cross-cutting reviewer** over the full ADR set + `STATUS.md` + `CLAUDE.md`
|
- Run **one cross-cutting reviewer** over the full ADR set + `STATUS.md` + `CLAUDE.md`
|
||||||
to catch contradictions that span files (per-shard agents can't see these).
|
to catch contradictions that span files (per-shard agents can't see these).
|
||||||
|
- **Resolve the deferral checklist.** For every `open-deferred-item` from Phase 0,
|
||||||
|
judge whether it is *genuinely* still open: search later ADRs / `STATUS.md` for a
|
||||||
|
decision on that subject (a deferred item often resolves silently when a later ADR
|
||||||
|
lands). If it has been decided, it is a stale-deferred finding — the fix is to mark
|
||||||
|
that entry RESOLVED in its **source ADR's** Deferred list (the spot the resolving
|
||||||
|
ADR's own change won't have touched). Treat every `stale-deferred` finding as
|
||||||
|
high-confidence. This is the recurring miss logged in `docs/FRICTION.md`.
|
||||||
- Diff against the previous run's `docs/reviews/<prev>-findings.json` and tag each
|
- Diff against the previous run's `docs/reviews/<prev>-findings.json` and tag each
|
||||||
finding **new / recurring / resolved**.
|
finding **new / recurring / resolved**.
|
||||||
- Prioritise by severity; split into auto-fixable vs report-only.
|
- Prioritise by severity; split into auto-fixable vs report-only.
|
||||||
|
|
|
||||||
65
.claude/commands/verify-service.md
Normal file
65
.claude/commands/verify-service.md
Normal file
|
|
@ -0,0 +1,65 @@
|
||||||
|
Exploratory service-UI verification (ADR-008 Level 4 / ADR-017)
|
||||||
|
|
||||||
|
Drive a browser against a **staging** deploy of a service, exercise its
|
||||||
|
`roles/<service>/VERIFY.md` acceptance journeys plus free exploration, and write a
|
||||||
|
tracked report. Argument: the service/role name (e.g. `/verify-service photoprism`).
|
||||||
|
|
||||||
|
## Prerequisites (this is forward-looking — ADR-017 dependencies)
|
||||||
|
|
||||||
|
This skill cannot run until all of these exist; if any is missing, say so and stop —
|
||||||
|
do not improvise around it:
|
||||||
|
|
||||||
|
- `ubongo` with the `playwright` Claude Code plugin (browser automation tools).
|
||||||
|
- A **staging** deploy of the target service (ADR-008 Level 2).
|
||||||
|
- Authentik (staging) for test-user provisioning + SSO.
|
||||||
|
- `roles/<name>/VERIFY.md` present.
|
||||||
|
|
||||||
|
## Process
|
||||||
|
|
||||||
|
### Phase 0 — safety gate (staging only)
|
||||||
|
|
||||||
|
Confirm the target resolves to the **staging** environment/inventory, never production.
|
||||||
|
If you cannot prove it is staging, **stop** — exploratory clicking is destructive
|
||||||
|
(ADR-002). State why you stopped.
|
||||||
|
|
||||||
|
### Phase 1 — read intent
|
||||||
|
|
||||||
|
Read `roles/<name>/VERIFY.md`: the Critical user journeys, What good looks like, Not
|
||||||
|
browser-verifiable, and Test data sections.
|
||||||
|
|
||||||
|
### Phase 2 — test user
|
||||||
|
|
||||||
|
Provision (reuse-or-create) a test user in the staging Authentik `test` group, with
|
||||||
|
ephemeral credentials held only for this run. Never use a real/production account.
|
||||||
|
|
||||||
|
### Phase 3 — drive the browser
|
||||||
|
|
||||||
|
Via the `playwright` plugin, on `ubongo`: open the service's staging URL (resolved via
|
||||||
|
boma DNS), authenticate through the real Traefik + Authentik SSO flow, then execute each
|
||||||
|
`VERIFY.md` journey — judging pass/fail and screenshotting key states — and free-explore
|
||||||
|
for anything obviously broken. Save screenshots to the git-ignored `.verify-runs/`
|
||||||
|
working dir; avoid capturing credential screens.
|
||||||
|
|
||||||
|
### Phase 4 — write the report
|
||||||
|
|
||||||
|
Save to `docs/testing/reviews/YYYY-MM-DD-<name>.md` and overwrite
|
||||||
|
`docs/testing/reviews/latest.md`. Structure:
|
||||||
|
|
||||||
|
- **One-line verdict** — e.g. "5/5 journeys passed; one manual check pending".
|
||||||
|
- **Run metadata** — date, service, staging env, test user, reviewed commit SHA.
|
||||||
|
- **Per-journey result** — pass/fail against `VERIFY.md`, with the evidence (linked
|
||||||
|
screenshot path) and any observation.
|
||||||
|
- **Free-exploration findings** — anything noticed beyond the listed journeys.
|
||||||
|
- **Manual-test checklist** — the "Not browser-verifiable" items plus anything Claude
|
||||||
|
couldn't do: numbered steps, expected result, and why it was handed off.
|
||||||
|
|
||||||
|
### Phase 5 — clean up + commit
|
||||||
|
|
||||||
|
Offer to clean up the `test`-group user (or note that the staging rebuild will).
|
||||||
|
Commit the report markdown per CLAUDE.md git conventions. **Do not** commit
|
||||||
|
`.verify-runs/` (git-ignored).
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Reports (markdown) are committed; screenshots stay local on `ubongo` in `.verify-runs/`.
|
||||||
|
- Exploratory and interactive — this is not a deterministic CI gate.
|
||||||
70
.claude/hooks/guard-execution-mode-menu.sh
Executable file
70
.claude/hooks/guard-execution-mode-menu.sh
Executable file
|
|
@ -0,0 +1,70 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Stop guard for two external-skill gates that conflict with boma conventions, where
|
||||||
|
# prose reminders repeatedly failed to hold (docs/FRICTION.md):
|
||||||
|
#
|
||||||
|
# 1. The execution-mode menu — writing-plans / subagent-driven-development script a
|
||||||
|
# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution
|
||||||
|
# handoff. boma's standing preference is to NEVER present it and proceed
|
||||||
|
# subagent-driven. (Recorded by the 2026-06-10 kaizen review; the 2026-06-17 review
|
||||||
|
# widened the matcher to also catch free-form *prose* re-asks of the same choice —
|
||||||
|
# e.g. "which execution approach?" — which the literal-menu matcher missed. The
|
||||||
|
# sibling push-vs-not re-ask is deliberately NOT hooked: a genuine "should I push?"
|
||||||
|
# is sometimes legitimate, so it stays a soft default via the
|
||||||
|
# dont-reask-settled-defaults memory rather than a hard block.)
|
||||||
|
# 2. The brainstorming spec-review gate — the brainstorming skill scripts "Spec written
|
||||||
|
# and committed … please review it before … the implementation plan." The standing
|
||||||
|
# agreement is to move directly from the committed spec to writing-plans. (Recorded
|
||||||
|
# by the 2026-06-14 kaizen review; 06-10/06-14 recurrences.)
|
||||||
|
#
|
||||||
|
# Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a
|
||||||
|
# block can never loop. Match signatures are deliberately tight so they fire on the
|
||||||
|
# actual gate text, not on meta-discussion of it.
|
||||||
|
#
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
input=$(cat 2>/dev/null) || exit 0
|
||||||
|
|
||||||
|
# Loop guard: if we already blocked once for this stop, let it through.
|
||||||
|
active=$(printf '%s' "$input" | jq -r '.stop_hook_active // false' 2>/dev/null) || exit 0
|
||||||
|
[ "$active" = "true" ] && exit 0
|
||||||
|
|
||||||
|
transcript=$(printf '%s' "$input" | jq -r '.transcript_path // empty' 2>/dev/null) || exit 0
|
||||||
|
[ -z "$transcript" ] || [ ! -r "$transcript" ] && exit 0
|
||||||
|
|
||||||
|
# Last assistant message's text blocks, joined.
|
||||||
|
text=$(jq -rs '
|
||||||
|
([ .[] | select(.type=="assistant") ] | last) as $a
|
||||||
|
| ($a.message.content // [])
|
||||||
|
| if type=="array" then [ .[] | select(.type=="text") | .text ] | join("\n")
|
||||||
|
elif type=="string" then .
|
||||||
|
else "" end
|
||||||
|
' "$transcript" 2>/dev/null) || exit 0
|
||||||
|
|
||||||
|
low="${text,,}"
|
||||||
|
|
||||||
|
if [[ "$low" == *"inline execution"* \
|
||||||
|
&& ( "$low" == *"which approach"* || "$low" == *"two execution options"* ) ]] \
|
||||||
|
|| [[ "$low" == *"subagent-driven or inline"* || "$low" == *"inline or subagent"* ]] \
|
||||||
|
|| [[ "$low" == *"subagent-driven vs inline"* || "$low" == *"subagent vs inline"* \
|
||||||
|
|| "$low" == *"inline vs subagent"* ]] \
|
||||||
|
|| [[ "$low" == *"execution approach"* && "$low" == *"?"* ]]; then
|
||||||
|
cat <<'JSON'
|
||||||
|
{"decision":"block","reason":"Execution-mode menu detected in your final message. boma standing preference (docs/FRICTION.md + always-subagent-driven-execution memory): never present the subagent-driven-vs-inline menu. Drop the menu and proceed with subagent-driven execution directly (superpowers:subagent-driven-development)."}
|
||||||
|
JSON
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Brainstorming spec-review gate: asking the user to review the committed spec before
|
||||||
|
# the implementation plan. Tight signature: "implementation plan" present, plus either the
|
||||||
|
# skill's literal "spec written and committed" line, or the review+spec+before combination.
|
||||||
|
if [[ "$low" == *"implementation plan"* \
|
||||||
|
&& ( "$low" == *"spec written and committed"* \
|
||||||
|
|| ( "$low" == *"review"* && "$low" == *"the spec"* && "$low" == *"before"* ) ) ]]; then
|
||||||
|
cat <<'JSON'
|
||||||
|
{"decision":"block","reason":"Brainstorming spec-review gate detected in your final message. boma standing agreement (docs/FRICTION.md): once the spec is written and committed, move directly to the implementation plan (superpowers:writing-plans) — do not stop to ask the user to review the spec first. Drop the gate and proceed."}
|
||||||
|
JSON
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
|
|
@ -1,12 +1,16 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
#
|
#
|
||||||
# PreToolUse guard (Bash): block `git commit` when the rbw vault agent is locked.
|
# PreToolUse guard (Bash): block `git commit` ONLY when the rbw vault agent is locked
|
||||||
# The pre-commit ansible-lint hook decrypts vault.yml via rbw, so a commit while
|
# AND the commit would actually need the vault. The pre-commit ansible-lint hook decrypts
|
||||||
# locked fails deep with a confusing error. This catches it early with a clear fix.
|
# vault.yml via rbw — but it is scoped (`files: ^(roles|playbooks|inventories)/.*\.ya?ml$`,
|
||||||
|
# always_run:false), so a docs-/config-only commit never triggers it and needs no vault.
|
||||||
|
# (2026-06-17 kaizen, docs/FRICTION.md: the old guard blocked *every* locked commit, so a
|
||||||
|
# docs-only commit got snagged needing a vault password it never uses.)
|
||||||
#
|
#
|
||||||
# Fails OPEN: only blocks on a definitive "rbw present AND not unlocked" signal.
|
# Fails OPEN: blocks only on a definitive "Ansible content staged AND rbw locked" signal.
|
||||||
# If rbw is missing, the command isn't a plain `git commit`, or `--no-verify` is
|
# rbw missing, not a plain `git commit`, `--no-verify`, or no Ansible content staged → allow.
|
||||||
# used, the action is allowed.
|
# When unsure it errs toward blocking (asking for an unlock is cheap; a deep pre-commit
|
||||||
|
# failure is not).
|
||||||
#
|
#
|
||||||
set -uo pipefail
|
set -uo pipefail
|
||||||
|
|
||||||
|
|
@ -22,14 +26,25 @@ case "$cmd" in
|
||||||
esac
|
esac
|
||||||
|
|
||||||
command -v rbw >/dev/null 2>&1 || exit 0 # rbw not installed — allow
|
command -v rbw >/dev/null 2>&1 || exit 0 # rbw not installed — allow
|
||||||
|
rbw unlocked >/dev/null 2>&1 && exit 0 # unlocked — allow
|
||||||
|
|
||||||
if rbw unlocked >/dev/null 2>&1; then
|
# rbw is LOCKED. Only block if this commit would run the vault-decrypting ansible-lint
|
||||||
exit 0 # unlocked — allow
|
# hook — i.e. staged content matches its `files:` scope. Mirror that regex exactly.
|
||||||
fi
|
ANSIBLE_RE='^(roles|playbooks|inventories)/.*\.ya?ml$'
|
||||||
|
|
||||||
# rbw present but not unlocked (locked or agent not running) — the commit would
|
cd "${CLAUDE_PROJECT_DIR:-.}" 2>/dev/null || exit 0
|
||||||
# fail in the pre-commit hook, so block early with guidance.
|
files=$(git diff --cached --name-only 2>/dev/null) || exit 0
|
||||||
|
# `git commit -a/--all` also sweeps in modified tracked files that aren't staged yet.
|
||||||
|
# (Substring match — errs toward including them, which only ever over-blocks. Safe.)
|
||||||
|
case " $cmd " in
|
||||||
|
*" -a"*|*"--all"*) files="$files"$'\n'"$(git diff --name-only 2>/dev/null)" ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
# No Ansible content in the fileset → ansible-lint hook won't run → no vault needed → allow.
|
||||||
|
printf '%s\n' "$files" | grep -Eq "$ANSIBLE_RE" || exit 0
|
||||||
|
|
||||||
|
# Ansible content staged AND rbw locked — the commit would fail deep in pre-commit. Block.
|
||||||
cat <<'JSON'
|
cat <<'JSON'
|
||||||
{"hookSpecificOutput":{"hookEventName":"PreToolUse","permissionDecision":"deny","permissionDecisionReason":"rbw is locked — the pre-commit ansible-lint hook needs the vault password to decrypt vault.yml. Run: rbw unlock"}}
|
{"hookSpecificOutput":{"hookEventName":"PreToolUse","permissionDecision":"deny","permissionDecisionReason":"rbw is locked and this commit stages Ansible content — the pre-commit ansible-lint hook needs the vault password to decrypt vault.yml. Run: rbw unlock (docs-/config-only commits are exempt and won't hit this guard.)"}}
|
||||||
JSON
|
JSON
|
||||||
exit 0
|
exit 0
|
||||||
|
|
|
||||||
|
|
@ -56,6 +56,23 @@
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
],
|
||||||
|
"Stop": [
|
||||||
|
{
|
||||||
|
"hooks": [
|
||||||
|
{
|
||||||
|
"type": "command",
|
||||||
|
"command": "bash \"${CLAUDE_PROJECT_DIR:-.}/.claude/hooks/guard-execution-mode-menu.sh\"",
|
||||||
|
"timeout": 10,
|
||||||
|
"statusMessage": "Checking for execution-mode menu"
|
||||||
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"statusLine": {
|
||||||
|
"type": "command",
|
||||||
|
"command": "bash \"${CLAUDE_PROJECT_DIR:-.}/.claude/statusline.sh\"",
|
||||||
|
"padding": 0
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
63
.claude/statusline.sh
Executable file
63
.claude/statusline.sh
Executable file
|
|
@ -0,0 +1,63 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Claude Code statusLine — shows working dir, model, and context-window usage.
|
||||||
|
# Wired via .claude/settings.json (statusLine.command). Receives the statusLine
|
||||||
|
# JSON on stdin; first stdout line is rendered (ANSI colour supported).
|
||||||
|
#
|
||||||
|
# Context usage comes straight from the input JSON — no transcript parsing:
|
||||||
|
# .context_window.used_percentage pre-calculated % of the window in use (input side)
|
||||||
|
# .context_window.context_window_size window size in tokens (1000000 for the 1M models)
|
||||||
|
# verified: Claude Code statusLine schema · code.claude.com/docs/en/statusline · 2026-06-17
|
||||||
|
#
|
||||||
|
# Fails soft: any parse problem prints nothing and exits 0 (never breaks the prompt).
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
input=$(cat 2>/dev/null) || exit 0
|
||||||
|
command -v jq >/dev/null 2>&1 || exit 0
|
||||||
|
|
||||||
|
# pct<TAB>window<TAB>dir-basename<TAB>model-name (used_percentage preferred,
|
||||||
|
# else derived from current_usage, else 0). @tsv keeps spaces in the dir safe.
|
||||||
|
parsed=$(printf '%s' "$input" | jq -r '
|
||||||
|
(.workspace.current_dir // .cwd // "" | sub(".*/"; "")) as $dir
|
||||||
|
| (.model.display_name // "?") as $model
|
||||||
|
| (.context_window.context_window_size // 200000) as $win
|
||||||
|
| (
|
||||||
|
if (.context_window.used_percentage // null) != null then
|
||||||
|
.context_window.used_percentage
|
||||||
|
elif (.context_window.current_usage // null) != null then
|
||||||
|
((.context_window.current_usage.input_tokens
|
||||||
|
+ (.context_window.current_usage.cache_creation_input_tokens // 0)
|
||||||
|
+ (.context_window.current_usage.cache_read_input_tokens // 0)) / $win * 100)
|
||||||
|
else 0 end | floor
|
||||||
|
) as $pct
|
||||||
|
| [$pct, $win, $dir, $model] | @tsv
|
||||||
|
' 2>/dev/null) || exit 0
|
||||||
|
[ -z "$parsed" ] && exit 0
|
||||||
|
|
||||||
|
IFS=$'\t' read -r pct win dir model <<<"$parsed"
|
||||||
|
|
||||||
|
# Human window label: 1000000 -> 1M, 200000 -> 200k, else Nk.
|
||||||
|
case "$win" in
|
||||||
|
1000000) wlabel="1M" ;;
|
||||||
|
*) wlabel="$((win / 1000))k" ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
# Colour the bar/percentage by pressure: green <70, yellow 70–89, red >=90.
|
||||||
|
if [ "$pct" -ge 90 ]; then col=$'\033[31m' # red
|
||||||
|
elif [ "$pct" -ge 70 ]; then col=$'\033[33m' # yellow
|
||||||
|
else col=$'\033[32m' # green
|
||||||
|
fi
|
||||||
|
dim=$'\033[2m'; rst=$'\033[0m'
|
||||||
|
|
||||||
|
# 10-cell bar; clamp fill to [0,10] so an over-100 reading can't overflow.
|
||||||
|
filled=$((pct / 10)); [ "$filled" -gt 10 ] && filled=10; [ "$filled" -lt 0 ] && filled=0
|
||||||
|
bar=""
|
||||||
|
for ((i = 0; i < 10; i++)); do
|
||||||
|
if [ "$i" -lt "$filled" ]; then bar+="█"; else bar+="░"; fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '%s%s%s · %s · %s%s %d%%%s %sctx/%s%s\n' \
|
||||||
|
"$dim" "$dir" "$rst" \
|
||||||
|
"$model" \
|
||||||
|
"$col" "$bar" "$pct" "$rst" \
|
||||||
|
"$dim" "$wlabel" "$rst"
|
||||||
22
.docker/caddy-gandi/Dockerfile
Normal file
22
.docker/caddy-gandi/Dockerfile
Normal file
|
|
@ -0,0 +1,22 @@
|
||||||
|
# syntax=docker/dockerfile:1
|
||||||
|
# Custom Caddy image: vanilla Caddy + the Gandi DNS-01 plugin (ADR-024).
|
||||||
|
#
|
||||||
|
# WHY: mesh/LAN-only services have no public A-record, so they cannot satisfy ACME
|
||||||
|
# HTTP-01; they need DNS-01 against Gandi (the M1 *.<domain> wildcard strategy).
|
||||||
|
# Caddy's official image ships no third-party DNS plugins, so we compile one in.
|
||||||
|
#
|
||||||
|
# WHERE to build: on ubongo (the control node) — NOT on askari/Hetzner. Google's Go
|
||||||
|
# module proxy 403s Hetzner IP ranges, which broke the original on-host build (M4a).
|
||||||
|
# Build here, push the pinned tag/digest to the Forgejo registry, pull on askari.
|
||||||
|
#
|
||||||
|
# Versions pinned (ADR-011/ADR-014). caddy-dns/gandi v1.1.0 -> libdns/gandi v1.1.0,
|
||||||
|
# which authenticates with a Gandi Personal Access Token via "Authorization: Bearer"
|
||||||
|
# against https://api.gandi.net/v5/livedns (the legacy Apikey scheme is gone — using
|
||||||
|
# a PAT in the old Apikey slot 403s, which is what sank the M4a attempt).
|
||||||
|
# verified: caddy-dns/gandi v1.1.0 sends the PAT as Bearer · WebFetch libdns/gandi
|
||||||
|
# client.go @master (go.mod requires v1.1.0) · 2026-06-15
|
||||||
|
FROM caddy:2.11.4-builder AS build
|
||||||
|
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0
|
||||||
|
|
||||||
|
FROM caddy:2.11.4
|
||||||
|
COPY --from=build /usr/bin/caddy /usr/bin/caddy
|
||||||
6
.gitignore
vendored
6
.gitignore
vendored
|
|
@ -31,3 +31,9 @@ terraform/**/*.tfstate
|
||||||
terraform/**/*.tfstate.backup
|
terraform/**/*.tfstate.backup
|
||||||
terraform/**/terraform.tfvars
|
terraform/**/terraform.tfvars
|
||||||
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
|
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
|
||||||
|
|
||||||
|
# Service-UI verification screenshots (kept locally on ubongo, not committed — ADR-017)
|
||||||
|
.verify-runs/
|
||||||
|
|
||||||
|
# Integration-test transient run dir (ADR-025); diagnostics live under ~/integration-runs
|
||||||
|
tests/integration/.run/
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,15 @@ repos:
|
||||||
rev: v24.12.2 # keep in sync with requirements.txt
|
rev: v24.12.2 # keep in sync with requirements.txt
|
||||||
hooks:
|
hooks:
|
||||||
- id: ansible-lint
|
- id: ansible-lint
|
||||||
|
# Only run on Ansible content. ansible-lint loads the play context, which
|
||||||
|
# auto-decrypts inventories/*/group_vars/all/vault.yml via the wired
|
||||||
|
# vault_password_file (→ rbw) — so it needs `rbw unlock`. The upstream hook is
|
||||||
|
# always_run+pass_filenames:false (lints the whole project, every commit); we
|
||||||
|
# override always_run:false and add a files filter so docs-/config-only commits
|
||||||
|
# skip it (no vault needed). pass_filenames stays false → still a project lint
|
||||||
|
# when any Ansible file is staged.
|
||||||
|
always_run: false
|
||||||
|
files: ^(roles|playbooks|inventories)/.*\.ya?ml$
|
||||||
additional_dependencies:
|
additional_dependencies:
|
||||||
- ansible-core==2.17.* # pin (not >=) — keep in sync with requirements.txt
|
- ansible-core==2.17.* # pin (not >=) — keep in sync with requirements.txt
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -24,4 +24,5 @@ ignore: |
|
||||||
.venv/
|
.venv/
|
||||||
.collections/
|
.collections/
|
||||||
.scaffold/
|
.scaffold/
|
||||||
|
tests/integration/.run/
|
||||||
**/vault.yml
|
**/vault.yml
|
||||||
|
|
|
||||||
71
CLAUDE.md
71
CLAUDE.md
|
|
@ -14,7 +14,8 @@ Keep it dense and command-focused. Verbose detail lives in `docs/`.
|
||||||
Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs.
|
Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs.
|
||||||
All hosts share a hardened base configuration. Each host runs a defined set of Docker
|
All hosts share a hardened base configuration. Each host runs a defined set of Docker
|
||||||
services deployed via Compose files rendered from Ansible templates. Ansible runs from
|
services deployed via Compose files rendered from Ansible templates. Ansible runs from
|
||||||
a dedicated control VM. CI runs on Forgejo Actions (self-hosted).
|
a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo
|
||||||
|
Actions (self-hosted).
|
||||||
|
|
||||||
Full design rationale: `docs/decisions/`
|
Full design rationale: `docs/decisions/`
|
||||||
|
|
||||||
|
|
@ -32,6 +33,8 @@ Full design rationale: `docs/decisions/`
|
||||||
| Scaffold a new role | `make new-role NAME=<name>` |
|
| Scaffold a new role | `make new-role NAME=<name>` |
|
||||||
| Review repo for drift/cruft | `/review-repo` (Claude command) |
|
| Review repo for drift/cruft | `/review-repo` (Claude command) |
|
||||||
| Review hardware capacity | `/capacity-review` (Claude command) |
|
| Review hardware capacity | `/capacity-review` (Claude command) |
|
||||||
|
| Edit the vault (nvim, auto re-encrypt) | `make edit-vault [VAULT=<path>]` |
|
||||||
|
| Validate vault structure | `make check-vault [VAULT=<path>]` |
|
||||||
| Encrypt a vault file | `make encrypt FILE=<path>` |
|
| Encrypt a vault file | `make encrypt FILE=<path>` |
|
||||||
| Decrypt a vault file | `make decrypt FILE=<path>` |
|
| Decrypt a vault file | `make decrypt FILE=<path>` |
|
||||||
| Install Python deps | `make setup` |
|
| Install Python deps | `make setup` |
|
||||||
|
|
@ -40,6 +43,8 @@ Full design rationale: `docs/decisions/`
|
||||||
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
|
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
|
||||||
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
|
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
|
||||||
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
|
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
|
||||||
|
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
|
||||||
|
| Clean up integration test VMs | `make test-integration-clean` |
|
||||||
|
|
||||||
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
|
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
|
||||||
|
|
||||||
|
|
@ -50,11 +55,18 @@ Full design rationale: `docs/decisions/`
|
||||||
## Ansible conventions
|
## Ansible conventions
|
||||||
|
|
||||||
- **FQCN always**: `ansible.builtin.template`, never `template`
|
- **FQCN always**: `ansible.builtin.template`, never `template`
|
||||||
- **Tags**: every task must have at least one tag; playbooks support `--tags` filtering
|
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
|
||||||
|
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
|
||||||
|
approved list (`tests/tags.yml`) only where it genuinely belongs to that concern —
|
||||||
|
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
|
||||||
|
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary and that each role import carries its role-name tag.
|
||||||
- **Handlers**: use `listen:` topic strings, not direct name references
|
- **Handlers**: use `listen:` topic strings, not direct name references
|
||||||
- **Variables**: `rolename__varname` double-underscore namespace for role defaults
|
- **Variables**: `rolename__varname` double-underscore namespace for role defaults
|
||||||
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
|
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
|
||||||
- **Loops**: prefer `loop:` over `with_items:`
|
- **Loops**: prefer `loop:` over `with_items:`
|
||||||
|
- **Loop var keys**: index with `item['key']`, never `item.key` — a key named
|
||||||
|
`values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt +
|
||||||
|
non-idempotent), not the value
|
||||||
- **Conditionals**: prefer `true`/`false` over `yes`/`no`
|
- **Conditionals**: prefer `true`/`false` over `yes`/`no`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
@ -71,7 +83,21 @@ Full design rationale: `docs/decisions/`
|
||||||
git commit** — the pre-commit ansible-lint hook decrypts `vault.yml`), run `rbw
|
git commit** — the pre-commit ansible-lint hook decrypts `vault.yml`), run `rbw
|
||||||
unlocked`; if it exits non-zero, ask the user to `rbw unlock` and wait rather than
|
unlocked`; if it exits non-zero, ask the user to `rbw unlock` and wait rather than
|
||||||
starting and failing partway. The agent stays unlocked 5h.
|
starting and failing partway. The agent stays unlocked 5h.
|
||||||
- To edit a vault file: `make decrypt FILE=<path>`, edit, `make encrypt FILE=<path>`
|
- To edit the vault: `make edit-vault` — decrypts → opens nvim → re-encrypts on `:wq`
|
||||||
|
(abort with `:cq`), then `check-vault` validates structure. No plaintext lands in the
|
||||||
|
work tree. Override the file with `VAULT=<path>`. (The lower-level `make decrypt`/
|
||||||
|
`encrypt FILE=<path>` still exist for scripted/non-interactive edits.)
|
||||||
|
- `make check-vault` validates the vault decrypts, is valid YAML, keeps secrets under the
|
||||||
|
nested `vault:` map, and has no empty leaves — printing a structure view with values
|
||||||
|
masked. Needs `rbw` unlocked. It also **flags any leaf still set to `CHANGEME`** (see
|
||||||
|
next bullet).
|
||||||
|
- **Stubbing a secret the operator must supply** (don't ping-pong over chat): when a new
|
||||||
|
secret is needed, the agent itself adds the vault entry with the sentinel value
|
||||||
|
**`CHANGEME`** plus a comment stating *what it is and how to obtain it*, wires the code
|
||||||
|
to `{{ vault.<service>.<key> }}`, and commits that. Then prompt the operator to run
|
||||||
|
`make edit-vault`, replace the `CHANGEME`(s) with the real value(s) — which never touch
|
||||||
|
the conversation — and re-encrypt. `make check-vault` lists any outstanding `CHANGEME`
|
||||||
|
placeholders so nothing is forgotten. The agent never handles the real secret.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -81,6 +107,12 @@ Full design rationale: `docs/decisions/`
|
||||||
- Every role must have a populated `README.md`
|
- Every role must have a populated `README.md`
|
||||||
- Every role must have `meta/main.yml` filled in
|
- Every role must have `meta/main.yml` filled in
|
||||||
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
||||||
|
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
|
||||||
|
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
|
||||||
|
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
|
||||||
|
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
|
||||||
|
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
|
||||||
|
data. A stateless service records `backup__state: false` with a reason.
|
||||||
- One service = one self-contained role; no shared multi-service roles (ADR-004)
|
- One service = one self-contained role; no shared multi-service roles (ADR-004)
|
||||||
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
|
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
|
||||||
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
|
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
|
||||||
|
|
@ -99,13 +131,17 @@ inventories/
|
||||||
vault.yml
|
vault.yml
|
||||||
docker_hosts/ # hosts running Docker services
|
docker_hosts/ # hosts running Docker services
|
||||||
proxmox_hosts/ # Proxmox nodes themselves
|
proxmox_hosts/ # Proxmox nodes themselves
|
||||||
|
offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog
|
||||||
host_vars/ # per-host overrides
|
host_vars/ # per-host overrides
|
||||||
staging/ # safe to run freely
|
staging/ # safe to run freely
|
||||||
```
|
```
|
||||||
|
|
||||||
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`
|
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`
|
||||||
|
|
||||||
(`control` holds the one manually-provisioned control node — see ADR-009.)
|
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
|
||||||
|
outside the cluster; `offsite_hosts` holds `askari`, the off-site Hetzner host that
|
||||||
|
runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015,
|
||||||
|
ADR-016.)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -138,12 +174,18 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
||||||
## Terraform conventions
|
## Terraform conventions
|
||||||
|
|
||||||
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
|
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
|
||||||
|
- Every TF-managed VM carries three Proxmox tags — `<env>`, its inventory `group`, and
|
||||||
|
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
|
||||||
|
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
|
||||||
- Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory)
|
- Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory)
|
||||||
- OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
|
- OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
|
||||||
- Environments are separate directories (`staging/`, `production/`), not workspaces
|
- Environments are separate directories (`staging/`, `production/`), not workspaces
|
||||||
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
|
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
|
||||||
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
|
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
|
||||||
- `.terraform.lock.hcl` is tracked (pins provider versions)
|
- `.terraform.lock.hcl` is tracked (pins provider versions)
|
||||||
|
- Every module declares its own `required_providers` (in `versions.tf`) for any
|
||||||
|
non-hashicorp provider — otherwise TF infers `hashicorp/<name>` and `init` fails
|
||||||
|
(caught only by a live `tf-init`, not by static review)
|
||||||
- Full rationale: `docs/decisions/006-terraform.md`
|
- Full rationale: `docs/decisions/006-terraform.md`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
@ -156,7 +198,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
||||||
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
|
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
|
||||||
- Force-push or rewrite already-pushed history on `main`
|
- Force-push or rewrite already-pushed history on `main`
|
||||||
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
|
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
|
||||||
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
|
- Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
|
||||||
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
|
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
|
||||||
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
|
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
|
||||||
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
|
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
|
||||||
|
|
@ -187,24 +229,39 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
||||||
| Topic | File |
|
| Topic | File |
|
||||||
|------------------------|---------------------------------------|
|
|------------------------|---------------------------------------|
|
||||||
| Architecture overview | `docs/decisions/001-architecture.md` |
|
| Architecture overview | `docs/decisions/001-architecture.md` |
|
||||||
| Capabilities overview (what boma does) | `docs/capabilities.md` |
|
| Build order / roadmap | `docs/ROADMAP.md` |
|
||||||
|
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
|
||||||
| Security baseline & strategy | `docs/decisions/002-security.md` |
|
| Security baseline & strategy | `docs/decisions/002-security.md` |
|
||||||
| Accepted security risks | `docs/security/accepted-risks.md` |
|
| Accepted security risks | `docs/security/accepted-risks.md` |
|
||||||
| Per-service security checklist | `docs/security/service-checklist.md` |
|
| Per-service security checklist | `docs/security/service-checklist.md` |
|
||||||
| Per-service security record (template) | `docs/security/service-security-template.md` |
|
| Per-service security record (template) | `docs/security/service-security-template.md` |
|
||||||
|
| Per-service verification spec (template) | `docs/testing/service-verify-template.md` |
|
||||||
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
|
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
|
||||||
| Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` |
|
| Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` |
|
||||||
| Toolchain choices | `docs/decisions/003-toolchain.md` |
|
| Toolchain choices | `docs/decisions/003-toolchain.md` |
|
||||||
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
|
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
|
||||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||||
|
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
|
||||||
| Terraform | `docs/decisions/006-terraform.md` |
|
| Terraform | `docs/decisions/006-terraform.md` |
|
||||||
| Network topology | `docs/decisions/007-network.md` |
|
| Network topology | `docs/decisions/007-network.md` |
|
||||||
|
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
|
||||||
| Testing methodology | `docs/decisions/008-testing.md` |
|
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||||
|
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
|
||||||
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
|
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
|
||||||
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
|
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
|
||||||
| Update management | `docs/decisions/011-update-management.md` |
|
| Update management | `docs/decisions/011-update-management.md` |
|
||||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||||
|
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
||||||
|
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
||||||
|
| Firewall strategy | `docs/decisions/020-firewall.md` |
|
||||||
|
| Operational access | `docs/decisions/021-operational-access.md` |
|
||||||
|
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
||||||
|
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
|
||||||
|
| Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` |
|
||||||
|
| Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` |
|
||||||
|
| Integration testing runbook | `docs/runbooks/integration-testing.md` |
|
||||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||||
|
| Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` |
|
||||||
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
||||||
| Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |
|
| Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |
|
||||||
|
|
|
||||||
111
Makefile
111
Makefile
|
|
@ -5,24 +5,45 @@ VENV := .venv
|
||||||
PYTHON := $(VENV)/bin/python
|
PYTHON := $(VENV)/bin/python
|
||||||
PIP := $(VENV)/bin/pip
|
PIP := $(VENV)/bin/pip
|
||||||
ANSIBLE := $(VENV)/bin/ansible
|
ANSIBLE := $(VENV)/bin/ansible
|
||||||
PLAYBOOK := $(VENV)/bin/ansible-playbook
|
PLAYBOOK_BIN := $(VENV)/bin/ansible-playbook
|
||||||
GALAXY := $(VENV)/bin/ansible-galaxy
|
GALAXY := $(VENV)/bin/ansible-galaxy
|
||||||
LINT := $(VENV)/bin/ansible-lint
|
LINT := $(VENV)/bin/ansible-lint
|
||||||
MOLECULE := $(VENV)/bin/molecule
|
MOLECULE := $(VENV)/bin/molecule
|
||||||
# Vault password is resolved via ansible.cfg (vault_password_file); no flag needed.
|
# Vault password is resolved via ansible.cfg (vault_password_file); no flag needed.
|
||||||
VAULT_ARGS :=
|
VAULT_ARGS :=
|
||||||
INVENTORY := -i inventories/production/hosts.yml
|
# Default vault file for edit-vault / check-vault (override with VAULT=<path>).
|
||||||
|
VAULT ?= inventories/production/group_vars/all/vault.yml
|
||||||
|
INVENTORY := -i inventories/production/
|
||||||
|
|
||||||
TF := terraform
|
TF := terraform
|
||||||
TF_ENV ?= staging
|
TF_ENV ?= staging
|
||||||
MOLECULE_IMAGE := forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest
|
MOLECULE_IMAGE := forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest
|
||||||
MOLECULE_DOCKERFILE := .docker/molecule-debian13/Dockerfile
|
MOLECULE_DOCKERFILE := .docker/molecule-debian13/Dockerfile
|
||||||
|
# Custom Caddy + Gandi DNS-01 plugin (ADR-024). Build on ubongo, NOT askari/Hetzner
|
||||||
|
# (the Go module proxy 403s Hetzner IPs); push the pinned tag to the Forgejo registry.
|
||||||
|
CADDY_IMAGE := forgejo.nyumbani.baobab.band/sjat/caddy-gandi:2.11.4
|
||||||
|
CADDY_DOCKERFILE := .docker/caddy-gandi/Dockerfile
|
||||||
|
# Forgejo container registry (same host/user as the image tags above). `make registry-login`
|
||||||
|
# logs the Docker daemon in using vault.forgejo.registry_token (2026-06-17 kaizen) so image
|
||||||
|
# pushes are agent-completable non-interactively.
|
||||||
|
REGISTRY_HOST := forgejo.nyumbani.baobab.band
|
||||||
|
REGISTRY_USER := sjat
|
||||||
|
|
||||||
|
# For TF_ENV=offsite, source the Hetzner token from the vault into the environment
|
||||||
|
# (rbw must be unlocked). Read in-memory; never written to a tfvars file (CLAUDE.md).
|
||||||
|
ifeq ($(TF_ENV),offsite)
|
||||||
|
TF_TOKEN_ENV := TF_VAR_hcloud_token="$$($(ANSIBLE)-vault view inventories/production/group_vars/all/vault.yml | $(PYTHON) -c 'import sys, yaml; print(yaml.safe_load(sys.stdin)["vault"]["hetzner"]["token"])')"
|
||||||
|
else
|
||||||
|
TF_TOKEN_ENV :=
|
||||||
|
endif
|
||||||
|
|
||||||
.DEFAULT_GOAL := help
|
.DEFAULT_GOAL := help
|
||||||
|
|
||||||
.PHONY: help setup collections lint test test-all check deploy encrypt decrypt new-role \
|
.PHONY: help setup collections lint test test-all test-integration test-integration-clean \
|
||||||
tf-init tf-plan tf-apply tf-output tf-inventory \
|
check deploy encrypt decrypt \
|
||||||
molecule-image molecule-image-push
|
edit-vault check-vault new-role \
|
||||||
|
tf-init tf-plan tf-apply tf-output tf-inventory tf-inventory-offsite \
|
||||||
|
molecule-image molecule-image-push caddy-image caddy-image-push registry-login
|
||||||
|
|
||||||
help:
|
help:
|
||||||
@echo ""
|
@echo ""
|
||||||
|
|
@ -33,8 +54,12 @@ help:
|
||||||
@echo " make lint Run yamllint + ansible-lint"
|
@echo " make lint Run yamllint + ansible-lint"
|
||||||
@echo " make test ROLE=<name> Run Molecule tests for a role"
|
@echo " make test ROLE=<name> Run Molecule tests for a role"
|
||||||
@echo " make test-all Run Molecule tests for all roles"
|
@echo " make test-all Run Molecule tests for all roles"
|
||||||
@echo " make check PLAYBOOK=<name> Dry-run a playbook (check mode)"
|
@echo " make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1] Run ADR-025 integration cycle against a VM"
|
||||||
@echo " make deploy PLAYBOOK=<name> Run a playbook against production"
|
@echo " make test-integration-clean Prune stale integration-test VM snapshots"
|
||||||
|
@echo " make check PLAYBOOK=<name> [LIMIT=<host>] [TAGS=<tags>] [EXTRA=<args>] Dry-run a playbook (check mode)"
|
||||||
|
@echo " make deploy PLAYBOOK=<name> [LIMIT=<host>] [TAGS=<tags>] [EXTRA=<args>] Run a playbook against production"
|
||||||
|
@echo " make edit-vault [VAULT=<path>] Edit the vault in nvim (auto re-encrypts + checks)"
|
||||||
|
@echo " make check-vault [VAULT=<path>] Validate vault structure (values masked)"
|
||||||
@echo " make encrypt FILE=<path> Encrypt a vault file"
|
@echo " make encrypt FILE=<path> Encrypt a vault file"
|
||||||
@echo " make decrypt FILE=<path> Decrypt a vault file"
|
@echo " make decrypt FILE=<path> Decrypt a vault file"
|
||||||
@echo " make new-role NAME=<name> Scaffold a new role"
|
@echo " make new-role NAME=<name> Scaffold a new role"
|
||||||
|
|
@ -44,11 +69,15 @@ help:
|
||||||
@echo " make tf-apply [TF_ENV=staging] Apply Terraform changes"
|
@echo " make tf-apply [TF_ENV=staging] Apply Terraform changes"
|
||||||
@echo " make tf-output [TF_ENV=staging] Print Terraform outputs as JSON"
|
@echo " make tf-output [TF_ENV=staging] Print Terraform outputs as JSON"
|
||||||
@echo " make tf-inventory [TF_ENV=staging] Regenerate Ansible inventory from Terraform outputs"
|
@echo " make tf-inventory [TF_ENV=staging] Regenerate Ansible inventory from Terraform outputs"
|
||||||
|
@echo " make tf-inventory-offsite Generate offsite_hosts inventory (askari) into inventories/production/"
|
||||||
@echo ""
|
@echo ""
|
||||||
@echo " TF_ENV defaults to 'staging'. Use TF_ENV=production for production."
|
@echo " TF_ENV defaults to 'staging'. Use TF_ENV=production for production."
|
||||||
@echo ""
|
@echo ""
|
||||||
@echo " make molecule-image Build the Molecule test image locally"
|
@echo " make molecule-image Build the Molecule test image locally"
|
||||||
@echo " make molecule-image-push Push the test image to the Forgejo registry"
|
@echo " make molecule-image-push Push the test image to the Forgejo registry"
|
||||||
|
@echo " make caddy-image Build the custom Caddy + Gandi DNS-01 image (run on ubongo)"
|
||||||
|
@echo " make caddy-image-push Push the Caddy image to the Forgejo registry"
|
||||||
|
@echo " make registry-login Log Docker into the Forgejo registry (vaulted token)"
|
||||||
@echo ""
|
@echo ""
|
||||||
|
|
||||||
# ── Environment setup ─────────────────────────────────────────────────────────
|
# ── Environment setup ─────────────────────────────────────────────────────────
|
||||||
|
|
@ -67,6 +96,7 @@ collections:
|
||||||
lint:
|
lint:
|
||||||
$(VENV)/bin/yamllint .
|
$(VENV)/bin/yamllint .
|
||||||
$(LINT)
|
$(LINT)
|
||||||
|
$(PYTHON) scripts/check-tags.py
|
||||||
|
|
||||||
# ── Testing ───────────────────────────────────────────────────────────────────
|
# ── Testing ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
@ -74,30 +104,50 @@ test:
|
||||||
ifndef ROLE
|
ifndef ROLE
|
||||||
$(error ROLE is required: make test ROLE=<rolename>)
|
$(error ROLE is required: make test ROLE=<rolename>)
|
||||||
endif
|
endif
|
||||||
cd roles/$(ROLE) && ../../$(MOLECULE) test
|
cd roles/$(ROLE) && PATH="$(CURDIR)/$(VENV)/bin:$$PATH" molecule test
|
||||||
|
|
||||||
test-all:
|
test-all:
|
||||||
@for role in roles/*/; do \
|
@for role in roles/*/; do \
|
||||||
echo "── Testing $$role ──"; \
|
echo "── Testing $$role ──"; \
|
||||||
cd $$role && ../../$(MOLECULE) test; cd ../..; \
|
cd $$role && PATH="$(CURDIR)/$(VENV)/bin:$$PATH" molecule test; cd ../..; \
|
||||||
done
|
done
|
||||||
|
|
||||||
|
test-integration:
|
||||||
|
ifndef HOST
|
||||||
|
$(error HOST is required: make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1])
|
||||||
|
endif
|
||||||
|
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py cycle \
|
||||||
|
--host $(HOST) $(if $(CERTS),--certs $(CERTS)) $(if $(KEEP),--keep)
|
||||||
|
|
||||||
|
test-integration-clean:
|
||||||
|
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py prune
|
||||||
|
|
||||||
# ── Playbook execution ────────────────────────────────────────────────────────
|
# ── Playbook execution ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
check:
|
check:
|
||||||
ifndef PLAYBOOK
|
ifndef PLAYBOOK
|
||||||
$(error PLAYBOOK is required: make check PLAYBOOK=<name>)
|
$(error PLAYBOOK is required: make check PLAYBOOK=<name>)
|
||||||
endif
|
endif
|
||||||
$(PLAYBOOK) $(INVENTORY) $(VAULT_ARGS) --check --diff playbooks/$(PLAYBOOK).yml
|
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) $(EXTRA) --check --diff playbooks/$(PLAYBOOK).yml
|
||||||
|
|
||||||
deploy:
|
deploy:
|
||||||
ifndef PLAYBOOK
|
ifndef PLAYBOOK
|
||||||
$(error PLAYBOOK is required: make deploy PLAYBOOK=<name>)
|
$(error PLAYBOOK is required: make deploy PLAYBOOK=<name>)
|
||||||
endif
|
endif
|
||||||
$(PLAYBOOK) $(INVENTORY) $(VAULT_ARGS) playbooks/$(PLAYBOOK).yml
|
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) $(EXTRA) playbooks/$(PLAYBOOK).yml
|
||||||
|
|
||||||
# ── Vault ─────────────────────────────────────────────────────────────────────
|
# ── Vault ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
# Streamlined edit: ansible-vault edit decrypts to a temp file, opens nvim, and
|
||||||
|
# re-encrypts on :wq (abort with :cq) — no plaintext ever lands in the work tree.
|
||||||
|
# Then validate structure. Override the file with VAULT=<path>.
|
||||||
|
edit-vault:
|
||||||
|
EDITOR=nvim $(ANSIBLE)-vault edit $(VAULT)
|
||||||
|
@$(PYTHON) scripts/check-vault.py $(VAULT)
|
||||||
|
|
||||||
|
check-vault:
|
||||||
|
@$(PYTHON) scripts/check-vault.py $(VAULT)
|
||||||
|
|
||||||
encrypt:
|
encrypt:
|
||||||
ifndef FILE
|
ifndef FILE
|
||||||
$(error FILE is required: make encrypt FILE=<path>)
|
$(error FILE is required: make encrypt FILE=<path>)
|
||||||
|
|
@ -118,19 +168,36 @@ molecule-image:
|
||||||
molecule-image-push: molecule-image
|
molecule-image-push: molecule-image
|
||||||
docker push $(MOLECULE_IMAGE)
|
docker push $(MOLECULE_IMAGE)
|
||||||
|
|
||||||
|
# ── Custom Caddy image (Gandi DNS-01 plugin, ADR-024) ─────────────────────────
|
||||||
|
# DNS-01 (wildcard / mesh-LAN-only certs) needs the caddy-dns/gandi plugin compiled
|
||||||
|
# in via xcaddy. Build on ubongo — Google's Go module proxy 403s Hetzner IPs.
|
||||||
|
|
||||||
|
caddy-image:
|
||||||
|
docker build -t $(CADDY_IMAGE) -f $(CADDY_DOCKERFILE) .docker/caddy-gandi
|
||||||
|
|
||||||
|
caddy-image-push: caddy-image
|
||||||
|
docker push $(CADDY_IMAGE)
|
||||||
|
|
||||||
|
# Log the local Docker daemon into the Forgejo registry using the vaulted token, so the
|
||||||
|
# *-image-push targets above are agent-completable non-interactively (rbw must be unlocked).
|
||||||
|
registry-login:
|
||||||
|
@ANSIBLE_VAULT="$(ANSIBLE)-vault" PYTHON="$(PYTHON)" VAULT="$(VAULT)" \
|
||||||
|
REGISTRY_HOST="$(REGISTRY_HOST)" REGISTRY_USER="$(REGISTRY_USER)" \
|
||||||
|
bash scripts/registry-login.sh
|
||||||
|
|
||||||
# ── Terraform ─────────────────────────────────────────────────────────────────
|
# ── Terraform ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
tf-init:
|
tf-init:
|
||||||
$(TF) -chdir=terraform/environments/$(TF_ENV) init
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) init
|
||||||
|
|
||||||
tf-plan:
|
tf-plan:
|
||||||
$(TF) -chdir=terraform/environments/$(TF_ENV) plan
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan
|
||||||
|
|
||||||
tf-apply:
|
tf-apply:
|
||||||
$(TF) -chdir=terraform/environments/$(TF_ENV) apply
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) apply
|
||||||
|
|
||||||
tf-output:
|
tf-output:
|
||||||
$(TF) -chdir=terraform/environments/$(TF_ENV) output -json
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) output -json
|
||||||
|
|
||||||
tf-inventory:
|
tf-inventory:
|
||||||
ifndef TF_ENV
|
ifndef TF_ENV
|
||||||
|
|
@ -140,6 +207,11 @@ endif
|
||||||
| $(PYTHON) scripts/tf_to_inventory.py > inventories/$(TF_ENV)/hosts.yml
|
| $(PYTHON) scripts/tf_to_inventory.py > inventories/$(TF_ENV)/hosts.yml
|
||||||
@echo "Inventory written to inventories/$(TF_ENV)/hosts.yml"
|
@echo "Inventory written to inventories/$(TF_ENV)/hosts.yml"
|
||||||
|
|
||||||
|
tf-inventory-offsite:
|
||||||
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
|
||||||
|
| $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
|
||||||
|
@echo "Offsite inventory written to inventories/production/offsite.yml"
|
||||||
|
|
||||||
# ── Role scaffolding ──────────────────────────────────────────────────────────
|
# ── Role scaffolding ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
new-role:
|
new-role:
|
||||||
|
|
@ -151,7 +223,14 @@ endif
|
||||||
roles/$(NAME)/molecule/default
|
roles/$(NAME)/molecule/default
|
||||||
echo "---" > roles/$(NAME)/tasks/main.yml
|
echo "---" > roles/$(NAME)/tasks/main.yml
|
||||||
echo "---" > roles/$(NAME)/handlers/main.yml
|
echo "---" > roles/$(NAME)/handlers/main.yml
|
||||||
echo "---" > roles/$(NAME)/defaults/main.yml
|
printf '%s\n' '---' \
|
||||||
|
'# Role defaults use the <rolename>__var double-underscore namespace.' \
|
||||||
|
'#' \
|
||||||
|
'# Service roles (ADR-004) also declare access__*/backup__* data here. Those are' \
|
||||||
|
'# cross-role conventions (not rolename-prefixed), so EACH such line needs a trailing' \
|
||||||
|
'# noqa: var-naming[no-role-prefix] (ansible-lint 24.x has no per-prefix allowlist).' \
|
||||||
|
'# Reference: roles/reverse_proxy/defaults/main.yml' \
|
||||||
|
> roles/$(NAME)/defaults/main.yml
|
||||||
echo "---" > roles/$(NAME)/meta/main.yml
|
echo "---" > roles/$(NAME)/meta/main.yml
|
||||||
printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md
|
printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md
|
||||||
cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml
|
cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml
|
||||||
|
|
|
||||||
28
README.md
28
README.md
|
|
@ -57,7 +57,13 @@ See `Makefile` for the full list of targets.
|
||||||
│
|
│
|
||||||
├── docs/
|
├── docs/
|
||||||
│ ├── decisions/ # Architecture decision records (ADRs)
|
│ ├── decisions/ # Architecture decision records (ADRs)
|
||||||
│ └── runbooks/ # Step-by-step operational procedures
|
│ ├── runbooks/ # Step-by-step operational procedures
|
||||||
|
│ ├── security/ # Per-service security checklist + templates + accepted risks
|
||||||
|
│ ├── testing/ # VERIFY.md template + service-UI verification reports
|
||||||
|
│ ├── access/ # ACCESS.md template (ADR-021)
|
||||||
|
│ ├── backup/ # BACKUP.md template (ADR-022)
|
||||||
|
│ ├── hardware/ # Physical capacity reference + reviews
|
||||||
|
│ └── reviews/ # /review-repo reports
|
||||||
│
|
│
|
||||||
├── inventories/
|
├── inventories/
|
||||||
│ ├── production/ # Live hosts — edit carefully
|
│ ├── production/ # Live hosts — edit carefully
|
||||||
|
|
@ -65,10 +71,12 @@ See `Makefile` for the full list of targets.
|
||||||
│
|
│
|
||||||
├── playbooks/ # Orchestration playbooks
|
├── playbooks/ # Orchestration playbooks
|
||||||
│ ├── site.yml # Full standard state
|
│ ├── site.yml # Full standard state
|
||||||
|
│ ├── workstation.yml # Developer environment (control group)
|
||||||
│ └── bootstrap.yml # First-run new host setup
|
│ └── bootstrap.yml # First-run new host setup
|
||||||
│
|
│
|
||||||
├── roles/ # Ansible roles
|
├── roles/ # Ansible roles
|
||||||
│ ├── base/ # OS baseline applied to all hosts
|
│ ├── base/ # OS baseline applied to all hosts
|
||||||
|
│ ├── dev_env/ # Interactive developer environment
|
||||||
│ └── docker_host/ # Docker runtime setup
|
│ └── docker_host/ # Docker runtime setup
|
||||||
│
|
│
|
||||||
├── terraform/ # VM provisioning only — no DNS (see ADR-006/009)
|
├── terraform/ # VM provisioning only — no DNS (see ADR-006/009)
|
||||||
|
|
@ -92,6 +100,24 @@ See `Makefile` for the full list of targets.
|
||||||
- Network topology: `docs/decisions/007-network.md`
|
- Network topology: `docs/decisions/007-network.md`
|
||||||
- Testing methodology: `docs/decisions/008-testing.md`
|
- Testing methodology: `docs/decisions/008-testing.md`
|
||||||
- Terraform ↔ Ansible handoff: `docs/decisions/009-provisioning-handoff.md`
|
- Terraform ↔ Ansible handoff: `docs/decisions/009-provisioning-handoff.md`
|
||||||
|
- Forgejo & CI: `docs/decisions/010-forgejo-ci.md`
|
||||||
|
- Update management: `docs/decisions/011-update-management.md`
|
||||||
|
- Hardware & capacity: `docs/decisions/012-hardware-capacity.md`
|
||||||
|
- Heritage / V4 policy: `docs/decisions/013-heritage-v4.md`
|
||||||
|
- Sourcing technical knowledge: `docs/decisions/014-knowledge-sourcing.md`
|
||||||
|
- Control / AI-worker host (`ubongo`): `docs/decisions/015-control-host.md`
|
||||||
|
- Mesh VPN (NetBird): `docs/decisions/016-mesh-vpn.md`
|
||||||
|
- Service-UI verification (Level 4): `docs/decisions/017-service-ui-verification.md`
|
||||||
|
- Logging & log integrity: `docs/decisions/018-logging.md`
|
||||||
|
- Tagging & run-targeting: `docs/decisions/019-tagging.md`
|
||||||
|
- Firewall strategy: `docs/decisions/020-firewall.md`
|
||||||
|
- Operational access: `docs/decisions/021-operational-access.md`
|
||||||
|
- Backup & disaster recovery: `docs/decisions/022-backup.md`
|
||||||
|
- ADR structure & lifecycle: `docs/decisions/023-adr-structure.md`
|
||||||
|
- Reverse proxy (Caddy): `docs/decisions/024-reverse-proxy.md`
|
||||||
|
|
||||||
|
(CLAUDE.md carries the full cross-referenced table, including the runbooks and
|
||||||
|
security/testing docs.)
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
|
|
|
||||||
54
STATUS.md
54
STATUS.md
|
|
@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
|
||||||
truth. **Before relying on a role, provider, or pipeline existing, check here.**
|
truth. **Before relying on a role, provider, or pipeline existing, check here.**
|
||||||
If something is listed as "designed, not built", do not assume it works.
|
If something is listed as "designed, not built", do not assume it works.
|
||||||
|
|
||||||
_Last reviewed: 2026-05-30._
|
_Last reviewed: 2026-06-19._
|
||||||
|
|
||||||
## Real and working today
|
## Real and working today
|
||||||
|
|
||||||
|
|
@ -20,30 +20,47 @@ _Last reviewed: 2026-05-30._
|
||||||
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
|
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
|
||||||
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
|
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
|
||||||
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
|
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
|
||||||
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
|
| `/kaizen` | Curate `docs/FRICTION.md` Open signals → decisions ledger (`scripts/friction-scan.py` Phase 0, unit-tested, + `.claude/commands/kaizen.md`). Interactive, on-demand; `--nudge` (recurrence/age/backlog) surfaces in `/review-repo`. Headless/cron deferred (TODO 11.3). |
|
||||||
|
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built". |
|
||||||
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
||||||
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
||||||
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
|
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
|
||||||
| Service-role standard + per-service `SECURITY.md` convention | Defined (ADR-004 + `docs/security/service-security-template.md`); not yet applied — no service roles exist |
|
| Service-role standard + per-service `SECURITY.md` convention | Defined (ADR-004 + `docs/security/service-security-template.md`); not yet applied — no service roles exist |
|
||||||
|
| Tag standard + enforcement (ADR-019) | Works — `tests/tags.yml` (closed vocabulary) + `scripts/check-tags.py` (run by `make lint`, unit-tested): enforces the tag vocabulary and that each role import in a play's `roles:` block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (`<env>`, group, `managed-by=terraform`) is in the Terraform HCL but unprovisioned. |
|
||||||
|
| `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). |
|
||||||
|
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
|
||||||
|
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
|
||||||
|
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **`base` firewall applied (mesh-hardening 2/3, 2026-06-19):** INPUT-only default-deny — input locked to `wt0` + ssh-from-control (`10.20.10.151`) + workstations (`10.20.10.50` mamba, `10.20.10.17`); forward `accept` (Docker/libvirt-NAT safe). Live-verified (SSH self-path + Docker egress, after a post-apply `restart docker` — base's flush wipes Docker nat, FRICTION); **real-host reboot-validated (2026-06-19):** after an operator reboot, the `policy drop` input chain + full allow-list re-applied on boot and the `wt0` mesh + SSH self-path came back clean. `claude` now self-SSHes (ad-hoc `authorized_keys` grant so the agent can run SSH-based deploys with the auto-rollback safety; fold into the control-node bootstrap). **Pending:** full `base` hardening (auditd/CIS); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservations (10.20.10.151 MAC `88:a4:c2:e0:ee:da` + the `.50`/`.17` workstation leases); Terraform state backup (now relevant — the offsite tfstate exists). |
|
||||||
|
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Mesh-hardening redesign applied + live reboot-validated (2026-06-20):** `base` INPUT-only nftables default-deny (`inet filter` input `policy drop`; forward `accept`, Docker-safe via a post-apply `restart docker`), SSH `wt0`-primary + a permanent WAN break-glass (ubongo's WAN `91.226.145.80`; the Hetzner console is the OOB ultimate fallback), managed over `wt0`; `netbird_coordinator` geolocation disabled (`NB_DISABLE_GEOLOCATION`) so a no-egress boot can't FATAL it. A real reboot recovered **unattended** — firewall persisted, Docker forwarding + public services (Caddy 80/443, STUN 3478) up, coordinator geo-disabled (no FATAL), `wt0`/mesh (Management+Signal Connected) + both SSH paths back. **Pending:** offsite tfstate backup (ADR-022); relay-SPOF reduction (next mesh-hardening sub-project — `ubongo→askari` is currently `Relayed` through askari's own relay). |
|
||||||
|
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy` → `/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). |
|
||||||
|
| `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. |
|
||||||
|
|
||||||
## Scaffolded but empty — NOT implemented
|
## Scaffolded but empty — NOT implemented
|
||||||
|
|
||||||
| Thing | State |
|
| Thing | State |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `roles/base/` | Not in git — only an empty dir on disk (untracked). `site.yml` references it, so a clean clone errors on `make deploy PLAYBOOK=site` until it is built. |
|
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is **applied to ubongo** (mesh-hardening 2/3, 2026-06-19) **and askari** (mesh-hardening redesign, 2026-06-20) — both INPUT-only default-deny via the `base__firewall_input_only` knob (input default-deny + `wt0`/ssh-from-control/`base__firewall_admin_addrs` allow-list; forward left `accept` so Docker/libvirt-NAT survive), both **live reboot-validated**. On a Docker host (askari) base's `flush ruleset` wipes Docker's nat, so the cutover follows the firewall apply with a `restart docker` to rebuild it (FRICTION). Not built: auditd, packages, users (Phase 2 / TODO 15). The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`) so a local-DNS hiccup can't strand the mesh — **applied + live on ubongo (2026-06-20)**: `getent hosts netbird.askari.wingu.me` → `77.42.120.136`, mesh unaffected. The single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment). |
|
||||||
| `roles/docker_host/` | Not in git. Same. |
|
|
||||||
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
|
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
|
||||||
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
|
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
|
||||||
|
|
||||||
So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `base` and
|
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
|
||||||
`docker_host` roles it calls do not exist yet.
|
and is built + applied to askari; see "Real and working today". Its deferred scope —
|
||||||
|
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
|
||||||
|
|
||||||
|
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
|
||||||
|
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
|
||||||
|
but in practice it is still limited: the production cluster has no docker hosts yet, and
|
||||||
|
`base`'s `firewall` concern is now applied to `ubongo` (control) but not yet to cluster docker hosts (none exist), so a full cluster `site` run does not
|
||||||
|
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
|
||||||
|
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
|
||||||
|
askari.)
|
||||||
|
|
||||||
## Designed but not built
|
## Designed but not built
|
||||||
|
|
||||||
| Thing | Designed in | Notes |
|
| Thing | Designed in | Notes |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
|
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
|
||||||
| Terraform actually provisioning | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
|
| Terraform actually provisioning (Proxmox) | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
|
||||||
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
|
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
|
||||||
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
|
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
|
||||||
| Per-service roles | ADR-004 | Model defined; no service roles built |
|
| Per-service roles | ADR-004 | Model defined; no service roles built |
|
||||||
|
|
@ -52,6 +69,29 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
|
||||||
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
|
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
|
||||||
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
|
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
|
||||||
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
||||||
|
| NetBird mesh — coordinator on `askari` | ADR-016 | **BUILT + applied (M4b, 2026-06-16)** — moved up to "Real and working today" (`roles/netbird_coordinator/`). Self-hosted control plane on askari; replaces ADR-007 WireGuard. Mesh **peer enrolment = M5** (next row). |
|
||||||
|
| NetBird agent enrollment in `base` | ADR-016 | **BUILT + applied (M5, 2026-06-17).** The `base` `mesh` concern (opt-in `base__mesh_enabled`) installs the pinned NetBird agent + runs `netbird up` with the reusable scoped key from `vault.netbird.setup_key`. Applied to **askari (`100.99.226.39`) + ubongo (`100.99.146.14`)** — both Management+Signal Connected; ubongo↔askari mesh ping verified. Enrollment is **additive** — the "SSH only on `wt0`" firewall lockdown is the deferred mesh-hardening follow-on, NOT applied. **Road-warrior clients (`mamba` + work laptop) enrolled (2026-06-17) → `ubongo` reachable from anywhere: the mobile-access goal is met and Phase 1 (remote access) is COMPLETE.** Client enrollment runbook: `docs/runbooks/netbird-client.md`. |
|
||||||
|
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||||
|
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
||||||
|
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
||||||
|
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
|
||||||
|
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
|
||||||
|
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
|
||||||
|
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
|
||||||
|
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
||||||
|
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
||||||
|
|
||||||
|
## Integration test harness (ADR-025)
|
||||||
|
|
||||||
|
| Thing | State |
|
||||||
|
|---|---|
|
||||||
|
| `roles/integration_test/` | **Built** — installs/enables libvirt+QEMU+virtinst on `control` group hosts; adds `sjat`/`claude` to `libvirt` group; creates image-cache dir. Lint clean; applied live to ubongo (substrate installed); molecule scenario present, not run in the build env. |
|
||||||
|
| `scripts/integration-vm.py` | **Built** — stdlib-only lifecycle driver over `virsh`/`virt-install`/`cloud-localds`: `up / apply / reboot / assert / cycle / down / prune / console`. Lazily ensures the golden Debian-13 genericcloud image. pytest clean (transient-inventory generation, var/overlay merge, `--certs` mapping, DHCP-lease parsing, resource-guard math). |
|
||||||
|
| `tests/integration/` (profile, verify, overrides) | **Built** — "be askari" profile + var overlay + `verify.yml` outcome assertions (Docker active, forward-chain accepts present, published-port DNAT alive). Validated end-to-end by the RED→GREEN acceptance run. |
|
||||||
|
| `make test-integration` / `make test-integration-clean` | **Built** — wired into `Makefile`. |
|
||||||
|
| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants, UEFI boot requirement, and claude-sudo dependency documented. |
|
||||||
|
| **RED/GREEN acceptance (ubongo live pass)** | **PASSED (2026-06-18).** A throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base nftables forward default-deny kills Docker forwarding on reboot) = RED. Applying the `docker_host` container-forward drop-in and rebooting survived = GREEN. Nine shakedown findings captured in `docs/FRICTION.md`; key learnings (UEFI boot, claude sudo) recorded in ADR-025. `docs/TODO.md` item 2.4 closed. |
|
||||||
|
| `le-staging` cert validation | **Pending** — wired in v1 but not yet exercised on a real VM (separate from the RED/GREEN acceptance gate). |
|
||||||
|
|
||||||
## Keeping this honest
|
## Keeping this honest
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,12 @@
|
||||||
[defaults]
|
[defaults]
|
||||||
inventory = inventories/production/hosts.yml
|
inventory = inventories/production/
|
||||||
roles_path = roles
|
roles_path = roles
|
||||||
collections_path = .collections
|
collections_path = .collections
|
||||||
vault_password_file = scripts/vault-pass-client.sh
|
vault_password_file = scripts/vault-pass-client.sh
|
||||||
interpreter_python = auto_silent
|
interpreter_python = auto_silent
|
||||||
stdout_callback = yaml
|
stdout_callback = default
|
||||||
callbacks_enabled = timer, profile_tasks
|
callback_result_format = yaml
|
||||||
|
callbacks_enabled = ansible.posix.profile_tasks
|
||||||
|
|
||||||
# Avoid slow DNS lookups
|
# Avoid slow DNS lookups
|
||||||
[ssh_connection]
|
[ssh_connection]
|
||||||
|
|
|
||||||
|
|
@ -24,13 +24,19 @@ decisions this frame enables.
|
||||||
|
|
||||||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||||
|---|---|---|---|---|---|
|
|---|---|---|---|---|---|
|
||||||
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
|
| Reverse proxy / TLS | Caddy (ADR-024) | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
|
||||||
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
|
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
|
||||||
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
|
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; applied (M1) |
|
||||||
|
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
|
||||||
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
|
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
|
||||||
|
|
||||||
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
|
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
|
||||||
|
|
||||||
|
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
|
||||||
|
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
|
||||||
|
role from a shared `group_vars` service catalog. The host `nftables` layer is built (the
|
||||||
|
`base` firewall concern); the OPNsense layer is still to be built._
|
||||||
|
|
||||||
## 2. Identity & access — [P]
|
## 2. Identity & access — [P]
|
||||||
|
|
||||||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||||
|
|
@ -43,8 +49,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
|
||||||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||||
|---|---|---|---|---|---|
|
|---|---|---|---|---|---|
|
||||||
| Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 |
|
| Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 |
|
||||||
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
|
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
|
||||||
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
|
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
|
||||||
|
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
|
||||||
| Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 |
|
| Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 |
|
||||||
| External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 |
|
| External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 |
|
||||||
| Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 |
|
| Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 |
|
||||||
|
|
@ -98,9 +105,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
|
||||||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||||
|---|---|---|---|---|---|
|
|---|---|---|---|---|---|
|
||||||
| Databases | Postgres/MariaDB — central *vs* per-app | P | candidate | Backing store for stateful apps | Open: central server vs per-service (TODO 3.9) |
|
| Databases | Postgres/MariaDB — central *vs* per-app | P | candidate | Backing store for stateful apps | Open: central server vs per-service (TODO 3.9) |
|
||||||
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
|
| Backup engine | restic (data-only) | S | planned | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
|
||||||
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
|
| Off-site target | pCloud (via rclone) | S | planned | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
|
||||||
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
|
| Air-gap target | USB hard drives | S | planned | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
|
||||||
|
|
||||||
## 10. Operations & support — [S]
|
## 10. Operations & support — [S]
|
||||||
|
|
||||||
|
|
@ -109,6 +116,11 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
|
||||||
| Update watcher | DIUN | S | planned | New-image alerts driving the update process | ADR-011 |
|
| Update watcher | DIUN | S | planned | New-image alerts driving the update process | ADR-011 |
|
||||||
| Scheduled jobs | `scheduled_jobs` role + `claude -p` jobs | S | planned | Declarative cron: `/review-repo`, security/capacity reviews, sanity checks | TODO 8 |
|
| Scheduled jobs | `scheduled_jobs` role + `claude -p` jobs | S | planned | Declarative cron: `/review-repo`, security/capacity reviews, sanity checks | TODO 8 |
|
||||||
| Sanity / smoke | whoami + health checks | S | planned | Verification endpoints + "is it actually working" checks | ADR-011 / TODO 8.2 |
|
| Sanity / smoke | whoami + health checks | S | planned | Verification endpoints + "is it actually working" checks | ADR-011 / TODO 8.2 |
|
||||||
|
| Service-UI verification | `/verify-service` skill | S | planned | Claude-driven exploratory Level 4 acceptance check of a deployed service's UI | Decided (ADR-017); running deferred on ubongo + playwright + Authentik |
|
||||||
|
|
||||||
|
- **Targeted runs** (ADR-019): playbooks are sliced with `--tags` along two axes —
|
||||||
|
role/service (tag = role name) or a closed list of cross-cutting concerns
|
||||||
|
(`firewall`, `logging`, `config`, `deploy`, …); the vocabulary is lint-enforced.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -136,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
|
||||||
none are committed.
|
none are committed.
|
||||||
|
|
||||||
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
|
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
|
||||||
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
|
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
|
||||||
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
|
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
|
||||||
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
|
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
|
||||||
V4-specific project websites — out of boma's scope by design. The narrower surface is
|
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
|
||||||
intentional, not an oversight.
|
environment of an infrastructure worker host, not a personal desktop, and does not apply
|
||||||
|
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
|
||||||
|
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
|
||||||
|
narrower surface is intentional, not an oversight.
|
||||||
|
|
|
||||||
373
docs/FRICTION.md
373
docs/FRICTION.md
|
|
@ -1,13 +1,15 @@
|
||||||
# FRICTION.md — kaizen friction log
|
# FRICTION.md — kaizen friction log
|
||||||
|
|
||||||
Raw signals for the periodic **kaizen review** (the methodology retrospective; see
|
Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
|
||||||
`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening
|
the input that keeps our tooling and conventions sharpening over time instead of only
|
||||||
over time instead of only accreting.
|
accreting.
|
||||||
|
|
||||||
**How to use:** append freely _during_ work — don't curate, don't fix here. Capture
|
**How to use:** append freely _during_ work under **Open signals** — don't curate,
|
||||||
friction, surprises, fixes that keep recurring, and tooling that isn't earning its
|
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
|
||||||
keep. The kaizen review reads this, then proposes **add / change / remove** (biased
|
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
|
||||||
toward _remove_) and records the decisions as ADRs.
|
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
|
||||||
|
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
|
||||||
|
and moves consumed signals into the **decisions ledger** below.
|
||||||
|
|
||||||
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
|
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
|
||||||
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
|
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
|
||||||
|
|
@ -16,47 +18,330 @@ earning its keep.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2026-05-30 — initial seed (from the Claude-Code setup session)
|
## Open signals
|
||||||
|
|
||||||
- `[recurring]` Every `git commit` needs `rbw` unlocked (the pre-commit ansible-lint
|
_(append new raw signals here; the next kaizen review consumes them)_
|
||||||
hook decrypts `vault.yml` for its syntax-check). Mitigated with a 5h lock timeout
|
|
||||||
and an `rbw unlocked` pre-flight convention. → _Open:_ could ansible-lint skip vault
|
|
||||||
decryption for syntax-check, so committing doesn't need the vault at all?
|
|
||||||
- `[gotcha]` pre-commit stashes _unstaged_ changes before running hooks, so a partial
|
|
||||||
commit reverted an interdependent file (`ansible.cfg`) and failed. → Commit
|
|
||||||
interdependent changes together, or stage the config change first.
|
|
||||||
- `[gotcha]` `make new-role` had never worked on this host: `mkdir {a,b,c}` brace
|
|
||||||
expansion fails under `/bin/sh` (dash). Fixed with explicit paths. → A real run
|
|
||||||
catches what static review can't; consider smoke-testing scaffold commands.
|
|
||||||
- `[gotcha]` `rbw sync` is required after adding a Vaultwarden item before `rbw get`
|
|
||||||
finds it (stale local cache).
|
|
||||||
- `[gotcha]` This shell is zsh — unquoted `$VAR` does not word-split, so a variable
|
|
||||||
holding a file list was passed as a single argument. → Use explicit args/arrays.
|
|
||||||
- `[friction]` Long sessions: I make a batch of edits but can't commit until you
|
|
||||||
`rbw unlock`. The 5h timeout + pre-flight check address the symptom; watch whether
|
|
||||||
it still bites.
|
|
||||||
- `[gotcha]` Hooks (or any new `.claude/settings.json`) added mid-session don't
|
|
||||||
activate until a Claude Code **restart** — the settings watcher only tracks settings
|
|
||||||
files that existed at session start. Opening `/hooks` and dismissing did _not_ load
|
|
||||||
them. → Fresh sessions load them normally; restart after adding hooks.
|
|
||||||
|
|
||||||
## 2026-05-31
|
- `[friction]` **Re-asked settled defaults (push + subagent-driven) at the plan→execute handoff**
|
||||||
|
(2026-06-19): despite the standing preference (memory `dont-reask-settled-defaults`: push to
|
||||||
|
origin as off-machine backup **and** go subagent-driven, both WITHOUT asking), I again asked the
|
||||||
|
operator "which execution approach?" and "want me to push?". The `writing-plans` skill scripts
|
||||||
|
that handoff question ("Which approach?"), and confirming a push felt natural — both overrode the
|
||||||
|
memory. → at the writing-plans → execution handoff, default to subagent-driven execution and push
|
||||||
|
to origin without a confirmation gate; reserve questions for genuine forks. Recurrence of an
|
||||||
|
already-recorded signal — treat the skill's scripted "Which approach?" as pre-answered
|
||||||
|
(subagent-driven) for this operator.
|
||||||
|
|
||||||
- I asked to draft an ADR and got: No formal status-header convention, but since this is a draft for discussion I'll mark it Proposed so it isn't mistaken for an
|
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
|
||||||
accepted decision. Here's the draft.
|
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
|
||||||
|
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
|
||||||
|
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
|
||||||
|
|
||||||
## 2026-06-01
|
- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
|
||||||
|
(2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
|
||||||
|
On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
|
||||||
|
inter-container over the bridge), so the drop kills it. It worked right after `make
|
||||||
|
deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
|
||||||
|
default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
|
||||||
|
services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
|
||||||
|
that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
|
||||||
|
firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
|
||||||
|
`docker_host` ships the container-forward rules; add a guard/check (a Docker host with
|
||||||
|
`firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
|
||||||
|
firewall design (ADR-020) should state the Docker-host dependency explicitly.
|
||||||
|
|
||||||
- `[friction]` The `finishing-a-development-branch` flow (and generic AI/dev tooling)
|
- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
|
||||||
offers "push and open a Pull Request," but our Forgejo `origin` is trunk-based with
|
mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
|
||||||
no merge-request / approval gate (CLAUDE.md git conventions). That option doesn't
|
`net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
|
||||||
apply — the real path is local fast-forward merge to `main`, then push. → Skills and
|
boot. In practice the console still showed sshd *"could not assign the address"* at boot
|
||||||
conventions that assume a GitHub-style PR workflow need a homelab-aware variant;
|
— so the protection did not work as designed, and because `wt0` never came up (the
|
||||||
encode that here "finishing a branch" means merge-locally-then-push, not open-a-PR.
|
coordinator was down), sshd had no listener at all → no SSH path. → the entire
|
||||||
|
"sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
|
||||||
|
and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
|
||||||
|
help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
|
||||||
|
or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
|
||||||
|
|
||||||
## 2026-06-05
|
- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
|
||||||
|
`askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
|
||||||
|
agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
|
||||||
|
the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
|
||||||
|
being `wt0`-only, the host was reachable only via the Hetzner console. → the
|
||||||
|
coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
|
||||||
|
`wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
|
||||||
|
rule: never make a host's only management path depend on a service that host itself
|
||||||
|
hosts.
|
||||||
|
|
||||||
- `[recurring]` The `writing-plans` skill ends by asking "subagent-driven vs inline
|
- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
|
||||||
execution?" — always answer subagent-driven here. Don't ask; default straight to
|
egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
|
||||||
subagent-driven (fresh subagent per task + review between tasks). → Standing
|
the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
|
||||||
preference; skip the execution-mode prompt.
|
any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
|
||||||
|
flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
|
||||||
|
Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
|
||||||
|
download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
|
||||||
|
dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
|
||||||
|
blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
|
||||||
|
is fragile across `nft flush` + reboot.
|
||||||
|
|
||||||
|
- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
|
||||||
|
recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
|
||||||
|
encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
|
||||||
|
flagged in STATUS as the coordinator's deferred backup). During the incident there was a
|
||||||
|
real fear the unclean reboots had corrupted the store, with no restore path. It turned
|
||||||
|
out to be a runtime/egress issue, not corruption — but the absence of a backup made the
|
||||||
|
whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
|
||||||
|
`netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
|
||||||
|
would have made "rebuild askari from scratch" a safe option.
|
||||||
|
|
||||||
|
- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
|
||||||
|
(2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
|
||||||
|
*before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
|
||||||
|
when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
|
||||||
|
lockout-risky cutovers: **validate reboot-recovery while the old access path is still
|
||||||
|
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
|
||||||
|
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
|
||||||
|
|
||||||
|
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
|
||||||
|
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
|
||||||
|
|
||||||
|
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
|
||||||
|
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
|
||||||
|
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
|
||||||
|
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
|
||||||
|
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
|
||||||
|
|
||||||
|
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
|
||||||
|
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
|
||||||
|
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
|
||||||
|
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
|
||||||
|
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
|
||||||
|
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
|
||||||
|
codify the sudoers drop-in in Ansible.
|
||||||
|
|
||||||
|
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
|
||||||
|
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
|
||||||
|
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
|
||||||
|
|
||||||
|
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
|
||||||
|
disk/seed/console under the repo/home failed "Permission denied (search permissions for
|
||||||
|
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
|
||||||
|
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
|
||||||
|
|
||||||
|
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
|
||||||
|
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
|
||||||
|
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
|
||||||
|
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
|
||||||
|
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
|
||||||
|
|
||||||
|
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
|
||||||
|
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
|
||||||
|
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
|
||||||
|
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
|
||||||
|
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
|
||||||
|
`{% %}` at column 0 (the repo's existing style).
|
||||||
|
|
||||||
|
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
|
||||||
|
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
|
||||||
|
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
|
||||||
|
`cloud-init status --wait` before applying.
|
||||||
|
|
||||||
|
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
|
||||||
|
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
|
||||||
|
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
|
||||||
|
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
|
||||||
|
`base__firewall_control_addr` to the NAT gateway.
|
||||||
|
|
||||||
|
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
|
||||||
|
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
|
||||||
|
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
|
||||||
|
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
|
||||||
|
cross-file review AND a real-hardware run; neither alone would have shipped it working.
|
||||||
|
|
||||||
|
<!-- From the 2026-06-19 mesh-hardening-2/3 design (ubongo INPUT-only default-deny). -->
|
||||||
|
|
||||||
|
- `[friction]` **Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows)**
|
||||||
|
(2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by
|
||||||
|
*raw lease* — `base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"]` — because
|
||||||
|
there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment
|
||||||
|
silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops
|
||||||
|
the workstation's *LAN* path (mesh still works, so never a full lockout). → when
|
||||||
|
OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with **MAC-pinned DHCP
|
||||||
|
reservations** (`10.20.10.17` = MAC `bc:0f:f3:c8:4a:8a`; mamba's MAC TBD) and allow the
|
||||||
|
reserved IPs. Spec: `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`.
|
||||||
|
|
||||||
|
- `[gotcha]` **`make test-integration` on ubongo fails (`qemu-img` "Permission denied") when
|
||||||
|
the agent session predates the `libvirt` group grant** (2026-06-19): the `integration_test`
|
||||||
|
role adds `claude` to `libvirt`+`kvm` and makes the cache dir `/var/lib/boma-integration`
|
||||||
|
`root:libvirt 2775` — correct — but a `claude` session whose shell started *before* that
|
||||||
|
grant carries a stale process group set (`id` → `claude,docker` only, no `libvirt`), so
|
||||||
|
`qemu-img create` of the VM overlay into the group-owned dir is denied. `virsh`/`virt-install`
|
||||||
|
still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side
|
||||||
|
as `libvirt-qemu`), so ONLY claude's own file-writes break. Unblock without restarting the
|
||||||
|
session: **`sg libvirt -c 'make test-integration HOST=<name>'`** (claude needs only `libvirt`
|
||||||
|
for the dir; `kvm` is server-side; note `sg` adds one group, not the full set). → self-heal
|
||||||
|
in `scripts/integration-vm.py`: if the `libvirt` gid is absent from `os.getgroups()`, re-exec
|
||||||
|
under `sg libvirt` (or have the Makefile target do it), so a stale-session agent never hits
|
||||||
|
this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session
|
||||||
|
transient — but high-confusion, worth self-healing.
|
||||||
|
|
||||||
|
- `[friction]` **No standard for when the agent may run local-VM integration tests on ubongo
|
||||||
|
without asking** (2026-06-19): `make test-integration HOST=<name>` spins an ISOLATED throwaway
|
||||||
|
KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards:
|
||||||
|
one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and
|
||||||
|
self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3,
|
||||||
|
Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent
|
||||||
|
just runs it. → decide + record the rule: e.g. a `.claude/settings.json` permission allow for
|
||||||
|
`make test-integration*` / `scripts/integration-vm.py` (and the `sg libvirt -c '…'` form per
|
||||||
|
the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests
|
||||||
|
from the genuinely-gated live steps (`make deploy` to real hosts, host reboots, cutovers —
|
||||||
|
still need a go-ahead). Ties to the `test-risky-infra-before-live-deploy` +
|
||||||
|
`dont-reask-settled-defaults` memories + ADR-025.
|
||||||
|
|
||||||
|
- `[gotcha]` **Molecule covers only the `input_only`-OFF (forward drop) branch of the base
|
||||||
|
firewall** (2026-06-19): mesh-hardening 2/3 added `base__firewall_input_only` (forward policy
|
||||||
|
drop↔accept). The `default` Molecule scenario renders ONE fixture, set to the secure default
|
||||||
|
(drop) — so the fast `make test ROLE=base` gate locks the drop default (security-critical for
|
||||||
|
service hosts) but does NOT exercise the `=true` → forward-`accept` rendering; only `make
|
||||||
|
test-integration HOST=ubongo` does (passed GREEN). An in-converge re-render can't cheaply
|
||||||
|
cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second
|
||||||
|
Molecule scenario (`molecule/input-only/`) asserting forward `policy accept`, vs accepting the
|
||||||
|
integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a
|
||||||
|
literal, and a var-name break would fail the drop branch too → caught).
|
||||||
|
|
||||||
|
- `[gotcha]` **Applying base's firewall to a Docker host flushes Docker's nat → container
|
||||||
|
egress dies until `restart docker`** (2026-06-19, mesh-hardening 2/3 live cutover): base's
|
||||||
|
`nftables.conf.j2` starts with `flush ruleset`, which wipes ALL tables incl. Docker's
|
||||||
|
`ip nat`/`ip filter` (+ libvirt's). On ubongo I chose INPUT-only so `forward` stays `accept`
|
||||||
|
— yet the apply STILL broke CONTAINER egress: `docker pull` worked (dockerd uses HOST egress)
|
||||||
|
but a container `ping` FAILED — the masquerade (SNAT) was gone, so replies couldn't return.
|
||||||
|
`forward accept` permits forwarding but can't replace the missing nat. The spec's "input-only
|
||||||
|
keeps Docker egress working" was therefore **incomplete**, and the local-VM harness couldn't
|
||||||
|
catch it (the test VM runs no Docker). Fix on the live host: `systemctl restart docker`
|
||||||
|
re-adds its `ip nat`/`ip filter` (egress restored; coexists fine with base's `inet filter`).
|
||||||
|
On REBOOT it self-heals (dockerd re-adds nat on boot; `forward accept` doesn't block — unlike
|
||||||
|
the 2026-06-17 `forward drop` incident). → (1) any cutover/runbook applying base firewall to a
|
||||||
|
Docker host MUST `restart docker` + check container egress after the apply; (2) the pending
|
||||||
|
`docker_host` nftables integration should own re-adding/persisting Docker's rules so base's
|
||||||
|
`flush` is safe; (3) the firewall final-review checklist should include "does the host run
|
||||||
|
Docker/libvirt? the flush wipes their nat."
|
||||||
|
|
||||||
|
<!-- From the 2026-06-19 mesh-hardening 3/3 (askari INPUT-only integration gate). -->
|
||||||
|
|
||||||
|
- `[gotcha]` **`inet filter` default-deny blocks libvirt dnsmasq DHCP — silent, hard to diagnose**
|
||||||
|
(2026-06-19, task-3 integration gate): when `base__firewall_input_only: true` is applied to
|
||||||
|
ubongo, the `table inet filter { chain input { policy drop; } }` blocks DHCP packets that arrive
|
||||||
|
via the libvirt bridge (`virbr-boma`). In nftables, multiple tables at the same hook priority all
|
||||||
|
run independently; an `accept` verdict in `table ip filter LIBVIRT_INP` does NOT prevent
|
||||||
|
`table inet filter` from seeing and dropping the same packet. VMs never got DHCP leases (dnsmasq
|
||||||
|
socket confirmed by strace to never receive POLLIN despite tcpdump seeing the packet on
|
||||||
|
`virbr-boma`). Diagnosed by temporarily changing `inet filter input` to `policy accept` → fd=3
|
||||||
|
immediately fired. Fix: `/etc/nftables.d/10-libvirt-boma.nft` drop-in adding
|
||||||
|
`iifname "virbr-boma" accept` (survives service restarts via `include "/etc/nftables.d/*.nft"`).
|
||||||
|
→ The `base` role's template needs a `base__firewall_trusted_bridges` variable so this is
|
||||||
|
encoded at the Ansible level, not in a manual host drop-in. Every host that runs Docker or
|
||||||
|
libvirt and also has `base__firewall_input_only: true` needs an analogous exception.
|
||||||
|
|
||||||
|
- `[gotcha]` **libvirt `leaseshelper` PID-file permission: `virPidFileReleasePath` unlinks
|
||||||
|
`/run/leaseshelper.pid` after EVERY call; nobody cannot recreate it** (2026-06-19, task-3
|
||||||
|
integration gate): dnsmasq runs as nobody; `libvirt_leaseshelper` is its `--dhcp-script`. The
|
||||||
|
helper acquires a PID-file mutex at `/run/leaseshelper.pid`, but `virPidFileReleasePath`
|
||||||
|
UNLINKS the file on exit. `/run/` is `root:root 755`, so nobody cannot create the file after the
|
||||||
|
first unlink → every subsequent `add` call fails with `errno=13`, dnsmasq silently drops the
|
||||||
|
DHCP grant (no log, no error to the client). Fix: suid root C wrapper at
|
||||||
|
`/usr/lib/libvirt/libvirt_leaseshelper` (original moved to `.real`) that pre-creates
|
||||||
|
`/run/leaseshelper.pid` owned by nobody, then drops privileges and execs the real helper. The
|
||||||
|
root dnsmasq fork calls the wrapper; suid gives it permission to touch `/run/`; on return to
|
||||||
|
nobody uid the PID file stays. Also: `/var/lib/libvirt/dnsmasq/` must be `nobody:nogroup 775`
|
||||||
|
so leaseshelper can update `virbr-boma.status`. This fix is host-local on ubongo and NOT in
|
||||||
|
Ansible — encode it in an `integration_test` role task (or a libvirt role) before the harness
|
||||||
|
can be safely re-deployed.
|
||||||
|
|
||||||
|
- `[gotcha]` **cloud-init rejects underscores in `local-hostname` → silently skips
|
||||||
|
network-config → VM never gets DHCP** (2026-06-19, task-3 integration gate): setting
|
||||||
|
`local-hostname: boma-it-askari_inputonly-<uuid>` caused cloud-init-local to consider the
|
||||||
|
hostname invalid and skip writing the network-config to the system. Systemd-networkd then
|
||||||
|
used the genericcloud default (no DHCP), so VMs got only IPv6 link-local. Fix in
|
||||||
|
`scripts/integration-vm.py`: `name.replace("_", "-")` in the meta-data hostname (disk paths
|
||||||
|
and virsh domain names keep the original underscore). Sanitization rule: RFC-952 hostnames
|
||||||
|
allow hyphens, not underscores.
|
||||||
|
|
||||||
|
- `[friction]` **Molecule Docker image can't `apt install` → roles with real package tasks
|
||||||
|
have no Molecule substrate coverage** (2026-06-19): the Docker Molecule image ships with
|
||||||
|
cleared apt-lists and no internet access, so any role whose core work is `apt install` —
|
||||||
|
`base`, `docker_host`, `integration_test` — cannot cover its package/substrate tasks in
|
||||||
|
Molecule. Those tasks are validated only by `make test-integration` (ADR-025, real KVM).
|
||||||
|
The gap is systemic: it affects every role with non-trivial package or system-level setup.
|
||||||
|
→ systematization idea: provide a Molecule image or driver that can install packages (e.g.
|
||||||
|
a custom Docker image with pre-seeded apt-lists, or a `prepare.yml` that pre-installs
|
||||||
|
packages from a local cache), or an alternative driver (e.g. `molecule-libvirt` using the
|
||||||
|
same KVM harness), so substrate tasks get real Molecule unit coverage rather than relying
|
||||||
|
entirely on the integration harness.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kaizen reviews — decisions ledger
|
||||||
|
|
||||||
|
Consumed signals and where their resolution now lives. Newest first.
|
||||||
|
|
||||||
|
### 2026-06-17
|
||||||
|
|
||||||
|
Second `/kaizen` run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
|
||||||
|
(the `rename-incomplete` scan check and the Forgejo registry-login path) were built by
|
||||||
|
parallel subagents and verified against the diff. **Bias-to-remove note:** one PARK
|
||||||
|
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
|
||||||
|
REMOVE; the rest accreted (migrate/change). None of the open signals were `[unused]`
|
||||||
|
*tooling*, so there was nothing to delete — the only reductive move available was parking
|
||||||
|
the out-of-phase build. **Cadence:** healthy — 3 days after the first run, every signal
|
||||||
|
0–2 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
|
||||||
|
`scripts/friction-scan.py` didn't fire this pass (all recurrence counts were 1), so the
|
||||||
|
thresholds need no change.
|
||||||
|
|
||||||
|
| Signal (first seen) | Verdict | Resolution / where it lives now |
|
||||||
|
|---|---|---|
|
||||||
|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old`→`New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.) |
|
||||||
|
| Image push to the Forgejo registry needs an interactive `docker login` (06-15) | SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`. |
|
||||||
|
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config *directory* for reload-in-place roles; restart-based roles are fine with a single-file mount. |
|
||||||
|
| `make check` always fails on the first-ever deploy of a compose service role (06-16) | CHANGE | `check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged. |
|
||||||
|
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form *prose* re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
|
||||||
|
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking *every* locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
|
||||||
|
| Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17) | PARK | The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and **Pending:** the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. **Resurrection trigger:** when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
|
||||||
|
|
||||||
|
### 2026-06-14
|
||||||
|
|
||||||
|
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
|
||||||
|
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
|
||||||
|
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
|
||||||
|
which migrate or archive (knowledge is never deleted).
|
||||||
|
|
||||||
|
| Signal (first seen) | Verdict | Resolution / where it lives now |
|
||||||
|
|---|---|---|
|
||||||
|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
|
||||||
|
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
|
||||||
|
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
|
||||||
|
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
|
||||||
|
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
|
||||||
|
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
|
||||||
|
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
|
||||||
|
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
|
||||||
|
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
|
||||||
|
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
|
||||||
|
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
|
||||||
|
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
|
||||||
|
|
||||||
|
### 2026-06-10
|
||||||
|
|
||||||
|
| Signal (first seen) | Verdict | Resolution / where it lives now |
|
||||||
|
|---|---|---|
|
||||||
|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
|
||||||
|
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
|
||||||
|
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
|
||||||
|
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
|
||||||
|
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
|
||||||
|
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
|
||||||
|
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
|
||||||
|
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
|
||||||
|
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
|
||||||
|
|
||||||
|
**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
|
||||||
|
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
|
||||||
|
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
|
||||||
|
cleared the backlog and validated the command.
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,15 @@ Project documentation.
|
||||||
Numbered from 001; each records context, the decision, and what was ruled out.
|
Numbered from 001; each records context, the decision, and what was ruled out.
|
||||||
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
|
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
|
||||||
secrets).
|
secrets).
|
||||||
|
- `security/` — security baseline, accepted-risk register, per-service checklist +
|
||||||
|
template (ADR-002/004).
|
||||||
|
- `testing/` — testing methodology artifacts + the `VERIFY.md` template (ADR-008/017).
|
||||||
|
- `access/` — operational-access doctrine + the `ACCESS.md` template (ADR-021).
|
||||||
|
- `backup/` — backup doctrine + the `BACKUP.md` template (ADR-022).
|
||||||
|
- `hardware/` — capacity reference + `/capacity-review` output (ADR-012).
|
||||||
|
- `reviews/` — `/review-repo` audit trail.
|
||||||
|
- `CAPABILITIES.md` / `ROADMAP.md` / `TODO.md` / `FRICTION.md` — what boma does, the
|
||||||
|
build order, the backlog, and recurring-friction notes.
|
||||||
|
|
||||||
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
|
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
|
||||||
the ADRs describe intent, not necessarily current reality.
|
the ADRs describe intent, not necessarily current reality.
|
||||||
|
|
|
||||||
227
docs/ROADMAP.md
Normal file
227
docs/ROADMAP.md
Normal file
|
|
@ -0,0 +1,227 @@
|
||||||
|
# ROADMAP — boma build order
|
||||||
|
|
||||||
|
High-level **build order** for the project. Almost everything in `docs/decisions/`
|
||||||
|
(the ADRs) is *designed, not built* — this file sequences that backlog into milestones
|
||||||
|
and records *why* the order is what it is.
|
||||||
|
|
||||||
|
- **What is built vs planned:** `STATUS.md` (ground truth — always check there first).
|
||||||
|
- **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it).
|
||||||
|
- **The design rationale:** `docs/decisions/` (ADRs).
|
||||||
|
|
||||||
|
This is a **living document**: update it as milestones land (move them to `STATUS.md`),
|
||||||
|
as ordering changes, or as new milestones appear. Each milestone gets its own
|
||||||
|
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
|
||||||
|
comes up; this file stays high-level.
|
||||||
|
|
||||||
|
_Last updated: 2026-06-19._
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Strategy — "remote-access first" (Approach A)
|
||||||
|
|
||||||
|
One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the
|
||||||
|
**Cluster** track. Cross-cutting/ongoing work runs underneath both.
|
||||||
|
|
||||||
|
**Why this order.** The only physical machine that exists today is `ubongo` (the control
|
||||||
|
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
|
||||||
|
— reach `ubongo` from `mamba` / a work laptop while on the move — needs only things
|
||||||
|
already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the
|
||||||
|
remote-access track first:
|
||||||
|
|
||||||
|
1. **delivers the mobile-access goal in the first phase**, and
|
||||||
|
2. **doubles as the proving ground** for boma's core machinery — the first real *service
|
||||||
|
role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts`
|
||||||
|
pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two
|
||||||
|
cheap, low-stakes hosts **before** spending on the cluster.
|
||||||
|
|
||||||
|
Cluster hardware is then procured *after* those patterns are proven and a
|
||||||
|
`/capacity-review` informs the sizing — so the spend happens once, with knowledge.
|
||||||
|
|
||||||
|
Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands
|
||||||
|
late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator
|
||||||
|
this collapses into interleaving with extra context-switching cost).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)
|
||||||
|
|
||||||
|
Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies.
|
||||||
|
All milestones (M1–M5) done; the mobile-access goal is met. Next: the Procurement gate.
|
||||||
|
|
||||||
|
### M1 · boma's DNS home — a new domain at Gandi, managed as code
|
||||||
|
|
||||||
|
Register a **new Swahili-themed domain at Gandi** for boma and manage its records **as
|
||||||
|
code (IaC)**. Greenfield, not a migration: investigating the existing domains ruled them
|
||||||
|
out as boma's home — `baobab.band` is the **live legacy homelab** (Cloudflare; vaultwarden
|
||||||
|
/ nextcloud / matrix in daily use), and `ziethen.dk` is the **family's primary email**
|
||||||
|
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
|
||||||
|
zero-risk and *born at Gandi*.
|
||||||
|
|
||||||
|
- **Driver:** values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
|
||||||
|
endangering anything live. `baobab.band`'s Cloudflare exit / V4 decommission is a
|
||||||
|
**separate, later track**, not part of this build. `ziethen.dk` is untouched.
|
||||||
|
- **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and
|
||||||
|
Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi
|
||||||
|
LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
|
||||||
|
- **Naming scheme (decided):** three tiers (on boma's new domain, `<boma-domain>`) —
|
||||||
|
`<host>.boma.<boma-domain>` (infra, internal-only) · `<service>.<boma-domain>`
|
||||||
|
(home/cluster services, split-horizon) · `<service>.askari.<boma-domain>` (off-site/VPS,
|
||||||
|
public). **`nyumbani` dropped.** Home services are **mesh/LAN-only by default** (no
|
||||||
|
public record; reached over LAN or the NetBird mesh), with public Gandi records only for
|
||||||
|
deliberate exceptions. The NetBird mesh carries the `<boma-domain>` match-domain to
|
||||||
|
road-warriors (resolver = dns1/dns2 over `wt0`); a `*.<boma-domain>` ACME **DNS-01**
|
||||||
|
wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and
|
||||||
|
review finding O12.
|
||||||
|
- **Records as a new/updated ADR:** amends ADR-007 — boma's public zone is
|
||||||
|
`<boma-domain>` at Gandi LiveDNS managed as code; the three-tier naming scheme;
|
||||||
|
`nyumbani` removed; mesh/LAN-only default; `baobab.band` (legacy, Cloudflare) is out of
|
||||||
|
scope.
|
||||||
|
- **Maps to:** ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (**resolved here**).
|
||||||
|
|
||||||
|
### M2 · `askari` provisioned + under Ansible
|
||||||
|
|
||||||
|
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
|
||||||
|
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
|
||||||
|
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
|
||||||
|
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
|
||||||
|
|
||||||
|
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
|
||||||
|
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
|
||||||
|
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
|
||||||
|
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
|
||||||
|
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
|
||||||
|
O6 (`offsite_hosts` missing from `hosts.yml`).
|
||||||
|
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
|
||||||
|
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
|
||||||
|
|
||||||
|
### M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE
|
||||||
|
|
||||||
|
Added the `hardening` concern to `base` (sshd drop-in key-only + `PermitRootLogin no`;
|
||||||
|
fail2ban sshd jail 5/1h; ADR-002) and **applied it to askari** by tag
|
||||||
|
(`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`) — SSH still works, fail2ban
|
||||||
|
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
|
||||||
|
|
||||||
|
- **NetBird agent → M4** (deferred from M3: it enrolls against the coordinator, which
|
||||||
|
doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
|
||||||
|
- **Host firewall on askari + ubongo hardening → M5** (applying default-deny pre-mesh
|
||||||
|
would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
|
||||||
|
- **Spec/plan:** `docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*`.
|
||||||
|
- **Maps to:** ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied),
|
||||||
|
TODO 15 (the rest of hardening → Phase 2).
|
||||||
|
|
||||||
|
### M4 · NetBird control plane on `askari` — first real service role
|
||||||
|
|
||||||
|
Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard
|
||||||
|
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
|
||||||
|
valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven —
|
||||||
|
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
|
||||||
|
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
|
||||||
|
`…2026-06-14-m4a-docker-caddy.md` / `…2026-06-14-m4b-netbird.md`.
|
||||||
|
|
||||||
|
**M4b — ✅ DONE (2026-06-16):** the `netbird_coordinator` service role, deployed to askari.
|
||||||
|
Reality differed from the original plan (captured fresh per ADR-014): NetBird **v0.72.4**
|
||||||
|
ships a **single combined `netbird-server`** container (management + signal + relay + STUN
|
||||||
|
+ **embedded Dex** IdP at `/oauth2`) plus `dashboard:v2.39.0` — **no separate signal/relay
|
||||||
|
container and no Coturn**. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing.
|
||||||
|
Dashboard live at `https://netbird.askari.wingu.me` (valid LE cert); `/api` auth-gated.
|
||||||
|
**M5 (enrol peers) is next** — incl. the first-boot `/setup` admin + setup keys.
|
||||||
|
|
||||||
|
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
|
||||||
|
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract** —
|
||||||
|
NetBird's management datastore is *stateful*, so it gets encrypted off-host backup
|
||||||
|
(ADR-016 §recovery, ADR-022).
|
||||||
|
- **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy
|
||||||
|
(e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup.
|
||||||
|
- **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access),
|
||||||
|
ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
|
||||||
|
|
||||||
|
### M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)
|
||||||
|
|
||||||
|
The `base` `mesh` concern enrolled **`ubongo` (`100.99.146.14`) + `askari`
|
||||||
|
(`100.99.226.39`)** as NetBird peers — both Management+Signal Connected, the ubongo↔askari
|
||||||
|
mesh link ping-verified. NetBird ships a default **Allow-All** peer policy, so any enrolled
|
||||||
|
peer reaches `ubongo` over `wt0`. The road-warrior clients (**`mamba` + the work laptop**)
|
||||||
|
are enrolled (operator, via `docs/runbooks/netbird-client.md`) → **`ubongo` is reachable
|
||||||
|
from anywhere. ← the mobile-access goal is met; Phase 1 is complete.**
|
||||||
|
|
||||||
|
- **Deferred to a "mesh-hardening" follow-on** (was folded into M5; split out as the
|
||||||
|
lockout-risky part): apply `base` nftables **default-deny** to `ubongo` + set
|
||||||
|
`base__firewall_control_addr` (ADR-021 `ssh-from-control`, built/dormant); tighten the
|
||||||
|
NetBird ACL off Allow-All to scoped policies; move `askari`'s SSH onto `wt0` (retiring
|
||||||
|
the Hetzner-firewall WAN allow). Safe to do now that the `wt0` path exists.
|
||||||
|
- **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gate — Procurement decision
|
||||||
|
|
||||||
|
Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox
|
||||||
|
hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
|
||||||
|
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
|
||||||
|
|
||||||
|
- **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
|
||||||
|
|
||||||
|
Canonical dependency order:
|
||||||
|
|
||||||
|
1. **Terraform provisioning** — `terraform init`/apply the Proxmox VM module; regenerate
|
||||||
|
inventory via `make tf-inventory` (ADR-006, ADR-009).
|
||||||
|
2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the
|
||||||
|
VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15).
|
||||||
|
3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d`
|
||||||
|
container rules (currently a scaffold; ADR-004, ADR-020).
|
||||||
|
4. **`dns` role** — render the internal zone from inventory (ADR-007).
|
||||||
|
5. **Auth + reverse proxy** — Authentik + **Caddy** (ADR-024): the foundation every
|
||||||
|
service sits behind with authentication (ADR-002).
|
||||||
|
6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters +
|
||||||
|
Uptime Kuma; decide which alerts live where (TODO 3.6).
|
||||||
|
7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each
|
||||||
|
clears `docs/security/service-checklist.md` and carries its standard files.
|
||||||
|
8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022).
|
||||||
|
9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Underneath both — Cross-cutting / ongoing
|
||||||
|
|
||||||
|
- **Accept ADR-011** (update management) — resolve its 6 open questions before the first
|
||||||
|
scheduled patch run (TODO 16).
|
||||||
|
- **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`**
|
||||||
|
skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1);
|
||||||
|
`scheduled_jobs` role (TODO 8.3).
|
||||||
|
- **User-notification function** — ntfy / matrix / email so tools + AI can reach the
|
||||||
|
operator (TODO 9; ties to ADR-011 control channel).
|
||||||
|
|
||||||
|
### Parked decisions — decide when they bite, not before
|
||||||
|
|
||||||
|
- Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1.
|
||||||
|
- Central database server vs per-app databases (TODO 3.9) — at the service phase.
|
||||||
|
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
|
||||||
|
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next step
|
||||||
|
|
||||||
|
**Phase 1 complete (M1–M5); mesh-hardening: ubongo (2/3) DONE 2026-06-19, askari redesign DONE 2026-06-20.**
|
||||||
|
Both hosts now run INPUT-only nftables default-deny (`base__firewall_input_only`), live reboot-validated.
|
||||||
|
askari's redesign (spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-askari-redesign*`)
|
||||||
|
applied INPUT-only default-deny + `wt0`-primary SSH + a permanent WAN break-glass + a geo-disabled
|
||||||
|
coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-projects:
|
||||||
|
|
||||||
|
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).**
|
||||||
|
2. ~~**redesign** `askari`'s SSH → `wt0`~~ → **DONE (2026-06-20)** — boot-race, coordinator-bootstrap
|
||||||
|
chicken-egg, and Docker-nat-flush all resolved + live reboot-validated.
|
||||||
|
3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a
|
||||||
|
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
|
||||||
|
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
|
||||||
|
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
|
||||||
|
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
|
||||||
|
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
|
||||||
|
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
|
||||||
|
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
|
||||||
|
|
||||||
|
**Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2.
|
||||||
119
docs/TODO.md
119
docs/TODO.md
|
|
@ -1,47 +1,53 @@
|
||||||
# ToDo
|
# ToDo
|
||||||
|
|
||||||
|
> **Build order lives in `docs/ROADMAP.md`** — that sequences this backlog into
|
||||||
|
> milestones. This file is the decision backlog; the roadmap is the order we build them.
|
||||||
|
>
|
||||||
|
> **Open items only.** Item numbers are stable cross-references (cited by ROADMAP,
|
||||||
|
> STATUS, ADRs, scripts) — **never renumber**. When an item is decided or built, collapse
|
||||||
|
> it to a one-line pointer in place; the full record lives in its ADR / `STATUS.md` / the
|
||||||
|
> `FRICTION.md` decisions ledger.
|
||||||
|
|
||||||
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
|
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
|
||||||
setup, etc. still need to be built)?
|
setup, etc. still need to be built)?
|
||||||
|
|
||||||
2. **Testing**
|
2. **Testing**
|
||||||
1. Choose and configure code-testing tooling (Molecule, etc.).
|
1. Choose and configure code-testing tooling (Molecule, etc.).
|
||||||
2. Decide how the AI interprets Molecule output and performs live testing:
|
2. Decide how the AI interprets Molecule output and performs live testing — API
|
||||||
API calls, curl pulls of web products, log reviews, and headless browsing.
|
calls, curl pulls of web products, log reviews. Headless browsing → ADR-017
|
||||||
3. Define a standard for generating test users and for instructing the user to
|
(`/verify-service`); the API/curl/log-review siblings remain open.
|
||||||
perform relevant manual tests.
|
3. ~~Standard for test users + manual-test instructions.~~ → ADR-017.
|
||||||
|
4. ~~Local VM integration testing on ubongo.~~ → ADR-025 / `make test-integration` (built + RED→GREEN validated 2026-06-18).
|
||||||
|
|
||||||
3. **Building services**
|
3. **Building services**
|
||||||
1. Decide how to manage logs.
|
1. ~~Decide how to manage logs.~~ → ADR-018.
|
||||||
2. Decide how to manage APIs / API access.
|
2. ~~Decide how to manage APIs / API access.~~ → ADR-021.
|
||||||
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
|
3. ~~Decide how to import/integrate from baobabAnsibleV4.~~ → ADR-013.
|
||||||
translate-don't-transplant — V4 is a source only of gotchas + working config
|
|
||||||
snippets, re-derived on boma's terms; never structure/requirements/values.
|
|
||||||
4. Decide what each node runs — base packages plus which apps/services.
|
4. Decide what each node runs — base packages plus which apps/services.
|
||||||
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
|
5. ~~Decide the firewall strategy.~~ → ADR-020 (builds: host nftables in `base` done; OPNsense-as-code pending).
|
||||||
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
6. Wire up the monitoring stack — Prometheus + metric exporters, Uptime Kuma, and
|
||||||
Kuma alerts on askari.
|
exactly which alerts live where. (Logging topology → ADR-018.)
|
||||||
7. Define a tagging standard that lets us target runs without over-tagging.
|
7. ~~Define a tagging standard.~~ → ADR-019.
|
||||||
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
8. ~~Ensure the right things are backed up.~~ → ADR-022 (build: the `backup` role, Plans 2–3, pending).
|
||||||
9. Decide: a central database server, or individual database services per app?
|
9. Decide: a central database server, or individual database services per app?
|
||||||
10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?
|
10. Should we keep the custom base-container (Molecule test image) method for role
|
||||||
|
testing, or revisit it as boma's testing approach matures (ADR-008)?
|
||||||
|
11. ~~Deliberate tagging strategy.~~ → ADR-019 (folded into 3.7).
|
||||||
|
|
||||||
4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?
|
4. ~~**Split-horizon FQDN.**~~ → ADR-007 / M1 (`wingu.me` three-tier; `nyumbani` dropped; mesh/LAN-only default).
|
||||||
|
|
||||||
5. **Control node**
|
5. **Control node**
|
||||||
1. Set up and test the control node while waiting for hardware.
|
1. Set up and test the control node while waiting for hardware.
|
||||||
2. Define control-node bootstrapping — a dedicated recipe and playbook?
|
2. Define control-node bootstrapping — a dedicated recipe and playbook?
|
||||||
3. Decide the role of mamba — access/availability vs compute power and ease?
|
3. Set up rbw on the control node.
|
||||||
4. Set up rbw on the control node.
|
|
||||||
|
|
||||||
6. **Updating**
|
6. **Updating** — 1. Decide the update strategy across services & containers vs packages
|
||||||
1. Decide pinning vs latest for versions.
|
& builds / GitHub pulls / Flatpaks. 2. Define scheduling of updates and reboots,
|
||||||
2. Decide the update strategy across services & containers vs packages &
|
including post-update testing. (Tracked in item 16 / ADR-011.)
|
||||||
builds / GitHub pulls / Flatpaks.
|
|
||||||
3. Define scheduling of updates and reboots, including post-update testing.
|
|
||||||
|
|
||||||
7. **Shell setup**
|
7. **Shell setup**
|
||||||
1. Decide what shell setup matters for the AI's work on the control node.
|
1. Decide what shell setup matters for the AI's work on the control node.
|
||||||
2. Decide what to set up on the hosts, given that direct access will be rare.
|
2. ~~Decide what to set up on the hosts (direct access rare).~~ → ADR-021.
|
||||||
|
|
||||||
8. **Scheduled work**
|
8. **Scheduled work**
|
||||||
1. Run `/review-repo` as `claude -p` via cron every two weeks?
|
1. Run `/review-repo` as `claude -p` via cron every two weeks?
|
||||||
|
|
@ -67,45 +73,44 @@
|
||||||
accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
|
accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
|
||||||
deterministic pre-scan (undeclared open ports, disabled baseline controls,
|
deterministic pre-scan (undeclared open ports, disabled baseline controls,
|
||||||
world-readable secrets, services not behind auth) with a judgement pass.
|
world-readable secrets, services not behind auth) with a judgement pass.
|
||||||
Open question: standalone, or folded into the kaizen `/retro` (item 11)?
|
Open question: standalone, or folded into `/kaizen` (item 11)?
|
||||||
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
|
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
|
||||||
|
|
||||||
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
|
10. **Claude setup** — DECIDED: brainstorm for intent → ADRs; hooks + slash commands +
|
||||||
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
|
`/review-repo` for enforcement at scale. Remaining:
|
||||||
remaining setup to carry out from this decision?
|
1. ~~V4 collaboration policy.~~ → ADR-013.
|
||||||
1. ~~Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.~~ DECIDED — ADR-013.
|
2. ~~Policy for how we write key documents like ADRs.~~ → ADR-023.
|
||||||
2. Policy for how we write key documents like ADRs.
|
3. Further development on how we collaborate on designing the foundation for the project - separate from how we implement new containers etc.
|
||||||
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
|
4. ~~Always-latest official documentation for our tech.~~ → ADR-014.
|
||||||
4. ~~How do we make sure agents always use the latest official documentation for the technologies etc. we use?~~ DECIDED — ADR-014 (facts → version-matched docs, cited + stamped; best practices → translated per ADR-013; risk-based triggers; graceful fallback to WebFetch).
|
5. ~~Always subagent-driven?~~ → DECIDED: yes (standing agreement; enforced by `.claude/hooks/guard-execution-mode-menu.sh`).
|
||||||
5. Always subagent driven?
|
|
||||||
6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
|
6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
|
||||||
7. ~~Reproducible agent toolchain (surfaced by ADR-014).~~ DONE — repo
|
7. ~~Reproducible agent toolchain.~~ → `.claude/settings.json` + `docs/runbooks/claude-code-setup.md`.
|
||||||
`.claude/settings.json` declares `extraKnownMarketplaces` + `enabledPlugins`
|
8. **Screenshot hand-off to the agent.** Give the operator a smooth way to hand the
|
||||||
(active set: superpowers · context7 · terraform · claude-md-management) and a
|
agent a screenshot (e.g. of a Hetzner/VNC console during an incident) — the agent
|
||||||
conservative permissions allowlist; bootstrap procedure in
|
can already read image files; the gap is the hand-off. During the 2026-06-17
|
||||||
`docs/runbooks/claude-code-setup.md`. Deferred plugins listed there with
|
incident the only diagnostic channel was console screenshots, copied manually to
|
||||||
triggers. (Plugin install is still a per-machine `/plugin` action — no native
|
`/tmp` and `find`-located. Options: a known drop path the agent checks (e.g.
|
||||||
auto-install.)
|
`~/screenshots/`), a small `screenshot`/paste helper or slash-command, or a
|
||||||
|
clipboard→file convention. Cheap, high-value for incident work.
|
||||||
|
|
||||||
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
|
11. **Kaizen loop** — `/kaizen` built (STATUS).
|
||||||
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
|
1. ~~Build the loop command.~~ → `/kaizen` (`scripts/friction-scan.py` + `.claude/commands/kaizen.md`; spec `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`).
|
||||||
findings + a tooling-usage inventory; proposes add / change / **remove**
|
2. Keep appending raw signals to `docs/FRICTION.md` (ongoing practice; see FRICTION.md).
|
||||||
(biased to remove); records decisions as ADRs; evaluates itself.
|
3. **Automation deferred** (revisit when the notify + cron stack is up): wire a
|
||||||
Recurrence-triggered plus a light periodic sweep.
|
**scheduled headless** run — report-only (proposes verdicts + notifies, does not
|
||||||
2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
|
auto-curate/commit). The on-demand command + recurrence/age nudge ship now.
|
||||||
retro consumes them.
|
|
||||||
|
|
||||||
12. **Spin-up order** — what is the right order of operations when spinning up
|
12. **Spin-up / build order** — what is the right order of operations when spinning up
|
||||||
from scratch (OS, DNS, Authentik, Traefik, …)?
|
from scratch (OS, DNS, Authentik, Caddy, …)?
|
||||||
|
|
||||||
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
|
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
|
||||||
|
|
||||||
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
|
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
|
||||||
`repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
|
`repo-scan.py`, `capacity-scan.py`, `friction-scan.py`) are stdlib-only by
|
||||||
run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
|
convention, for run-anywhere portability (control node, CI, bare clone, no venv).
|
||||||
whether selectively allowing libraries (e.g. PyYAML — already present via
|
Reevaluate whether selectively allowing libraries (e.g. PyYAML — already present via
|
||||||
Ansible) is a better fit in general: weigh the parsing-correctness win
|
Ansible) is a better fit in general: weigh the parsing-correctness win against losing
|
||||||
against losing zero-setup portability. Decide a clear rule and record it.
|
zero-setup portability. Decide a clear rule and record it.
|
||||||
|
|
||||||
15. **Security hardening implementation** — build out the ADR-002 hardening standard.
|
15. **Security hardening implementation** — build out the ADR-002 hardening standard.
|
||||||
1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
|
1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
|
||||||
|
|
@ -123,6 +128,7 @@
|
||||||
6. Supply-chain hygiene: enforce tiered image pinning (stateful `tag@digest`;
|
6. Supply-chain hygiene: enforce tiered image pinning (stateful `tag@digest`;
|
||||||
stateless rolling tags — ADR-011) + official/verified images via the service
|
stateless rolling tags — ADR-011) + official/verified images via the service
|
||||||
checklist; revisit active scanning (Trivy/Grype) once a triage stack exists (R1).
|
checklist; revisit active scanning (Trivy/Grype) once a triage stack exists (R1).
|
||||||
|
7. Is our network setup as it should be? I am not sure if all traffic between ubongo and notes goes via askari? what if askari breaks - will the rest work?
|
||||||
|
|
||||||
16. **ADR-011 (update management) — resolve open questions + accept.** Committed as
|
16. **ADR-011 (update management) — resolve open questions + accept.** Committed as
|
||||||
**Proposed**; resolve before marking Accepted:
|
**Proposed**; resolve before marking Accepted:
|
||||||
|
|
@ -136,7 +142,4 @@
|
||||||
Friday timing enough at this scale?
|
Friday timing enough at this scale?
|
||||||
6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this
|
6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this
|
||||||
week" / "pause" switch (ties to TODO 9).
|
week" / "pause" switch (ties to TODO 9).
|
||||||
7. ~~Reconcile pinning conflict (tags vs digests).~~ DECIDED: tiered rule —
|
7. ~~Reconcile pinning conflict (tags vs digests).~~ → DECIDED: tiered (stateful `tag@digest`, stateless rolling); ADR-011 dec. 2 / ADR-004 / ADR-002.
|
||||||
**stateful `tag@digest`** (readable tag + integrity digest), **stateless
|
|
||||||
rolling tags**. Aligned across ADR-011 (dec. 2), ADR-004, ADR-002 supply-chain
|
|
||||||
row + accepted-risk R1, the service checklist, and 15.6.
|
|
||||||
|
|
|
||||||
38
docs/access/service-access-template.md
Normal file
38
docs/access/service-access-template.md
Normal file
|
|
@ -0,0 +1,38 @@
|
||||||
|
# Per-service operational-access record — template
|
||||||
|
|
||||||
|
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
|
||||||
|
It is the per-service **operational-access record**: every documented, verifiable way in
|
||||||
|
for troubleshooting. The structured parts are **rendered from the role's `access__*`
|
||||||
|
data** (the single source of truth that also drives `/check-access`) — keep the data
|
||||||
|
authoritative and regenerate this file rather than hand-editing the tables. The prose
|
||||||
|
"Operational notes" tail is hand-written.
|
||||||
|
|
||||||
|
Delete this preamble in the copy and start from the heading below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Access — <service>
|
||||||
|
|
||||||
|
## Access paths
|
||||||
|
|
||||||
|
The documented ways in, by tier (rendered from `access__*`):
|
||||||
|
|
||||||
|
| Tier | Path | Invocation |
|
||||||
|
|---|---|---|
|
||||||
|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
|
||||||
|
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
|
||||||
|
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
|
||||||
|
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
|
||||||
|
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
|
||||||
|
|
||||||
|
## Break-glass
|
||||||
|
|
||||||
|
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
||||||
|
|
||||||
|
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
|
||||||
|
|
||||||
|
## Operational notes
|
||||||
|
|
||||||
|
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
|
||||||
|
|
||||||
|
- <none yet>
|
||||||
44
docs/backup/service-backup-template.md
Normal file
44
docs/backup/service-backup-template.md
Normal file
|
|
@ -0,0 +1,44 @@
|
||||||
|
# Per-service backup record — template
|
||||||
|
|
||||||
|
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
|
||||||
|
role (ADR-022). It is the per-service **backup record**: what state the service holds,
|
||||||
|
how it is captured consistently, and how it is restored. The structured parts are
|
||||||
|
**rendered from the role's `backup__*` data** (the single source of truth that also
|
||||||
|
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
|
||||||
|
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
|
||||||
|
|
||||||
|
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
|
||||||
|
`backup__state: false` with a reason in its role defaults instead.
|
||||||
|
|
||||||
|
Delete this preamble in the copy and start from the heading below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Backup — <service>
|
||||||
|
|
||||||
|
## State captured
|
||||||
|
|
||||||
|
Rendered from `backup__*`:
|
||||||
|
|
||||||
|
| What | Source | How captured |
|
||||||
|
|---|---|---|
|
||||||
|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
|
||||||
|
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
|
||||||
|
|
||||||
|
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
|
||||||
|
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
|
||||||
|
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
|
||||||
|
|
||||||
|
## Restore procedure
|
||||||
|
|
||||||
|
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
|
||||||
|
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
|
||||||
|
3. Replay each `<backup__dumps[*].dest>` into its database.
|
||||||
|
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
|
||||||
|
|
||||||
|
## Restore notes
|
||||||
|
|
||||||
|
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
|
||||||
|
known-tricky migrations.
|
||||||
|
|
||||||
|
- <none yet>
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-001 — Architecture overview
|
# ADR-001 — Architecture overview
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
This document describes the overall architecture of the homelab infrastructure
|
This document describes the overall architecture of the homelab infrastructure
|
||||||
|
|
@ -10,15 +14,16 @@ and the boundaries of what this Ansible monorepo manages.
|
||||||
- **Hypervisor**: Proxmox cluster (2+ nodes)
|
- **Hypervisor**: Proxmox cluster (2+ nodes)
|
||||||
- **Guest OS**: Debian 13 (all managed hosts)
|
- **Guest OS**: Debian 13 (all managed hosts)
|
||||||
- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle
|
- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle
|
||||||
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
|
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
|
||||||
The control node is the one host that cannot fully bootstrap itself from scratch
|
**outside** the cluster. Ansible runs from here. It cannot be created by the
|
||||||
and requires manual initial setup (see `docs/runbooks/new-host.md`).
|
Terraform it hosts, so it is provisioned manually (see ADR-015 and
|
||||||
|
`docs/runbooks/new-host.md`).
|
||||||
|
|
||||||
## What this repo manages
|
## What this repo manages
|
||||||
|
|
||||||
| Layer | Managed by | Notes |
|
| Layer | Managed by | Notes |
|
||||||
|--------------------|--------------------|--------------------------------------------|
|
|--------------------|--------------------|--------------------------------------------|
|
||||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
|
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
|
||||||
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
|
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
|
||||||
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
|
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
|
||||||
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
|
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
|
||||||
|
|
@ -32,14 +37,17 @@ describes the *intended* design — see STATUS.md for what is actually built.
|
||||||
|
|
||||||
```
|
```
|
||||||
all
|
all
|
||||||
├── control # the control node itself — baseline config only, runs no services
|
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
|
||||||
├── docker_hosts # VMs running Docker services (most hosts)
|
├── docker_hosts # VMs running Docker services (most hosts)
|
||||||
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
|
├── proxmox_hosts # Proxmox nodes themselves (limited management scope)
|
||||||
|
└── offsite_hosts # askari (off-site Hetzner) — NetBird coordinator + external watchdog
|
||||||
```
|
```
|
||||||
|
|
||||||
The `control` group holds the single manually-provisioned control node; it is
|
The `control` group holds the single manually-provisioned control node; it is
|
||||||
managed for baseline config (SSH, firewall, updates) but never runs the
|
managed for baseline config (SSH, firewall, updates) but never runs the
|
||||||
`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH).
|
`docker_host` role. The `offsite_hosts` group holds `askari`, the off-site Hetzner
|
||||||
|
host — also manually provisioned (ADR-016), managed for baseline config plus the
|
||||||
|
`netbird_coordinator` service role. Proxmox nodes are managed only for basic baseline tasks (SSH).
|
||||||
Proxmox configuration itself (storage, clustering, networking)
|
Proxmox configuration itself (storage, clustering, networking)
|
||||||
is out of scope.
|
is out of scope.
|
||||||
|
|
||||||
|
|
@ -61,3 +69,21 @@ This architecture prioritises:
|
||||||
- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
|
- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
|
||||||
- **Reproducibility**: any host can be rebuilt from scratch via Ansible
|
- **Reproducibility**: any host can be rebuilt from scratch via Ansible
|
||||||
- **Legibility**: a human reading the repo can understand what runs where
|
- **Legibility**: a human reading the repo can understand what runs where
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the boundaries this ADR already states:
|
||||||
|
|
||||||
|
- The small fleet (2–5 VMs) is treated as individuals, not cattle (per Infrastructure),
|
||||||
|
and forgoing an orchestration layer is the cost of the simplicity priority (per
|
||||||
|
Decision).
|
||||||
|
- The control node `ubongo` cannot be created by the Terraform it hosts, so it is
|
||||||
|
provisioned manually — the one documented exception to Terraform-owned VM existence
|
||||||
|
(per Infrastructure / Host groups; ADR-009, ADR-015).
|
||||||
|
- Management scope is deliberately bounded: Proxmox configuration itself (storage,
|
||||||
|
clustering, networking) is out of scope, and the `control` group never runs the
|
||||||
|
`docker_host` role (per Host groups).
|
||||||
|
- Compose files are always regenerated by Ansible on deploy; no hand-edited Compose
|
||||||
|
files exist on hosts (per Service interaction model).
|
||||||
|
- The "What this repo manages" table describes the *intended* design — STATUS.md
|
||||||
|
records what is actually built (per that section).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-002 — Security baseline and strategy
|
# ADR-002 — Security baseline and strategy
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Security here is not a single control but the sum of several combined efforts —
|
Security here is not a single control but the sum of several combined efforts —
|
||||||
|
|
@ -75,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
|
||||||
### Updates — *opportunistic*
|
### Updates — *opportunistic*
|
||||||
|
|
||||||
- `unattended-upgrades` enabled for **security patches only**
|
- `unattended-upgrades` enabled for **security patches only**
|
||||||
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
|
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
|
||||||
|
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
|
||||||
- No automatic reboots — reboots are a conscious operational decision
|
- No automatic reboots — reboots are a conscious operational decision
|
||||||
|
|
||||||
### Minimal attack surface — *opportunistic, blast radius*
|
### Minimal attack surface — *opportunistic, blast radius*
|
||||||
|
|
@ -87,7 +92,9 @@ time. Each heading tags the threat(s) it primarily serves.
|
||||||
### Audit trail — *agent error, blast radius*
|
### Audit trail — *agent error, blast radius*
|
||||||
|
|
||||||
- `auditd` installed and running with a baseline ruleset
|
- `auditd` installed and running with a baseline ruleset
|
||||||
- Logs shipped to a central location if a log aggregation service is available
|
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
|
||||||
|
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
|
||||||
|
trail survives host (and full-cluster) compromise (ADR-018)
|
||||||
|
|
||||||
### Mandatory access control — *blast radius*
|
### Mandatory access control — *blast radius*
|
||||||
|
|
||||||
|
|
@ -102,8 +109,9 @@ time. Each heading tags the threat(s) it primarily serves.
|
||||||
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
|
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
|
||||||
unexpected changes to system files
|
unexpected changes to system files
|
||||||
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
|
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
|
||||||
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
|
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
|
||||||
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
|
log-source-silence (a host that stops shipping) — into Grafana alerting on the
|
||||||
|
Loki/Grafana stack (ADR-018; planned)
|
||||||
|
|
||||||
## Secrets management — *agent error, opportunistic*
|
## Secrets management — *agent error, opportunistic*
|
||||||
|
|
||||||
|
|
@ -180,3 +188,27 @@ This posture was chosen to be:
|
||||||
Out-of-scope items and conscious trade-offs are recorded in
|
Out-of-scope items and conscious trade-offs are recorded in
|
||||||
`docs/security/accepted-risks.md` rather than here, so this decision record stays
|
`docs/security/accepted-risks.md` rather than here, so this decision record stays
|
||||||
stable while the risk posture evolves.
|
stable while the risk posture evolves.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the trade-offs, scoping, and follow-on work this ADR already states:
|
||||||
|
|
||||||
|
- Targeted/physical adversaries are out of scope at this scale, and supply chain is
|
||||||
|
consciously deprioritized — active vuln scanning is deferred as an accepted risk
|
||||||
|
(per Threat model; `docs/security/accepted-risks.md`).
|
||||||
|
- SELinux is not used (non-native to Debian, redundant with AppArmor), recorded as an
|
||||||
|
accepted risk (per Mandatory access control).
|
||||||
|
- Some CIS L2 items require separate partitions with restrictive mount options, which
|
||||||
|
reaches into VM disk layout — a provisioning concern (Terraform / cloud-init, ADR-006),
|
||||||
|
not just the `base` role (per Hardening standard). Any impractical CIS item is exempted
|
||||||
|
into the accepted-risk register with rationale, recording named exceptions rather than a
|
||||||
|
blanket opt-out.
|
||||||
|
- Several controls and governance mechanisms are stated as planned, not yet built:
|
||||||
|
Suricata network IDS, active alerting wiring AIDE/`auditd`/`fail2ban`/Suricata plus
|
||||||
|
log-source-silence into Grafana, the `/security-review` skill and its aggregation of
|
||||||
|
every `roles/*/SECURITY.md`, and the periodic security review (per File integrity /
|
||||||
|
Governance; STATUS.md / `docs/TODO.md`).
|
||||||
|
- The per-service security bar is enforced manually in review today, pending the planned
|
||||||
|
`/security-review` automation (per Governance).
|
||||||
|
- The accepted-risk register is kept out of this ADR so the record stays stable while the
|
||||||
|
risk posture evolves (per Decision; `docs/security/accepted-risks.md`).
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,20 @@
|
||||||
# ADR-003 — Toolchain decisions
|
# ADR-003 — Toolchain decisions
|
||||||
|
|
||||||
## Execution engine
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma needs a defined, reproducible toolchain for running and testing its Ansible
|
||||||
|
monorepo: an execution engine, a Python environment, secrets handling, a testing
|
||||||
|
framework, linting, CI/CD, developer-ergonomics conventions, and a collections/roles
|
||||||
|
policy. This ADR records the choice made for each, together with the alternatives
|
||||||
|
weighed and why they were not adopted.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Execution engine
|
||||||
|
|
||||||
**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
|
**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
|
||||||
|
|
||||||
|
|
@ -12,7 +26,7 @@ that isn't needed in a maintained monorepo.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Python environment
|
### Python environment
|
||||||
|
|
||||||
**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
|
**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
|
||||||
|
|
||||||
|
|
@ -24,7 +38,7 @@ reproducible, and has no extra dependencies.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Secrets
|
### Secrets
|
||||||
|
|
||||||
**Choice**: Ansible Vault (file-based, built-in)
|
**Choice**: Ansible Vault (file-based, built-in)
|
||||||
|
|
||||||
|
|
@ -40,7 +54,7 @@ CLAUDE.md → Secrets).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Testing
|
### Testing
|
||||||
|
|
||||||
**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
|
**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
|
||||||
|
|
||||||
|
|
@ -59,7 +73,7 @@ are needed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Linting
|
### Linting
|
||||||
|
|
||||||
**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
|
**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
|
||||||
|
|
||||||
|
|
@ -71,7 +85,7 @@ Config files: `.ansible-lint`, `.yamllint` in repo root.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## CI/CD
|
### CI/CD
|
||||||
|
|
||||||
**Choice**: Forgejo Actions (self-hosted at forgejo.nyumbani.baobab.band) + `act_runner`
|
**Choice**: Forgejo Actions (self-hosted at forgejo.nyumbani.baobab.band) + `act_runner`
|
||||||
|
|
||||||
|
|
@ -82,11 +96,12 @@ Config files: `.ansible-lint`, `.yamllint` in repo root.
|
||||||
2. On green → deploy to staging
|
2. On green → deploy to staging
|
||||||
3. [manual promote gate] → deploy to production
|
3. [manual promote gate] → deploy to production
|
||||||
|
|
||||||
`act_runner` runs as a Docker container on the control node or a dedicated runner VM.
|
`act_runner` runs as a Docker container on `ubongo` (the control node — ADR-015), or on
|
||||||
|
a dedicated runner VM later if CI load warrants a separate host.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Developer ergonomics
|
### Developer ergonomics
|
||||||
|
|
||||||
**Choice**: `Makefile` as the single interface for all operations
|
**Choice**: `Makefile` as the single interface for all operations
|
||||||
|
|
||||||
|
|
@ -101,7 +116,7 @@ The venv is activated in the user's shell profile.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Collections and roles policy
|
### Collections and roles policy
|
||||||
|
|
||||||
**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
|
**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
|
||||||
Galaxy roles introduce external state, versioning surprises, and implicit
|
Galaxy roles introduce external state, versioning surprises, and implicit
|
||||||
|
|
@ -135,3 +150,24 @@ are removed. Each entry in `requirements.yml` must justify its presence.
|
||||||
| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 |
|
| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 |
|
||||||
|
|
||||||
Terraform is **adopted** for VM provisioning only (no DNS) — see `docs/decisions/006-terraform.md`.
|
Terraform is **adopted** for VM provisioning only (no DNS) — see `docs/decisions/006-terraform.md`.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the rationale and trade-offs this ADR already states:
|
||||||
|
|
||||||
|
- Pinning `ansible-core` + an explicit `requirements.yml` and a plain pinned venv keeps
|
||||||
|
the control-node environment small and fully reproducible, at the cost of maintaining
|
||||||
|
the pins (per Execution engine / Python environment).
|
||||||
|
- Ansible Vault's whole-file encryption makes diffs unreadable regardless of layout, so
|
||||||
|
secrets are organised for human lookup (`vault.<service>.<key>`) rather than diff
|
||||||
|
ergonomics — the trade accepted against SOPS/age (per Secrets).
|
||||||
|
- The `Makefile` is the single interface: Claude Code and CI invoke the same targets, so
|
||||||
|
local and CI behaviour can't drift and collaborators need not know raw flags (per
|
||||||
|
Developer ergonomics).
|
||||||
|
- Collections are added only on demand, so `requirements.yml` stays minimal; this defers
|
||||||
|
`community.crypto` (use `openssl` CLI until a role needs certs) and `community.general`
|
||||||
|
(add only the specific sub-module needed) until a real need appears (per Collections
|
||||||
|
and roles policy).
|
||||||
|
- The heavier orchestration tools were declined for this scale, each with a named
|
||||||
|
revisit trigger — e.g. Semaphore if non-SSH operators must trigger runs, AWX-adjacent
|
||||||
|
tooling only if AWX/AAP is ever adopted (per "What was explicitly ruled out").
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-004 — Docker and Compose service model
|
# ADR-004 — Docker and Compose service model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
All services run as Docker containers managed via Docker Compose. This document
|
All services run as Docker containers managed via Docker Compose. This document
|
||||||
|
|
@ -42,8 +46,18 @@ below). Each service role contains a standard set of files:
|
||||||
| `defaults/main.yml` | Tuneables, `rolename__` namespace |
|
| `defaults/main.yml` | Tuneables, `rolename__` namespace |
|
||||||
| `README.md` | Purpose, variables, usage (role convention) |
|
| `README.md` | Purpose, variables, usage (role convention) |
|
||||||
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
|
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
|
||||||
|
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
|
||||||
|
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
|
||||||
|
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
|
||||||
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
|
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
|
||||||
|
|
||||||
|
The `access__*` (ADR-021) and `backup__*` (ADR-022) data in `defaults/main.yml` are
|
||||||
|
**cross-role conventions** — shared field names that deliberately do *not* carry the
|
||||||
|
`<rolename>__` prefix. ansible-lint's `var-naming[no-role-prefix]` has no per-prefix
|
||||||
|
allowlist, so each such line carries a trailing `# noqa: var-naming[no-role-prefix]` (the
|
||||||
|
rule stays enforced for genuinely role-scoped vars). `make new-role` scaffolds a reminder;
|
||||||
|
`roles/reverse_proxy/defaults/main.yml` is the reference.
|
||||||
|
|
||||||
### Standard deploy mechanics
|
### Standard deploy mechanics
|
||||||
|
|
||||||
Every service role's `tasks/main.yml` follows the same sequence, so all roles are
|
Every service role's `tasks/main.yml` follows the same sequence, so all roles are
|
||||||
|
|
@ -97,7 +111,9 @@ Managed by the `docker_host` role. Key settings:
|
||||||
|
|
||||||
- Bind mounts preferred over named volumes for data that must be backed up
|
- Bind mounts preferred over named volumes for data that must be backed up
|
||||||
- All bind mount paths are under `/opt/services/<name>/data/`
|
- All bind mount paths are under `/opt/services/<name>/data/`
|
||||||
- Backup strategy is defined separately (not in scope of this repo)
|
- Backup strategy is defined in **ADR-022** — the bind mounts under
|
||||||
|
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
|
||||||
|
contract (and `BACKUP.md`) captures
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
|
|
@ -106,3 +122,23 @@ Docker Compose was chosen over Kubernetes/Swarm because:
|
||||||
- Compose files are human-readable and easily auditable
|
- Compose files are human-readable and easily auditable
|
||||||
- No distributed state to manage
|
- No distributed state to manage
|
||||||
- Straightforward to back up and restore
|
- Straightforward to back up and restore
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the trade-offs and deferred items this ADR already states:
|
||||||
|
|
||||||
|
- A shared `compose_service` engine role is intentionally not built: the ~5 standard
|
||||||
|
tasks are duplicated per role in favour of legible, self-contained roles, with a stated
|
||||||
|
revisit trigger — extract a shared engine if maintaining the duplicated mechanics
|
||||||
|
becomes painful (a pattern change touching many roles, or drift this standard alone
|
||||||
|
isn't preventing) (per "Why not a shared engine").
|
||||||
|
- Forgoing Kubernetes/Swarm is the deliberate cost of matching complexity to a 2–5 host
|
||||||
|
fleet with no distributed state to manage (per Decision).
|
||||||
|
- User-namespace remapping is not enabled by default — evaluated per use case (per Docker
|
||||||
|
daemon configuration).
|
||||||
|
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
|
||||||
|
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
|
||||||
|
ADR-011).
|
||||||
|
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
|
||||||
|
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
|
||||||
|
contract captures (per Persistent data).
|
||||||
|
|
|
||||||
|
|
@ -1,13 +1,17 @@
|
||||||
# ADR-005 — Host bootstrapping
|
# ADR-005 — Host bootstrapping
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
This document defines the **cloud-init template** that managed VMs are cloned
|
This document defines the **cloud-init template** that managed VMs are cloned
|
||||||
from, and the **control-node** bootstrapping special case. The per-host
|
from, and the **control-node** bootstrapping special case. The per-host
|
||||||
provisioning pipeline — how a VM is created from this template and handed off to
|
provisioning pipeline — how a VM is created from this template and handed off to
|
||||||
Ansible — is owned by ADR-009. Terraform clones the template defined here; the
|
Ansible — is owned by ADR-009. Terraform clones the template defined here; the
|
||||||
template is the base image both for Terraform-managed hosts and for the manually
|
template is the base image for Terraform-managed hosts. The control node (`ubongo`)
|
||||||
provisioned control node.
|
is a physical machine installed directly, not cloned from this template (ADR-015).
|
||||||
|
|
||||||
## Approach: Proxmox cloud-init template
|
## Approach: Proxmox cloud-init template
|
||||||
|
|
||||||
|
|
@ -32,10 +36,10 @@ High-level steps:
|
||||||
|
|
||||||
## VM provisioning (per new host)
|
## VM provisioning (per new host)
|
||||||
|
|
||||||
Per-host VMs are created by **Terraform**, which clones this template, sets the
|
Per-host VMs are created by **Terraform**, which clones this template and sets the
|
||||||
cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's
|
cloud-init values (hostname, SSH public key, IP/gateway). Cloud-init runs at first
|
||||||
DNS A record. Cloud-init runs at first boot (~30–60 seconds), leaving the VM
|
boot (~30–60 seconds), leaving the VM reachable via SSH with the ansible user's key.
|
||||||
reachable via SSH with the ansible user's key.
|
Terraform writes no DNS records — the `dns` role owns the internal zone (ADR-009).
|
||||||
|
|
||||||
The full create → inventory → configure pipeline, and the Terraform↔Ansible data
|
The full create → inventory → configure pipeline, and the Terraform↔Ansible data
|
||||||
contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
|
contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
|
||||||
|
|
@ -51,11 +55,12 @@ for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedu
|
||||||
## Control node bootstrapping
|
## Control node bootstrapping
|
||||||
|
|
||||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||||
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
|
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
|
||||||
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
|
**physical** machine outside the cluster, and the one documented exception to
|
||||||
|
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
|
||||||
|
|
||||||
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
|
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
|
||||||
`qm clone`), since Terraform is not yet available to do it
|
a Proxmox guest, so there is no template to clone)
|
||||||
2. Manual setup of the Ansible environment:
|
2. Manual setup of the Ansible environment:
|
||||||
```bash
|
```bash
|
||||||
git clone <repo> ~/ansible
|
git clone <repo> ~/ansible
|
||||||
|
|
@ -68,9 +73,10 @@ exception to Terraform-owned VM existence (see ADR-009). The control node requir
|
||||||
```
|
```
|
||||||
3. After that, the control node can manage all other hosts normally
|
3. After that, the control node can manage all other hosts normally
|
||||||
|
|
||||||
The control node itself is listed in `inventories/production/hosts.yml` under
|
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
|
||||||
a `control` group and can be managed for baseline config (SSH, firewall, updates)
|
and can be managed for baseline config (SSH, firewall, updates) but not for the
|
||||||
but not for the `docker_host` role (it does not run services).
|
`docker_host` role (it does not run services). Hardware target and recovery model
|
||||||
|
are in ADR-015.
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
|
|
@ -79,3 +85,19 @@ Cloud-init with Proxmox templates provides:
|
||||||
- No manual installer interaction
|
- No manual installer interaction
|
||||||
- A clean handoff point to Ansible
|
- A clean handoff point to Ansible
|
||||||
- Easy rebuilds — destroy VM, clone template, run Ansible
|
- Easy rebuilds — destroy VM, clone template, run Ansible
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the trade-offs and special cases this ADR already states:
|
||||||
|
|
||||||
|
- The cloud-init image was chosen over a manual Debian installer (slow, error-prone,
|
||||||
|
not reproducible) and over preseed/netboot (powerful but complex to maintain) (per
|
||||||
|
Approach).
|
||||||
|
- Template creation is a one-time manual procedure per Proxmox cluster, and the template
|
||||||
|
is never booted directly (per Template creation).
|
||||||
|
- There is no manual `qm clone` path for managed hosts; the full create → inventory →
|
||||||
|
configure pipeline and the Terraform↔Ansible contract live in ADR-009 (per VM
|
||||||
|
provisioning / Ansible handoff).
|
||||||
|
- The control node is the sole documented exception — `ubongo`, a physical machine
|
||||||
|
installed by hand because it cannot be created by the Terraform it hosts (chicken-and-egg);
|
||||||
|
its hardware target and recovery model live in ADR-015 (per Control node bootstrapping).
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,14 @@
|
||||||
# ADR-006 — Terraform for infrastructure provisioning
|
# ADR-006 — Terraform for infrastructure provisioning
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Ansible manages host configuration well but has no state model for infrastructure
|
Ansible manages host configuration well but has no state model for infrastructure
|
||||||
existence. Adding Terraform handles the "what exists" layer — creating and destroying
|
existence. Adding Terraform handles the "what exists" layer — creating and destroying
|
||||||
VMs on Proxmox — while Ansible continues to own everything that runs inside them,
|
VMs on Proxmox and Hetzner — while Ansible continues to own everything that runs inside them,
|
||||||
including all internal DNS records.
|
including all internal DNS records.
|
||||||
|
|
||||||
This complements rather than replaces Ansible. The two tools do not overlap. The
|
This complements rather than replaces Ansible. The two tools do not overlap. The
|
||||||
|
|
@ -13,7 +17,9 @@ exact boundary, handoff pipeline, and data contract between them live in **ADR-0
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Responsibility split
|
## Decision
|
||||||
|
|
||||||
|
### Responsibility split
|
||||||
|
|
||||||
The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
|
The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
|
||||||
owns VM existence only; Ansible owns everything inside a VM, including all internal
|
owns VM existence only; Ansible owns everything inside a VM, including all internal
|
||||||
|
|
@ -26,11 +32,16 @@ cadence, making them a poor fit for Terraform state.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Providers
|
### Providers
|
||||||
|
|
||||||
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
|
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
|
||||||
full Proxmox 8 API support, and better cloud-init integration. This is the only
|
full Proxmox 8 API support, and better cloud-init integration. This is the provider
|
||||||
provider.
|
for Proxmox VMs.
|
||||||
|
|
||||||
|
**`hetznercloud/hcloud` (`~> 1.65`)**: owns off-site VM existence (`askari`). ADR-006's
|
||||||
|
scope is now **Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across
|
||||||
|
providers. The `offsite` environment + `hetzner_vm` module live alongside the Proxmox env
|
||||||
|
+ `proxmox_vm` module; each environment has its own local state.
|
||||||
|
|
||||||
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
|
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
|
||||||
to write A records, but that created a bootstrap cycle — the first DNS server cannot
|
to write A records, but that created a bootstrap cycle — the first DNS server cannot
|
||||||
|
|
@ -42,7 +53,7 @@ Terraform manages its own provider dependencies via `required_providers` and
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## State backend
|
### State backend
|
||||||
|
|
||||||
**Choice**: Local state on the control node.
|
**Choice**: Local state on the control node.
|
||||||
|
|
||||||
|
|
@ -59,15 +70,17 @@ integration boundary.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Structure
|
### Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
terraform/
|
terraform/
|
||||||
modules/
|
modules/
|
||||||
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
|
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
|
||||||
|
hetzner_vm/ # reusable VM module — Hetzner Cloud, no DNS
|
||||||
environments/
|
environments/
|
||||||
staging/ # staging VMs, separate state file
|
staging/ # staging Proxmox VMs, separate state file
|
||||||
production/ # production VMs, separate state file
|
production/ # production Proxmox VMs, separate state file
|
||||||
|
offsite/ # off-site Hetzner VMs (askari), separate state file
|
||||||
```
|
```
|
||||||
|
|
||||||
Separate environment directories (not Terraform workspaces) for the clearest
|
Separate environment directories (not Terraform workspaces) for the clearest
|
||||||
|
|
@ -75,7 +88,7 @@ isolation — no risk of accidentally applying the wrong state.
|
||||||
|
|
||||||
Each environment directory contains:
|
Each environment directory contains:
|
||||||
- `providers.tf` — provider version pins and configuration
|
- `providers.tf` — provider version pins and configuration
|
||||||
- `backend.tf` — Forgejo state backend (environment-specific path)
|
- `backend.tf` — backend configuration (local state on the control node; no remote backend — see "State backend" above)
|
||||||
- `variables.tf` — input declarations
|
- `variables.tf` — input declarations
|
||||||
- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
|
- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
|
||||||
- `main.tf` — `local.vms` map and module calls (no DNS resources)
|
- `main.tf` — `local.vms` map and module calls (no DNS resources)
|
||||||
|
|
@ -83,7 +96,7 @@ Each environment directory contains:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Secrets handling
|
### Secrets handling
|
||||||
|
|
||||||
The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
|
The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
|
||||||
environment variable and declared `sensitive = true` in `variables.tf`. It never
|
environment variable and declared `sensitive = true` in `variables.tf`. It never
|
||||||
|
|
@ -92,7 +105,7 @@ appears in `.tfvars` files. Non-secret configuration lives in tracked
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Ansible integration
|
### Ansible integration
|
||||||
|
|
||||||
After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
|
After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
|
||||||
`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
|
`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
|
||||||
|
|
@ -102,7 +115,7 @@ handoff)**.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## What was ruled out
|
### What was ruled out
|
||||||
|
|
||||||
| Option | Reason |
|
| Option | Reason |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -110,3 +123,26 @@ handoff)**.
|
||||||
| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
|
| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
|
||||||
| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
|
| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
|
||||||
| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |
|
| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the "What was ruled out" section and the decisions stated above:
|
||||||
|
|
||||||
|
- `bpg/proxmox` is the provider for Proxmox VMs; `telmate/proxmox` was ruled out for weaker
|
||||||
|
maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out).
|
||||||
|
- `hetznercloud/hcloud` is the provider for off-site VM existence (`askari`); ADR-006's
|
||||||
|
scope now covers Proxmox + Hetzner (Providers).
|
||||||
|
- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid
|
||||||
|
community-provider rot across OPNsense releases (Responsibility split; What was
|
||||||
|
ruled out).
|
||||||
|
- Terraform writes no DNS records; Ansible's `dns` role owns the entire internal
|
||||||
|
zone, avoiding the bootstrap cycle and split DNS ownership the earlier
|
||||||
|
`hashicorp/dns` design created (Providers).
|
||||||
|
- State is local on the control node because Forgejo offers no usable HTTP state
|
||||||
|
backend; this is sufficient at solo-operator scale (no concurrent applies, no
|
||||||
|
remote locking), with a real backend such as MinIO/S3 to be added later if
|
||||||
|
warranted (State backend).
|
||||||
|
- Separate environment directories are used instead of Terraform workspaces to
|
||||||
|
remove the risk of applying the wrong state (Structure; What was ruled out).
|
||||||
|
- Terraform and Ansible internals are kept in one monorepo rather than a separate
|
||||||
|
Terraform repo to avoid cross-referencing friction (What was ruled out).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-007 — Network topology and addressing
|
# ADR-007 — Network topology and addressing
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
The boma homelab is a Proxmox cluster on a dedicated private network behind an
|
The boma homelab is a Proxmox cluster on a dedicated private network behind an
|
||||||
|
|
@ -10,7 +14,9 @@ and OPNsense configuration.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Physical topology
|
## Decision
|
||||||
|
|
||||||
|
### Physical topology
|
||||||
|
|
||||||
```
|
```
|
||||||
ISP
|
ISP
|
||||||
|
|
@ -38,7 +44,7 @@ ISP
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## VLAN design
|
### VLAN design
|
||||||
|
|
||||||
| VLAN | Name | Subnet | Purpose |
|
| VLAN | Name | Subnet | Purpose |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
|
|
@ -47,13 +53,13 @@ ISP
|
||||||
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
|
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
|
||||||
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
|
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
|
||||||
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
|
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
|
||||||
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
|
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## IP addressing
|
### IP addressing
|
||||||
|
|
||||||
### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
|
#### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
|
||||||
|
|
||||||
| Address | Host |
|
| Address | Host |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -63,7 +69,7 @@ ISP
|
||||||
| `10.10.0.201` | `pve1` |
|
| `10.10.0.201` | `pve1` |
|
||||||
| `10.10.0.202` | `pve2` |
|
| `10.10.0.202` | `pve2` |
|
||||||
|
|
||||||
### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
|
#### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
|
||||||
|
|
||||||
| Range | Purpose |
|
| Range | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -81,36 +87,45 @@ Assigned infrastructure addresses:
|
||||||
| `10.20.0.12` | `proxy` | Reverse proxy |
|
| `10.20.0.12` | `proxy` | Reverse proxy |
|
||||||
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
|
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
|
||||||
|
|
||||||
### VLAN 30 — lan (10.30.0.0/24)
|
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
|
||||||
|
> manually-provisioned physical control node and currently lives on the **legacy V4
|
||||||
|
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
|
||||||
|
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
|
||||||
|
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
|
||||||
|
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
|
||||||
|
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
|
||||||
|
|
||||||
|
#### VLAN 30 — lan (10.30.0.0/24)
|
||||||
|
|
||||||
| Range | Purpose |
|
| Range | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `10.30.0.1` | OPNsense gateway |
|
| `10.30.0.1` | OPNsense gateway |
|
||||||
| `10.30.0.100`–`.249` | DHCP pool |
|
| `10.30.0.100`–`.249` | DHCP pool |
|
||||||
|
|
||||||
### VLAN 40 — iot (10.40.0.0/24)
|
#### VLAN 40 — iot (10.40.0.0/24)
|
||||||
|
|
||||||
| Range | Purpose |
|
| Range | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `10.40.0.1` | OPNsense gateway |
|
| `10.40.0.1` | OPNsense gateway |
|
||||||
| `10.40.0.100`–`.249` | DHCP pool |
|
| `10.40.0.100`–`.249` | DHCP pool |
|
||||||
|
|
||||||
### VLAN 50 — guest (10.50.0.0/24)
|
#### VLAN 50 — guest (10.50.0.0/24)
|
||||||
|
|
||||||
| Range | Purpose |
|
| Range | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `10.50.0.1` | OPNsense gateway |
|
| `10.50.0.1` | OPNsense gateway |
|
||||||
| `10.50.0.100`–`.249` | DHCP pool |
|
| `10.50.0.100`–`.249` | DHCP pool |
|
||||||
|
|
||||||
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
|
#### VLAN 99 — vpn — retired
|
||||||
|
|
||||||
| Address | Host |
|
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
|
||||||
|---|---|
|
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
|
||||||
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
|
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
|
||||||
| `10.99.0.2` | `askari` (Hetzner VPS) |
|
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
|
||||||
| `10.99.0.10`+ | Road-warrior clients |
|
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
|
||||||
|
`10.99.0.0/24` is freed.
|
||||||
|
|
||||||
### Corosync ring (172.16.0.0/24) — not on managed switch
|
#### Corosync ring (172.16.0.0/24) — not on managed switch
|
||||||
|
|
||||||
| Address | Host |
|
| Address | Host |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -120,7 +135,7 @@ Assigned infrastructure addresses:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## OPNsense firewall rules (intent)
|
### OPNsense firewall rules (intent)
|
||||||
|
|
||||||
| Source | Destination | Policy |
|
| Source | Destination | Policy |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
|
|
@ -132,8 +147,8 @@ Assigned infrastructure addresses:
|
||||||
| `iot` | internet | allow egress only |
|
| `iot` | internet | allow egress only |
|
||||||
| `iot` | `srv` (HA IP only) | allow on integration ports |
|
| `iot` | `srv` (HA IP only) | allow on integration ports |
|
||||||
| `guest` | internet | allow, isolated from all internal |
|
| `guest` | internet | allow, isolated from all internal |
|
||||||
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
|
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
|
||||||
| `vpn` | `mgmt` | allow (administration from askari) |
|
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
|
||||||
|
|
||||||
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
|
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
|
||||||
ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery.
|
ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery.
|
||||||
|
|
@ -141,7 +156,7 @@ IoT devices cannot initiate connections to `srv`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Naming scheme
|
### Naming scheme
|
||||||
|
|
||||||
| Layer | Convention | Examples |
|
| Layer | Convention | Examples |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
|
|
@ -150,37 +165,74 @@ IoT devices cannot initiate connections to `srv`.
|
||||||
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
|
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
|
||||||
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
|
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
|
||||||
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
|
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
|
||||||
| Public service FQDN | `<service>.baobab.band` | `forgejo.nyumbani.baobab.band` |
|
| Public service FQDN | `<service>.wingu.me` | `vaultwarden.wingu.me` |
|
||||||
|
| Off-site (VPS) FQDN | `<service>.askari.wingu.me` | `netbird.askari.wingu.me` |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## DNS zones and split-horizon
|
### DNS zones and split-horizon
|
||||||
|
|
||||||
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
|
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
|
||||||
|
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
|
||||||
|
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
|
||||||
|
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
|
||||||
|
ADR-009/016). This is the single source for that transition; other references use the
|
||||||
|
current name and inherit this caveat.
|
||||||
The zone is rendered by the Ansible `dns` role: host A records come from the
|
The zone is rendered by the Ansible `dns` role: host A records come from the
|
||||||
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
|
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
|
||||||
and service/alias/split-horizon records are explicit zone data in `group_vars`.
|
and service/alias/split-horizon records are explicit zone data in `group_vars`.
|
||||||
Terraform itself writes no DNS records — see ADR-009.
|
Terraform itself writes no DNS records — see ADR-009.
|
||||||
|
|
||||||
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
|
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
|
||||||
Public-facing services resolve to the public IP or Cloudflare proxy.
|
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
|
||||||
|
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
|
||||||
|
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||||
|
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
|
||||||
|
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
|
||||||
|
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)
|
||||||
|
is out of scope here.
|
||||||
|
|
||||||
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
|
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
|
||||||
both a public and private face. Example: `forgejo.nyumbani.baobab.band` resolves to
|
both a public and private face. Example: `vaultwarden.wingu.me` resolves to
|
||||||
`10.20.0.12` (proxy) internally and to the public IP externally.
|
`10.20.0.12` (proxy) internally and to the public IP externally (the internal
|
||||||
|
zone will be renamed to `boma.wingu.me` when the `dns` role is built — Phase 2).
|
||||||
|
|
||||||
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
|
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
|
||||||
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
|
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## External monitoring — askari
|
### External monitoring — askari
|
||||||
|
|
||||||
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
|
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
|
||||||
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
|
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
|
||||||
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
|
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
||||||
for administration.
|
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
||||||
|
|
||||||
`askari` is provisioned and managed independently of the Proxmox cluster — it
|
`askari` is provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed
|
||||||
must be reachable even when the homelab is down (its entire purpose).
|
independently of the Proxmox cluster (its own provider + local state in
|
||||||
FQDN: `askari.baobab.band`.
|
`terraform/environments/offsite/`). It must be reachable even when the homelab is down
|
||||||
|
(its entire purpose), which is also why the mesh coordinator lives here: an off-site
|
||||||
|
control plane survives a homelab outage.
|
||||||
|
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the implications already stated above:
|
||||||
|
|
||||||
|
- VLAN 99 (`vpn`, `10.99.0.0/24`) is retired and the subnet freed; remote access is
|
||||||
|
carried by the self-hosted NetBird mesh instead of an OPNsense WireGuard subnet
|
||||||
|
(VLAN design; IP addressing — VLAN 99 retired).
|
||||||
|
- Mesh-peer firewall allowances (to `srv` metrics ports and `mgmt`) are enforced by
|
||||||
|
NetBird ACLs, not OPNsense rules (OPNsense firewall rules (intent)).
|
||||||
|
- IoT devices cannot initiate connections to `srv`; only Home Assistant at
|
||||||
|
`10.20.0.13` may reach the IoT VLAN, with OPNsense Avahi bridging `srv` ↔ `iot`
|
||||||
|
for discovery (OPNsense firewall rules (intent)).
|
||||||
|
- Terraform writes no DNS records; the Ansible `dns` role renders the internal zone
|
||||||
|
from inventory plus `group_vars`, with `dns1`/`dns2` serving split-horizon answers
|
||||||
|
(DNS zones and split-horizon).
|
||||||
|
- `askari` runs independently of the cluster so it survives a homelab outage, which
|
||||||
|
is why the off-site NetBird control plane lives there (External monitoring —
|
||||||
|
askari).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,12 @@
|
||||||
# ADR-008 — Testing methodology
|
# ADR-008 — Testing methodology
|
||||||
|
|
||||||
|
> Practical point-of-use pitfalls (nft render checks, Molecule `community.docker`,
|
||||||
|
> apply-path coverage blind spots) live in `docs/testing/gotchas.md`.
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Ansible roles must be idempotent and correct before they touch production hosts.
|
Ansible roles must be idempotent and correct before they touch production hosts.
|
||||||
|
|
@ -8,11 +15,13 @@ This document records the testing strategy, what each level covers, and — crit
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Three testing levels
|
## Decision
|
||||||
|
|
||||||
### Level 1 — Molecule (per role, always required)
|
### Three testing levels
|
||||||
|
|
||||||
Runs in Docker on the control node or in CI. Fast (~5 min per role).
|
#### Level 1 — Molecule (per role, always required)
|
||||||
|
|
||||||
|
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
|
||||||
|
|
||||||
**What happens during `molecule test`:**
|
**What happens during `molecule test`:**
|
||||||
1. `create` — start the test container
|
1. `create` — start the test container
|
||||||
|
|
@ -38,7 +47,7 @@ The idempotency step is non-negotiable. Every role must pass it cleanly.
|
||||||
that: svc.stdout == "active"
|
that: svc.stdout == "active"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Level 2 — Staging playbook (full stack, real VMs)
|
#### Level 2 — Staging playbook (full stack, real VMs)
|
||||||
|
|
||||||
`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
|
`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
|
||||||
Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
|
Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
|
||||||
|
|
@ -47,15 +56,35 @@ have already run and configured the firewall).
|
||||||
|
|
||||||
Run before every merge to `main`.
|
Run before every merge to `main`.
|
||||||
|
|
||||||
### Level 3 — External smoke test from askari
|
#### Level 3 — External smoke test from askari
|
||||||
|
|
||||||
Once `askari` is operational: scripted checks from outside the network confirming
|
Once `askari` is operational: scripted checks from outside the network confirming
|
||||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||||
configuration issues invisible to Ansible check mode.
|
configuration issues invisible to Ansible check mode.
|
||||||
|
|
||||||
|
#### Level 4 — Service-UI acceptance (Claude-driven exploratory)
|
||||||
|
|
||||||
|
A Claude-driven exploratory check of a service's **application UI**, run as
|
||||||
|
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
|
||||||
|
`playwright` plugin against a **staging** deploy, authenticates through the real
|
||||||
|
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
|
||||||
|
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
|
||||||
|
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
|
||||||
|
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
|
||||||
|
it can't verify (hardware, paid/external flows, subjective judgment).
|
||||||
|
|
||||||
|
Catches application-level regressions no lower level sees ("does PhotoPrism actually
|
||||||
|
serve photos?"). Placement: after Level 2 (staging deploy), before production
|
||||||
|
promotion. Exploratory and interactive by design — *not* a deterministic CI/cron gate
|
||||||
|
(that role belongs to health checks / Uptime Kuma).
|
||||||
|
|
||||||
|
**Status:** the skill, the `VERIFY.md` template, and standards are authorable now;
|
||||||
|
running it is deferred on `ubongo` + the `playwright` plugin + Authentik + a staging
|
||||||
|
deploy (STATUS.md). Full design: ADR-017.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Molecule test image
|
### Molecule test image
|
||||||
|
|
||||||
**No external images.** The project builds and hosts its own test image.
|
**No external images.** The project builds and hosts its own test image.
|
||||||
|
|
||||||
|
|
@ -80,7 +109,7 @@ functionally equivalent and fully owned.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Idempotency requirements
|
### Idempotency requirements
|
||||||
|
|
||||||
Every role task must satisfy one of these:
|
Every role task must satisfy one of these:
|
||||||
|
|
||||||
|
|
@ -98,9 +127,9 @@ catches anything lint misses.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## What Molecule tests — and what it does not
|
### What Molecule tests — and what it does not
|
||||||
|
|
||||||
### Tested in Molecule
|
#### Tested in Molecule
|
||||||
|
|
||||||
| Capability | Notes |
|
| Capability | Notes |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -116,7 +145,7 @@ catches anything lint misses.
|
||||||
| auditd installation and configuration | Install and config file |
|
| auditd installation and configuration | Install and config file |
|
||||||
| Idempotency of all of the above | Enforced by Molecule's idempotency step |
|
| Idempotency of all of the above | Enforced by Molecule's idempotency step |
|
||||||
|
|
||||||
### Not tested in Molecule — explicit exceptions
|
#### Not tested in Molecule — explicit exceptions
|
||||||
|
|
||||||
The following require a real kernel or real hardware and are validated only at
|
The following require a real kernel or real hardware and are validated only at
|
||||||
Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
|
Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
|
||||||
|
|
@ -125,7 +154,8 @@ Level 2 (staging) or Level 3 (external). This is a conscious, documented decisio
|
||||||
| Capability | Reason not testable in Molecule |
|
| Capability | Reason not testable in Molecule |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
|
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
|
||||||
| WireGuard tunnel establishment | Requires `wireguard` kernel module |
|
| **Reboot-survivability / host-firewall × Docker interaction / boot-ordering** | **Requires a real kernel reboot — the class that caused the 2026-06-17 mesh-hardening incident. Now covered by local VM integration testing (ADR-025).** |
|
||||||
|
| NetBird mesh data plane (`wt0` WireGuard interface) | Requires the `wireguard` kernel module; Molecule checks only that the agent is installed/configured (ADR-016) |
|
||||||
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
|
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
|
||||||
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
|
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
|
||||||
| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
|
| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
|
||||||
|
|
@ -136,9 +166,14 @@ For the above, Molecule tests only what it can: that the relevant packages are
|
||||||
installed, that configuration files render correctly, and that services are enabled.
|
installed, that configuration files render correctly, and that services are enabled.
|
||||||
Behavioural correctness is confirmed on staging.
|
Behavioural correctness is confirmed on staging.
|
||||||
|
|
||||||
|
**ADR-025 is the concrete build of Level 2/3** — local VM integration testing on
|
||||||
|
ubongo (libvirt/KVM, throwaway overlay VMs, stdlib-only driver). It specifically
|
||||||
|
targets the reboot-survivability / host-firewall × Docker / boot-ordering class that
|
||||||
|
Molecule structurally cannot reach. See `docs/decisions/025-local-vm-integration-testing.md`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## CI pipeline
|
### CI pipeline
|
||||||
|
|
||||||
```
|
```
|
||||||
push to main
|
push to main
|
||||||
|
|
@ -155,3 +190,27 @@ promote to production
|
||||||
|
|
||||||
Manual gates are intentional. Automated tests prove correctness in isolation;
|
Manual gates are intentional. Automated tests prove correctness in isolation;
|
||||||
a human confirms the change is safe to promote.
|
a human confirms the change is safe to promote.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the limitations and trade-offs already stated above:
|
||||||
|
|
||||||
|
- The Molecule idempotency step is non-negotiable; every role must pass it cleanly
|
||||||
|
(Three testing levels — Level 1).
|
||||||
|
- A class of capabilities (nftables rule loading, NetBird mesh data plane,
|
||||||
|
unattended-upgrades behaviour, OPNsense DHCP, Avahi mDNS reflection, hardware
|
||||||
|
passthrough, corosync cluster formation) cannot be verified in Molecule and is
|
||||||
|
validated only at Level 2 (staging) or Level 3 (external) — a conscious,
|
||||||
|
documented decision, not a gap (What Molecule tests — and what it does not).
|
||||||
|
- The project builds and hosts its own `molecule-debian13` image rather than relying
|
||||||
|
on an external Docker Hub image (e.g. geerlingguy), accepting the maintenance of a
|
||||||
|
custom image to avoid drift, disappearance, or unexpected changes outside project
|
||||||
|
control (Molecule test image).
|
||||||
|
- Level 4 service-UI acceptance is authorable now but its execution is deferred,
|
||||||
|
pending `ubongo`, the `playwright` plugin, Authentik, and a staging deploy (Three
|
||||||
|
testing levels — Level 4).
|
||||||
|
- Promotion to staging and to production stays behind intentional manual approval
|
||||||
|
gates; automation proves isolated correctness, a human confirms promotion safety
|
||||||
|
(CI pipeline).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-009 — Terraform ↔ Ansible provisioning handoff
|
# ADR-009 — Terraform ↔ Ansible provisioning handoff
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Two tools touch every managed host. Terraform owns **what exists** — VMs on
|
Two tools touch every managed host. Terraform owns **what exists** — VMs on
|
||||||
|
|
@ -14,7 +18,9 @@ the cloud-init template that VMs are cloned from. This ADR covers how they conne
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The boundary
|
## Decision
|
||||||
|
|
||||||
|
### The boundary
|
||||||
|
|
||||||
| Layer | Tool | Notes |
|
| Layer | Tool | Notes |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
|
|
@ -31,7 +37,7 @@ below).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The handoff pipeline
|
### The handoff pipeline
|
||||||
|
|
||||||
There is one path by which a managed host comes into existence and reaches its
|
There is one path by which a managed host comes into existence and reaches its
|
||||||
configured state:
|
configured state:
|
||||||
|
|
@ -55,7 +61,7 @@ this pipeline — **never** by hand-editing the inventory.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The data contract
|
### The data contract
|
||||||
|
|
||||||
The seam's interface is a single Terraform output consumed by a single script.
|
The seam's interface is a single Terraform output consumed by a single script.
|
||||||
|
|
||||||
|
|
@ -75,7 +81,12 @@ The seam's interface is a single Terraform output consumed by a single script.
|
||||||
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
|
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
|
||||||
group against the allowed set and fails loudly on an unknown group.
|
group against the allowed set and fails loudly on an unknown group.
|
||||||
|
|
||||||
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
|
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
|
||||||
|
|
||||||
|
`control` holds `ubongo`, a physical machine not managed by Terraform (see the
|
||||||
|
control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
|
||||||
|
Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
|
||||||
|
(see the off-site handoff note below and ADR-016).
|
||||||
|
|
||||||
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
||||||
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
||||||
|
|
@ -83,7 +94,7 @@ Terraform, and the inventory is regenerated, never edited.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Cloud-init's role
|
### Cloud-init's role
|
||||||
|
|
||||||
Cloud-init is the thin first-boot layer between Terraform and Ansible:
|
Cloud-init is the thin first-boot layer between Terraform and Ansible:
|
||||||
|
|
||||||
|
|
@ -98,7 +109,7 @@ The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Internal DNS — owned by Ansible, no chicken-and-egg
|
### Internal DNS — owned by Ansible, no chicken-and-egg
|
||||||
|
|
||||||
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
|
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
|
||||||
rendered entirely by the Ansible `dns` role:
|
rendered entirely by the Ansible `dns` role:
|
||||||
|
|
@ -108,7 +119,8 @@ rendered entirely by the Ansible `dns` role:
|
||||||
remains the ultimate source of truth for which hosts exist; the data simply flows
|
remains the ultimate source of truth for which hosts exist; the data simply flows
|
||||||
through the inventory instead of through a direct Terraform→DNS write.
|
through the inventory instead of through a direct Terraform→DNS write.
|
||||||
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
|
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
|
||||||
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
|
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
|
||||||
|
`group_vars`.
|
||||||
|
|
||||||
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
|
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
|
||||||
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
|
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
|
||||||
|
|
@ -124,14 +136,16 @@ convention only — it no longer implies any difference in how records are writt
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The control-node exception
|
### The control-node exception
|
||||||
|
|
||||||
The control node — the host that runs Terraform and Ansible — is the one VM
|
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
|
||||||
Terraform does **not** create. It cannot provision the infrastructure that would
|
dedicated **physical** machine outside the cluster. It is not a VM at all, so
|
||||||
provision itself (chicken-and-egg). It is therefore the single documented exception
|
Terraform genuinely never touches it: it cannot provision the infrastructure that
|
||||||
to "Terraform owns VM existence":
|
would provision itself (chicken-and-egg). It is therefore the single documented
|
||||||
|
exception to "Terraform owns VM existence":
|
||||||
|
|
||||||
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
|
- Provisioned and bootstrapped manually on bare metal, per the control-node section
|
||||||
|
of ADR-005; rationale, hardware, and recovery model in ADR-015.
|
||||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||||
Ansible for baseline config only (no `docker_host` role).
|
Ansible for baseline config only (no `docker_host` role).
|
||||||
|
|
||||||
|
|
@ -139,7 +153,28 @@ Every other host is Terraform-managed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## What was ruled out
|
### The off-site handoff (`offsite` environment → `offsite_hosts`)
|
||||||
|
|
||||||
|
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
|
||||||
|
with its own provider and environment:
|
||||||
|
|
||||||
|
- **Producer** — `terraform/environments/offsite/outputs.tf` emits a `vms` map in the
|
||||||
|
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
|
||||||
|
`offsite_hosts`.
|
||||||
|
- **Consumer** — `scripts/tf_to_inventory.py` reads `terraform output -json` from the
|
||||||
|
`offsite` environment and writes `inventories/production/offsite.yml`.
|
||||||
|
- **Makefile target** — `make tf-inventory-offsite` runs the generator for the offsite
|
||||||
|
environment.
|
||||||
|
|
||||||
|
The production inventory is a **directory** (`inventories/production/`) that Ansible
|
||||||
|
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
|
||||||
|
(offsite-generated) together form the full production host list. Each file is a build
|
||||||
|
artifact — never hand-edited; their source of truth is `local.vms` in the respective
|
||||||
|
environment's `main.tf`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### What was ruled out
|
||||||
|
|
||||||
| Option | Reason |
|
| Option | Reason |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -147,3 +182,28 @@ Every other host is Terraform-managed.
|
||||||
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
|
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
|
||||||
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
|
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
|
||||||
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
|
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Drawn from the boundary, the data contract, and the "What was ruled out" section above:
|
||||||
|
|
||||||
|
- Adding a host means editing `local.vms` and running the handoff pipeline; the
|
||||||
|
generated `hosts.yml` is a build artifact and must never be hand-edited — manual
|
||||||
|
edits are overwritten on the next `make tf-inventory` (The handoff pipeline; The
|
||||||
|
data contract; What was ruled out).
|
||||||
|
- Manual `qm clone` is rejected as a general provisioning path so the inventory and
|
||||||
|
real infrastructure cannot drift; Terraform is the single way VMs come into
|
||||||
|
existence (What was ruled out).
|
||||||
|
- Terraform writes no DNS records: the Ansible `dns` role renders the whole internal
|
||||||
|
zone from inventory plus `group_vars`, dissolving the bootstrap cycle a
|
||||||
|
Terraform-managed zone (`hashicorp/dns` + RFC 2136) would create (Internal DNS —
|
||||||
|
owned by Ansible, no chicken-and-egg; What was ruled out).
|
||||||
|
- The control node (`ubongo`) is the single documented exception to "Terraform owns
|
||||||
|
VM existence" — a physical machine provisioned manually and managed by Ansible for
|
||||||
|
baseline config only (The control-node exception).
|
||||||
|
- The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
|
||||||
|
`tf_to_inventory.py` (`make tf-inventory-offsite` → `inventories/production/offsite.yml`);
|
||||||
|
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
|
||||||
|
`offsite.yml` (offsite) (The off-site handoff).
|
||||||
|
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
|
||||||
|
here rather than restating it (What was ruled out).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-010 — Forgejo integration and CI
|
# ADR-010 — Forgejo integration and CI
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-05-30)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
boma's git host, container registry, and (planned) CI all run on a self-hosted
|
boma's git host, container registry, and (planned) CI all run on a self-hosted
|
||||||
|
|
@ -20,7 +24,7 @@ held to the same standard as the rest of the repo's secrets.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Decisions
|
## Decision
|
||||||
|
|
||||||
### 1. API tokens are managed secrets, least-privilege
|
### 1. API tokens are managed secrets, least-privilege
|
||||||
|
|
||||||
|
|
@ -63,8 +67,8 @@ Trunk-based, matching ADR-003 / ADR-008:
|
||||||
push to main → lint + Molecule → deploy staging → [manual gate] → deploy production
|
push to main → lint + Molecule → deploy staging → [manual gate] → deploy production
|
||||||
```
|
```
|
||||||
|
|
||||||
Runner: `act_runner` on the control node or a dedicated runner VM. Actions is not
|
Runner: `act_runner` on `ubongo` (the control node — ADR-015), or a dedicated runner VM
|
||||||
yet enabled — see STATUS.md.
|
later if CI load warrants a separate host. Actions is not yet enabled — see STATUS.md.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -75,3 +79,21 @@ yet enabled — see STATUS.md.
|
||||||
| Terraform Forgejo HTTP state backend | Forgejo's `/raw/` API is read-only; state can't be written there. Local state instead (ADR-006). |
|
| Terraform Forgejo HTTP state backend | Forgejo's `/raw/` API is read-only; state can't be written there. Local state instead (ADR-006). |
|
||||||
| Admin-scoped automation tokens | Unnecessary privilege; scope to `read:repository` + `read`/`write:package`. |
|
| Admin-scoped automation tokens | Unnecessary privilege; scope to `read:repository` + `read`/`write:package`. |
|
||||||
| Ad-hoc UI/API configuration as the norm | Becomes undocumented drift; codify or document instead. |
|
| Ad-hoc UI/API configuration as the norm | Becomes undocumented drift; codify or document instead. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The planned CI pipeline (see "CI pipeline (planned)") is trunk-based per ADR-003 /
|
||||||
|
ADR-008 — `push to main → lint + Molecule → deploy staging → [manual gate] → deploy
|
||||||
|
production` — running `act_runner` on `ubongo` (or a dedicated runner VM later if CI
|
||||||
|
load warrants); Actions is not yet enabled, so this remains future work tracked in
|
||||||
|
STATUS.md.
|
||||||
|
- Terraform state is not held in Forgejo: its `/raw/` API is read-only and cannot be
|
||||||
|
written, so local state is used instead (ADR-006) (see "What was ruled out").
|
||||||
|
- Automation tokens are scoped to `read:repository` + `read`/`write:package` rather
|
||||||
|
than admin, accepting the limits that least-privilege imposes on what automation can
|
||||||
|
do (see "What was ruled out").
|
||||||
|
- Instance/repo configuration must be codified or documented rather than changed
|
||||||
|
ad-hoc, to avoid the undocumented drift `/review-repo` exists to catch (see "What was
|
||||||
|
ruled out").
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,9 @@
|
||||||
# ADR-011 — Update and upgrade management
|
# ADR-011 — Update and upgrade management
|
||||||
|
|
||||||
**Status: Proposed — draft for discussion (not yet accepted).**
|
## Status
|
||||||
|
|
||||||
|
Proposed (2026-06-04) — draft for discussion; not yet accepted. The core decisions
|
||||||
|
below are settled in intent, but several specifics remain open (see "Open questions").
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
|
|
@ -10,7 +13,7 @@ drift over time and must be kept current without breaking the homelab: the **hos
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Decisions
|
## Decision
|
||||||
|
|
||||||
### 1. Every service is classified stateful or stateless
|
### 1. Every service is classified stateful or stateless
|
||||||
|
|
||||||
|
|
@ -18,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
|
||||||
`false`). The split is the load-bearing classification for the whole policy.
|
`false`). The split is the load-bearing classification for the whole policy.
|
||||||
|
|
||||||
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
||||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
|
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
|
||||||
reverse proxies, FlareSolverr.
|
reverse proxies, FlareSolverr.
|
||||||
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
||||||
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
||||||
|
|
@ -53,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
|
||||||
5. **Verify** again; alert on failure.
|
5. **Verify** again; alert on failure.
|
||||||
|
|
||||||
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
||||||
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
|
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
|
||||||
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
||||||
|
|
||||||
### 4. Snapshot-before is the rollback mechanism
|
### 4. Snapshot-before is the rollback mechanism
|
||||||
|
|
@ -64,8 +67,8 @@ Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday
|
||||||
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
|
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
|
||||||
|
|
||||||
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
|
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
|
||||||
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
|
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` design in
|
||||||
ADR-010) does:
|
`docs/TODO.md` 8.3, not yet built) does:
|
||||||
|
|
||||||
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
|
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
|
||||||
pinned tag against what's available.
|
pinned tag against what's available.
|
||||||
|
|
@ -125,10 +128,26 @@ alert-driven.
|
||||||
| -------------------------------------- | ----------------------------------------------------------------------------- |
|
| -------------------------------------- | ----------------------------------------------------------------------------- |
|
||||||
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
|
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
|
||||||
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
|
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
|
||||||
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
|
| Digest-_only_ pin (no readable tag) for stateful | Unreadable in diffs — the tiered rule pins `tag@digest` (readable tag *and* digest) instead (Decision 2). |
|
||||||
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
|
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
|
||||||
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
|
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
|
||||||
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
|
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
|
||||||
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
|
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- A single uniform update policy is rejected: the stateful/stateless split is
|
||||||
|
load-bearing, so stateless services roll on rolling tags while stateful services are
|
||||||
|
pinned `tag@digest`, human-gated, and backup-first (see "What was ruled out").
|
||||||
|
- The weekly run never touches stateful services and the whole fleet is never updated
|
||||||
|
at once, accepting the added orchestration of host ordering and an 8-weekly +
|
||||||
|
fast-path cadence in exchange for bounded blast radius (see "What was ruled out").
|
||||||
|
- No update automation ships until the health-check verification gate is in order; the
|
||||||
|
pipeline is deliberately sequenced behind that harness (see Decision 6).
|
||||||
|
- Several points remain open for discussion (see "Open questions"): where the Proxmox
|
||||||
|
snapshot is driven from across the TF/Ansible boundary; the exact cadences; where the
|
||||||
|
health-check harness lives and the minimum bar that counts as "in order"; whether
|
||||||
|
classification is a per-role `__stateful` flag or a group_vars list; whether the
|
||||||
|
weekly run hits staging first; and the notification + "skip/pause" control channel.
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-012 — Hardware reference & capacity evaluation
|
# ADR-012 — Hardware reference & capacity evaluation
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-01)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
|
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
|
||||||
|
|
@ -13,6 +17,8 @@ workload that should move, or a node due an upgrade.
|
||||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||||
physical compute + network gear and workload placement intent. Two
|
physical compute + network gear and workload placement intent. Two
|
||||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||||
|
This includes `ubongo`, the physical control node (ADR-015), even though it sits
|
||||||
|
outside the Proxmox cluster.
|
||||||
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
||||||
parses those tables, computes per-node allocated-vs-physical rollups, and
|
parses those tables, computes per-node allocated-vs-physical rollups, and
|
||||||
cross-checks workload hostnames against `terraform output -json` /
|
cross-checks workload hostnames against `terraform output -json` /
|
||||||
|
|
@ -34,5 +40,11 @@ workload that should move, or a node due an upgrade.
|
||||||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||||
- `reference.md` table headers are a parser contract — changing them needs a
|
- `reference.md` table headers are a parser contract — changing them needs a
|
||||||
matching `capacity-scan.py` change.
|
matching `capacity-scan.py` change.
|
||||||
|
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
|
||||||
|
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
|
||||||
|
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
|
||||||
|
not assumed.
|
||||||
|
|
||||||
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
## Related
|
||||||
|
|
||||||
|
ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-013 — Heritage: learning from AnsibleBaobabV4 without inheriting it
|
# ADR-013 — Heritage: learning from AnsibleBaobabV4 without inheriting it
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-04)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
boma is the methodology successor to AnsibleBaobabV4 (and V3 before it) — not a new
|
boma is the methodology successor to AnsibleBaobabV4 (and V3 before it) — not a new
|
||||||
|
|
@ -10,7 +14,9 @@ structure and assumptions creep back in under the guise of "inspiration." This A
|
||||||
sets the policy for drawing on V4 without inheriting it. (Resolves the questions
|
sets the policy for drawing on V4 without inheriting it. (Resolves the questions
|
||||||
previously parked in TODO 3.3 and 10.1.)
|
previously parked in TODO 3.3 and 10.1.)
|
||||||
|
|
||||||
## Principle — translate, don't transplant
|
## Decision
|
||||||
|
|
||||||
|
### Principle — translate, don't transplant
|
||||||
|
|
||||||
V4 is **evidence, never authority.** It can show what was needed or what went wrong;
|
V4 is **evidence, never authority.** It can show what was needed or what went wrong;
|
||||||
it can never be the reason boma does something a certain way.
|
it can never be the reason boma does something a certain way.
|
||||||
|
|
@ -21,7 +27,7 @@ it can never be the reason boma does something a certain way.
|
||||||
- **Acceptance test** for anything V4-derived: *can it be justified purely from
|
- **Acceptance test** for anything V4-derived: *can it be justified purely from
|
||||||
boma's principles, with zero reference to V4?* If not, it does not land.
|
boma's principles, with zero reference to V4?* If not, it does not land.
|
||||||
|
|
||||||
## What V4 is — and is not — a source of
|
### What V4 is — and is not — a source of
|
||||||
|
|
||||||
| Legitimate source of | Never a source of |
|
| Legitimate source of | Never a source of |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -33,7 +39,7 @@ it can never be the reason boma does something a certain way.
|
||||||
Only concrete, verifiable, low-level knowledge crosses over — precisely because it is
|
Only concrete, verifiable, low-level knowledge crosses over — precisely because it is
|
||||||
safe to re-derive, whereas structure and requirements drag assumptions along.
|
safe to re-derive, whereas structure and requirements drag assumptions along.
|
||||||
|
|
||||||
## Provenance — transient only
|
### Provenance — transient only
|
||||||
|
|
||||||
When a boma decision was prompted by a V4 lesson, or a config adapted from V4, the
|
When a boma decision was prompted by a V4 lesson, or a config adapted from V4, the
|
||||||
lineage is recorded only in **transient** places: the commit message, the working
|
lineage is recorded only in **transient** places: the commit message, the working
|
||||||
|
|
@ -42,7 +48,7 @@ extraction warrants one. **Durable artifacts (ADRs, role READMEs, `SECURITY.md`)
|
||||||
stand on boma's own terms with no V4 reference.** Honest about lineage in history;
|
stand on boma's own terms with no V4 reference.** Honest about lineage in history;
|
||||||
clean in the living repo.
|
clean in the living repo.
|
||||||
|
|
||||||
## AI consultation guardrails
|
### AI consultation guardrails
|
||||||
|
|
||||||
The AI is the main consumer of V4 — it is on disk and readable. When consulting it:
|
The AI is the main consumer of V4 — it is on disk and readable. When consulting it:
|
||||||
|
|
||||||
|
|
@ -68,5 +74,7 @@ copy.
|
||||||
cost of a clean methodological break.
|
cost of a clean methodological break.
|
||||||
- The policy is enforceable in review and by the AI guardrails above.
|
- The policy is enforceable in review and by the AI guardrails above.
|
||||||
|
|
||||||
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
|
## Related
|
||||||
|
|
||||||
|
ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
|
||||||
(update management — ntfy topics decided fresh per this policy).
|
(update management — ntfy topics decided fresh per this policy).
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,9 @@
|
||||||
# ADR-014 — Sourcing technical knowledge (docs and best practices)
|
# ADR-014 — Sourcing technical knowledge (docs and best practices)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-04)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
Most work in boma is done by AI agents drawing on training memory, which is stale
|
Most work in boma is done by AI agents drawing on training memory, which is stale
|
||||||
|
|
@ -85,10 +89,11 @@ The accelerators this policy prefers (`context7`, `deep-research`, `superpowers`
|
||||||
`claude-code-guide`) are **plugins under `~/.claude/`** — local per machine, **not**
|
`claude-code-guide`) are **plugins under `~/.claude/`** — local per machine, **not**
|
||||||
synced by Claude account and **not** carried by the git repo (only `.claude/commands`,
|
synced by Claude account and **not** carried by the git repo (only `.claude/commands`,
|
||||||
`.claude/hooks`, `.claude/settings.json` travel). A fresh clone therefore lacks the
|
`.claude/hooks`, `.claude/settings.json` travel). A fresh clone therefore lacks the
|
||||||
plugin toolchain until it is reinstalled. Making it reproducible from the repo
|
plugin toolchain until it is reinstalled. Making it reproducible from the repo is
|
||||||
(`extraKnownMarketplaces` + `enabledPlugins` in `.claude/settings.json`, plus a
|
**done** (TODO 10.7): `.claude/settings.json` declares `extraKnownMarketplaces` +
|
||||||
bootstrap step) is tracked in `docs/TODO.md` and tied to control-node/AI setup. Until
|
`enabledPlugins`, and `docs/runbooks/claude-code-setup.md` documents the per-machine
|
||||||
then, the graceful-degradation fallback above keeps the policy working.
|
bootstrap. Until a fresh clone runs that bootstrap, the graceful-degradation fallback
|
||||||
|
above keeps the policy working.
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
|
|
@ -99,5 +104,27 @@ then, the graceful-degradation fallback above keeps the policy working.
|
||||||
- Commit to the principle, not a tool — degrade to `WebFetch`/`WebSearch` when plugins
|
- Commit to the principle, not a tool — degrade to `WebFetch`/`WebSearch` when plugins
|
||||||
are absent.
|
are absent.
|
||||||
|
|
||||||
See also: ADR-013 (heritage / translate-don't-transplant), ADR-011 (version pinning),
|
## Consequences
|
||||||
ADR-008 (testing/verification).
|
|
||||||
|
Drawn from the follow-on work and limitations this ADR already states:
|
||||||
|
|
||||||
|
- Verified facts carry a durable, greppable stamp; a stamp binds a fact to a pinned
|
||||||
|
version, so a `requirements` change or image upgrade marks exactly what to re-check
|
||||||
|
(per Capture / Re-verification).
|
||||||
|
- Stale-stamp detection — a `/review-repo` or `/security-review` check flagging stamps
|
||||||
|
whose recorded version no longer matches what is pinned — is a noted enhancement, not
|
||||||
|
built yet (per Re-verification).
|
||||||
|
- Any version-specific claim given from memory must be marked "from memory, unverified"
|
||||||
|
as a transparency backstop, since agent self-assessed certainty is unreliable (per
|
||||||
|
When consulting is required).
|
||||||
|
- The policy commits to the principle rather than a specific plugin, so it degrades to
|
||||||
|
`WebFetch`/`WebSearch` on a bare install; reproducing the plugin toolchain from the
|
||||||
|
repo is done via `.claude/settings.json` and `docs/runbooks/claude-code-setup.md`,
|
||||||
|
with the graceful-degradation fallback covering a fresh clone until bootstrap runs
|
||||||
|
(per Source hierarchy / Reproducibility of the toolchain).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-013 — heritage / translate-don't-transplant.
|
||||||
|
- ADR-011 — version pinning.
|
||||||
|
- ADR-008 — testing / verification.
|
||||||
|
|
|
||||||
192
docs/decisions/015-control-host.md
Normal file
192
docs/decisions/015-control-host.md
Normal file
|
|
@ -0,0 +1,192 @@
|
||||||
|
# ADR-015 — Control / development / AI-worker host (`ubongo`)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-05). **Amended 2026-06-18:** the `claude` AI-worker account now has
|
||||||
|
`NOPASSWD:ALL` sudo on `ubongo` — reversing the original "no local sudo" sub-decision.
|
||||||
|
The amendment is recorded in §Access & security below; rationale and accepted risk are
|
||||||
|
in ADR-021 and `docs/security/accepted-risks.md` (R7).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
|
||||||
|
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||||
|
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
|
||||||
|
the control node purely as a control-plane runner.
|
||||||
|
|
||||||
|
It fails four needs, all confirmed as drivers:
|
||||||
|
|
||||||
|
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
|
||||||
|
something else creates it; the bootstrap is circular and awkward.
|
||||||
|
2. **Always-on availability** — the operator wants to SSH in from a work PC or
|
||||||
|
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
|
||||||
|
or being rebuilt.
|
||||||
|
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
|
||||||
|
inside the thing it rebuilds.
|
||||||
|
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
|
||||||
|
with production VM lifecycle.
|
||||||
|
|
||||||
|
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
|
||||||
|
and recovery. A small dedicated always-on physical machine outside the cluster
|
||||||
|
satisfies all four.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
|
||||||
|
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
|
||||||
|
It becomes *the* control node and collapses four roles into one box:
|
||||||
|
|
||||||
|
- Terraform + Ansible runner (control plane)
|
||||||
|
- Claude Code / AI-worker host the operator SSHes into
|
||||||
|
- Local test runner (Molecule/Docker, lint, and later a browser stack)
|
||||||
|
- Persistent dev home for the repo
|
||||||
|
|
||||||
|
There is **no longer a control VM on the cluster.** The `control` inventory group
|
||||||
|
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||||
|
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||||
|
Every other host stays a Terraform-managed VM exactly as designed.
|
||||||
|
|
||||||
|
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production
|
||||||
|
hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs**
|
||||||
|
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
|
||||||
|
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
|
||||||
|
This is not a production workload — it is the concrete implementation of ADR-008 Level
|
||||||
|
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
|
||||||
|
|
||||||
|
### Hardware target
|
||||||
|
|
||||||
|
| Spec | Target | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||||
|
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||||
|
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||||
|
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||||
|
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||||
|
|
||||||
|
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250).
|
||||||
|
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
|
||||||
|
is **all testing being local** — Molecule (Docker), lint, and a future
|
||||||
|
headless-Chromium/Playwright stack.
|
||||||
|
|
||||||
|
### Provisioning (bootstrap path)
|
||||||
|
|
||||||
|
Manual, on bare metal:
|
||||||
|
|
||||||
|
1. Install Debian 13 on the box (one-time, by hand).
|
||||||
|
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||||
|
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
|
||||||
|
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
||||||
|
baseline config via the `control` group (`base` role only).
|
||||||
|
|
||||||
|
### Access & security
|
||||||
|
|
||||||
|
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
|
||||||
|
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
|
||||||
|
stays inside ADR-002.
|
||||||
|
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
||||||
|
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
||||||
|
denied on the physical NIC.
|
||||||
|
- **Operational reality (until the mesh exists):** the "SSH only on the mesh interface"
|
||||||
|
target above is the end state, not yet in force. Today remote access is **LAN SSH
|
||||||
|
only** — key-only, with password auth and root login disabled — until the NetBird mesh
|
||||||
|
(ADR-016) is stood up.
|
||||||
|
- **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
|
||||||
|
password-locked `claude` user (in the `docker` and `libvirt` groups; **`NOPASSWD:ALL`
|
||||||
|
sudo** via a repo-managed drop-in — see amendment below). It is reached via `sudo -iu
|
||||||
|
claude` or its own SSH key. The rationale is **attribution + revocation, not
|
||||||
|
containment**: auditd/Loki (ADR-018) can separate human from agent actions, and the
|
||||||
|
account/key can be revoked without touching the operator's access. (ADR-021 left the
|
||||||
|
on-`ubongo` agent identity unspecified; this records it.)
|
||||||
|
|
||||||
|
**Amendment (2026-06-18) — `claude` now has `NOPASSWD:ALL` sudo.**
|
||||||
|
> **Superseded by [ADR-025](025-local-vm-integration-testing.md)** (per ADR-023 §4): the
|
||||||
|
> "no local sudo" sub-decision is reversed. The shakedown that necessitated it is ADR-025;
|
||||||
|
> the resulting two-account access model is ADR-021; the accepted risk is R7.
|
||||||
|
|
||||||
|
During the
|
||||||
|
integration-testing harness shakedown, the original "no local sudo" sub-decision was
|
||||||
|
reversed. No-sudo blocked the AI-worker from diagnosing a failed VM: `virsh`,
|
||||||
|
`virt-install`, `cloud-localds`, `journalctl`, `nft` — nearly all low-level
|
||||||
|
diagnostic commands — require root. The AI-worker must autonomously spin up,
|
||||||
|
inspect, and tear down test VMs without operator hand-holding; that is the harness's
|
||||||
|
core value proposition. Compensating controls make the risk acceptable:
|
||||||
|
|
||||||
|
1. `claude`'s password is **locked** (no interactive login, no `su claude` without the
|
||||||
|
operator's own credentials) — `NOPASSWD` sudo is the *only* sudo path.
|
||||||
|
2. `auditd` + Loki attribution (ADR-018) separates human from agent root actions.
|
||||||
|
3. The drop-in is **repo-managed** via `base__ai_worker_user` — revocable in one commit
|
||||||
|
and one deploy.
|
||||||
|
4. Single-operator homelab: everything in git, off-machine backups (ADR-022).
|
||||||
|
|
||||||
|
The operator (`sjat`) uses **password-required sudo** via the `sudo` group; their
|
||||||
|
former `NOPASSWD` drop-in was removed 2026-06-18 as redundant once `claude` had sudo
|
||||||
|
(least-privilege cleanup). The accepted risk is registered as R7 in
|
||||||
|
`docs/security/accepted-risks.md`. ADR-021 records the resulting sudo model for both
|
||||||
|
accounts.
|
||||||
|
- **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
|
||||||
|
TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
|
||||||
|
`docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
|
||||||
|
compensated by physical security, a BIOS supervisor password, and disabled
|
||||||
|
external/USB boot.
|
||||||
|
|
||||||
|
### Recovery model
|
||||||
|
|
||||||
|
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
|
||||||
|
|
||||||
|
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
|
||||||
|
able to drive the fleet if `ubongo` dies.
|
||||||
|
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
|
||||||
|
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`.
|
||||||
|
3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
|
||||||
|
local encrypted copy of the vault and decrypts it offline with the operator's
|
||||||
|
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
|
||||||
|
whole cluster down — provided `rbw` has synced once and the operator keeps the
|
||||||
|
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
|
||||||
|
`mamba`.
|
||||||
|
|
||||||
|
There is always exactly one irreducible offline root secret; here it is the
|
||||||
|
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
|
||||||
|
would make the control node run a service (against its remit) and still need that
|
||||||
|
master password.
|
||||||
|
|
||||||
|
> verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden
|
||||||
|
> host blocked, `rbw sync` failed but `rbw get` decrypted the cached vault offline ·
|
||||||
|
> 2026-06-11
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The control node is physical compute outside the cluster, so it appears in
|
||||||
|
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
|
||||||
|
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
|
||||||
|
- A future **service-UI acceptance** testing level (Claude driving a headless browser
|
||||||
|
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
|
||||||
|
is a separate spec.
|
||||||
|
|
||||||
|
## Deferred (separate specs / discussions)
|
||||||
|
|
||||||
|
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
|
||||||
|
(off-site, so it survives a homelab outage and stays out of the cluster it
|
||||||
|
administers). Replaces ADR-007's OPNsense WireGuard.
|
||||||
|
2. **Browser-E2E verification harness — RESOLVED (ADR-017):** Claude-driven
|
||||||
|
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
|
||||||
|
staging with test users in Authentik. Design + skill + standards complete; running
|
||||||
|
deferred on the stack.
|
||||||
|
3. **`rbw` offline-cache verification — RESOLVED (2026-06-11 build):** confirmed offline
|
||||||
|
cache decryption on rbw 1.15.0 — `rbw sync` fails with Vaultwarden unreachable while
|
||||||
|
`rbw get` still decrypts from the local cache (ADR-014).
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
|
||||||
|
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
|
||||||
|
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
|
||||||
|
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
|
||||||
|
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
||||||
|
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||||
|
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|
||||||
166
docs/decisions/016-mesh-vpn.md
Normal file
166
docs/decisions/016-mesh-vpn.md
Normal file
|
|
@ -0,0 +1,166 @@
|
||||||
|
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||||
|
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||||
|
`base` exists.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||||
|
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
|
||||||
|
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
|
||||||
|
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
|
||||||
|
weigh. This ADR settles it.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
|
||||||
|
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
|
||||||
|
|
||||||
|
The decision in four parts:
|
||||||
|
|
||||||
|
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
|
||||||
|
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
|
||||||
|
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
|
||||||
|
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
|
||||||
|
survives a homelab outage and stays out of the cluster it administers.
|
||||||
|
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
|
||||||
|
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
|
||||||
|
parity) — ruled out below.
|
||||||
|
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5
|
||||||
|
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
|
||||||
|
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
|
||||||
|
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
|
||||||
|
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
|
||||||
|
5. **Identity — embedded local users** (Dex in the management container); external SSO
|
||||||
|
(Zitadel/Keycloak) stays an optional future.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||||
|
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
|
||||||
|
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
|
||||||
|
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
|
||||||
|
|
||||||
|
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
|
||||||
|
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
|
||||||
|
open-core feature gating.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
|
||||||
|
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
|
||||||
|
allocated for it.
|
||||||
|
|
||||||
|
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
|
||||||
|
- `ubongo` — agent.
|
||||||
|
- All Linux managed hosts — agent via the `base` role.
|
||||||
|
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
|
||||||
|
- OPNsense / `mgmt` — single non-agent exception.
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
|
||||||
|
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
|
||||||
|
privilege.
|
||||||
|
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||||
|
consumed by `base`; prefer ephemeral/scoped keys.
|
||||||
|
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
|
||||||
|
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
|
||||||
|
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
|
||||||
|
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
|
||||||
|
explicit the control-node SSH allow that the recovery model already implied; the access
|
||||||
|
doctrine and the three-tier access ladder live in **ADR-021**.
|
||||||
|
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||||
|
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||||
|
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||||
|
Recorded as accepted-risk R3.
|
||||||
|
|
||||||
|
## Recovery & operations
|
||||||
|
|
||||||
|
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
|
||||||
|
mesh/coordinator outage never blocks on-LAN runs.
|
||||||
|
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` →
|
||||||
|
`base` enrolls the fleet.
|
||||||
|
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||||
|
NetBird's management datastore is **intended** to be backed up encrypted off `askari`
|
||||||
|
(synced to `ubongo`/`mamba`; not yet built — see the Availability amendment / R8); peers
|
||||||
|
keep last-known config through a brief coordinator outage.
|
||||||
|
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned
|
||||||
|
as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox
|
||||||
|
cluster (its own provider + local state). Ansible configuration: `base` role, plus a
|
||||||
|
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||||
|
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||||
|
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||||
|
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
|
||||||
|
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
|
||||||
|
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
|
||||||
|
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
|
||||||
|
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||||
|
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- A new public surface appears on `askari` — management API + dashboard (80/443) +
|
||||||
|
Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where
|
||||||
|
practical, `base` hardening and version-pinned NetBird, and recorded as accepted-risk
|
||||||
|
R3 (Security).
|
||||||
|
- On-LAN SSH never depends on the mesh: `base` allows inbound SSH from `ubongo`'s LAN
|
||||||
|
address as a mesh-independent secondary path, so a mesh/coordinator outage never
|
||||||
|
blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations).
|
||||||
|
- The mesh survives a homelab outage because the coordinator is off-site on `askari`,
|
||||||
|
with its management datastore **intended** to be backed up encrypted off `askari` (not yet built — see the Availability amendment / R8) and peers keeping
|
||||||
|
last-known config through a brief coordinator outage (Recovery & operations).
|
||||||
|
- Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an
|
||||||
|
on-cluster coordinator, a `ubongo` subnet router, and a standalone IdP gains
|
||||||
|
identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single
|
||||||
|
operator footprint (What was ruled out).
|
||||||
|
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
|
||||||
|
service-role machinery exist (Status).
|
||||||
|
|
||||||
|
## Availability — an `askari` outage (amendment 2026-06-20)
|
||||||
|
|
||||||
|
The coordinator is deliberately **single** (one off-site host). Recorded here so its
|
||||||
|
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
|
||||||
|
|
||||||
|
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
|
||||||
|
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
|
||||||
|
radius**:
|
||||||
|
|
||||||
|
| Traffic | `askari` down |
|
||||||
|
|---|---|
|
||||||
|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
|
||||||
|
| node ↔ node over LAN IPs (cluster) | unaffected |
|
||||||
|
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
|
||||||
|
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
|
||||||
|
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
|
||||||
|
|
||||||
|
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
|
||||||
|
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
|
||||||
|
operations, above).
|
||||||
|
|
||||||
|
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
|
||||||
|
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
|
||||||
|
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
|
||||||
|
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
|
||||||
|
hosts get the same pin via `base__mesh_coordinator_pin`.
|
||||||
|
|
||||||
|
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
|
||||||
|
default-deny posture; only helps established sessions), a second relay (needs another public
|
||||||
|
host / reintroduces the home public surface), a second coordinator (unsupported by
|
||||||
|
self-hosted NetBird; against this ADR).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||||
|
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||||
|
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
|
||||||
|
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).
|
||||||
112
docs/decisions/017-service-ui-verification.md
Normal file
112
docs/decisions/017-service-ui-verification.md
Normal file
|
|
@ -0,0 +1,112 @@
|
||||||
|
# ADR-017 — Service-UI acceptance verification (Level 4)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||||
|
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||||
|
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
||||||
|
Level 4 stub. Nothing below Level 4 exercises a service's **application UI** — none
|
||||||
|
answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?"
|
||||||
|
(TODO 8.2). The operator's ask (TODO 2.2 headless browsing + TODO 2.3 test users +
|
||||||
|
manual-test instruction): Claude spins up a browser, *sees* the service UI, exercises
|
||||||
|
it, generates test users, and instructs the operator on manual tests. Today Claude sees
|
||||||
|
a browser only passively (`/screenshot` fetches operator-taken shots from `mamba`); this
|
||||||
|
is the active counterpart.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A Claude-driven exploratory service-UI verification harness — **Level 4** — invoked as
|
||||||
|
`/verify-service <name>` on `ubongo`. Five settled forks:
|
||||||
|
|
||||||
|
1. **Claude-driven exploratory** — Claude navigates with judgment, not deterministic
|
||||||
|
scripts. A scripted regression suite is explicitly not built here.
|
||||||
|
2. **Interactive, Claude-in-the-loop** — exploratory judgment can't be a headless cron
|
||||||
|
gate; scheduled smoke is a determinism job for health checks / Uptime Kuma later.
|
||||||
|
3. **Staging, full exercise** — Claude creates test users and exercises features
|
||||||
|
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
|
||||||
|
resolves safety.
|
||||||
|
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
|
||||||
|
Caddy (ADR-024) + Authentik as a real user would.
|
||||||
|
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
|
||||||
|
acceptance spec of critical journeys; Claude executes it and explores beyond it.
|
||||||
|
|
||||||
|
## VERIFY.md standard
|
||||||
|
|
||||||
|
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from
|
||||||
|
`docs/testing/service-verify-template.md` — parallel to `SECURITY.md` from
|
||||||
|
`service-security-template.md`. A new role convention. It lists the service's critical
|
||||||
|
user journeys (what "working" means), what good looks like, and what is not
|
||||||
|
browser-verifiable (→ manual handoff). It also joins the pre-production gate in
|
||||||
|
`docs/security/service-checklist.md`.
|
||||||
|
|
||||||
|
## Test-user standard (TODO 2.3)
|
||||||
|
|
||||||
|
Test identities live only in the **staging** Authentik (never production): a dedicated
|
||||||
|
`test` group / naming prefix; ephemeral per-run credentials (staging is rebuildable, so
|
||||||
|
nothing persisted, none in `vault.yml`); reuse-or-create; teardown via staging rebuild
|
||||||
|
or explicit `test`-group cleanup.
|
||||||
|
|
||||||
|
## Reporting & manual handoff
|
||||||
|
|
||||||
|
`/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md` (+ `latest.md`),
|
||||||
|
mirroring `/review-repo` and `/capacity-review`: pass/fail per `VERIFY.md` journey,
|
||||||
|
observations, the test-user/env used, a verdict, and a structured **manual-test
|
||||||
|
checklist** for anything Claude can't do (physical device, paid/external flow,
|
||||||
|
subjective judgment) — the "instruct me on tests" output. Screenshots are saved to a
|
||||||
|
git-ignored working dir on `ubongo` (PNG bloat + secret-leak risk); the report links
|
||||||
|
them.
|
||||||
|
|
||||||
|
## Safety
|
||||||
|
|
||||||
|
- **Staging-only guard** — the skill refuses to run against production (exploratory
|
||||||
|
clicking is destructive); ADR-002-aligned hard stop.
|
||||||
|
- **Confined blast radius** — test users only in the staging `test` group; the run
|
||||||
|
sticks to the target service.
|
||||||
|
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
|
||||||
|
avoid capturing credential screens.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
|
||||||
|
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
|
||||||
|
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
|
||||||
|
- A staging deploy of the service (ADR-008 Level 2) — staging is currently empty stubs.
|
||||||
|
- `make new-role` scaffolding `VERIFY.md` — deferred to when that scaffold is next touched.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Scripted Playwright regression suite | Operator wants exploratory judgment; scripts add maintenance burden. Could be a later layer, not this. |
|
||||||
|
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
|
||||||
|
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
|
||||||
|
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
|
||||||
|
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
|
||||||
|
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The harness is confined to staging by a hard stop: it refuses to run against
|
||||||
|
production because exploratory clicking is destructive, the blast radius is bounded to
|
||||||
|
the target service, and test users live only in the staging `test` group (Safety).
|
||||||
|
- No secrets leak: the git-ignored screenshot dir is the safety boundary and credential
|
||||||
|
screens are avoided (Safety; Reporting & manual handoff).
|
||||||
|
- Test identities are ephemeral per-run credentials in the staging Authentik only —
|
||||||
|
never production, none persisted in `vault.yml` — created reuse-or-create and torn
|
||||||
|
down via staging rebuild or `test`-group cleanup (Test-user standard).
|
||||||
|
- Anything Claude cannot exercise (physical device, paid/external flow, subjective
|
||||||
|
judgment) is handed off via a structured manual-test checklist in the run report
|
||||||
|
(Reporting & manual handoff).
|
||||||
|
- Authoring is possible now (this ADR, the `VERIFY.md` template, the `/verify-service`
|
||||||
|
skill, conventions/checklist edits), but running is deferred on its dependencies:
|
||||||
|
`ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role`
|
||||||
|
scaffolding `VERIFY.md` (Status; Dependencies).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
|
||||||
|
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|
||||||
124
docs/decisions/018-logging.md
Normal file
124
docs/decisions/018-logging.md
Normal file
|
|
@ -0,0 +1,124 @@
|
||||||
|
# ADR-018 — Logging and log integrity
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||||
|
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||||
|
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||||
|
and the live pipeline.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||||
|
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
|
||||||
|
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
|
||||||
|
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
|
||||||
|
off-site watchdog). Undecided: the architecture and the **integrity** question — an
|
||||||
|
attacker who roots a host will try to clear logs to cover their tracks.
|
||||||
|
|
||||||
|
The framing insight: the biggest anti-tampering win is that logs **leave the host in
|
||||||
|
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
|
||||||
|
local copy is futile. How far to harden the central store is set by the threat model.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
|
||||||
|
forensic-grade.
|
||||||
|
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
|
||||||
|
trends. Near-real-time shipping already defeats per-host track-covering.
|
||||||
|
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** —
|
||||||
|
tamper-resistant against full-cluster compromise, at bounded volume.
|
||||||
|
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
|
||||||
|
proportionate control.
|
||||||
|
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
|
||||||
|
retention + wearout monitoring.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
|
||||||
|
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
|
||||||
|
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
|
||||||
|
single-binary mode; NVMe; bounded retention.
|
||||||
|
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
|
||||||
|
only, write-only, long retention, tiny volume.
|
||||||
|
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
|
||||||
|
+ the alerting ADR-002 calls for.
|
||||||
|
|
||||||
|
## Data flow & the security subset
|
||||||
|
|
||||||
|
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
|
||||||
|
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
|
||||||
|
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
|
||||||
|
it syslog-forwards its alerts to the ingest point), and key container security events.
|
||||||
|
|
||||||
|
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
|
||||||
|
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
|
||||||
|
hosts. The push API has no edit/delete verb, so a compromised host can append but not
|
||||||
|
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
|
||||||
|
(WAL) + retries across a brief outage.
|
||||||
|
|
||||||
|
## Security, integrity & residual risks
|
||||||
|
|
||||||
|
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
|
||||||
|
(append-only, off-cluster). The security trail survives full-cluster compromise.
|
||||||
|
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
|
||||||
|
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
|
||||||
|
shipping but not alter shipped history; **a host going silent is itself an alert**; a
|
||||||
|
stolen push credential appends noise but can't delete; an `askari` outage buffers +
|
||||||
|
flushes on reconnect.
|
||||||
|
|
||||||
|
## Retention & disk-wear
|
||||||
|
|
||||||
|
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
|
||||||
|
bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr).
|
||||||
|
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
|
||||||
|
verbosity at source (sane levels, selective access logging, a targeted `auditd`
|
||||||
|
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
|
||||||
|
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||||
|
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||||
|
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
|
||||||
|
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
|
||||||
|
(sibling effort, TODO 3.6).
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
|
||||||
|
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||||
|
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
|
||||||
|
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
||||||
|
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
|
||||||
|
the host in near-real-time and the off-cluster security trail is append-only, so it
|
||||||
|
survives full-cluster compromise (Security, integrity & residual risks).
|
||||||
|
- Conscious residuals remain: append-only is not cryptographic WORM (root-on-`askari`
|
||||||
|
could edit chunks — R4); there is a few-seconds un-shipped window; agent compromise
|
||||||
|
can stop future shipping but not alter shipped history; a stolen push credential
|
||||||
|
appends noise but cannot delete; and an `askari` outage buffers then flushes on
|
||||||
|
reconnect (Security, integrity & residual risks).
|
||||||
|
- A host going silent is itself an alert (Security, integrity & residual risks).
|
||||||
|
- Only a bounded security subset ships off-site — `auditd`, `authpriv`, `fail2ban`,
|
||||||
|
AIDE, Suricata and key container security events tagged `security="true"` — while the
|
||||||
|
cluster Loki holds everything, keeping off-site volume small (Data flow & the security
|
||||||
|
subset).
|
||||||
|
- Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash,
|
||||||
|
bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD
|
||||||
|
wearout/TBW with an alert; log storage is a tracked allocation in
|
||||||
|
`docs/hardware/reference.md` (Retention & disk-wear).
|
||||||
|
- The decision is authorable now but the live pipeline is deferred on the stack:
|
||||||
|
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
|
||||||
|
push-only credential (Status; Dependencies).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||||
|
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||||
|
standard), ADR-011 (health checks — distinct from this).
|
||||||
113
docs/decisions/019-tagging.md
Normal file
113
docs/decisions/019-tagging.md
Normal file
|
|
@ -0,0 +1,113 @@
|
||||||
|
# ADR-019 — Tagging standard for targeted, predictable runs
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-06). Resolves TODO 3.7 ("Define a tagging standard that lets us
|
||||||
|
target runs without over-tagging") and TODO 3.11 ("Deliberate tagging strategy").
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma wants to run playbooks **targeted** — a single service, a single layer, or a
|
||||||
|
single cross-cutting concern — **transparently and predictably**: a reader should
|
||||||
|
know from a `--tags` invocation exactly what it will and won't touch. CLAUDE.md
|
||||||
|
already requires tag-filterable tasks, but no vocabulary or convention existed, and
|
||||||
|
the TODO explicitly warns against the opposite failure mode: **over-tagging**.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Two-tier tagging
|
||||||
|
|
||||||
|
**Tier 1 — role/service tag (mechanical).** The tag equals the role name, applied
|
||||||
|
once at the role-import level:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
roles:
|
||||||
|
- role: photoprism
|
||||||
|
tags: [photoprism]
|
||||||
|
```
|
||||||
|
|
||||||
|
Ansible propagates it to every task in the role. Because one service = one role
|
||||||
|
(ADR-004), this single rule covers both the *layer/role* and *single-service*
|
||||||
|
targeting axes with zero per-task burden. Role-less lifecycle playbooks
|
||||||
|
(e.g. `bootstrap.yml`) carry a single playbook-identity tag instead.
|
||||||
|
|
||||||
|
**Tier 2 — concern tag (curated).** A small **closed list** of cross-cutting concern
|
||||||
|
tags, applied per-task/block **only where a task genuinely belongs to that concern**.
|
||||||
|
|
||||||
|
### The closed concern list
|
||||||
|
|
||||||
|
A concern earns a tag only if it (a) appears in 2+ roles, (b) is worth running as a
|
||||||
|
slice on its own, and (c) doesn't overlap confusingly with another.
|
||||||
|
|
||||||
|
| Tag | Covers |
|
||||||
|
|-----|--------|
|
||||||
|
| `packages` | apt package install/management |
|
||||||
|
| `users` | accounts, groups, sudo |
|
||||||
|
| `firewall` | nftables rulesets & port definitions (ADR-002) |
|
||||||
|
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
|
||||||
|
| `logging` | Alloy / log-shipping config (ADR-018) |
|
||||||
|
| `monitoring` | metric exporters / health checks |
|
||||||
|
| `config` | render templated config/compose files to disk — **no restart** |
|
||||||
|
| `deploy` | bring services up / restart (`compose up -d`) |
|
||||||
|
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
|
||||||
|
|
||||||
|
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
|
||||||
|
config`) without bouncing services, then restart deliberately (`--tags deploy`).
|
||||||
|
`backup` and `secrets` are intentionally omitted until the roles needing them exist.
|
||||||
|
|
||||||
|
### `always` / `never`
|
||||||
|
|
||||||
|
- **`always`** — reserved for cheap preflight assertions (vault unlocked, OS is
|
||||||
|
Debian 13, required vars present), so even `--tags config` runs its safety guards.
|
||||||
|
- **`never`** — reserved for destructive/expensive opt-in tasks, each paired with a
|
||||||
|
descriptive tag (e.g. `tags: [never, force_pull]`); they run only when named.
|
||||||
|
|
||||||
|
### Predictability principle: tags are union-only
|
||||||
|
|
||||||
|
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. boma therefore
|
||||||
|
targets **one axis at a time**: either a role/service *or* a concern, never an
|
||||||
|
intersection like "photoprism's firewall only." If that's ever needed, just run
|
||||||
|
`--tags photoprism` (idempotent and fast). Designing for intersection is the
|
||||||
|
over-tagging trap; we decline it on purpose.
|
||||||
|
|
||||||
|
### Terraform / Proxmox VM tags (metadata only)
|
||||||
|
|
||||||
|
Every Terraform-managed VM carries exactly three Proxmox tags:
|
||||||
|
|
||||||
|
| Tag | Value | Purpose |
|
||||||
|
|-----|-------|---------|
|
||||||
|
| env | `staging` \| `production` | which environment |
|
||||||
|
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
|
||||||
|
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
|
||||||
|
|
||||||
|
These are **pure metadata for transparency** (glanceable in the Proxmox UI). They do
|
||||||
|
**not** drive run-targeting and do **not** feed inventory — `scripts/tf_to_inventory.py`
|
||||||
|
keeps building groups from the `group` output field, the single source of truth.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
`tests/tags.yml` is the single source of truth for the allowed concern/special/
|
||||||
|
opt-in/playbook tags. `scripts/check-tags.py` (run by `make lint`, covered by
|
||||||
|
`tests/test_check_tags.py`) scans `roles/` and `playbooks/` and fails on any tag
|
||||||
|
outside `{role directory names} ∪ {tests/tags.yml entries}`.
|
||||||
|
Molecule scenario files (`roles/*/molecule/**`) are excluded from the scan — they are test orchestration, not the production run-targeting surface this standard governs.
|
||||||
|
It also checks that every role imported in a play's `roles:` block carries its own role name as a tag (additional tags are allowed).
|
||||||
|
|
||||||
|
## Extending the vocabulary
|
||||||
|
|
||||||
|
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the concern
|
||||||
|
table above with a one-line justification showing it passes the litmus test
|
||||||
|
(cross-cutting, 2+ roles, distinct). That is the whole gate — lightweight, but it
|
||||||
|
leaves a paper trail.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Targeted runs are predictable: only two kinds of tags exist, one of them mechanical.
|
||||||
|
- Over-tagging is structurally resisted (closed list + lint enforcement).
|
||||||
|
- Intersection targeting is unavailable by design.
|
||||||
|
- Authors must keep role tags = role names. `make lint` enforces both the *vocabulary* (every tag is a known role name or approved tag) and that each role import in a `roles:` block carries its own role-name tag (extra tags allowed).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline / firewall), ADR-004 (one service = one role),
|
||||||
|
ADR-009 (TF↔Ansible handoff / inventory), ADR-018 (logging).
|
||||||
150
docs/decisions/020-firewall.md
Normal file
150
docs/decisions/020-firewall.md
Normal file
|
|
@ -0,0 +1,150 @@
|
||||||
|
# ADR-020 — Firewall strategy: two-layer model with a shared service catalog
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-06). Resolves TODO 3.5 ("Decide the firewall strategy — which
|
||||||
|
firewall, ruleset, per-host vs central").
|
||||||
|
|
||||||
|
**Strategy ADR.** It pins the architecture and each layer's responsibilities; the
|
||||||
|
detailed builds are separate follow-up efforts (see *Scope*).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma needs a firewall strategy that is predictable, declarative, and defends the stated
|
||||||
|
threat model — opportunistic external, lateral movement / blast radius, operator/agent
|
||||||
|
error (ADR-002). The pieces were already committed across other ADRs (`nftables`
|
||||||
|
default-deny on hosts — ADR-002; OPNsense at the perimeter — ADR-007; Docker with
|
||||||
|
`iptables: false` — ADR-004), but nothing tied them together: which layer owns what,
|
||||||
|
where firewall intent is declared, and how the layers stay consistent. Without that,
|
||||||
|
ports drift open ad-hoc and "per-host vs central" stays unanswered.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Two layers, distinct jobs
|
||||||
|
|
||||||
|
**OPNsense — perimeter + inter-VLAN.** Owns the WAN edge and all policy *between zones*:
|
||||||
|
`lan`/`iot`/`guest` → `srv`, `mgmt` access, and the per-VLAN egress rules (ADR-007). It
|
||||||
|
is **structurally blind to intra-`srv` traffic** — services share the switched `srv`
|
||||||
|
subnet (VLAN 20), which never reaches the gateway.
|
||||||
|
|
||||||
|
**Host nftables — host-local + east-west within `srv`** (in the `base` role, every VM):
|
||||||
|
|
||||||
|
- **Default-deny inbound**; allow loopback + established/related.
|
||||||
|
- **East-west allowlist**: a service host accepts a connection only from declared
|
||||||
|
sources (e.g. the reverse proxy, a named peer) — the lateral-movement control OPNsense
|
||||||
|
cannot provide.
|
||||||
|
- **Permissive egress**: allow outbound + established/related; per-VLAN egress
|
||||||
|
restriction stays at OPNsense (ADR-007). Host-level egress allowlisting is
|
||||||
|
high-friction (every DNS/NTP/update/registry/webhook must be enumerated) for limited
|
||||||
|
added benefit once the VLAN already bounds where a host can go.
|
||||||
|
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
|
||||||
|
including container traffic (ADR-004).
|
||||||
|
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
|
||||||
|
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
|
||||||
|
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
|
||||||
|
catalog, applied atomically — a malformed or empty catalog can never lock out
|
||||||
|
management. The control-node source is part of the guaranteed plane, not the service
|
||||||
|
catalog (it is management, not a service); see ADR-021 for the access doctrine.
|
||||||
|
|
||||||
|
So "per-host vs central" is answered: **both**, with clear ownership.
|
||||||
|
|
||||||
|
### Single source of truth — a shared service catalog
|
||||||
|
|
||||||
|
A central, declarative **service catalog** in `group_vars/` is the one source of truth
|
||||||
|
for firewall intent (aligning with ADR-002's "port definitions live in `group_vars/`",
|
||||||
|
and keeping connectivity *topology* in inventory rather than in any one self-contained
|
||||||
|
service role — ADR-004). Each entry describes a service's **ingress**:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
photoprism:
|
||||||
|
ingress:
|
||||||
|
- { from: reverse_proxy, port: 2342, proto: tcp }
|
||||||
|
reverse_proxy:
|
||||||
|
ingress:
|
||||||
|
- { from: lan, port: 443, proto: tcp }
|
||||||
|
```
|
||||||
|
|
||||||
|
`from` is **symbolic**, resolved at render time: a host/group → IP(s) from inventory; a
|
||||||
|
role (`reverse_proxy`) → the host(s) filling it; a VLAN/zone (`lan`) → the subnet from
|
||||||
|
the ADR-007 table. This keeps the catalog readable and resilient to IP changes.
|
||||||
|
|
||||||
|
### Each layer renders only its own slice
|
||||||
|
|
||||||
|
| Ingress rule | Host nftables | OPNsense |
|
||||||
|
|---|---|---|
|
||||||
|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
|
||||||
|
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
|
||||||
|
|
||||||
|
The dominant pattern falls out naturally: most services are **proxied** — their only
|
||||||
|
ingress is `from: reverse_proxy`, and users reach them through the reverse proxy, which
|
||||||
|
alone carries `from: lan, port: 443` (matches "services sit behind the reverse proxy
|
||||||
|
with authentication", ADR-002).
|
||||||
|
|
||||||
|
This was chosen over a single connectivity-model-generates-both (too much machinery,
|
||||||
|
tight coupling of two very different rule domains) and over fully independent per-layer
|
||||||
|
declarations (real drift risk).
|
||||||
|
|
||||||
|
### Off-cluster hosts — `askari` (Hetzner)
|
||||||
|
|
||||||
|
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
|
||||||
|
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
|
||||||
|
alongside the VM itself. Rule set: SSH inbound from `ubongo`'s public IP (M2), plus
|
||||||
|
TCP 80/443 + UDP 3478 opened in **M4a** (Caddy + NetBird). The `netbird_coordinator`
|
||||||
|
service role that uses 3478 lands in **M4b**; the ports are already open.
|
||||||
|
|
||||||
|
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
|
||||||
|
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for
|
||||||
|
OPNsense at the perimeter.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OPNsense automation — owned here, mechanism deferred
|
||||||
|
|
||||||
|
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
|
||||||
|
OPNsense provider"). It renders the cross-VLAN slice of the catalog plus the static
|
||||||
|
ADR-007 facts. The **how** — config-XML templating vs the OPNsense API vs a plugin — is
|
||||||
|
deferred to the OPNsense-as-code follow-up spec. Recorded as an explicit open
|
||||||
|
sub-decision.
|
||||||
|
|
||||||
|
## Guardrails
|
||||||
|
|
||||||
|
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist —
|
||||||
|
hardening the existing rule "never open a firewall port ad-hoc on a host" (ADR-002).
|
||||||
|
- **The `firewall` tag** (ADR-019) marks firewall tasks; `--tags firewall` re-renders
|
||||||
|
rules.
|
||||||
|
- **Drift detection (aspiration).** A deterministic check — in the spirit of
|
||||||
|
`scripts/check-tags.py` — comparing each host's live `nft` ruleset / listening ports
|
||||||
|
against the catalog and flagging anything undeclared. Ties to TODO 8.5
|
||||||
|
(`/security-review`). Not necessarily built first.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Lateral movement within `srv` is constrained — the gap OPNsense structurally can't
|
||||||
|
close.
|
||||||
|
- One declarative catalog → no ad-hoc ports and no cross-layer drift on shared facts
|
||||||
|
(ports, IPs, sources).
|
||||||
|
- Cost: the catalog + render-per-layer machinery must be built and maintained; east-west
|
||||||
|
allowlisting adds per-service ingress declarations (mitigated by proxied-by-default,
|
||||||
|
which keeps most entries to a single line).
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
**Decided here:** the two-layer model and responsibilities; host nftables = default-deny
|
||||||
|
inbound + east-west allowlist + permissive egress + guaranteed management plane + Docker
|
||||||
|
`iptables:false`; the shared `group_vars` catalog as single source of truth with
|
||||||
|
symbolic sources; each layer renders its own slice; the no-ad-hoc-ports guardrail.
|
||||||
|
|
||||||
|
**Deferred to follow-up specs (each its own brainstorm → plan):**
|
||||||
|
|
||||||
|
1. **Host nftables implementation** in `base` — catalog schema, nftables template,
|
||||||
|
Docker `iptables:false` integration, fail-safe ordering, Molecule tests. The natural
|
||||||
|
next spec.
|
||||||
|
2. **OPNsense-as-code** — tooling mechanism + cross-VLAN rule rendering.
|
||||||
|
3. **Drift-detection check** — if/when built.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
|
||||||
|
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
|
||||||
|
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag),
|
||||||
|
ADR-021 (operational access doctrine; `ssh-from-control` management-plane source).
|
||||||
238
docs/decisions/021-operational-access.md
Normal file
238
docs/decisions/021-operational-access.md
Normal file
|
|
@ -0,0 +1,238 @@
|
||||||
|
# ADR-021 — Operational access: documented, verifiable ways in
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
|
||||||
|
will be rare) and TODO 3.2 (the service admin-API access question). **Amended
|
||||||
|
2026-06-18:** the on-`ubongo` sudo model for the two local accounts is now settled
|
||||||
|
(see §Sudo model on `ubongo` below).
|
||||||
|
|
||||||
|
**Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
|
||||||
|
data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
|
||||||
|
**not** build any of them — `base`'s non-firewall concerns, service roles, and live
|
||||||
|
hosts do not exist yet. Designed now, built when there is something to access (see
|
||||||
|
*Scope*). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
|
||||||
|
*Reconciliation*).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
||||||
|
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
||||||
|
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
|
||||||
|
question unanswered: **when a host or service breaks, how does the operator (and the AI
|
||||||
|
working from `ubongo`) actually get in to troubleshoot it?**
|
||||||
|
|
||||||
|
Troubleshooting is far more effective with *several* documented ways in — SSH, container
|
||||||
|
exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no
|
||||||
|
standard guaranteeing those paths exist, are documented, or still work. The risk is the
|
||||||
|
classic one: the access you assumed you had is stale exactly when you need it (key
|
||||||
|
rotated, API disabled, token expired).
|
||||||
|
|
||||||
|
boma already has the right *shape*. Service roles carry record docs — `SECURITY.md`
|
||||||
|
(security answers) and `VERIFY.md` (acceptance spec). What is missing is the third
|
||||||
|
sibling — an operational-access record — and the doctrine behind it.
|
||||||
|
|
||||||
|
Two constraints shape the decision:
|
||||||
|
|
||||||
|
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
||||||
|
paths over *trusted* interfaces, never new exposed ports.
|
||||||
|
2. **A documented path that is never tested drifts** — it fails exactly when needed. So
|
||||||
|
the access facts must be *data* that both renders the doc and drives an active
|
||||||
|
verifier; the two can then never disagree.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### The doctrine
|
||||||
|
|
||||||
|
> **Every host and every service guarantees at least one documented, verifiable way in
|
||||||
|
> for operational troubleshooting — and the deploy that creates it also records and
|
||||||
|
> proves it.**
|
||||||
|
|
||||||
|
Access is a deployment deliverable, not something rediscovered under pressure. The deploy
|
||||||
|
that creates a host/service also records its access paths and (by design) proves them.
|
||||||
|
|
||||||
|
### Two layers
|
||||||
|
|
||||||
|
- **Host layer** (resolves TODO 7.2). Every host, via the `base` role, guarantees a fixed
|
||||||
|
access baseline: SSH over `wt0` and from `ubongo` (the ladder below), Docker/Compose
|
||||||
|
tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a
|
||||||
|
known, uniform set of paths exists over trusted interfaces. The break-glass console per
|
||||||
|
host class is recorded once at this layer. This is boma's answer to "what every host
|
||||||
|
runs for access."
|
||||||
|
- **Service layer** (resolves TODO 3.2). Every service role guarantees and records its
|
||||||
|
own paths: container exec + compose management, its Loki log labels, and its admin API
|
||||||
|
where one exists (enabled, token in vault, endpoint + health probe documented) — or an
|
||||||
|
explicit "no API."
|
||||||
|
|
||||||
|
### The three-tier access ladder
|
||||||
|
|
||||||
|
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
||||||
|
before SSH sees it. The preferred path (ADR-016's original rationale).
|
||||||
|
2. **LAN SSH from `ubongo` only — secondary, mesh-independent.** All hardware but
|
||||||
|
`askari` shares a LAN. SSH from `ubongo`'s LAN address is allowed, giving a fallback
|
||||||
|
that survives a NetBird/`wt0` outage. It is gated by *source IP* (spoofable on a LAN)
|
||||||
|
**plus** the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal
|
||||||
|
cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All
|
||||||
|
*other* LAN hosts stay default-denied.
|
||||||
|
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, never
|
||||||
|
exercised for routine work:
|
||||||
|
- **Cluster VMs** → Proxmox serial/VNC console — independent of the guest network,
|
||||||
|
`wt0`, and even a broken guest nftables ruleset.
|
||||||
|
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
||||||
|
- **`ubongo`** (physical) → local console.
|
||||||
|
|
||||||
|
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
||||||
|
|
||||||
|
### Reconciliation, not weakening
|
||||||
|
|
||||||
|
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
|
||||||
|
outage never blocks on-LAN runs" — which **requires** LAN SSH from `ubongo`. Yet ADR-016
|
||||||
|
also stated "SSH only on `wt0`," and ADR-020's guaranteed management plane listed only
|
||||||
|
`wt0`. That was a latent contradiction. ADR-021 resolves it by making the control-node
|
||||||
|
SSH allow **explicit** and adding it to the guaranteed management plane. This does **not**
|
||||||
|
weaken default-deny: it admits exactly one extra trusted source on the LAN (`ubongo`),
|
||||||
|
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
|
||||||
|
amended to cross-reference this ladder.
|
||||||
|
|
||||||
|
### The declarative `access__*` data model
|
||||||
|
|
||||||
|
Structured access facts live as **data** — the single source of truth that both renders
|
||||||
|
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge
|
||||||
|
(the firewall-catalog philosophy of ADR-020, applied to access).
|
||||||
|
|
||||||
|
Each service role's defaults carry:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
access__service: photoprism
|
||||||
|
access__compose_project: photoprism # docker compose -p <this>
|
||||||
|
access__compose_path: /opt/photoprism/compose.yml
|
||||||
|
access__containers: [photoprism, photoprism-db] # exec targets
|
||||||
|
access__log:
|
||||||
|
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
||||||
|
access__api:
|
||||||
|
enabled: true
|
||||||
|
base_url: "http://photoprism.srv:2342" # reachable over the mesh
|
||||||
|
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
||||||
|
auth: { vault_ref: "vault.photoprism.api_token" }
|
||||||
|
health_path: "/api/v1/status" # what /check-access pings
|
||||||
|
# where the service has no API:
|
||||||
|
# access__api: { enabled: false, reason: "<none upstream>" }
|
||||||
|
```
|
||||||
|
|
||||||
|
**Invariant — `access__api` never opens a port.** It `firewall_ref`s an entry in the
|
||||||
|
`group_vars` firewall catalog; ADR-020 stays the **sole owner of exposure**. The access
|
||||||
|
data adds only *how to use* the path (endpoint, token ref, health probe) — no duplication,
|
||||||
|
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
||||||
|
|
||||||
|
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
||||||
|
uniform, so it is asserted by `base` and recorded once at the host/group level, not
|
||||||
|
re-stated per service.
|
||||||
|
|
||||||
|
### The rendered record — `ACCESS.md`
|
||||||
|
|
||||||
|
`ACCESS.md` is a first-class sibling of `SECURITY.md`/`VERIFY.md`, **rendered** from the
|
||||||
|
`access__*` data with a prose tail for the narrative parts:
|
||||||
|
|
||||||
|
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
||||||
|
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
||||||
|
invocation.
|
||||||
|
- **Break-glass (generated from host class)** — the Proxmox/provider/local console line.
|
||||||
|
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
||||||
|
part a template cannot know.
|
||||||
|
|
||||||
|
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
||||||
|
security/verify templates.
|
||||||
|
|
||||||
|
### The verifier — `/check-access`
|
||||||
|
|
||||||
|
`/check-access <service|host>` runs from `ubongo` and turns the `access__*` data into
|
||||||
|
live probes, reporting which declared paths are green right now — the access analogue of
|
||||||
|
`/verify-service` (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
|
||||||
|
the admin API health path; on any red it names the path and the likely cause. **Break-glass
|
||||||
|
is checked for reachability only, never exercised** — firing a serial console is invasive,
|
||||||
|
so the verifier confirms the fallback *exists* without disrupting anything. Designed now,
|
||||||
|
**build-pending on infra** (needs live hosts + staging + vault), exactly like
|
||||||
|
`/verify-service` under ADR-017.
|
||||||
|
|
||||||
|
### Governance
|
||||||
|
|
||||||
|
Three light touches, mirroring how `SECURITY.md`/`VERIFY.md` are enforced: the service
|
||||||
|
checklist (`docs/security/service-checklist.md`) gains an access item; the `new-role`
|
||||||
|
runbook gains a fill/render/`check-access` step (step 11: copy
|
||||||
|
`docs/access/service-access-template.md` into `roles/<service>/ACCESS.md` and populate the
|
||||||
|
`access__*` data); and a service-checklist gate item blocks clearance until the record
|
||||||
|
exists and `/check-access` is green (or a deviation is recorded in `accepted-risks.md`).
|
||||||
|
No scaffold change — same manual-copy-plus-review pattern the sibling records
|
||||||
|
(`SECURITY.md`/`VERIFY.md`) use.
|
||||||
|
|
||||||
|
### Sudo model on `ubongo` (amendment 2026-06-18)
|
||||||
|
|
||||||
|
The original ADR left on-`ubongo` local sudo unspecified. The integration-testing
|
||||||
|
harness shakedown settled it:
|
||||||
|
|
||||||
|
| Account | Role | Sudo |
|
||||||
|
|---|---|---|
|
||||||
|
| `claude` | Automated AI-worker | `NOPASSWD:ALL` via repo-managed drop-in (`base__ai_worker_user`) |
|
||||||
|
| `sjat` | Human operator | Password-required sudo via the `sudo` group |
|
||||||
|
|
||||||
|
**Rationale for `claude NOPASSWD`.** No-sudo blocked the AI-worker from diagnosing a
|
||||||
|
failed test VM: `virsh`, `virt-install`, `cloud-localds`, `nft`, `journalctl` —
|
||||||
|
almost every low-level diagnostic tool — require root. The harness's core value is
|
||||||
|
autonomous spin-up → apply → reboot → assert → diagnose; that loop collapses without
|
||||||
|
local root access.
|
||||||
|
|
||||||
|
**Compensating controls (R7 in `docs/security/accepted-risks.md`):**
|
||||||
|
- `claude`'s password is locked — `NOPASSWD` is the account's *only* sudo path; no
|
||||||
|
interactive login is possible.
|
||||||
|
- `auditd` + Loki attribution (ADR-018) separates human from agent root actions in the
|
||||||
|
audit trail.
|
||||||
|
- The drop-in is repo-managed and revocable in one commit + one deploy.
|
||||||
|
- Single-operator homelab; everything in git; off-machine backups (ADR-022).
|
||||||
|
|
||||||
|
**`sjat` NOPASSWD removed.** The operator's former `NOPASSWD` drop-in
|
||||||
|
(`/etc/sudoers.d/sjat-ansible`, added as an interim measure during M5 NetBird
|
||||||
|
enrolment) was removed 2026-06-18. It was redundant once `claude` held sudo, and its
|
||||||
|
removal restores least-privilege for the human operator. `sjat` retains full sudo
|
||||||
|
capability via the `sudo` group (password required).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Every host and service has at least one documented, verifiable way in — and a verifier
|
||||||
|
that proves it, so stale access is caught before an outage, not during one.
|
||||||
|
- Doc and verifier share one source of truth (`access__*`), so they cannot drift apart.
|
||||||
|
- The management plane gains exactly one extra trusted LAN source (`ubongo`); attack
|
||||||
|
surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
|
||||||
|
- Cost: per-service `access__*` declarations and a rendered `ACCESS.md` to maintain
|
||||||
|
(mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus `/check-access` to build.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
Delivered by ADR-021's implementation plan
|
||||||
|
(`docs/superpowers/plans/2026-06-09-operational-access.md`), task by task, and tracked in
|
||||||
|
`STATUS.md` as it lands — not all of it exists at the moment this ADR is written. The split
|
||||||
|
below is near-term tranche vs longer build-pending, not instant-existence vs not.
|
||||||
|
|
||||||
|
**Near-term tranche (this plan):** the doctrine; this ADR; the `ACCESS.md` template; the
|
||||||
|
`ssh-from-control` firewall management-plane source — added to ADR-020's *guaranteed
|
||||||
|
management plane* (the always-allowed block that already holds the `wt0` SSH/Ansible allow
|
||||||
|
and is explicitly independent of the service catalog), not added to the catalog itself (the
|
||||||
|
catalog owns service ingress only) — via the `base__firewall_control_addr` knob and its
|
||||||
|
nftables rule, both of which do **not** exist in `roles/base` yet and land with the
|
||||||
|
`firewall` concern of `base`; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.
|
||||||
|
|
||||||
|
**Build-pending on infra:** per-service `access__*` data and rendered `ACCESS.md` files
|
||||||
|
(wait on service roles), `/check-access` *running* (waits on live hosts + staging + vault),
|
||||||
|
and the real `ubongo` LAN address value behind `base__firewall_control_addr`. Designed now,
|
||||||
|
built when there is something to verify.
|
||||||
|
|
||||||
|
**Out of scope:** broader LAN SSH (a management VLAN) — explicitly rejected, `ubongo`-only;
|
||||||
|
exercising (vs reachability-probing) the break-glass console; any access path that is not
|
||||||
|
over the mesh or the one `ubongo` LAN source.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
|
||||||
|
model, Compose), ADR-016 (NetBird mesh; amended — SSH on `wt0` **and** from `ubongo`'s
|
||||||
|
LAN address), ADR-017 (`/verify-service` Level-4 verification), ADR-018 (logging:
|
||||||
|
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
|
||||||
|
amended — adds the `ssh-from-control` management-plane source), ADR-019 (`firewall` tag).
|
||||||
277
docs/decisions/022-backup.md
Normal file
277
docs/decisions/022-backup.md
Normal file
|
|
@ -0,0 +1,277 @@
|
||||||
|
# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up,
|
||||||
|
incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all
|
||||||
|
"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which
|
||||||
|
assumed a backup policy existed but never defined one.
|
||||||
|
|
||||||
|
**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service
|
||||||
|
contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB
|
||||||
|
air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi`
|
||||||
|
node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet.
|
||||||
|
Designed now, built in the implementation plan referenced at the foot of this ADR.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
||||||
|
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
||||||
|
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
||||||
|
copies live, *how* they are encrypted, or *whether restores actually work*.
|
||||||
|
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
||||||
|
but commits to nothing.
|
||||||
|
|
||||||
|
The gap is not just theoretical. Every boma service is stateful in some dimension:
|
||||||
|
DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in
|
||||||
|
the stack. Without a backup policy the IaC is not reproducible from nothing; it is
|
||||||
|
reproducible-modulo-data. This ADR closes that gap.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
||||||
|
|
||||||
|
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
||||||
|
Ansible re-renders the Docker Compose stack. Backups therefore protect **state only** —
|
||||||
|
DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
||||||
|
|
||||||
|
Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic
|
||||||
|
restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1
|
||||||
|
topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into
|
||||||
|
a continuous proof that the IaC *and* the backups both work.
|
||||||
|
|
||||||
|
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run
|
||||||
|
plus data restore, potentially hours), and it bets the repo is complete enough to
|
||||||
|
rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify.
|
||||||
|
**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO
|
||||||
|
proves too slow; nothing here precludes it.
|
||||||
|
|
||||||
|
### 2. One backup tier, ~24 h RPO
|
||||||
|
|
||||||
|
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
||||||
|
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
||||||
|
experience to justify the added machinery.
|
||||||
|
|
||||||
|
### 3. Engine — restic (data) + rclone (off-site); no second encryption layer
|
||||||
|
|
||||||
|
- **restic** captures state into an encrypted, deduplicated repository.
|
||||||
|
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
||||||
|
rclone has a first-class pCloud backend).
|
||||||
|
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
||||||
|
encryption layer, no pCloud "crypto folder."
|
||||||
|
|
||||||
|
No PBS in v1 (see Decision 1).
|
||||||
|
|
||||||
|
### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group
|
||||||
|
|
||||||
|
A single backup node owns the canonical restic repo. It is **off the Proxmox cluster**
|
||||||
|
— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster)
|
||||||
|
dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
||||||
|
(off-site): a manually-provisioned physical node in its own inventory group, still
|
||||||
|
Ansible-managed (the `base` role applies, plus a `backup` role).
|
||||||
|
|
||||||
|
**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared
|
||||||
|
dump command remotely, pulls the declared paths read-only, then `restic` snapshots the
|
||||||
|
staged data into its local repo. **Hosts hold no backup credentials and cannot reach
|
||||||
|
the repo** — a compromised or ransomwared service host cannot delete backup history.
|
||||||
|
|
||||||
|
**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional —
|
||||||
|
the *role* ("the backup node") is load-bearing; the physical assignment may be
|
||||||
|
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
||||||
|
(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the
|
||||||
|
pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives
|
||||||
|
(Decision 11).
|
||||||
|
|
||||||
|
**Inventory:** a new `backup_hosts` group is added to both inventories, structured
|
||||||
|
like `control` and `offsite_hosts`. The `base` role applies.
|
||||||
|
|
||||||
|
### 5. 3-2-1 mapping
|
||||||
|
|
||||||
|
| Copy | Location | Medium | Off-site? | Notes |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
||||||
|
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
||||||
|
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) |
|
||||||
|
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
||||||
|
|
||||||
|
≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a
|
||||||
|
fourth, offline copy that no online compromise can reach.
|
||||||
|
|
||||||
|
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance
|
||||||
|
|
||||||
|
Each service role declares its backup needs in role vars — the same render-from-data
|
||||||
|
pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
backup__service: nextcloud # identifier; matches the role / compose project
|
||||||
|
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||||
|
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||||
|
- /srv/nextcloud/data
|
||||||
|
backup__dumps: # logical app-consistent dumps ([] = none)
|
||||||
|
- cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud"
|
||||||
|
dest: nextcloud-db.sql
|
||||||
|
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
|
||||||
|
```
|
||||||
|
|
||||||
|
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||||
|
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||||
|
service with **no** `backup__paths` must explicitly declare `backup__state: false` with
|
||||||
|
a reason; omission is never an implicit "nothing to back up." (`backup__state` and the
|
||||||
|
list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not
|
||||||
|
silent" point.)
|
||||||
|
|
||||||
|
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
|
||||||
|
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||||
|
what state exists, what is backed up, the dump command, and the per-service restore
|
||||||
|
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
|
||||||
|
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
|
||||||
|
`BACKUP.md`.
|
||||||
|
|
||||||
|
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
|
||||||
|
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
|
||||||
|
service checklist (`docs/security/service-checklist.md`) gains a backup item; the
|
||||||
|
`new-role` runbook gains a fill/render/`check-backup` step (copy
|
||||||
|
`docs/backup/service-backup-template.md` into `roles/<service>/BACKUP.md` and
|
||||||
|
populate the `backup__*` data); and a checklist gate blocks service clearance until
|
||||||
|
the record exists and a restore drill confirms it (or a deviation is recorded in
|
||||||
|
`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check
|
||||||
|
analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md`
|
||||||
|
presence** — same manual-copy-plus-review pattern the sibling records use. The design
|
||||||
|
document's "make lint gates its presence" wording is superseded by this governance
|
||||||
|
choice.
|
||||||
|
|
||||||
|
### 7. Consistency — logical dumps first; quiesce as escape hatch
|
||||||
|
|
||||||
|
- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) —
|
||||||
|
portable, version-independent, restorable to a fresh DB. Plain data dirs are backed
|
||||||
|
up as files. No downtime required.
|
||||||
|
- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step
|
||||||
|
(stop container → back up volume → restart) via `backup__quiesce` in the same contract.
|
||||||
|
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
||||||
|
crash-consistent for a live database).
|
||||||
|
|
||||||
|
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
||||||
|
way, each service declares how to dump its own data.
|
||||||
|
|
||||||
|
### 8. Restore testing — two tiers; `ubongo` stays bare Debian
|
||||||
|
|
||||||
|
- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in
|
||||||
|
rotation, restore its latest snapshot into a throwaway container on `ubongo`
|
||||||
|
(reusing the Molecule harness, ADR-015), start the app against the restored data,
|
||||||
|
and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure
|
||||||
|
that actually kills people — *silently corrupt or unrestorable backups*. Failures
|
||||||
|
alert via ntfy.
|
||||||
|
- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging.
|
||||||
|
Rebuild a host from zero via Terraform + Ansible + restic restore on the staging
|
||||||
|
cluster. This validates the whole Model-A recovery chain. **At least once a year the
|
||||||
|
rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end.
|
||||||
|
|
||||||
|
**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to
|
||||||
|
be the independent recovery anchor — "the tool used to rebuild the cluster must not
|
||||||
|
live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served
|
||||||
|
by the PVE staging environment (same hardware class, same cluster, same provisioning
|
||||||
|
path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk
|
||||||
|
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
||||||
|
|
||||||
|
### 9. Retention — GFS via restic
|
||||||
|
|
||||||
|
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
||||||
|
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
||||||
|
Tune once real repo growth is observed.
|
||||||
|
|
||||||
|
### 10. Encryption + key escrow + break-glass
|
||||||
|
|
||||||
|
restic encrypts the repo at rest, so **one secret — the restic repo password —
|
||||||
|
protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three.
|
||||||
|
|
||||||
|
**Escrow locations:**
|
||||||
|
- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively
|
||||||
|
and `fisi` is redeployable.
|
||||||
|
- **Vaultwarden** — the day-to-day human-accessible copy.
|
||||||
|
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
||||||
|
copy that survives "everything is down."
|
||||||
|
|
||||||
|
**The paper holds *two* secrets:** (1) the **restic repo password** (to read any
|
||||||
|
backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from
|
||||||
|
the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero
|
||||||
|
recovery). With both on paper, the break-glass chain has **no circular dependency**:
|
||||||
|
paper → restic restores Vaultwarden + repo data → the vault password (from paper)
|
||||||
|
drives Terraform/Ansible re-provisioning → services return, `rbw` works again.
|
||||||
|
|
||||||
|
**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh +
|
||||||
|
`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if
|
||||||
|
`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming
|
||||||
|
zero running boma infrastructure: install restic on any machine, point it at pCloud
|
||||||
|
*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the
|
||||||
|
vault password.
|
||||||
|
|
||||||
|
### 11. USB air-gap — plug-and-go cold copy
|
||||||
|
|
||||||
|
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
||||||
|
systemd unit / script that: mounts the drive, confirms it is an expected drive, runs
|
||||||
|
**`restic copy` from the local repo → a restic repo on the USB drive** (same
|
||||||
|
password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts,
|
||||||
|
and **notifies via ntfy** with the result. Only allowlisted serials trigger anything —
|
||||||
|
a rogue USB does nothing.
|
||||||
|
|
||||||
|
`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable
|
||||||
|
directly in a break-glass with nothing else alive. Drives are rotated and **stored
|
||||||
|
off-site** — a second geographic off-site copy independent of pCloud.
|
||||||
|
|
||||||
|
### 12. Failure alerting — guard against silent death
|
||||||
|
|
||||||
|
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||||
|
|
||||||
|
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||||
|
monitor**; no ping in ~25 h → alert.
|
||||||
|
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||||
|
- **Weekly `restic check`** for repo integrity → alert on corruption.
|
||||||
|
- **Tier-1 restore-verify failures → ntfy.**
|
||||||
|
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||||
|
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
||||||
|
|
||||||
|
### 13. Schedule
|
||||||
|
|
||||||
|
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||||
|
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||||
|
→ `rclone sync` → pCloud. Sequential, off-hours.
|
||||||
|
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
|
||||||
|
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||||
|
- **USB air-gap:** manual, approximately monthly, whenever a drive is docked.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left
|
||||||
|
open; "backup-first" and "snapshot + backup" are no longer assumed.
|
||||||
|
- Every service role that holds state must declare its backup contract (`backup__*`
|
||||||
|
vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost:
|
||||||
|
per-service declarations and a rendered doc to maintain (mitigated by the new-role
|
||||||
|
runbook step + checklist gate).
|
||||||
|
- **pCloud is off-site but sync-coupled** — `rclone sync` propagates deletions (a
|
||||||
|
prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB
|
||||||
|
air-gap drive is the only truly immutable copy**; pCloud's own file-version history
|
||||||
|
is enabled as a secondary cushion.
|
||||||
|
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
||||||
|
receives full `base` hardening and tight access. restic encryption means a stolen
|
||||||
|
`fisi`, USB drive, or pCloud blob yields ciphertext only.
|
||||||
|
- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years
|
||||||
|
at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB.
|
||||||
|
- Recovery time under Model A (full Ansible run + data restore) is potentially hours —
|
||||||
|
slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected.
|
||||||
|
- The paper break-glass must be kept current (restic password + vault password). An
|
||||||
|
outdated paper sheet is the one failure mode this ADR cannot prevent mechanically —
|
||||||
|
the semi-annual DR rehearsal is the human control.
|
||||||
|
|
||||||
|
Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`.
|
||||||
|
Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one
|
||||||
|
role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness
|
||||||
|
reused for Tier-1), ADR-011 (update management: backup-first rule now grounded),
|
||||||
|
ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant),
|
||||||
|
ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy
|
||||||
|
→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021
|
||||||
|
(render-from-data pattern: `access__*`/`ACCESS.md` → `backup__*`/`BACKUP.md`;
|
||||||
|
runbook+gate governance model).
|
||||||
106
docs/decisions/023-adr-structure.md
Normal file
106
docs/decisions/023-adr-structure.md
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# ADR-023 — ADR structure & lifecycle
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-10). Meta/doctrine ADR — pins how ADRs are written; the
|
||||||
|
`adr-structure` check (`scripts/repo-scan.py`) and `docs/decisions/adr-template.md`
|
||||||
|
ship with it, and ADRs 001–018 were retroactively restructured to conform. Resolves
|
||||||
|
the FRICTION signal (2026-05-31) about ADR-writing policy being unsettled.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma records architectural decisions as numbered ADRs in `docs/decisions/`, and
|
||||||
|
CLAUDE.md treats them as load-bearing. Yet no ADR said how an ADR is written. The
|
||||||
|
newest ADRs (019–022) converged on a clean shape — Status → Context → Decision →
|
||||||
|
Consequences → Related — but only by imitation. ADRs 001–018 predate it and drifted
|
||||||
|
widely: most lacked a `## Status` section entirely (016–018 carried only a trailing
|
||||||
|
build-state note), and many lacked an explicit `## Decision` or `## Consequences`
|
||||||
|
heading, their decisions spread across ad-hoc topical sections. The result was
|
||||||
|
structural drift and no uniform way to tell an active decision from a superseded or
|
||||||
|
deprecated one.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Title & filename
|
||||||
|
|
||||||
|
Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash). Filename:
|
||||||
|
`NNN-kebab-title.md`, zero-padded 3-digit, monotonic, never reused — a superseded ADR
|
||||||
|
keeps its number and file. A new ADR is registered as a row in the CLAUDE.md
|
||||||
|
"Further reading" table.
|
||||||
|
|
||||||
|
### 2. Mandatory sections, in this order
|
||||||
|
|
||||||
|
- `## Status` — a lifecycle line, usually `Accepted (YYYY-MM-DD)` (see §4), plus an
|
||||||
|
optional one-line note.
|
||||||
|
- `## Context` — the forces, the problem, what exists today, why now.
|
||||||
|
- `## Decision` — what we are doing; numbered sub-decisions for multi-part ADRs.
|
||||||
|
- `## Consequences` — results, trade-offs explicitly accepted, follow-on work.
|
||||||
|
|
||||||
|
### 3. Optional sections (use only where they genuinely apply)
|
||||||
|
|
||||||
|
`## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`,
|
||||||
|
`## Verified facts (ADR-014)`.
|
||||||
|
|
||||||
|
### 4. Status lifecycle
|
||||||
|
|
||||||
|
Four states. Because boma is single-contributor and trunk-based with no review gate,
|
||||||
|
most ADRs are **born `Accepted (YYYY-MM-DD)`** — committed-to on writing. A
|
||||||
|
**`Proposed`** state exists for a genuine draft whose core direction is recorded but
|
||||||
|
whose specifics are still open for discussion (e.g. ADR-011); it is promoted to
|
||||||
|
`Accepted` once settled.
|
||||||
|
|
||||||
|
- **`Proposed (YYYY-MM-DD)`** — drafted, under discussion, not yet committed-to. May
|
||||||
|
carry open questions. Promoted to `Accepted (YYYY-MM-DD)` when decided.
|
||||||
|
- **`Accepted (YYYY-MM-DD)`** — committed-to. The common starting state.
|
||||||
|
- Replaced → old ADR's Status becomes **`Superseded by ADR-NNN (YYYY-MM-DD)`**; the new
|
||||||
|
ADR records `Supersedes ADR-MMM` in its Status and `## Related`. The link is
|
||||||
|
**bidirectional**.
|
||||||
|
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** + a one-line reason.
|
||||||
|
|
||||||
|
**No silent rewrites.** An Accepted ADR is not edited to reverse its decision. Typo and
|
||||||
|
clarity fixes are fine; a material reversal requires a new ADR and a `Superseded by`
|
||||||
|
marker on the old one.
|
||||||
|
|
||||||
|
### 5. Template & enforcement
|
||||||
|
|
||||||
|
`docs/decisions/adr-template.md` is the scaffold for new ADRs. The `/review-repo`
|
||||||
|
command's pre-scan (`scripts/repo-scan.py`) emits an `adr-structure` finding for any
|
||||||
|
numbered ADR missing a mandatory section or with an unparseable Status line. It checks
|
||||||
|
**presence and Status, not section order** — order is a convention the template carries,
|
||||||
|
deliberately not gated, to keep enforcement lightweight (consistent with boma's other
|
||||||
|
doctrine ADRs adding no CI gate).
|
||||||
|
|
||||||
|
### 6. Retroactive conformance of the back-catalogue
|
||||||
|
|
||||||
|
ADRs 001–018 are restructured to satisfy this standard rather than grandfathered. The
|
||||||
|
restructure is **presentational** — existing headings are relabelled, regrouped, or
|
||||||
|
demoted under a `## Decision` umbrella; a dated `## Status` is added; a `## Consequences`
|
||||||
|
section is assembled from implications the ADR already states. **The substance of no
|
||||||
|
decision is changed.** This keeps the check uniform (no number threshold) and the corpus
|
||||||
|
a consistent, legible decision history.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- New ADRs have one obvious shape and a scaffold; structural drift stops.
|
||||||
|
- Every ADR declares its lifecycle state uniformly, and reversals are traceable.
|
||||||
|
- The whole corpus conforms; the check needs no grandfathering and stays simple.
|
||||||
|
- One-time restructure churn across ADRs 001–018 (heading reorganization + a Status and
|
||||||
|
a Consequences section per file; no decision substance changed).
|
||||||
|
- `/review-repo` grows one deterministic check; no new CI machinery.
|
||||||
|
- This ADR is the first conformant example and is held to its own check.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
- **A `make lint` / CI gate for ADR structure** — heavier than the risk warrants;
|
||||||
|
the `/review-repo` check and the template suffice.
|
||||||
|
- **Machine-enforcing section order** — brittle for marginal value; left as a
|
||||||
|
template-demonstrated convention.
|
||||||
|
- **Grandfathering 001–018 from the check** — rejected in favour of restructuring the
|
||||||
|
whole corpus to conform, so the standard applies uniformly with no exceptions.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-014 — knowledge sourcing (the `Verified facts` optional section).
|
||||||
|
- ADR-019/020/021/022 — the emergent structure this ADR codifies.
|
||||||
|
- `docs/decisions/adr-template.md` — the scaffold.
|
||||||
|
- `scripts/repo-scan.py` — the `adr-structure` enforcement check.
|
||||||
145
docs/decisions/024-reverse-proxy.md
Normal file
145
docs/decisions/024-reverse-proxy.md
Normal file
|
|
@ -0,0 +1,145 @@
|
||||||
|
# ADR-024 — Reverse proxy: Caddy (ACME — HTTP-01 public, DNS-01 private)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-14; DNS-01 path resolved + proven 2026-06-15). Amends the soft
|
||||||
|
Traefik assumption carried by the roadmap (Phase-2 step 5) and ADR-017 prose; those
|
||||||
|
are updated to read "Caddy (ADR-024)".
|
||||||
|
|
||||||
|
> **Cert method follows exposure.** The cert *challenge* depends on whether a host is
|
||||||
|
> publicly reachable: **public hosts** (askari) use **HTTP-01** with **vanilla Caddy** —
|
||||||
|
> simplest, no plugin; **mesh/LAN-only cluster services** (no public A-record) use
|
||||||
|
> **DNS-01** via Gandi (the M1 capability), since they can't satisfy HTTP-01.
|
||||||
|
>
|
||||||
|
> **DNS-01 resolved + proven (2026-06-15) — the M4a deferral is closed.** The original
|
||||||
|
> failure was diagnosed as **version skew**: the image built at M4a used a pre-Bearer
|
||||||
|
> `libdns/gandi` that sent Gandi's **deprecated `Apikey` header** (→ 403 on a
|
||||||
|
> verified-valid token), and the `xcaddy` build ran *on a Hetzner IP* (Google's Go
|
||||||
|
> module proxy 403s those ranges). Both have clean, boma-aligned fixes: **pin
|
||||||
|
> caddy-dns/gandi v1.1.0** (→ `libdns/gandi` v1.1.0, which sends the PAT as
|
||||||
|
> `Authorization: Bearer` to `https://api.gandi.net/v5/livedns`) and **build the image
|
||||||
|
> on ubongo, not Hetzner**. Verified end-to-end (2026-06-15): the custom image issues a
|
||||||
|
> real **wildcard** cert (`*.dns01test.wingu.me`) against Let's Encrypt **staging** via
|
||||||
|
> Gandi DNS-01 using `vault.gandi.pat`; `caddy validate` accepts `acme_dns gandi` on the
|
||||||
|
> custom image and rejects it on vanilla `caddy:2`. Build with `make caddy-image`; the
|
||||||
|
> `reverse_proxy` role enables it per-instance via `reverse_proxy__acme_dns_provider:
|
||||||
|
> gandi` + `reverse_proxy__image`. **Traefik was reconsidered and rejected again** —
|
||||||
|
> lego's Gandi provider faces the *same* PAT-vs-Apikey question, so switching would not
|
||||||
|
> have dodged the issue, and would reverse this ADR for nothing. askari (M4a) stays on
|
||||||
|
> HTTP-01 (a public host needs no DNS-01).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
|
||||||
|
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
|
||||||
|
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
|
||||||
|
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
|
||||||
|
that cannot satisfy HTTP-01 (no public A-record to point at).
|
||||||
|
|
||||||
|
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
|
||||||
|
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
|
||||||
|
|
||||||
|
- Traefik's headline feature is **dynamic Docker-label discovery** — it discovers and
|
||||||
|
routes services automatically from container labels without any static config.
|
||||||
|
- boma already renders *all* config from Ansible templates and the `group_vars` catalog
|
||||||
|
(ADR-004). That makes dynamic label discovery a disadvantage: a service that is not in
|
||||||
|
the catalog does not exist (CLAUDE.md), so any route that Traefik auto-discovers
|
||||||
|
outside the catalog would be unaudited.
|
||||||
|
- The first reverse-proxy instance is needed on `askari` for M4 (NetBird), a host where
|
||||||
|
`docker_hosts` patterns are being established under off-site/VPS constraints, not a
|
||||||
|
full Proxmox cluster with many services.
|
||||||
|
|
||||||
|
No production investment in Traefik config has been made; the decision can be made
|
||||||
|
cleanly here.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
boma's reverse proxy is **Caddy**.
|
||||||
|
|
||||||
|
### 1. Rationale for Caddy over Traefik
|
||||||
|
|
||||||
|
1. Traefik's dynamic label discovery is wasted — boma renders config from the catalog;
|
||||||
|
Caddy's static Caddyfile maps naturally to "render from templates" (ADR-004).
|
||||||
|
2. Caddy's Caddyfile is simple to template with `ansible.builtin.template`; one file,
|
||||||
|
one `ansible_managed` header, no side-channel label state.
|
||||||
|
3. **Automatic HTTPS** via ACME DNS-01: the `caddy-dns/gandi` plugin satisfies the
|
||||||
|
Gandi DNS-01 challenge, which is the only cert path for services with no public
|
||||||
|
A-record (ADR-007/M1 wildcard strategy).
|
||||||
|
4. Far simpler for a solo operator: no dashboard-as-a-service, no routing-rule DSL,
|
||||||
|
no dynamic config files to reconcile.
|
||||||
|
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
|
||||||
|
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
|
||||||
|
|
||||||
|
### 2. Custom image (DNS-01 path — built)
|
||||||
|
|
||||||
|
> Applies only to the **DNS-01** path. M4a ships **vanilla `caddy:2`** on askari
|
||||||
|
> (HTTP-01) — no custom image; only DNS-01 hosts pull the custom one.
|
||||||
|
|
||||||
|
Caddy's official Docker image does not include third-party DNS plugins. The
|
||||||
|
`caddy-dns/gandi` plugin must be compiled in via `xcaddy`. boma builds a custom image
|
||||||
|
(`.docker/caddy-gandi/Dockerfile`, `make caddy-image`), **pinned** (ADR-011/ADR-014):
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM caddy:2.11.4-builder AS build
|
||||||
|
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0
|
||||||
|
|
||||||
|
FROM caddy:2.11.4
|
||||||
|
COPY --from=build /usr/bin/caddy /usr/bin/caddy
|
||||||
|
```
|
||||||
|
|
||||||
|
Two hard constraints, both learned from the M4a failure:
|
||||||
|
|
||||||
|
1. **Build on ubongo, not Hetzner.** Google's Go module proxy 403s Hetzner IP ranges, so
|
||||||
|
the on-host build on askari failed. ubongo (the control node) builds it in ~1 min,
|
||||||
|
then it is pushed to the Forgejo registry (`make caddy-image-push`) and pulled by
|
||||||
|
DNS-01 hosts — the same artifact pattern as the Molecule image.
|
||||||
|
2. **Pin a Bearer-capable plugin.** caddy-dns/gandi v1.1.0 → libdns/gandi v1.1.0 sends
|
||||||
|
the PAT as `Authorization: Bearer`. Older versions used the deprecated `Apikey`
|
||||||
|
header and 403 on a PAT — that was the M4a "valid token but no TXT record" symptom.
|
||||||
|
|
||||||
|
### 3. Deployment scope
|
||||||
|
|
||||||
|
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
|
||||||
|
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
|
||||||
|
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
|
||||||
|
Phase 2 when services multiply.
|
||||||
|
|
||||||
|
### 4. Authentik integration (deferred)
|
||||||
|
|
||||||
|
`forward_auth` to Authentik is deferred to Phase 2 (when Authentik is deployed on the
|
||||||
|
cluster). The Caddyfile template will carry a placeholder comment. No Traefik-Authentik
|
||||||
|
middleware migration is required.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
|
||||||
|
Caddy (ADR-024)".
|
||||||
|
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
|
||||||
|
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. The DNS-01
|
||||||
|
custom Caddy image (`xcaddy` + `caddy-dns/gandi`, `.docker/caddy-gandi/`) is **built and
|
||||||
|
proven**; it must be pushed to the Forgejo registry (`make caddy-image-push`, needs
|
||||||
|
`docker login`) and kept current (plugin + base-image version bumps, pinned per
|
||||||
|
ADR-011/ADR-014) as DNS-01 cluster services come online.
|
||||||
|
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
|
||||||
|
and easier to review than distributed container labels.
|
||||||
|
- `forward_auth` to Authentik is available when Authentik is deployed; no extra
|
||||||
|
middleware layer required.
|
||||||
|
- The `proxy` concern tag (already in `tests/tags.yml`) covers Caddy config tasks.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
- **Traefik** — dynamic label discovery is a mismatch for boma's catalog-rendered
|
||||||
|
config model (ADR-004); more complex for a solo operator; no prior investment to
|
||||||
|
protect.
|
||||||
|
- **nginx / HAProxy** — no built-in ACME; require a separate ACME client (certbot,
|
||||||
|
acme.sh) adding operational surface; Caddy's integrated ACME is simpler.
|
||||||
|
- **NetBird's bundled TLS** — NetBird's management UI can serve its own TLS, but that
|
||||||
|
doesn't generalise; a real proxy separates concerns and applies to every service.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-002 — services behind a proxy with authentication (the requirement this satisfies).
|
||||||
|
- ADR-004 — Docker & Compose model (template-rendered config, catalog-driven).
|
||||||
|
- ADR-007 / M1 — Gandi DNS-01 ACME path (the TLS strategy Caddy implements).
|
||||||
|
- ADR-016 — NetBird (M4 is the first deployment of this proxy).
|
||||||
|
- ADR-017 — service-UI verification; forward_auth to Authentik is the future auth story.
|
||||||
180
docs/decisions/025-local-vm-integration-testing.md
Normal file
180
docs/decisions/025-local-vm-integration-testing.md
Normal file
|
|
@ -0,0 +1,180 @@
|
||||||
|
# ADR-025 — Local VM integration testing on ubongo
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
|
||||||
|
viable on ubongo). **RED→GREEN acceptance PASSED on real hardware (2026-06-18):** a
|
||||||
|
throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward
|
||||||
|
default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once
|
||||||
|
the `docker_host` container-forward drop-in was applied — GREEN. Two shakedown
|
||||||
|
learnings added below.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
|
||||||
|
`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no
|
||||||
|
reboot**. That structurally cannot catch an entire class of bug — reboot-survivability,
|
||||||
|
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
|
||||||
|
that caused the **2026-06-17 mesh-hardening incident**.
|
||||||
|
|
||||||
|
During that incident, `base`'s nftables `forward { policy drop; }` killed the askari
|
||||||
|
Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking
|
||||||
|
published-port DNAT and inter-container forwarding. Public services and the mesh went
|
||||||
|
down. It had worked right after `make deploy`, when Docker's runtime rules still
|
||||||
|
coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh
|
||||||
|
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
|
||||||
|
break-glass. Molecule had passed.
|
||||||
|
|
||||||
|
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
|
||||||
|
|
||||||
|
> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present +
|
||||||
|
> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM
|
||||||
|
> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** ·
|
||||||
|
> 2026-06-18.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Virtualisation approach: libvirt/KVM directly (Approach A)
|
||||||
|
|
||||||
|
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
|
||||||
|
ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded
|
||||||
|
via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/
|
||||||
|
integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt-
|
||||||
|
python` dependency — the driver stays portable and the role stays lean.
|
||||||
|
|
||||||
|
### 2. Fidelity envelope
|
||||||
|
|
||||||
|
The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor
|
||||||
|
is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port
|
||||||
|
DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its
|
||||||
|
own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome
|
||||||
|
is not mirrored.
|
||||||
|
|
||||||
|
### 3. Scope: one throwaway VM at a time, instantiated from real inventory
|
||||||
|
|
||||||
|
The first profile is **"be askari"** — a single box running Docker host + NetBird
|
||||||
|
coordinator + mesh peer, mirroring the host whose incident motivates this work. The
|
||||||
|
mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies
|
||||||
|
are a deferred extension.
|
||||||
|
|
||||||
|
### 4. Acceptance: self-validating against the real failure
|
||||||
|
|
||||||
|
The harness is accepted when it can, on a local VM:
|
||||||
|
|
||||||
|
1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker
|
||||||
|
host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead,
|
||||||
|
services down). If step 1 passes, the harness is not faithful.
|
||||||
|
2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**.
|
||||||
|
|
||||||
|
### 5. Tiered cert fidelity via a `--certs` knob
|
||||||
|
|
||||||
|
DNS-01 is what makes real certs possible without public inbound (validation is
|
||||||
|
out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which
|
||||||
|
the isolated NAT network provides):
|
||||||
|
|
||||||
|
| Tier | Description | Default? |
|
||||||
|
|---|---|---|
|
||||||
|
| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes |
|
||||||
|
| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
|
||||||
|
| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. |
|
||||||
|
|
||||||
|
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 —
|
||||||
|
`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces
|
||||||
|
`internal`, since ACME requires egress.
|
||||||
|
|
||||||
|
### 6. The toolchain is Ansible-managed
|
||||||
|
|
||||||
|
A new non-service role (`integration_test`, `control` group) installs and enables
|
||||||
|
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
|
||||||
|
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
|
||||||
|
repo owns ubongo's state.
|
||||||
|
|
||||||
|
### 7. Stubs live in an overlay file, never in the real inventory
|
||||||
|
|
||||||
|
Transient inventory entries for the test VM are generated at runtime as a single-host
|
||||||
|
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
|
||||||
|
`tests/integration/overrides/<host>.yml` — an explicit, reviewable overlay. The real
|
||||||
|
inventory is never touched, so `make tf-inventory` and "don't edit inventory directly"
|
||||||
|
stay intact.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its
|
||||||
|
local-test-runner role — it is still not a production hypervisor. A default VM
|
||||||
|
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
|
||||||
|
driver enforces **one integration VM at a time** (resource guard, name-prefix
|
||||||
|
`boma-it-*`) and refuses to start below a free-RAM threshold.
|
||||||
|
- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on
|
||||||
|
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
|
||||||
|
becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`.
|
||||||
|
- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT
|
||||||
|
(`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge`
|
||||||
|
TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the
|
||||||
|
default. Compensating controls: ephemeral VM, isolated NAT network, TXT records
|
||||||
|
auto-removed by Caddy after validation.
|
||||||
|
- **Three safety invariants** make the test tool itself safe:
|
||||||
|
1. The transient inventory contains only the test VM — no real host is ever in scope.
|
||||||
|
2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node
|
||||||
|
mesh; it never enrols in the real mesh.
|
||||||
|
3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
|
||||||
|
only, not reachable to the LAN (`10.20.x`) or the real mesh.
|
||||||
|
- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and
|
||||||
|
dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`,
|
||||||
|
`systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*`
|
||||||
|
orphans. Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/`.
|
||||||
|
- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
|
||||||
|
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
|
||||||
|
time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering,
|
||||||
|
cert/ACME paths, mesh bootstrap on one box.
|
||||||
|
|
||||||
|
**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate
|
||||||
|
(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per
|
||||||
|
ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker
|
||||||
|
layer, not provisioning).
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
|
||||||
|
| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
|
||||||
|
| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. |
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group),
|
||||||
|
Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB
|
||||||
|
disk free · 2026-06-18.
|
||||||
|
|
||||||
|
## Shakedown learnings (2026-06-18 live run)
|
||||||
|
|
||||||
|
Two findings from the RED→GREEN acceptance run that affect anyone operating the harness:
|
||||||
|
|
||||||
|
1. **Boot firmware: UEFI required.** The Debian 13 genericcloud image triple-faults
|
||||||
|
under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI
|
||||||
|
(`virt-install --boot uefi`; `ovmf` package). The driver does this by default; note
|
||||||
|
it here so the requirement is findable.
|
||||||
|
|
||||||
|
2. **`claude` sudo is load-bearing.** VM management (`virsh`, `virt-install`,
|
||||||
|
`cloud-localds`) and offline diagnostics (`nft list ruleset`, `journalctl -b`,
|
||||||
|
`systemd-analyze critical-chain`) all require root. The harness assumes the
|
||||||
|
AI-worker has `NOPASSWD:ALL` sudo on `ubongo` — settled as the ADR-015 amendment
|
||||||
|
(2026-06-18) and registered as R7 in `docs/security/accepted-risks.md`. A `claude`
|
||||||
|
account without sudo will block the harness at the first `virsh` call.
|
||||||
|
|
||||||
|
The nine full shakedown findings (including the UEFI boot-loop) are in
|
||||||
|
`docs/FRICTION.md`.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
|
||||||
|
- ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3.
|
||||||
|
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. **Supersedes** ADR-015's "no local sudo" sub-decision for the AI-worker — the shakedown necessitated `claude` NOPASSWD sudo (ADR-023 §4; access model in ADR-021, risk R7).
|
||||||
|
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
|
||||||
|
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
|
||||||
|
- ADR-021 — Operational access; sudo model for `claude` and `sjat` on `ubongo`.
|
||||||
|
- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.
|
||||||
40
docs/decisions/adr-template.md
Normal file
40
docs/decisions/adr-template.md
Normal file
|
|
@ -0,0 +1,40 @@
|
||||||
|
# ADR-NNN — <Title>: <optional clarifying subtitle>
|
||||||
|
|
||||||
|
<!-- Filename: NNN-kebab-title.md (zero-padded, monotonic, never reused).
|
||||||
|
Register a row in CLAUDE.md "Further reading" when this ADR is created.
|
||||||
|
Sections below in order. Mandatory: Status, Context, Decision, Consequences.
|
||||||
|
Delete this comment and any optional section you don't use. -->
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (YYYY-MM-DD)
|
||||||
|
<!-- Lifecycle: usually born "Accepted (YYYY-MM-DD)"; use "Proposed (YYYY-MM-DD)" for a
|
||||||
|
genuine draft (open questions), promoted to Accepted once settled. Later:
|
||||||
|
"Superseded by ADR-NNN (YYYY-MM-DD)" or "Deprecated (YYYY-MM-DD)" + one-line why.
|
||||||
|
Optional trailing note OK, e.g.
|
||||||
|
"Accepted (2026-06-10). Doctrine ADR — pins policy, builds nothing yet." -->
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
<!-- The forces, the problem, what exists today, why now. -->
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
<!-- What we are doing. Use numbered sub-decisions (### 1. ...) for multi-part ADRs. -->
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
<!-- Results, trade-offs explicitly accepted, follow-on work. -->
|
||||||
|
|
||||||
|
<!-- Optional sections — uncomment any that genuinely apply; never pad:
|
||||||
|
|
||||||
|
## Scope — explicit in / out-of-scope boundaries.
|
||||||
|
|
||||||
|
## Guardrails — how the decision is mechanically enforced (lint, CI, hooks).
|
||||||
|
|
||||||
|
## What was ruled out — rejected alternatives, each with its reason.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014) — verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>
|
||||||
|
|
||||||
|
## Related — links to other ADRs by number; bidirectional for Supersedes/Superseded-by.
|
||||||
|
-->
|
||||||
|
|
@ -18,6 +18,25 @@
|
||||||
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
||||||
- **Notes:** _warranty, quirks_
|
- **Notes:** _warranty, quirks_
|
||||||
|
|
||||||
|
### ubongo (control node — outside the cluster)
|
||||||
|
- **Model / form factor:** Lenovo ThinkCentre M70q Tiny (machine type 11DUS7XP00); 1-litre tiny/USFF
|
||||||
|
- **CPU:** Intel Core i3-10100T — 4 cores / 8 threads, 35 W TDP
|
||||||
|
- **RAM:** 16 GB DDR4-3200 (2×8 GB SODIMM)
|
||||||
|
- **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption)
|
||||||
|
- **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da
|
||||||
|
- **BIOS:** Lenovo M2WKT5AA (2023-06-20)
|
||||||
|
- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred). Also runs **one ephemeral KVM integration test VM** (~3 GiB RAM) at a time per ADR-025 — the resource guard enforces one-at-a-time; do not run a test-integration cycle alongside a heavy Level-4 browser session (Chromium/Playwright).
|
||||||
|
|
||||||
|
### fisi (backup node — outside the cluster; provisional)
|
||||||
|
- **Model / form factor:** HP Elite 600 G9 (tower)
|
||||||
|
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
|
||||||
|
- **RAM:** 16 GB+ (TBD exact)
|
||||||
|
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
|
||||||
|
- **NICs:** wired GbE
|
||||||
|
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
|
||||||
|
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
|
||||||
|
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
|
||||||
|
|
||||||
_(repeat for pve1, pve2, askari)_
|
_(repeat for pve1, pve2, askari)_
|
||||||
|
|
||||||
## 2. Network gear
|
## 2. Network gear
|
||||||
|
|
@ -46,6 +65,8 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
||||||
|------|-------|--------|---------|
|
|------|-------|--------|---------|
|
||||||
| pve0 | 20 | 64 | 4000 |
|
| pve0 | 20 | 64 | 4000 |
|
||||||
| pve1 | 20 | 64 | 4000 |
|
| pve1 | 20 | 64 | 4000 |
|
||||||
|
| ubongo | 4 | 16 | 250 |
|
||||||
|
| fisi | 4 | 16 | 8000 |
|
||||||
|
|
||||||
## 5. Capacity notes
|
## 5. Capacity notes
|
||||||
|
|
||||||
|
|
|
||||||
88
docs/reviews/2026-06-05-findings.json
Normal file
88
docs/reviews/2026-06-05-findings.json
Normal file
|
|
@ -0,0 +1,88 @@
|
||||||
|
{
|
||||||
|
"date": "2026-06-05",
|
||||||
|
"reviewed_commit": "f566fd1",
|
||||||
|
"fixes_commit": "666ad42",
|
||||||
|
"mode": "on-demand",
|
||||||
|
"counts": {
|
||||||
|
"auto_fixed": 4,
|
||||||
|
"open": 12,
|
||||||
|
"scan": {"broken-path-ref": 14, "marker": 35, "open-deferred-item": 6, "stale-deferred": 0}
|
||||||
|
},
|
||||||
|
"auto_fixed": [
|
||||||
|
{"id": "AF1", "dimension": "consistency", "severity": "high",
|
||||||
|
"location": "docs/decisions/005-bootstrapping.md:36; docs/runbooks/new-host.md:62,71",
|
||||||
|
"description": "Terraform 'writes the host's DNS A record' contradicts ADR-009 (dns role owns the zone)",
|
||||||
|
"fix": "removed the DNS-write clause; noted Terraform writes no DNS records",
|
||||||
|
"tag": "recurring"},
|
||||||
|
{"id": "AF2", "dimension": "consistency", "severity": "high",
|
||||||
|
"location": "docs/decisions/005-bootstrapping.md:8",
|
||||||
|
"description": "control node described as cloned from the cloud-init template; ADR-015 makes ubongo physical",
|
||||||
|
"fix": "control node is a physical box installed directly, not cloned (ADR-015)",
|
||||||
|
"tag": "new"},
|
||||||
|
{"id": "AF3", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "CLAUDE.md:197",
|
||||||
|
"description": "Further reading missing the VERIFY.md template row",
|
||||||
|
"fix": "added docs/testing/service-verify-template.md row",
|
||||||
|
"tag": "new"},
|
||||||
|
{"id": "AF4", "dimension": "cruft", "severity": "low",
|
||||||
|
"location": "docs/TODO.md:79",
|
||||||
|
"description": "typos: 'we we', 'seperate'",
|
||||||
|
"fix": "corrected to 'we' and 'separate'",
|
||||||
|
"tag": "new"}
|
||||||
|
],
|
||||||
|
"open": [
|
||||||
|
{"id": "O1", "dimension": "consistency", "severity": "medium",
|
||||||
|
"location": "docs/decisions/004-docker-model.md",
|
||||||
|
"description": "service-role standard file table lists SECURITY.md but not VERIFY.md (ADR-017/CLAUDE.md:85 mandate it)",
|
||||||
|
"suggested_fix": "add a VERIFY.md row to ADR-004's file table", "tag": "new"},
|
||||||
|
{"id": "O2", "dimension": "consistency", "severity": "medium",
|
||||||
|
"location": "docs/runbooks/new-role.md",
|
||||||
|
"description": "no step to write VERIFY.md for service roles; STATUS.md:17 'runbooks reconciled' now overstated",
|
||||||
|
"suggested_fix": "add a VERIFY.md step mirroring the SECURITY.md step", "tag": "new"},
|
||||||
|
{"id": "O3", "dimension": "cruft", "severity": "low",
|
||||||
|
"location": "README.md:58-60,94",
|
||||||
|
"description": "ADR list stops at 001-009; docs/ tree omits security/, testing/, hardware/",
|
||||||
|
"suggested_fix": "extend ADR list + docs/ subtree", "tag": "new"},
|
||||||
|
{"id": "O4", "dimension": "consistency", "severity": "medium",
|
||||||
|
"location": "CLAUDE.md:106; docs/decisions/009-provisioning-handoff.md:78; scripts/tf_to_inventory.py:24",
|
||||||
|
"description": "ADR-016 says askari gets its own inventory group but none is named; valid-groups set excludes it",
|
||||||
|
"suggested_fix": "name the group; add to host-groups + ADR-009 valid groups", "tag": "new"},
|
||||||
|
{"id": "O5", "dimension": "consistency", "severity": "medium",
|
||||||
|
"location": "docs/decisions/006-terraform.md:78",
|
||||||
|
"description": "backend.tf labelled 'Forgejo state backend' contradicts ADR-006's own local-state section",
|
||||||
|
"suggested_fix": "relabel to local state backend (no remote backend)", "tag": "new"},
|
||||||
|
{"id": "O6", "dimension": "drift", "severity": "medium",
|
||||||
|
"location": "docs/decisions/014-knowledge-sourcing.md:88",
|
||||||
|
"description": "plugin reproducibility described as open, but TODO 10.7 is DONE",
|
||||||
|
"suggested_fix": "update to resolved state; drop the forward-pointer", "tag": "new"},
|
||||||
|
{"id": "O7", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "docs/decisions/011-update-management.md:128",
|
||||||
|
"description": "ruled-out 'Digest-pinning the stateful tier' contradicts Decision #2 (adopts tag@digest); ADR-011 is draft",
|
||||||
|
"suggested_fix": "remove/replace the ruled-out row when accepting ADR-011 (TODO 16)", "tag": "new"},
|
||||||
|
{"id": "O8", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "docs/decisions/003-toolchain.md:85; docs/decisions/010-forgejo-ci.md:66",
|
||||||
|
"description": "'act_runner on control node or a dedicated runner VM' ambiguous vs ADR-015",
|
||||||
|
"suggested_fix": "name ubongo as runner host; cross-ref ADR-015", "tag": "new"},
|
||||||
|
{"id": "O9", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "docs/decisions/008-testing.md:148",
|
||||||
|
"description": "WireGuard Molecule-exclusion row framed for retired OPNsense VLAN-99 WireGuard",
|
||||||
|
"suggested_fix": "reframe to NetBird wt0 data plane (ADR-016)", "tag": "new"},
|
||||||
|
{"id": "O10", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "docs/decisions/011-update-management.md:67",
|
||||||
|
"description": "cross-refs 'scheduled_jobs plan and ADR-010'; ADR-010 has no such plan (TODO 8.3)",
|
||||||
|
"suggested_fix": "point to TODO 8.3", "tag": "new"},
|
||||||
|
{"id": "O11", "dimension": "consistency", "severity": "low",
|
||||||
|
"location": "docs/CAPABILITIES.md",
|
||||||
|
"description": "no row for the /verify-service (Level 4) capability decided in ADR-017",
|
||||||
|
"suggested_fix": "add an Operations row for /verify-service", "tag": "new"},
|
||||||
|
{"id": "O12", "dimension": "cruft", "severity": "low",
|
||||||
|
"location": "docs/TODO.md:30",
|
||||||
|
"description": "item 3.10 is garbled/unfollowable",
|
||||||
|
"suggested_fix": "rewrite clearly or strike", "tag": "new"}
|
||||||
|
],
|
||||||
|
"scan_noise": [
|
||||||
|
"broken-path-ref x14: illustrative report-name templates (YYYY-MM-DD-<service>.md) and not-yet-created latest.md files; scanner stops at the <placeholder> boundary",
|
||||||
|
"marker x35: mostly prose references to TODO.md items, not code markers",
|
||||||
|
"open-deferred-item x6: all confirmed genuinely open (ADR-011 #1-5, ADR-015 #3); 0 stale-deferred"
|
||||||
|
]
|
||||||
|
}
|
||||||
93
docs/reviews/2026-06-05-review.md
Normal file
93
docs/reviews/2026-06-05-review.md
Normal file
|
|
@ -0,0 +1,93 @@
|
||||||
|
# Repo review — 2026-06-05
|
||||||
|
|
||||||
|
- **Reviewed commit:** `f566fd1` (scan); auto-fixes landed in `666ad42`
|
||||||
|
- **Mode:** on-demand (interactive)
|
||||||
|
- **Scope:** whole repo — 2 roles, 17 ADRs, 4 runbooks, 7 scripts; doc-heavy
|
||||||
|
- **Prior run:** 2026-05-30 (`de38d1c`) — 7 auto-fixed, 17 open
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
| | high | medium | low | total |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Auto-fixed | 2 | 0 | 2 | 4 |
|
||||||
|
| Open (report-only) | 0 | 5 | 7 | 12 |
|
||||||
|
|
||||||
|
This review followed a session of heavy documentation work (ADR-015 `ubongo`,
|
||||||
|
ADR-016 NetBird mesh, ADR-017 Level-4 verification). Most findings are **propagation
|
||||||
|
gaps** — a new decision landed but an older doc still reflects the prior design.
|
||||||
|
|
||||||
|
**New deferral check exercised.** `repo-scan.py` now enumerates open ADR "Deferred/
|
||||||
|
Open" items and flags any another file calls resolved-but-unmarked. This run: 6
|
||||||
|
open-deferred-items surfaced, **all confirmed genuinely open** by the cross-cutting
|
||||||
|
reviewer (ADR-011 #1–5, ADR-015 #3), **0 stale-deferred**. The check produced no false
|
||||||
|
resolutions and the judgement layer agreed — working as designed.
|
||||||
|
|
||||||
|
## Auto-fixes applied (`666ad42`)
|
||||||
|
|
||||||
|
| id | dim | sev | location | fix |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| AF1 | consistency | high | `docs/decisions/005-bootstrapping.md:36`, `docs/runbooks/new-host.md:62,71` | Removed "Terraform writes the host's DNS A record" — contradicts ADR-009 (the `dns` role owns the zone). **Recurring**: the 2026-05-30 run fixed the same contradiction in README/ADR-003; it reappeared in two more files. |
|
||||||
|
| AF2 | consistency | high | `docs/decisions/005-bootstrapping.md:8` | Control node described as cloned from the cloud-init template; ADR-015 makes `ubongo` a physical box installed directly. Corrected. |
|
||||||
|
| AF3 | consistency | low | `CLAUDE.md:197` | Added the missing `docs/testing/service-verify-template.md` row to Further reading (parallels the security-template row). |
|
||||||
|
| AF4 | cruft | low | `docs/TODO.md:79` | Typos: "we we" → "we"; "seperate" → "separate". |
|
||||||
|
|
||||||
|
## Open findings (report-only)
|
||||||
|
|
||||||
|
### VERIFY.md propagation cluster (ADR-017 not fully threaded through)
|
||||||
|
|
||||||
|
| id | sev | location | finding | suggested fix |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| O1 | medium | `docs/decisions/004-docker-model.md` (file table) | The service-role standard lists `SECURITY.md` but not `VERIFY.md`, though ADR-017 + CLAUDE.md:85 now mandate it. | Add a `VERIFY.md` row to ADR-004's file table. |
|
||||||
|
| O2 | medium | `docs/runbooks/new-role.md` (step 9 → Commit) | No step to write `VERIFY.md` for service roles (only `SECURITY.md`). Makes `STATUS.md:17` ("runbooks current and mutually reconciled") slightly overstated. | Add a "write the per-service verification spec" step mirroring the SECURITY.md step. |
|
||||||
|
| O3 | low | `README.md:58-60, 94` | ADR list stops at 001–009 (010–017 absent); the `docs/` tree omits `security/`, `testing/`, `hardware/`. | Extend the ADR list (or point to `docs/decisions/` + CLAUDE.md's table); expand the `docs/` subtree. |
|
||||||
|
|
||||||
|
### Design gaps from the recent ADRs
|
||||||
|
|
||||||
|
| id | sev | location | finding | suggested fix |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| O4 | medium | `CLAUDE.md:106`, `docs/decisions/009-provisioning-handoff.md:78`, `scripts/tf_to_inventory.py:24` | ADR-016 says "`askari` is Ansible-managed — its own inventory group", but no group is named anywhere; host-groups list + valid-groups set don't include it. | Decide the group name (e.g. `edge_hosts`/`hetzner_hosts`), add to CLAUDE.md host groups + ADR-009 valid groups. (`askari` is manual like the control node, so `tf_to_inventory.py` need not generate it, but the group must be valid.) |
|
||||||
|
| O5 | medium | `docs/decisions/006-terraform.md:78` | `backend.tf` labelled "Forgejo state backend", contradicting ADR-006's own State-backend section (local state on `ubongo`; Forgejo's API is read-only). | Relabel to "local state backend (no remote backend)". |
|
||||||
|
| O6 | medium | `docs/decisions/014-knowledge-sourcing.md:88` | Plugin-reproducibility described as open ("tracked in `docs/TODO.md`"), but TODO 10.7 is marked DONE (settings.json declares the plugin set; claude-code-setup.md covers bootstrap). | Update to reflect the resolved state; drop the forward-pointer. |
|
||||||
|
|
||||||
|
### Clarity / lower-priority consistency
|
||||||
|
|
||||||
|
| id | sev | location | finding | suggested fix |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| O7 | low | `docs/decisions/011-update-management.md:128` | "Digest-pinning the stateful tier" sits in the ruled-out table, but Decision #2 *adopts* `tag@digest` for stateful (TODO 16 confirms). ADR-011 is still **Proposed/draft**. | Remove/replace the ruled-out row when accepting ADR-011 (TODO 16). |
|
||||||
|
| O8 | low | `docs/decisions/003-toolchain.md:85`, `docs/decisions/010-forgejo-ci.md:66` | "act_runner on the control node **or a dedicated runner VM**" reads ambiguously against ADR-015 (no cluster control VM). Not wrong (a runner VM is a separate option) but worth disambiguating. | Name `ubongo` as the runner host; cross-ref ADR-015; keep "dedicated runner VM" as an explicit future option. |
|
||||||
|
| O9 | low | `docs/decisions/008-testing.md:148` | The "WireGuard tunnel establishment" Molecule-exclusion row is framed for the retired OPNsense VLAN-99 WireGuard; NetBird still uses WireGuard (`wt0`) as its data plane. | Reframe the row to the NetBird `wt0` data-plane (ADR-016). |
|
||||||
|
| O10 | low | `docs/decisions/011-update-management.md:67` | Cross-references "the `scheduled_jobs` plan and ADR-010"; ADR-010 is Forgejo CI, not scheduled jobs (that's TODO 8.3, unbuilt). | Point to TODO 8.3 instead. |
|
||||||
|
| O11 | low | `docs/CAPABILITIES.md` §10 | No row for the `/verify-service` (Level 4) capability though ADR-017 decided it. | Add an Operations row for `/verify-service`. |
|
||||||
|
| O12 | low | `docs/TODO.md:30` (item 3.10) | Garbled text ("maybe something in the improvements of the methods in boma moods the point?") — unfollowable. | Rewrite the question clearly or strike it. |
|
||||||
|
|
||||||
|
### Deterministic-scan noise (not fixed — known limitations)
|
||||||
|
|
||||||
|
- **`broken-path-ref` ×14** — all illustrative/future paths: report-name templates
|
||||||
|
(`docs/testing/reviews/YYYY-MM-DD-<service>.md`) and `latest.md` files not yet
|
||||||
|
created. The path-ref check stops at the `<placeholder>` boundary, so a templated
|
||||||
|
path registers as a partial broken ref. *Potential scanner improvement: skip a path
|
||||||
|
ref immediately followed by a placeholder char or a `YYYY-MM-DD` token.*
|
||||||
|
- **`marker` ×35** — mostly prose references to `TODO.md` items, not code markers.
|
||||||
|
Known noise; the regex already excludes `TODO.md`/alternations but not "TODO 8.2"
|
||||||
|
prose.
|
||||||
|
- **`open-deferred-item` ×6** — all confirmed genuinely open (see above). `0`
|
||||||
|
stale-deferred. New check healthy.
|
||||||
|
|
||||||
|
## Diff vs prior run (2026-05-30)
|
||||||
|
|
||||||
|
- **Recurring:** the Terraform-writes-DNS contradiction (AF1) — fixed in README/ADR-003
|
||||||
|
last run, reappeared in ADR-005/new-host.md. Signal that this phrasing keeps being
|
||||||
|
copied; worth a `/review-repo`-time grep for "writes … DNS A record".
|
||||||
|
- **New:** everything else — the repo gained ADR-010…017 and the `ubongo`/NetBird/
|
||||||
|
Level-4 work since the prior run, so most findings are fresh propagation gaps.
|
||||||
|
- **Resolved:** prior-run open items were largely addressed during the intervening
|
||||||
|
doc work (control-node-as-VM, WireGuard framing, etc., now mostly reconciled).
|
||||||
|
|
||||||
|
## Follow-up prompt
|
||||||
|
|
||||||
|
> Thread the ADR-017 `VERIFY.md` convention through the remaining docs (O1–O3): add a
|
||||||
|
> `VERIFY.md` row to ADR-004's service-role file table, a VERIFY.md step to
|
||||||
|
> `new-role.md` (and reconcile STATUS.md:17), and refresh `README.md`'s ADR list +
|
||||||
|
> `docs/` tree. Then settle the `askari` inventory group name (O4) and propagate it to
|
||||||
|
> CLAUDE.md host-groups + ADR-009 valid-groups. Finally clear the stale labels O5
|
||||||
|
> (ADR-006 backend.tf) and O6 (ADR-014 plugin reproducibility = DONE).
|
||||||
65
docs/reviews/2026-06-11-findings.json
Normal file
65
docs/reviews/2026-06-11-findings.json
Normal file
|
|
@ -0,0 +1,65 @@
|
||||||
|
{
|
||||||
|
"date": "2026-06-11",
|
||||||
|
"reviewed_commit": "67f2aba",
|
||||||
|
"fixes_commit": null,
|
||||||
|
"mode": "on-demand",
|
||||||
|
"counts": {
|
||||||
|
"auto_fixed": 5,
|
||||||
|
"open": 18,
|
||||||
|
"scan": {
|
||||||
|
"broken-adr-ref": 4,
|
||||||
|
"broken-path-ref": 1,
|
||||||
|
"marker": 14,
|
||||||
|
"open-deferred-item": 5,
|
||||||
|
"stale-deferred": 0
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"deferral_checklist": {
|
||||||
|
"adr-011-open-items": "all 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open; cross-checked against later ADRs + TODO 16. No stale-deferred.",
|
||||||
|
"adr-015-deferred": "deferred #1 (mesh VPN) #2 (service-UI) #3 (build) all confirmed marked RESOLVED in place. No stale-deferred.",
|
||||||
|
"stale_deferred_found": 0
|
||||||
|
},
|
||||||
|
"scan_false_positives": [
|
||||||
|
{"check": "broken-path-ref", "location": "STATUS.md:38", "why": "STATUS legitimately documents roles/docker_host/ as 'Not in git.' — intentional reference to an unbuilt role."},
|
||||||
|
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
|
||||||
|
{"check": "marker", "location": "docs/superpowers/plans/*, docs/superpowers/specs/*, docs/decisions/019-tagging.md:14", "why": "All 14 markers are in historical planning artifacts (commit-message TODOs, plan steps) or prose discussing 'over-tagging' as a concept — not actionable cruft."}
|
||||||
|
],
|
||||||
|
"auto_fixed": [
|
||||||
|
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/README.md:11-13", "description": "'base and docker_host not built yet — empty, untracked dirs, so site.yml would fail on a clean clone' contradicts STATUS.md: base is partially built (firewall concern, tracked), docker_host does not exist, dev_env is built+applied.", "fix": "rewrote Current-state paragraph: base partially built (firewall), docker_host not yet created, dev_env built+applied.", "tag": "new"},
|
||||||
|
{"id": "AF2", "dimension": "drift", "severity": "medium", "location": "playbooks/site.yml:4-5", "description": "NOTE claimed base + docker_host 'not built yet ... fails on a clean clone'; base's firewall concern is built+applied per STATUS.md.", "fix": "NOTE now states base is partially built (firewall) and only docker_host is missing.", "tag": "new"},
|
||||||
|
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-8", "description": "site.yml described as 'currently a no-op' (roles empty); base's firewall now applies real nftables state. workstation.yml (applies dev_env) was unlisted.", "fix": "reworded the no-op claim and added a workstation.yml bullet.", "tag": "new"},
|
||||||
|
{"id": "AF4", "dimension": "drift", "severity": "low", "location": "README.md:58-76", "description": "project-structure tree omitted docs/access/, docs/backup/, roles/dev_env/, and playbooks/workstation.yml — all present on disk.", "fix": "added the four missing tree entries.", "tag": "recurring"},
|
||||||
|
{"id": "AF5", "dimension": "consistency", "severity": "low", "location": "docs/decisions/016-mesh-vpn.md:110; docs/decisions/020-firewall.md:135", "description": "ADR-021 states it amends ADR-016 and ADR-020 to cross-reference the SSH ladder, but neither listed ADR-021 back in its See-also/Related section.", "fix": "added the reciprocal ADR-021 cross-reference to both.", "tag": "new"}
|
||||||
|
],
|
||||||
|
"open": [
|
||||||
|
{"id": "O1", "dimension": "conformance", "severity": "high", "location": "playbooks/site.yml:18", "description": "`make lint` is RED on `main`: site.yml imports the `docker_host` role which does not exist, so ansible-lint syntax-check fails on a clean checkout. Violates CLAUDE.md 'main must always work' and 'Never skip lint' (pre-commit would block every commit unless bypassed).", "suggested_fix": "Decide an interim posture: guard the docker_host play (e.g. skip until the role exists), stub the role via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until built — and record it. Judgement call.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105 ↔ docs/decisions/022-backup.md", "description": "ADR-004 'Persistent data' says 'Backup strategy is defined separately (not in scope of this repo).' ADR-022 defines a full in-repo backup strategy (backup role, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR contradiction on scope.", "suggested_fix": "Update ADR-004's line to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report only.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O3", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-49", "description": "ADR-004's service-role file table (the canonical standard) lists only SECURITY.md + VERIFY.md, but CLAUDE.md + ADR-021/ADR-022 now mandate ACCESS.md (every service role) and BACKUP.md (stateful service roles).", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022) rows to ADR-004's service-role file table. (Prior O1 'missing VERIFY.md' is now resolved — this is the next evolution.)", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O4", "dimension": "consistency", "severity": "medium", "location": "docs/CAPABILITIES.md:149-154 ↔ STATUS.md:29", "description": "CAPABILITIES lists nvim/tmux/shell config as a CONFIRMED EXCLUSION ('boma is server-only, so these are correctly absent'), but the dev_env role (built+applied to ubongo) installs exactly zsh+oh-my-zsh+tmux+neovim.", "suggested_fix": "Carve out an exception for the control-node developer/AI-worker environment (ubongo, ADR-015) rather than flatly excluding nvim/tmux; distinguish infra worker-host config from personal desktops.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O5", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References `make deploy PLAYBOOK=upgrade` as the deliberate full-upgrade mechanism, but no upgrade.yml playbook exists (only bootstrap/site/workstation) and ADR-011 update-management is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:7-16; inventories/staging/hosts.yml:7-14", "description": "Committed hosts.yml stubs omit the offsite_hosts group, but it is one of the four VALID_GROUPS in tf_to_inventory.py and in ADR-009/ADR-016/CLAUDE.md; the next `make tf-inventory` would add it, so the hand-stubs have drifted. (Prior O4 'askari group unnamed' is resolved — naming is now consistent; this is the residual stub gap.)", "suggested_fix": "Regenerate via `make tf-inventory TF_ENV=production` and `TF_ENV=staging` (do NOT hand-edit hosts.yml — CLAUDE.md), or accept the stubs lag until TF runs.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:81-130", "description": "Part E (control node ubongo) instructs creating an 'ansible' user and 'ssh ansible@<IP>', but STATUS.md records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), ansible-user bootstrap a pending item per STATUS.md.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/per_user.yml:2-9", "description": "The getent + `set_fact: dev_env__home` preflight is untagged, but downstream tasks that consume dev_env__home carry concern tags (users, config). A partial `--tags users` or `--tags config` run skips the set_fact, leaving dev_env__home undefined and failing the tagged tasks — against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Tag the preflight with the union of dependent concerns ([users, config]) or `always`.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O9", "dimension": "consistency", "severity": "medium", "location": "STATUS.md:31 ↔ docs/decisions/007-network.md", "description": "STATUS places ubongo at 10.20.10.151; ADR-007 defines srv as 10.20.0.0/24 and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither. base__firewall_control_addr (ADR-021 recovery path) depends on this address being correct. Already a tracked follow-up in the ubongo-build plan (line 147).", "suggested_fix": "Either correct ubongo's recorded address to a valid ADR-007 subnet, or amend ADR-007 to document the actual VLAN/subnet ubongo's physical port lives on, before base__firewall_control_addr is populated.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O10", "dimension": "drift", "severity": "low", "location": "README.md:104-106", "description": "README's Documentation ADR list stops at 017; ADRs 018 (logging), 019 (tagging), 020 (firewall), 021 (access), 022 (backup), 023 (ADR structure) exist and are in CLAUDE.md's full table. Partial enumeration is now stale. (Evolved from prior O3, which is otherwise resolved — the docs/ tree omissions were fixed in AF4.)", "suggested_fix": "Extend the list through 023, or trim it to a pointer at CLAUDE.md's full table to avoid a stale partial list.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O11", "dimension": "conformance", "severity": "low", "location": "docs/decisions/008-testing.md:3; 014-knowledge-sourcing.md:98; 016-mesh-vpn.md:91; 017-service-ui-verification.md:66; 018-logging.md:73", "description": "ADR-023 §2 mandates section order Status→Context→Decision→Consequences. ADR-008 injects a gotchas blockquote before ## Status; ADR-014's ## Decision is a late summary after six topical sections; ADR-016/017/018 place ## Status mid-document. The scan checks presence, not order, so all pass lint — but they don't match the stated standard.", "suggested_fix": "Presentational restructure per ADR-023 §6 (move Status first; pull Decision up). No decision substance changes. Judgement call — report.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O12", "dimension": "consistency", "severity": "low", "location": "docs/decisions/007-network.md:160", "description": "The naming-scheme table states the public FQDN convention is `<service>.baobab.band`, but its own example is `forgejo.nyumbani.baobab.band` (extra nyumbani label). The nyumbani split-horizon sub-label is still OPEN (TODO 4); convention and example disagree.", "suggested_fix": "Change the example to forgejo.baobab.band, or note nyumbani is an unresolved split-horizon sub-label (TODO 4). Ties to an open decision — report.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes `alias rclone=\"/usr/bin/rclone\"` (rclone is not installed by dev_env) and `eval \"$(direnv hook zsh)\"` unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages every shell startup errors.", "suggested_fix": "Drop the rclone alias (role doesn't install it) and guard the direnv hook with `command -v direnv`, or document direnv as a hard dependency of the shipped .zshrc.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no `config` tag, while analogous dotfile tasks in per_user.yml are tagged `config` — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O15", "dimension": "consistency", "severity": "low", "location": "terraform/environments/production/terraform.tfvars.example:9-11; staging/terraform.tfvars.example", "description": "proxmox_node/endpoint examples use pve01 / pve01.baobab.band, but ADR-007 defines Proxmox node names as pve0/pve1/pve2 (single digit, no leading zero). Example contradicts the naming convention.", "suggested_fix": "Change example values to pve0 / pve0.baobab.band (both envs). Verify the actual node name first — report rather than auto-fix.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "docs/decisions/013-heritage-v4.md:77; docs/decisions/015-control-host.md", "description": "ADR-013 and ADR-015 close with an inline 'See also:' prose line, whereas ADRs 014/019/020/021/022 and the adr-template use a dedicated `## Related` section. Stylistic inconsistency (## Related is optional per ADR-023 §3).", "suggested_fix": "Convert the 'See also:' prose in ADR-013/015 into ## Related sections for uniformity. Cosmetic.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O17", "dimension": "cruft", "severity": "low", "location": "roles/dev_env/handlers/main.yml; roles/base/handlers/main.yml", "description": "Both roles ship an empty handlers/main.yml (only `---`); neither defines or uses handlers (base's firewall apply/rollback is deliberately in tasks). Scaffold artifacts from make new-role.", "suggested_fix": "Confirm whether empty scaffold files are an intentional convention; if not, delete. Low priority.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O18", "dimension": "consistency", "severity": "low", "location": "docs/README.md:5-8; inventories/README.md:1-12", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews/superpowers); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both are narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list. Low priority.", "tag": "new", "auto_fixable": false}
|
||||||
|
],
|
||||||
|
"prior_resolved": [
|
||||||
|
{"id": "O1@2026-06-05", "description": "ADR-004 service-role table missing VERIFY.md row", "status": "resolved — table now lists SECURITY.md + VERIFY.md (next gap ACCESS/BACKUP tracked as O3)"},
|
||||||
|
{"id": "O2@2026-06-05", "description": "new-role runbook missing VERIFY.md step", "status": "resolved — step 10 present"},
|
||||||
|
{"id": "O3@2026-06-05", "description": "README ADR list + docs/ tree omissions", "status": "partial — docs tree security/testing/hardware now present; access/backup fixed in AF4; ADR-list staleness carried as O10"},
|
||||||
|
{"id": "O4@2026-06-05", "description": "askari inventory group unnamed", "status": "resolved — offsite_hosts named consistently (residual stub gap = O6)"},
|
||||||
|
{"id": "O5@2026-06-05", "description": "backend.tf mislabelled Forgejo state backend", "status": "resolved — now labelled local state"},
|
||||||
|
{"id": "O6@2026-06-05", "description": "ADR-014 plugin reproducibility described open but TODO done", "status": "resolved"},
|
||||||
|
{"id": "O11@2026-06-05", "description": "CAPABILITIES missing /verify-service Level-4 row", "status": "resolved — present (§10)"},
|
||||||
|
{"id": "O12@2026-06-05", "description": "TODO 3.10 garbled", "status": "resolved — readable"},
|
||||||
|
{"id": "O7-O10@2026-06-05", "description": "ADR-011 digest-pinning row; act_runner ambiguity; WireGuard Molecule row; ADR-011 scheduled_jobs cross-ref", "status": "not re-detected this run (ADR-011 still Proposed) — verify on next run"}
|
||||||
|
]
|
||||||
|
}
|
||||||
161
docs/reviews/2026-06-11-review.md
Normal file
161
docs/reviews/2026-06-11-review.md
Normal file
|
|
@ -0,0 +1,161 @@
|
||||||
|
# Repo review — 2026-06-11
|
||||||
|
|
||||||
|
- **Reviewed commit:** `67f2aba` (main)
|
||||||
|
- **Mode:** on-demand (interactive)
|
||||||
|
- **Previous run:** `2026-06-05` (commit `f566fd1`)
|
||||||
|
- **Process:** Phase 0 deterministic scan → 5 parallel shard reviewers + 1 cross-cutting
|
||||||
|
reviewer → synthesis, deferral-checklist resolution, prior-run diff → safe auto-fixes.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
| | High | Medium | Low | Total |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Auto-fixed** | 1 | 2 | 2 | 5 |
|
||||||
|
| **Open (report-only)** | 2 | 7 | 9 | 18 |
|
||||||
|
|
||||||
|
By dimension (open): conformance 3 · consistency 8 · drift 6 · cruft 1.
|
||||||
|
|
||||||
|
**Headline:** `make lint` is currently **red on `main`** — `playbooks/site.yml` imports the
|
||||||
|
not-yet-existent `docker_host` role (confirmed at clean HEAD, unrelated to this run's
|
||||||
|
edits). That breaks CLAUDE.md's "main must always work" / "Never skip lint" contract and
|
||||||
|
is the top open finding (O1). The bulk of the rest is documentation drift created by the
|
||||||
|
recent `base` (firewall) + `dev_env` build wave: several READMEs/playbook notes still
|
||||||
|
described the roles as "empty / not built." Those were the safe auto-fixes.
|
||||||
|
|
||||||
|
**Good news:** 7 of the 12 open findings from the 2026-06-05 run are confirmed resolved
|
||||||
|
(VERIFY.md row + runbook step, backend.tf relabel, askari group naming, ADR-014
|
||||||
|
reproducibility, CAPABILITIES Level-4 row, TODO 3.10). The deferral checklist is clean —
|
||||||
|
**0 stale-deferred** this run (the recurring miss logged in FRICTION.md did not recur).
|
||||||
|
|
||||||
|
## Auto-fixes applied
|
||||||
|
|
||||||
|
Markdown / YAML-comment only; no runtime behaviour, logic, vars, or task order touched.
|
||||||
|
|
||||||
|
| ID | Sev | File(s) | What |
|
||||||
|
|---|---|---|---|
|
||||||
|
| AF1 | high | `roles/README.md` | Rewrote stale "base & docker_host are empty untracked dirs, site.yml would fail on a clean clone" → base partially built (firewall), docker_host not yet created, dev_env built+applied. |
|
||||||
|
| AF2 | med | `playbooks/site.yml` | NOTE no longer claims base is unbuilt / "fails on a clean clone"; now reflects firewall-only base + missing docker_host. |
|
||||||
|
| AF3 | med | `playbooks/README.md` | Dropped the "currently a no-op" claim; added a `workstation.yml` bullet. |
|
||||||
|
| AF4 | low | `README.md` | Added `docs/access/`, `docs/backup/`, `roles/dev_env/`, `playbooks/workstation.yml` to the project-structure tree. |
|
||||||
|
| AF5 | low | `docs/decisions/016-mesh-vpn.md`, `docs/decisions/020-firewall.md` | Added the reciprocal `ADR-021` cross-reference that ADR-021 says it amended in. |
|
||||||
|
|
||||||
|
> `make lint` was re-run after the fixes: it fails **only** on the pre-existing
|
||||||
|
> `docker_host` syntax-check (O1), identical to clean HEAD. No auto-fix introduced or
|
||||||
|
> changed any lint result, so none were reverted.
|
||||||
|
|
||||||
|
## Open findings (prioritised)
|
||||||
|
|
||||||
|
### High
|
||||||
|
|
||||||
|
- **O1 — `make lint` is red on `main`** · `playbooks/site.yml:18` · *conformance*
|
||||||
|
site.yml imports the `docker_host` role, which does not exist, so ansible-lint's
|
||||||
|
syntax-check fails on a clean checkout. Violates "main must always work" + "Never skip
|
||||||
|
lint" (pre-commit would block every commit unless bypassed).
|
||||||
|
*Fix (judgement):* guard/skip the docker_host play until the role exists, scaffold a
|
||||||
|
stub via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until
|
||||||
|
built — and record the choice. **new**
|
||||||
|
|
||||||
|
- **O2 — ADR-004 ↔ ADR-022 backup-scope contradiction** ·
|
||||||
|
`docs/decisions/004-docker-model.md:105` · *consistency*
|
||||||
|
ADR-004 says "Backup strategy is defined separately (not in scope of this repo)";
|
||||||
|
ADR-022 defines a full in-repo backup strategy. Per ADR-023 (no silent reversals),
|
||||||
|
update ADR-004's line to defer to ADR-022 and cross-link. Design decision — report. **new**
|
||||||
|
|
||||||
|
### Medium
|
||||||
|
|
||||||
|
- **O3 — ADR-004 service-role file table missing ACCESS.md + BACKUP.md** ·
|
||||||
|
`docs/decisions/004-docker-model.md:48` · *consistency* — CLAUDE.md + ADR-021/022 now
|
||||||
|
mandate both for service roles; the canonical table lists only SECURITY.md + VERIFY.md.
|
||||||
|
(Prior "missing VERIFY.md" is resolved; this is the next evolution.) **new**
|
||||||
|
- **O4 — CAPABILITIES nvim/tmux exclusion ↔ dev_env built** ·
|
||||||
|
`docs/CAPABILITIES.md:149` · *consistency* — listed as a confirmed exclusion
|
||||||
|
("server-only"), but `dev_env` (built+applied to ubongo) installs exactly that. Carve
|
||||||
|
out the control-node/AI-worker exception (ADR-015). **new**
|
||||||
|
- **O5 — phantom `make deploy PLAYBOOK=upgrade`** · `docs/decisions/002-security.md:82` ·
|
||||||
|
*drift* — no `upgrade.yml` exists; ADR-011 is unbuilt. Add a "(planned)" caveat. **new**
|
||||||
|
- **O6 — hosts.yml stubs missing `offsite_hosts` group** ·
|
||||||
|
`inventories/{production,staging}/hosts.yml` · *drift* — the generator emits it (one of
|
||||||
|
four VALID_GROUPS); the hand-stubs predate the standard. Regenerate via
|
||||||
|
`make tf-inventory` (don't hand-edit). (Prior "askari group unnamed" is resolved.) **new**
|
||||||
|
- **O7 — new-host runbook Part E vs ubongo reality** · `docs/runbooks/new-host.md:81-130`
|
||||||
|
· *drift* — instructs creating an `ansible` user / `ssh ansible@`; STATUS records ubongo
|
||||||
|
is managed as `sjat`, ansible-user bootstrap pending. **new**
|
||||||
|
- **O8 — dev_env untagged `set_fact` under tagged consumers** ·
|
||||||
|
`roles/dev_env/tasks/per_user.yml:2-9` · *conformance* — partial `--tags users|config`
|
||||||
|
runs skip the `dev_env__home` set_fact and fail. Tag the preflight `[users, config]` or
|
||||||
|
`always`. **new**
|
||||||
|
- **O9 — ubongo address outside ADR-007 subnets** · `STATUS.md:31 ↔ 007-network.md` ·
|
||||||
|
*drift* — 10.20.10.151 is in neither srv (10.20.0.0/24) nor mgmt (10.10.0.0/24);
|
||||||
|
`base__firewall_control_addr` depends on it. Already a tracked follow-up in the
|
||||||
|
ubongo-build plan. Reconcile address or ADR-007. **new**
|
||||||
|
|
||||||
|
### Low
|
||||||
|
|
||||||
|
- **O10 — README ADR list stops at 017** · `README.md:104` · *drift* — 018–023 exist;
|
||||||
|
extend or trim to a pointer. **recurring** (evolved from prior O3)
|
||||||
|
- **O11 — ADR section-order vs ADR-023 §2** · `008:3, 014:98, 016:91, 017:66, 018:73` ·
|
||||||
|
*conformance* — Status-not-first / Decision-late; passes lint (order not gated) but not
|
||||||
|
the standard. Presentational restructure. **new**
|
||||||
|
- **O12 — ADR-007 FQDN convention vs its own example** · `007-network.md:160` ·
|
||||||
|
*consistency* — `<service>.baobab.band` vs `forgejo.nyumbani.baobab.band`; ties to open
|
||||||
|
TODO 4 (split-horizon). **new**
|
||||||
|
- **O13 — dev_env `.zshrc` heritage carryovers** ·
|
||||||
|
`roles/dev_env/files/dotfiles/zsh/.zshrc:28,55` · *consistency* — hard-coded
|
||||||
|
`/usr/bin/rclone` alias (not installed by the role) + unguarded `direnv` hook. **new**
|
||||||
|
- **O14 — oh_my_posh config tasks untagged** · `roles/dev_env/tasks/oh_my_posh.yml:15-26`
|
||||||
|
· *consistency* — inconsistent `config` tagging vs per_user.yml. **new**
|
||||||
|
- **O15 — tfvars.example `pve01` vs ADR-007 `pve0`** ·
|
||||||
|
`terraform/environments/*/terraform.tfvars.example:9` · *consistency* — verify the real
|
||||||
|
node name, then align. **new**
|
||||||
|
- **O16 — ADR-013/015 "See also:" vs `## Related`** · *consistency* — stylistic; convert
|
||||||
|
for uniformity. **new**
|
||||||
|
- **O17 — empty scaffold `handlers/main.yml`** · `roles/{dev_env,base}/handlers/main.yml`
|
||||||
|
· *cruft* — confirm convention or delete. **new**
|
||||||
|
- **O18 — docs/README.md + inventories/README.md narrower than reality** · *consistency*
|
||||||
|
— omit several real subdirs / the offsite_hosts group. **new**
|
||||||
|
|
||||||
|
## Deferral checklist (Phase 2)
|
||||||
|
|
||||||
|
| Source | Items | Verdict |
|
||||||
|
|---|---|---|
|
||||||
|
| ADR-011 Deferred/Open | 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) | **All genuinely still open** — cross-checked against later ADRs + TODO 16. None silently resolved. |
|
||||||
|
| ADR-015 Deferred | #1 mesh VPN, #2 service-UI, #3 build | **All marked RESOLVED in place** (ADR-016 / ADR-017 / 2026-06-11 build). |
|
||||||
|
|
||||||
|
**Stale-deferred found: 0.** The recurring FRICTION.md miss did not recur this run.
|
||||||
|
|
||||||
|
## Scan false positives (folded in, not actionable)
|
||||||
|
|
||||||
|
- `broken-path-ref STATUS.md:38` — STATUS legitimately documents `roles/docker_host/` as
|
||||||
|
"Not in git." (intentional reference to an unbuilt role).
|
||||||
|
- `broken-adr-ref` ×4 — `ADR-099`/`ADR-100` in `tests/test_repo_scan.py` and the
|
||||||
|
adr-structure plan are intentional **test fixtures** for the scanner's bad-ref check.
|
||||||
|
- `marker` ×14 — all in `docs/superpowers/{plans,specs}/*` (historical commit-message
|
||||||
|
TODOs / plan steps) or prose discussing "over-tagging" as a concept. Not cruft.
|
||||||
|
|
||||||
|
## Prior-run diff (vs 2026-06-05)
|
||||||
|
|
||||||
|
**Resolved (7):** O1 VERIFY.md row · O2 new-role VERIFY step · O4 askari group naming ·
|
||||||
|
O5 backend.tf relabel · O6 ADR-014 reproducibility · O11 CAPABILITIES Level-4 row ·
|
||||||
|
O12 TODO 3.10. **Partial:** O3 (docs tree fixed in AF4; ADR-list carried as O10).
|
||||||
|
**Not re-detected (verify next run):** O7–O10 (ADR-011 still Proposed).
|
||||||
|
|
||||||
|
## Follow-up prompt (copy-paste)
|
||||||
|
|
||||||
|
> Act on the open findings from `docs/reviews/2026-06-11-review.md`. Priority order:
|
||||||
|
> 1. **O1 (high):** `make lint` is red on `main` — `playbooks/site.yml` imports the
|
||||||
|
> non-existent `docker_host` role. Pick an interim posture (guard/skip the play, or
|
||||||
|
> `make new-role NAME=docker_host` to scaffold a stub, or exclude from syntax-check
|
||||||
|
> until built) so the trunk lints clean again, and record the choice in STATUS.md.
|
||||||
|
> 2. **O2 (high):** Resolve the ADR-004 ↔ ADR-022 backup-scope contradiction —
|
||||||
|
> update ADR-004's "not in scope of this repo" line to defer to ADR-022 (per ADR-023's
|
||||||
|
> no-silent-reversal rule) and cross-link.
|
||||||
|
> 3. **O3:** Add ACCESS.md + BACKUP.md rows to ADR-004's service-role file table.
|
||||||
|
> 4. **O4:** Reconcile CAPABILITIES' nvim/tmux exclusion with the built `dev_env` role
|
||||||
|
> (carve out the ubongo control-node exception).
|
||||||
|
> 5. **O8 (conformance):** Tag the `dev_env__home` preflight `set_fact` so partial
|
||||||
|
> `--tags users|config` runs don't fail.
|
||||||
|
> 6. **O6 / O9:** Regenerate the inventory stubs to include `offsite_hosts`; reconcile
|
||||||
|
> ubongo's 10.20.10.151 against ADR-007's subnets (or amend ADR-007).
|
||||||
|
> 7. Sweep the low-severity doc items (O5 caveat, O7 runbook, O10 ADR list, O11 ADR
|
||||||
|
> section order, O12–O18) as a single docs-hygiene batch.
|
||||||
|
> Run `make lint` before committing; commit per CLAUDE.md git conventions.
|
||||||
76
docs/reviews/2026-06-14-findings.json
Normal file
76
docs/reviews/2026-06-14-findings.json
Normal file
|
|
@ -0,0 +1,76 @@
|
||||||
|
{
|
||||||
|
"date": "2026-06-14",
|
||||||
|
"reviewed_commit": "e346137",
|
||||||
|
"fixes_commit": null,
|
||||||
|
"mode": "on-demand",
|
||||||
|
"counts": {
|
||||||
|
"auto_fixed": 11,
|
||||||
|
"open": 29,
|
||||||
|
"scan": {
|
||||||
|
"broken-adr-ref": 4,
|
||||||
|
"broken-path-ref": 2,
|
||||||
|
"marker": 14,
|
||||||
|
"open-deferred-item": 5,
|
||||||
|
"stale-deferred": 0
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"deferral_checklist": {
|
||||||
|
"adr-011-open-items": "all 5 ('Open questions': Proxmox snapshot driver, exact cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open. ADR-011 is still Proposed/unbuilt; the same questions are echoed open in docs/TODO.md item 16; no later ADR or STATUS decides any of them. No stale-deferred.",
|
||||||
|
"stale_deferred_found": 0
|
||||||
|
},
|
||||||
|
"scan_false_positives": [
|
||||||
|
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
|
||||||
|
{"check": "broken-path-ref", "location": "docs/superpowers/plans/2026-06-14-m4b-netbird.md:28,56", "why": "roles/netbird/ is referenced by the M4b implementation plan for a role to be scaffolded via make new-role; forward-looking plan for unbuilt work, not a dead ref."},
|
||||||
|
{"check": "marker", "location": "docs/decisions/019-tagging.md:14 + docs/superpowers/plans/* + docs/superpowers/specs/*", "why": "019-tagging.md:14 is prose discussing 'over-tagging' as a concept ('the TODO explicitly warns against...'), not an actionable TODO. The 13 superpowers markers are historical planning artifacts (commit-message TODOs, plan steps)."}
|
||||||
|
],
|
||||||
|
"auto_fixed": [
|
||||||
|
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/reverse_proxy/meta/main.yml:4-6", "description": "meta description said 'ACME DNS-01 TLS via Gandi ... builds the custom image on-host (caddy-dns/gandi)' — but the role is now vanilla Caddy + HTTP-01 (commit b7e919d dropped the custom image); README/defaults/compose/STATUS all reflect vanilla. Only meta was stale and contradicted the code.", "fix": "rewrote description to 'Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public hosts. Routes from reverse_proxy__routes, managed via Docker Compose.'", "tag": "new"},
|
||||||
|
{"id": "AF2", "dimension": "cruft", "severity": "medium", "location": "roles/README.md:11-15", "description": "Current-state paragraph said base hardening (SSH/fail2ball), auditd, packages, users 'not yet built' and docker_host 'scaffolded but has no tasks yet' — but STATUS records the hardening concern built+tested+applied to askari, and docker_host/reverse_proxy/public_dns all built.", "fix": "rewrote to: base firewall+hardening built (hardening applied to askari), docker_host/reverse_proxy/public_dns/dev_env built; auditd/packages/users pending.", "tag": "recurring"},
|
||||||
|
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-13", "description": "site.yml note said docker_host 'scaffolded with no tasks yet' (now installs Docker engine) and the file omitted dns.yml and offsite.yml entirely.", "fix": "reworded site.yml note (base firewall+hardening, no cluster docker hosts yet) and added dns.yml + offsite.yml bullets.", "tag": "new"},
|
||||||
|
{"id": "AF4", "dimension": "cruft", "severity": "low", "location": "roles/public_dns/README.md:7-9", "description": "'the anti-spoof baseline now; askari in M4' — M4a is done; askari + *.askari records are applied.", "fix": "updated to note askari.wingu.me + *.askari wildcard applied in M4a.", "tag": "new"},
|
||||||
|
{"id": "AF5", "dimension": "cruft", "severity": "low", "location": "scripts/README.md:17", "description": "Helper-script list omitted check-tags.py, which exists and is run by make lint (ADR-019).", "fix": "added a check-tags.py bullet.", "tag": "new"},
|
||||||
|
{"id": "AF6", "dimension": "drift", "severity": "medium", "location": "terraform/README.md:7-15", "description": "Top-level terraform README omitted modules/hetzner_vm and environments/offsite — the only built+applied TF environment (askari).", "fix": "added hetzner_vm + offsite env bullets; scoped 'not yet init'ed' to the Proxmox envs.", "tag": "new"},
|
||||||
|
{"id": "AF7", "dimension": "cruft", "severity": "low", "location": "terraform/environments/offsite/providers.tf:1", "description": "Verified-stamp said 'cax11@hel1' but the deployed server is cx23 (CAX11 out of stock).", "fix": "stamp now reads cx23@hel1.", "tag": "new"},
|
||||||
|
{"id": "AF8", "dimension": "cruft", "severity": "low", "location": "terraform/modules/hetzner_vm/variables.tf:7", "description": "server_type description example was 'e.g. cax11 (ARM)'; the only consumer uses cx23.", "fix": "example now 'e.g. cx23 (x86) or cax11 (ARM)'.", "tag": "new"},
|
||||||
|
{"id": "AF9", "dimension": "drift", "severity": "medium", "location": "inventories/production/group_vars/all/public_dns.yml:16-17", "description": "Comment on the *.askari wildcard said 'Caddy gets a *.askari.wingu.me cert via DNS-01 (M4a)' — M4a uses HTTP-01 (the wildcard A record itself is still legitimately needed for name resolution).", "fix": "comment now says per-host certs via ACME HTTP-01 (M4a).", "tag": "new"},
|
||||||
|
{"id": "AF10", "dimension": "drift", "severity": "high", "location": "docs/CAPABILITIES.md:27,29", "description": "Capability table named Traefik as the reverse-proxy candidate (ADR-024 chose Caddy, built+applied) and marked public DNS 'apply pending' (applied 2026-06-14).", "fix": "reverse-proxy row -> 'Caddy (ADR-024)'; public DNS note -> 'applied (M1)'. (The V4-history Traefik mention at line 134 is correct and left as-is.)", "tag": "new"},
|
||||||
|
{"id": "AF11", "dimension": "cruft", "severity": "low", "location": "README.md:110-119", "description": "README 'Documentation' ADR list stopped at ADR-017; ADR-018..024 exist.", "fix": "extended the list through ADR-024 (logging, tagging, firewall, access, backup, ADR-structure, reverse-proxy).", "tag": "recurring"}
|
||||||
|
],
|
||||||
|
"open": [
|
||||||
|
{"id": "O1", "dimension": "drift", "severity": "high", "location": "STATUS.md:41 (+ 45-48) ↔ STATUS.md:33-34", "description": "The 'Scaffolded but empty — NOT implemented' table still lists roles/docker_host as 'Scaffolded, no tasks ... applying it is a no-op', and the trailing prose (45-48) repeats it. This contradicts STATUS.md:33-34 ('Built + applied', installs Docker CE + compose) and the actual roles/docker_host/tasks/main.yml. An internal STATUS contradiction; one side is plainly correct (docker_host is built).", "suggested_fix": "Remove/rewrite the docker_host row in the 'Scaffolded but empty' table and the 45-48 paragraph: docker_host now installs the Docker engine; only its deferred daemon-hardening + nftables.d scope (ADR-004/020) remains. Report (STATUS is the operator's ground-truth doc — reword deliberately).", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105,131 ↔ docs/decisions/022-backup.md", "description": "ADR-004 states twice that 'Backup strategy is defined separately (not in scope of this repo)'. ADR-022 defines a full in-repo backup/DR doctrine (restic, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR scope contradiction.", "suggested_fix": "Reword ADR-004's lines to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O3", "dimension": "consistency", "severity": "high", "location": "docs/decisions/024-reverse-proxy.md (Consequences) ↔ 008-testing.md:70; 017-service-ui-verification.md:27,88; 019-tagging.md:52", "description": "ADR-024's Consequences claim 'ADR-017 prose that mentioned Traefik is updated to read Caddy'. That update was NOT done: ADR-017:27,88 still say 'Traefik + Authentik'; ADR-008:70 'Traefik + Authentik SSO flow'; ADR-019:52 'Traefik routes, Authentik'. The doc set still designs around Traefik while ADR-024 overclaims the reconciliation was completed.", "suggested_fix": "Replace Traefik with Caddy (ADR-024) in ADR-008:70, ADR-017:27,88, ADR-019:52, OR soften ADR-024's Consequences to 'to be updated'. ADR prose = design docs — report (not auto-fixed).", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O4", "dimension": "conformance", "severity": "high", "location": "docs/decisions/023-adr-structure.md:7-8,77-80 ↔ 016-mesh-vpn.md:3; 017-service-ui-verification.md:3; 018-logging.md:3", "description": "ADR-023 §2 mandates ## Status as the first section and §6 explicitly claims ADRs 001–018 were retroactively restructured to lead with Status (calling out 016–018). But ADR-016/017/018 still open with ## Context, Status buried late (016:~92, 017:~66, 018:~73). ADR-023's own conformance claim is contradicted by three in-scope files. (Older ADRs 001–010 lead with Status but place Decision/Consequences after topical sections — an accepted presentational trade-off per ADR-023 §5/§6.)", "suggested_fix": "Either add a top-of-file ## Status section to ADR-016/017/018 (move the existing build-state line up), or correct ADR-023 §6 to exclude them. Reordering judgement — report.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O5", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-50", "description": "The service-role file table (the canonical standard) lists only README/SECURITY/VERIFY; it omits ACCESS.md (ADR-021) and BACKUP.md (ADR-022), both of which CLAUDE.md + those ADRs mandate as required per-service-role files.", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022, stateful) rows to ADR-004's file table.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References 'make deploy PLAYBOOK=upgrade' as the deliberate full-upgrade mechanism, but no upgrade.yml exists (only bootstrap/dns/offsite/site/workstation) and ADR-011 is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line, or drop the concrete command until upgrade.yml exists.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/CAPABILITIES.md:150-155 ↔ STATUS.md:29", "description": "CAPABILITIES still lists nvim/kitty/tmux among 'Confirmed exclusions' boma 'deliberately does not' have, but the dev_env role (built+applied to ubongo) installs neovim + tmux. (The reverse-proxy/public-DNS rows in this file were auto-fixed in AF10; this exclusions block was left because it needs a scoped carve-out, not a token swap.)", "suggested_fix": "Scope the exclusion to managed cluster/server hosts and note the control/dev host (ubongo, ADR-015) runs an interactive dev_env, or drop nvim/tmux from the list.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/main.yml (include_tasks per_user.yml) + roles/dev_env/tasks/per_user.yml:4-9", "description": "per_user.yml's getent + set_fact dev_env__home preflight is untagged, and the include_tasks that pulls it in carries no 'apply: tags:'. base/tasks/main.yml documents and guards exactly this gotcha with apply: tags:; dev_env does not. A partial --tags users or --tags config run selects only the include statement (running nothing) or, if made tag-aware, skips the set_fact and fails the dependent [config] tasks on an undefined dev_env__home. Against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Add apply: tags: [users, config] to the per_user.yml include (mirroring base), and tag the getent+set_fact with 'always' (or the union [users, config]).", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O9", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:1-17", "description": "Header claims 'Generated from Terraform outputs: make tf-inventory TF_ENV=production', but the file is hand-maintained: it carries the manual control host (ubongo) and omits the offsite_hosts group that tf_to_inventory.py always emits (VALID_GROUPS). Running tf-inventory against the empty production env would DROP ubongo and ADD offsite_hosts, so the header misrepresents how the file is managed.", "suggested_fix": "Make the header honest (hand-maintained for the manual control-node exception while production TF has no VMs; offsite hosts live in offsite.yml), and reconcile the declared group set with tf_to_inventory.py. Do NOT hand-regenerate hosts.yml in a way that drops ubongo.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O10", "dimension": "consistency", "severity": "medium", "location": "inventories/production/group_vars/all/vars.yml:42 + hosts.yml:12 ↔ docs/decisions/007-network.md", "description": "ubongo's address is 10.20.10.151 (control host_var + base__firewall_control_addr), but ADR-007 defines srv as 10.20.0.0/24 (network__srv_subnet) and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither, and ADR-007's addressing tables don't record where the physical control node lives. base__firewall_control_addr (ADR-021 recovery path) depends on this being right.", "suggested_fix": "Add ubongo to ADR-007's addressing table (which VLAN/segment 10.20.10.151 belongs to, clearly outside srv 10.20.0.0/24), or correct the address. Confirm the real address with the operator first.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O11", "dimension": "consistency", "severity": "medium", "location": "terraform/environments/{staging,production}/terraform.tfvars.example:9-11 + variables.tf:5", "description": "Proxmox node naming uses 'pve01' (two-digit) in both tfvars.example files and the proxmox_endpoint var descriptions; ADR-007 defines single-digit node names pve0/pve1/pve2, and internal FQDNs as <host>.boma.<domain>. Example contradicts the naming convention.", "suggested_fix": "Align example values with ADR-007 (proxmox_node = pve0; endpoint = https://pve0.boma.<domain>:8006/). Verify the intended node name with the operator before changing — report rather than auto-fix.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O12", "dimension": "conformance", "severity": "medium", "location": "roles/reverse_proxy/ (missing SECURITY.md, VERIFY.md, ACCESS.md, BACKUP.md)", "description": "CLAUDE.md requires every service role to carry SECURITY.md (ADR-002/004), VERIFY.md (ADR-008/017), ACCESS.md (ADR-021), and a stateful BACKUP.md (ADR-022); a stateless service records backup__state: false with a reason. reverse_proxy is the first real built+applied service role (askari, M4a) but ships only README.md. (Judgement recorded: public_dns is exempt — it runs on the control node against an external DNS API, provisioning no host-resident service/port, so it is not a 'service' role in the ADR-004 sense.)", "suggested_fix": "Add the four files from docs/security|testing|access|backup/ templates. BACKUP.md can declare backup__state: false (Caddy state = re-issuable ACME certs).", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "docs/decisions/012-hardware-capacity.md; 013-heritage-v4.md:77; 015-control-host.md; 016-mesh-vpn.md; 017-service-ui-verification.md; 018-logging.md", "description": "Inconsistent cross-reference convention: ADRs 014/019/020/021/022/023 + adr-template use a dedicated '## Related' section, while 012/013/015/016/017/018 use an inline 'See also:' prose line (placed mid-document in 016/017/018). ADR-023 §3 names ## Related as the optional section; 'See also:' is an undocumented variant.", "suggested_fix": "Convert the 'See also:' prose into ## Related sections (after Consequences) in ADR-012/013/015/016/017/018 for uniformity. Cosmetic.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "docs/README.md:4-8; inventories/README.md", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list in the repo README / CLAUDE.md.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O15", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:82,114-138 (Part E)", "description": "Part E (control node ubongo) still instructs 'ssh ansible@<IP>' / an ansible-user flow, but STATUS records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), the ansible-user bootstrap a pending item per STATUS.md.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes alias rclone=\"/usr/bin/rclone\" (rclone not installed by dev_env) and 'eval \"$(direnv hook zsh)\"' unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages, every shell startup errors.", "suggested_fix": "Drop the rclone alias and guard the direnv hook with 'command -v direnv', or document direnv as a hard dependency of the shipped .zshrc.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O17", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no 'config' tag, while analogous dotfile tasks in per_user.yml are tagged config — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O18", "dimension": "drift", "severity": "medium", "location": "docs/decisions/007-network.md:159,167,186 + 009-provisioning-handoff.md:114 + 016-mesh-vpn.md:90 ↔ 007-network.md:174,184", "description": "Internal-zone name is inconsistent across the doc set: ADR-007:159/167/186, ADR-009:114, ADR-016:90 call it 'boma.baobab.band', while ADR-007:174/184 says infra is '<host>.boma.wingu.me' and the internal zone 'will be renamed to boma.wingu.me' (Phase 2). M1 moved boma's home to wingu.me. A reader can't tell which domain the unbuilt dns role should render.", "suggested_fix": "State the transitional state in one authoritative place (current = boma.baobab.band, target = boma.wingu.me in Phase 2), or align all references on the target. Report.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O19", "dimension": "consistency", "severity": "low", "location": "docs/decisions/009-provisioning-handoff.md:122", "description": "M1 retired 'nyumbani' as a naming tier (ROADMAP:70, ADR-007:176). ADR-009:122 still uses 'forgejo.nyumbani.baobab.band' as the worked example of internal-zone data the dns role would render. (Note: STATUS:19 + ADR-003/008/010 use the same name for the LIVE legacy Forgejo host, which is legitimately legacy infra — distinguish.)", "suggested_fix": "Update the ADR-009:122 example to a non-nyumbani name consistent with the retired-nyumbani decision; annotate the legacy Forgejo references as intentionally legacy where they remain.", "tag": "recurring", "auto_fixable": false},
|
||||||
|
{"id": "O20", "dimension": "drift", "severity": "low", "location": "docs/ROADMAP.md:82-83", "description": "ROADMAP M2 still describes askari as 'CAX11 ARM / Helsinki', but STATUS records it provisioned as cx23/x86 (CAX11/ARM out of stock EU-wide on 2026-06-14). M3/M4 sections got DONE notes; M2's spec line wasn't corrected.", "suggested_fix": "Update ROADMAP M2 to note askari shipped as cx23/x86 (CAX11 unavailable), or add a DONE note mirroring M3/M4.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O21", "dimension": "drift", "severity": "low", "location": "docs/decisions/020-firewall.md:91-93", "description": "ADR-020 says askari's Hetzner Cloud Firewall 'NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator role is built' — but M4a is DONE and the firewall already opens 80/443/3478. Future-tense is stale; only the netbird role (M4b) remains.", "suggested_fix": "Update ADR-020 to past tense (80/443/3478 opened in M4a); keep the netbird coordinator role (M4b) caveated as unbuilt.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O22", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:60-92", "description": "ADR-024 is internally inconsistent post-revision: the revised Status note says askari ships HTTP-01 with vanilla Caddy (custom-image DNS-01 deferred to Phase 2), but Decision §2 still asserts boma builds/maintains the custom xcaddy+gandi image, §3 says 'fronts the NetBird stack on askari (M4)' (M4b unbuilt), and Consequences still lists 'a custom Caddy image must be built/pushed/kept current' as a present obligation.", "suggested_fix": "Scope the custom-image obligation (§2, Consequences) to the deferred Phase-2 DNS-01 path; soften §3 to reflect that M4a ships a test vhost and the NetBird front-end is M4b. Report (touches decision substance).", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O23", "dimension": "consistency", "severity": "low", "location": "docs/decisions/001-architecture.md:50 + 016-mesh-vpn.md:87 ↔ docs/ROADMAP.md:116", "description": "The future NetBird service role is named 'netbird_coordinator' in ADR-001:50 + ADR-016:87 (coordinator framing also in STATUS), but ROADMAP M4b:116 calls it 'the netbird service role'. make new-role creates one directory name; the committed names will mismatch the actual role at build time. (The M4b plan at docs/superpowers/plans/2026-06-14-m4b-netbird.md also uses 'netbird'.)", "suggested_fix": "Settle one role name and align ADR-001/016, ROADMAP, and the M4b plan before scaffolding.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O24", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:22 ↔ docs/ROADMAP.md:71", "description": "ADR-024 describes the M1 ACME DNS-01 wildcard as '*.boma.<domain>' (infra subdomain), while ROADMAP:71 specifies '*.<boma-domain>' (apex). Different name spaces — the cert's actual SAN coverage for unexposed services is ambiguous across the two docs.", "suggested_fix": "Align the wildcard scope (decide *.wingu.me vs *.boma.wingu.me vs both) and state it identically in ADR-024 and ROADMAP.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O25", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/molecule/default/verify.yml:11,22; roles/public_dns/molecule/default/verify.yml:12", "description": "Molecule verify tasks use tags: [verify], which is not in the tests/tags.yml vocabulary (concerns/special/opt_ins/playbooks). check-tags.py exempts molecule/ paths so the linter doesn't flag it, and 4 roles use this de-facto convention — but it's an out-of-vocabulary tag the ADR-019 standard doesn't sanction.", "suggested_fix": "Either drop the tags from molecule verify tasks (the linter ignores molecule anyway) or add 'verify' as a sanctioned testing-only tag in tests/tags.yml with an ADR-019 note. Repo-wide convention call.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O26", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/templates/Caddyfile.j2:1; docker-compose.yml.j2:1", "description": "Neither rendered template carries an {{ ansible_managed }} header, though ADR-024 §1.2 cites 'one ansible_managed header' as a Caddy advantage. (No template in the repo currently uses ansible_managed — consistent with current practice but inconsistent with the ADR's stated intent.)", "suggested_fix": "Add a commented '# {{ ansible_managed }}' header to both templates (and ideally adopt the convention repo-wide).", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O27", "dimension": "consistency", "severity": "low", "location": "inventories/production/group_vars/all/reverse_proxy.yml", "description": "reverse_proxy production vars live in group_vars/all/ (every host) though the role only runs on offsite_hosts via offsite.yml; CLAUDE.md establishes an offsite_hosts/ group_vars dir for askari-specific config, which doesn't exist on disk. Harmless today (only askari imports the role) but broader scope than intended.", "suggested_fix": "Consider moving reverse_proxy.yml (and the offsite firewall opens) to group_vars/offsite_hosts/ for scope clarity, or leave if intentionally global. Judgement call.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O28", "dimension": "drift", "severity": "low", "location": "scripts/capacity-scan.py:133", "description": "capacity-scan.py cross-checks workload hostnames only against inventories/<env>/hosts.yml. askari lives in inventories/production/offsite.yml, not hosts.yml, so the drift cross-check never sees it. Minor (capacity is intent-based today) but a latent gap as offsite hosts grow.", "suggested_fix": "Also read offsite.yml (or glob inventories/<env>/*.yml host files) so offsite_hosts are included.", "tag": "new", "auto_fixable": false},
|
||||||
|
{"id": "O29", "dimension": "consistency", "severity": "low", "location": "inventories/production/offsite.yml:1-16 ↔ inventories/production/hosts.yml:7-16", "description": "offsite.yml (generated by tf-inventory-offsite) re-declares control/docker_hosts/proxmox_hosts with empty host maps because tf_to_inventory.py always emits all four VALID_GROUPS — duplicating groups in hosts.yml in the same inventory dir. Ansible merges them harmlessly, but the duplication/merge is undocumented.", "suggested_fix": "Document in inventories/README.md that offsite.yml is a second generated inventory file merged with hosts.yml, or have tf_to_inventory.py emit only non-empty groups for offsite. Leave as-is if intended; just document.", "tag": "new", "auto_fixable": false}
|
||||||
|
],
|
||||||
|
"prior_resolved": [
|
||||||
|
{"id": "O1@2026-06-11", "description": "make lint RED on main (site.yml imported nonexistent docker_host role)", "status": "resolved — docker_host scaffolded (03d33f8) then built (456c27d); make lint green this run."},
|
||||||
|
{"id": "O10@2026-06-11", "description": "README ADR list stopped early (recurring)", "status": "resolved — auto-fixed this run (AF11), extended through ADR-024."},
|
||||||
|
{"id": "O17@2026-06-11", "description": "empty handlers/main.yml scaffold artifacts in base/dev_env", "status": "resolved (accepted) — treated as an intentional make new-role scaffold convention; not re-raised."},
|
||||||
|
{"id": "O2,O3,O4,O5,O6,O7,O8,O9,O11,O12,O13,O14,O15,O16,O18@2026-06-11", "description": "ADR-004 backup scope; ADR-004 ACCESS/BACKUP table; CAPABILITIES nvim/tmux; ADR-002 upgrade caveat; hosts.yml offsite_hosts; new-host Part E; dev_env set_fact tag; ubongo subnet; ADR section order; ADR-007 example; .zshrc rclone/direnv; oh_my_posh config tag; tfvars pve01; See-also vs Related; docs/inventories README narrowness", "status": "still open — carried forward as O2,O5,O7,O6,O9,O15,O8,O10,O4,O18/O19,O16,O17,O11,O13,O14 respectively (renumbered)."}
|
||||||
|
]
|
||||||
|
}
|
||||||
157
docs/reviews/2026-06-14-review.md
Normal file
157
docs/reviews/2026-06-14-review.md
Normal file
|
|
@ -0,0 +1,157 @@
|
||||||
|
# Repo review — 2026-06-14
|
||||||
|
|
||||||
|
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
|
||||||
|
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
|
||||||
|
- **Previous run:** 2026-06-11 (`67f2aba`)
|
||||||
|
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
|
||||||
|
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
|
||||||
|
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
|
||||||
|
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
|
||||||
|
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
|
||||||
|
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
|
||||||
|
|
||||||
|
### Counts
|
||||||
|
|
||||||
|
| Dimension | High | Medium | Low | Total |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Cruft / staleness | 0 | 0 | 0 | 0 |
|
||||||
|
| Design conformance | 1 | 2 | 2 | 5 |
|
||||||
|
| Consistency & intent | 2 | 2 | 9 | 13 |
|
||||||
|
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
|
||||||
|
| **Open total** | **4** | **8** | **16** | **29** |
|
||||||
|
|
||||||
|
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
|
||||||
|
|
||||||
|
### Phase-0 scan
|
||||||
|
|
||||||
|
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
|
||||||
|
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
|
||||||
|
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
|
||||||
|
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
|
||||||
|
not a TODO). Details in the findings JSON.
|
||||||
|
|
||||||
|
### Deferral checklist
|
||||||
|
|
||||||
|
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
|
||||||
|
harness home, classification home, staging-first) confirmed **genuinely still open** —
|
||||||
|
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
|
||||||
|
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
|
||||||
|
|
||||||
|
## Auto-fixes applied
|
||||||
|
|
||||||
|
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
|
||||||
|
descriptions) — no logic, variable, secret, or task-order changes.
|
||||||
|
|
||||||
|
| ID | Sev | File | What |
|
||||||
|
|---|---|---|---|
|
||||||
|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
|
||||||
|
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
|
||||||
|
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
|
||||||
|
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
|
||||||
|
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
|
||||||
|
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
|
||||||
|
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1` → `cx23@hel1` (actual server) |
|
||||||
|
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)` → `cx23 (x86) or cax11 (ARM)` |
|
||||||
|
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
|
||||||
|
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik` → `Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
|
||||||
|
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
|
||||||
|
|
||||||
|
## Open findings (prioritised)
|
||||||
|
|
||||||
|
### High
|
||||||
|
|
||||||
|
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
|
||||||
|
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
|
||||||
|
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
|
||||||
|
paragraph (left for the operator — STATUS is the ground-truth doc).
|
||||||
|
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
|
||||||
|
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
|
||||||
|
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
|
||||||
|
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
|
||||||
|
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
|
||||||
|
still say Traefik too. Either finish the rename or soften ADR-024's claim.
|
||||||
|
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
|
||||||
|
claims ADRs 001–018 were restructured to lead with `## Status`, but 016/017/018 still
|
||||||
|
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
|
||||||
|
|
||||||
|
### Medium
|
||||||
|
|
||||||
|
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
|
||||||
|
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
|
||||||
|
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
|
||||||
|
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
|
||||||
|
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
|
||||||
|
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
|
||||||
|
carve-out (not a token swap, so left from AF10).
|
||||||
|
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
|
||||||
|
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
|
||||||
|
`--tags users|config` run breaks (base guards this; dev_env doesn't).
|
||||||
|
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
|
||||||
|
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
|
||||||
|
ubongo. Make the header honest.
|
||||||
|
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
|
||||||
|
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
|
||||||
|
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
|
||||||
|
`pve0`; verify the real node name before changing.
|
||||||
|
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
|
||||||
|
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
|
||||||
|
node external-API role, not a host service.)
|
||||||
|
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
|
||||||
|
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
|
||||||
|
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
|
||||||
|
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
|
||||||
|
in one place.
|
||||||
|
|
||||||
|
### Low
|
||||||
|
|
||||||
|
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
|
||||||
|
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
|
||||||
|
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
|
||||||
|
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
|
||||||
|
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
|
||||||
|
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
|
||||||
|
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
|
||||||
|
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
|
||||||
|
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
|
||||||
|
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
|
||||||
|
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
|
||||||
|
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
|
||||||
|
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
|
||||||
|
|
||||||
|
Full detail + suggested fixes in `2026-06-14-findings.json`.
|
||||||
|
|
||||||
|
## Themes worth a deliberate pass
|
||||||
|
|
||||||
|
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
|
||||||
|
sweep across ADR-008/017/019 closes it.
|
||||||
|
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
|
||||||
|
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
|
||||||
|
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
|
||||||
|
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
|
||||||
|
mirror base's `apply: tags:` guard.
|
||||||
|
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
|
||||||
|
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
|
||||||
|
|
||||||
|
## Follow-up prompt
|
||||||
|
|
||||||
|
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
|
||||||
|
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
|
||||||
|
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
|
||||||
|
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
|
||||||
|
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
|
||||||
|
> with its own revised HTTP-01 Status note.
|
||||||
|
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
|
||||||
|
> ACCESS.md + BACKUP.md rows to its service-role file table.
|
||||||
|
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
|
||||||
|
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
|
||||||
|
> `--tags config` run still resolves the home dir.
|
||||||
|
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
|
||||||
|
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
|
||||||
|
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
|
||||||
|
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
|
||||||
|
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
|
||||||
|
> batch; commit per CLAUDE.md git conventions.
|
||||||
|
|
@ -1,23 +1,157 @@
|
||||||
# Latest repo review
|
# Repo review — 2026-06-14
|
||||||
|
|
||||||
Most recent: **2026-05-30** → full report: `docs/reviews/2026-05-30-review.md`
|
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
|
||||||
|
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
|
||||||
|
- **Previous run:** 2026-06-11 (`67f2aba`)
|
||||||
|
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
|
||||||
|
|
||||||
| | high | medium | low | total |
|
## Summary
|
||||||
|
|
||||||
|
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
|
||||||
|
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
|
||||||
|
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
|
||||||
|
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
|
||||||
|
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
|
||||||
|
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
|
||||||
|
|
||||||
|
### Counts
|
||||||
|
|
||||||
|
| Dimension | High | Medium | Low | Total |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| Auto-fixed | 2 | 3 | 2 | 7 |
|
| Cruft / staleness | 0 | 0 | 0 | 0 |
|
||||||
| Open | 4 | 4 | 9 | 17 |
|
| Design conformance | 1 | 2 | 2 | 5 |
|
||||||
|
| Consistency & intent | 2 | 2 | 9 | 13 |
|
||||||
|
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
|
||||||
|
| **Open total** | **4** | **8** | **16** | **29** |
|
||||||
|
|
||||||
Dominant theme: drift from this session's own changes — residual `.vault_pass`
|
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
|
||||||
references after the Vaultwarden/rbw switch, and leftover PR/merge-request language
|
|
||||||
after going trunk-based.
|
|
||||||
|
|
||||||
## Suggested follow-up prompt
|
### Phase-0 scan
|
||||||
|
|
||||||
> Remediate the boma 2026-05-30 review (`docs/reviews/2026-05-30-review.md`):
|
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
|
||||||
> 1. Purge the residual `.vault_pass` references R1–R5 → the rbw/Vaultwarden flow.
|
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
|
||||||
> 2. Decide the workflow model R6–R7 — I lean "keep deploy approval gates, drop the
|
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
|
||||||
> PR/merge-request framing"; reconcile ADR-003/008 and CLAUDE.md to match.
|
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
|
||||||
> 3. Resolve R8 — scaffold `base`/`docker_host` via `make new-role`, or correct
|
not a TODO). Details in the findings JSON.
|
||||||
> STATUS.md/roles/README.md to say the roles don't exist yet.
|
|
||||||
> 4. Fix the Terraform `vlan_tag` wiring (R9).
|
### Deferral checklist
|
||||||
> Report on the rest.
|
|
||||||
|
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
|
||||||
|
harness home, classification home, staging-first) confirmed **genuinely still open** —
|
||||||
|
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
|
||||||
|
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
|
||||||
|
|
||||||
|
## Auto-fixes applied
|
||||||
|
|
||||||
|
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
|
||||||
|
descriptions) — no logic, variable, secret, or task-order changes.
|
||||||
|
|
||||||
|
| ID | Sev | File | What |
|
||||||
|
|---|---|---|---|
|
||||||
|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
|
||||||
|
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
|
||||||
|
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
|
||||||
|
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
|
||||||
|
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
|
||||||
|
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
|
||||||
|
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1` → `cx23@hel1` (actual server) |
|
||||||
|
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)` → `cx23 (x86) or cax11 (ARM)` |
|
||||||
|
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
|
||||||
|
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik` → `Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
|
||||||
|
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
|
||||||
|
|
||||||
|
## Open findings (prioritised)
|
||||||
|
|
||||||
|
### High
|
||||||
|
|
||||||
|
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
|
||||||
|
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
|
||||||
|
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
|
||||||
|
paragraph (left for the operator — STATUS is the ground-truth doc).
|
||||||
|
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
|
||||||
|
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
|
||||||
|
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
|
||||||
|
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
|
||||||
|
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
|
||||||
|
still say Traefik too. Either finish the rename or soften ADR-024's claim.
|
||||||
|
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
|
||||||
|
claims ADRs 001–018 were restructured to lead with `## Status`, but 016/017/018 still
|
||||||
|
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
|
||||||
|
|
||||||
|
### Medium
|
||||||
|
|
||||||
|
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
|
||||||
|
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
|
||||||
|
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
|
||||||
|
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
|
||||||
|
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
|
||||||
|
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
|
||||||
|
carve-out (not a token swap, so left from AF10).
|
||||||
|
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
|
||||||
|
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
|
||||||
|
`--tags users|config` run breaks (base guards this; dev_env doesn't).
|
||||||
|
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
|
||||||
|
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
|
||||||
|
ubongo. Make the header honest.
|
||||||
|
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
|
||||||
|
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
|
||||||
|
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
|
||||||
|
`pve0`; verify the real node name before changing.
|
||||||
|
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
|
||||||
|
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
|
||||||
|
node external-API role, not a host service.)
|
||||||
|
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
|
||||||
|
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
|
||||||
|
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
|
||||||
|
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
|
||||||
|
in one place.
|
||||||
|
|
||||||
|
### Low
|
||||||
|
|
||||||
|
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
|
||||||
|
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
|
||||||
|
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
|
||||||
|
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
|
||||||
|
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
|
||||||
|
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
|
||||||
|
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
|
||||||
|
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
|
||||||
|
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
|
||||||
|
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
|
||||||
|
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
|
||||||
|
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
|
||||||
|
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
|
||||||
|
|
||||||
|
Full detail + suggested fixes in `2026-06-14-findings.json`.
|
||||||
|
|
||||||
|
## Themes worth a deliberate pass
|
||||||
|
|
||||||
|
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
|
||||||
|
sweep across ADR-008/017/019 closes it.
|
||||||
|
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
|
||||||
|
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
|
||||||
|
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
|
||||||
|
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
|
||||||
|
mirror base's `apply: tags:` guard.
|
||||||
|
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
|
||||||
|
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
|
||||||
|
|
||||||
|
## Follow-up prompt
|
||||||
|
|
||||||
|
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
|
||||||
|
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
|
||||||
|
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
|
||||||
|
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
|
||||||
|
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
|
||||||
|
> with its own revised HTTP-01 Status note.
|
||||||
|
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
|
||||||
|
> ACCESS.md + BACKUP.md rows to its service-role file table.
|
||||||
|
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
|
||||||
|
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
|
||||||
|
> `--tags config` run still resolves the home dir.
|
||||||
|
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
|
||||||
|
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
|
||||||
|
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
|
||||||
|
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
|
||||||
|
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
|
||||||
|
> batch; commit per CLAUDE.md git conventions.
|
||||||
|
|
|
||||||
|
|
@ -50,6 +50,13 @@ Don't install these until their trigger lands — then add them here and to
|
||||||
- **The venv-activate hook** — this repo expects the Python `.venv` active for Bash
|
- **The venv-activate hook** — this repo expects the Python `.venv` active for Bash
|
||||||
commands. If you use the user-level `~/.claude/hooks/activate-venv.sh` pattern,
|
commands. If you use the user-level `~/.claude/hooks/activate-venv.sh` pattern,
|
||||||
replicate it; otherwise `source .venv/bin/activate` per session after `make setup`.
|
replicate it; otherwise `source .venv/bin/activate` per session after `make setup`.
|
||||||
|
- **Forgejo registry login (for image pushes)** — `make caddy-image-push` /
|
||||||
|
`molecule-image-push` need the Docker daemon authenticated to
|
||||||
|
`forgejo.nyumbani.baobab.band`. Run **`make registry-login`** once per machine: it reads
|
||||||
|
`vault.forgejo.registry_token` from the vault and does `docker login --password-stdin`
|
||||||
|
(no interactive prompt, so an agent can complete a push). The token is operator-minted
|
||||||
|
(Forgejo → Settings → Applications → Generate Token, package read+write) and set via
|
||||||
|
`make edit-vault`; until then `registry-login` prints how to obtain it. (2026-06-17 kaizen.)
|
||||||
|
|
||||||
## 4. A note on user-level settings
|
## 4. A note on user-level settings
|
||||||
|
|
||||||
|
|
@ -58,6 +65,23 @@ The dangerous-mode permission prompt (`skipDangerousModePermissionPrompt`) is a
|
||||||
"operator/agent error" threat, prefer leaving that prompt **on** unless you
|
"operator/agent error" threat, prefer leaving that prompt **on** unless you
|
||||||
deliberately rely on bypass mode.
|
deliberately rely on bypass mode.
|
||||||
|
|
||||||
|
## Environment gotchas
|
||||||
|
|
||||||
|
Migrated from `docs/FRICTION.md` by the 2026-06-10 kaizen review — surprises that bite
|
||||||
|
on this kind of host/toolchain:
|
||||||
|
|
||||||
|
- **Hooks (and any new `.claude/settings.json`) added mid-session don't activate until a
|
||||||
|
Claude Code restart.** The settings watcher only tracks settings files that existed at
|
||||||
|
session start; opening `/hooks` and dismissing does *not* load them. Fresh sessions
|
||||||
|
load them normally — restart after adding a hook.
|
||||||
|
- **pre-commit stashes *unstaged* changes before running hooks**, so a partial commit of
|
||||||
|
interdependent files can revert one and fail (e.g. an `ansible.cfg` change left
|
||||||
|
unstaged). Commit interdependent changes together, or stage the config change first.
|
||||||
|
- **`rbw sync` is required after adding a Vaultwarden item before `rbw get` finds it**
|
||||||
|
(the local cache is stale otherwise).
|
||||||
|
- **This shell is zsh** — unquoted `$VAR` does *not* word-split, so a variable holding a
|
||||||
|
file list is passed as a single argument. Use explicit args/arrays.
|
||||||
|
|
||||||
## Verifying
|
## Verifying
|
||||||
|
|
||||||
After setup, a quick check: the project commands (`/review-repo`, `/capacity-review`,
|
After setup, a quick check: the project commands (`/review-repo`, `/capacity-review`,
|
||||||
|
|
|
||||||
229
docs/runbooks/integration-testing.md
Normal file
229
docs/runbooks/integration-testing.md
Normal file
|
|
@ -0,0 +1,229 @@
|
||||||
|
# Runbook — Local VM integration testing
|
||||||
|
|
||||||
|
## When to use this
|
||||||
|
|
||||||
|
Run a local VM integration test before deploying any change that touches:
|
||||||
|
|
||||||
|
- **nftables / firewall rules** (the `firewall` concern of `base`)
|
||||||
|
- **sshd configuration** (listener address, port, key types, `base` hardening)
|
||||||
|
- **boot ordering or kernel parameters** (systemd units, sysctl)
|
||||||
|
- **Docker host networking** (`docker_host` DNAT rules, published-port forwarding, `daemon.json`)
|
||||||
|
|
||||||
|
These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require
|
||||||
|
a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3
|
||||||
|
(see ADR-025) and directly operationalises two standing rules:
|
||||||
|
|
||||||
|
- **"Test risky infra before live deploy"** (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host.
|
||||||
|
- **FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass** — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot *while you still have the break-glass open*, not after.
|
||||||
|
|
||||||
|
You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## First-deploy (one-time setup)
|
||||||
|
|
||||||
|
The `integration_test` role installs libvirt + QEMU + virtinst on ubongo and adds the
|
||||||
|
operator accounts (`sjat`, `claude`) to the `libvirt` and `kvm` groups.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test
|
||||||
|
```
|
||||||
|
|
||||||
|
**Re-login after this run** — group membership changes do not take effect in the current
|
||||||
|
session. The driver (`scripts/integration-vm.py`) requires both `libvirt` and `kvm`
|
||||||
|
group membership to create and manage VMs.
|
||||||
|
|
||||||
|
The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run
|
||||||
|
(one-time cost, ~500 MB); subsequent runs reuse the cached image.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running a cycle
|
||||||
|
|
||||||
|
### Makefile interface (recommended)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full cycle (provision → apply → reboot → assert → teardown on pass)
|
||||||
|
make test-integration HOST=askari
|
||||||
|
|
||||||
|
# With a specific cert tier
|
||||||
|
make test-integration HOST=askari CERTS=le-staging
|
||||||
|
|
||||||
|
# Keep the VM alive after the run (for manual inspection)
|
||||||
|
make test-integration HOST=askari KEEP=1
|
||||||
|
|
||||||
|
# Destroy all orphan integration VMs (name-prefix boma-it-*)
|
||||||
|
make test-integration-clean
|
||||||
|
```
|
||||||
|
|
||||||
|
`HOST` is a hostname from the production inventory (the profile `tests/integration/
|
||||||
|
profiles/<host>.json` must exist — see Adding a new profile below). `CERTS` defaults
|
||||||
|
to `internal`.
|
||||||
|
|
||||||
|
### Lower-level driver
|
||||||
|
|
||||||
|
The driver (`scripts/integration-vm.py`) exposes individual lifecycle steps for manual
|
||||||
|
or scripted use:
|
||||||
|
|
||||||
|
| Sub-command | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `up` | Ensure golden image → create ephemeral overlay → cloud-init seed → boot |
|
||||||
|
| `apply` | Run the site playbook against the transient inventory (real apply) |
|
||||||
|
| `reboot` | `virsh reboot` + wait for a verified reboot (boot-id change) — the step Molecule cannot do |
|
||||||
|
| `assert` | Run `tests/integration/verify.yml` (outcome assertions) |
|
||||||
|
| `cycle` | `up` → `apply` → `reboot` → `assert` → `down` (default: destroy on pass) |
|
||||||
|
| `down` | Destroy the VM + overlay |
|
||||||
|
| `prune` | Destroy all `boma-it-*` VMs + overlays (orphan cleanup) |
|
||||||
|
| `console` | Print the VM's captured serial-console log |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Example: step through manually
|
||||||
|
python3 scripts/integration-vm.py up --host askari
|
||||||
|
python3 scripts/integration-vm.py apply --host askari
|
||||||
|
python3 scripts/integration-vm.py reboot --host askari
|
||||||
|
python3 scripts/integration-vm.py assert --host askari
|
||||||
|
python3 scripts/integration-vm.py down --host askari
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cert tiers
|
||||||
|
|
||||||
|
| Tier | Flag | Use when |
|
||||||
|
|---|---|---|
|
||||||
|
| `internal` | `CERTS=internal` (default) | Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant. |
|
||||||
|
| `le-staging` | `CERTS=le-staging` | Testing the Caddy DNS-01 ACME path, cert renewal logic, or the `caddy-gandi` plugin. Real cert files, untrusted root, effectively no rate limits. Requires `vault.gandi.pat`. |
|
||||||
|
| `le-prod-wildcard` | `CERTS=le-prod-wildcard` | Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (`docs/security/accepted-risks.md`): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real `wingu.me` zone. |
|
||||||
|
|
||||||
|
> A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the
|
||||||
|
> `netbird-server` GeoLite2 FATAL-loop when NAT masquerade is wiped) **must** use
|
||||||
|
> `CERTS=internal`: the egress loss is the fault being simulated, and ACME requires egress.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Diagnostics and inspecting a failed VM
|
||||||
|
|
||||||
|
### Where diagnostics land
|
||||||
|
|
||||||
|
Diagnostics from every run are captured in:
|
||||||
|
|
||||||
|
```
|
||||||
|
~/integration-runs/<timestamp>-<host>/
|
||||||
|
```
|
||||||
|
|
||||||
|
This directory is gitignored. On a failed assert step, the driver dumps:
|
||||||
|
|
||||||
|
- `nft list ruleset` — the live nftables state at failure
|
||||||
|
- `docker ps -a` — container states
|
||||||
|
- `ss -tlnp` — listening sockets
|
||||||
|
- `journalctl -b` — full boot log
|
||||||
|
- `systemd-analyze critical-chain` — boot timing
|
||||||
|
- Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner
|
||||||
|
console, addressing FRICTION 2026-06-17 #5)
|
||||||
|
|
||||||
|
The agent reads these directly from `~/integration-runs/` — no manual download needed.
|
||||||
|
|
||||||
|
### Inspecting a kept or failed VM
|
||||||
|
|
||||||
|
When a run fails or when `KEEP=1` is passed, the VM is left running. Connect to it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Serial console (no SSH needed — useful when SSH is the fault)
|
||||||
|
python3 scripts/integration-vm.py console --host askari
|
||||||
|
# or directly:
|
||||||
|
virsh console boma-it-askari
|
||||||
|
# Exit with Ctrl-]
|
||||||
|
|
||||||
|
# SSH (as the ansible user, IP from virsh)
|
||||||
|
virsh domifaddr boma-it-askari --source lease
|
||||||
|
ssh ansible@<IP>
|
||||||
|
|
||||||
|
# List all integration VMs
|
||||||
|
virsh list --all | grep boma-it-
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cleanup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Destroy a specific VM
|
||||||
|
python3 scripts/integration-vm.py down --host askari
|
||||||
|
|
||||||
|
# Reap all orphans
|
||||||
|
make test-integration-clean
|
||||||
|
# or:
|
||||||
|
python3 scripts/integration-vm.py prune
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Safety invariants
|
||||||
|
|
||||||
|
These make the test tool itself safe — the harness cannot reach or modify production:
|
||||||
|
|
||||||
|
1. **Single-host transient inventory** — the playbook apply runs against a generated
|
||||||
|
single-host inventory (`ansible_host=<VM lease IP>`). No real host is ever in scope.
|
||||||
|
2. **In-VM coordinator only** — "be askari" points NetBird at the coordinator running
|
||||||
|
inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it
|
||||||
|
never enrols in the real NetBird mesh.
|
||||||
|
3. **Isolated NAT network** — test VMs sit on a dedicated libvirt NAT network.
|
||||||
|
Outbound NAT provides ACME/image-pull access, but the VM is not reachable from
|
||||||
|
the LAN (`10.20.x`) or the real mesh.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource constraints
|
||||||
|
|
||||||
|
The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The
|
||||||
|
driver enforces **one integration VM at a time** (refusing to start if another
|
||||||
|
`boma-it-*` VM is already running) and refuses to start below the free-RAM threshold
|
||||||
|
(~13 GiB available on ubongo at baseline, per ADR-025).
|
||||||
|
|
||||||
|
**Do not run a test-integration cycle alongside a Level-4 browser session**
|
||||||
|
(Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the
|
||||||
|
enforcement mechanism, not a suggestion.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding a new profile
|
||||||
|
|
||||||
|
To make the harness "be" a different host:
|
||||||
|
|
||||||
|
1. Create `tests/integration/profiles/<hostname>.json` — specifies which roles to apply
|
||||||
|
and base VM sizing for that host.
|
||||||
|
2. Create `tests/integration/overrides/<hostname>.yml` — the explicit stub overlay:
|
||||||
|
cert tier, in-VM coordinator endpoint (if the host runs the coordinator),
|
||||||
|
`ansible_host` placeholder, and any other variables that must differ from the real
|
||||||
|
inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator).
|
||||||
|
3. Add assertions to `tests/integration/verify.yml` (or extend an existing task with a
|
||||||
|
`when: inventory_hostname == '<hostname>'` guard) for any host-specific outcomes.
|
||||||
|
4. Run `make test-integration HOST=<hostname>` to validate the new profile.
|
||||||
|
|
||||||
|
All stubs must be explicit in the overlay — the real inventory is never edited.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reproducing the 2026-06-17 incident
|
||||||
|
|
||||||
|
The acceptance test for the harness (ADR-025) deliberately reproduces the incident:
|
||||||
|
|
||||||
|
1. Run with today's `base` (firewall on, no `docker_host` container-forward drop-in):
|
||||||
|
```bash
|
||||||
|
make test-integration HOST=askari CERTS=internal
|
||||||
|
```
|
||||||
|
The assert step **must FAIL** after reboot (Docker forwarding dead, published ports
|
||||||
|
unreachable). If it passes, the harness is not faithful.
|
||||||
|
|
||||||
|
2. Implement the `docker_host` container-forward rules (FRICTION 2026-06-17 #1 fix) and
|
||||||
|
re-run. The assert step **must PASS** across the reboot.
|
||||||
|
|
||||||
|
This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the
|
||||||
|
fix survives a real reboot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-025 — decision record for this harness (approach, cert tiers, safety invariants)
|
||||||
|
- ADR-008 — testing methodology; this is Level 2/3
|
||||||
|
- `docs/security/accepted-risks.md` R6 — `le-prod-wildcard` accepted risk
|
||||||
|
- `docs/FRICTION.md` — 2026-06-17 signals that motivated this runbook
|
||||||
144
docs/runbooks/netbird-client.md
Normal file
144
docs/runbooks/netbird-client.md
Normal file
|
|
@ -0,0 +1,144 @@
|
||||||
|
# Runbook — Enrolling a NetBird client (road-warrior device)
|
||||||
|
|
||||||
|
Joins a **client/road-warrior device** (laptop, desktop, phone) to the boma NetBird mesh
|
||||||
|
so it can reach `ubongo` and other peers from anywhere. The self-hosted coordinator is on
|
||||||
|
`askari` (ADR-016, M4b); enrollment lands a device on the `100.64.0.0/10` overlay.
|
||||||
|
|
||||||
|
> **Hosts vs clients.** Managed **Linux hosts** join via the `base` role's `mesh` concern
|
||||||
|
> (`base__mesh_enabled: true` + the reusable key in `vault.netbird.setup_key`) — see
|
||||||
|
> ADR-016 / the `base` README, *not* this runbook. This runbook is for **user devices**
|
||||||
|
> NetBird doesn't manage with Ansible.
|
||||||
|
|
||||||
|
verified: NetBird client install + self-hosted `--management-url` flow · docs.netbird.io
|
||||||
|
(`/get-started/install/windows`, `/get-started/cli`) · 2026-06-17
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- The coordinator's first-boot `/setup` admin exists and you can log in at
|
||||||
|
`https://netbird.askari.wingu.me`.
|
||||||
|
- **Auth, pick one:**
|
||||||
|
- **SSO** (recommended for a personal device) — your dashboard account; no secret to copy.
|
||||||
|
- **Setup key** — dashboard → **Settings → Setup Keys** → a reusable key (mint a
|
||||||
|
client-specific one for clean ACL grouping, or reuse the existing reusable key).
|
||||||
|
- Local **admin rights** on the device (the client installs a service).
|
||||||
|
- **Coordinator facts:** management URL `https://netbird.askari.wingu.me`; `ubongo`
|
||||||
|
= `100.99.146.14` (`ubongo.netbird.selfhosted`); `askari` = `100.99.226.39`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part A — Windows 11
|
||||||
|
|
||||||
|
1. **Install:** download + run the MSI **https://pkgs.netbird.io/windows/msi/x64**
|
||||||
|
(official x64 client; installs the tray app + the `netbird` service).
|
||||||
|
2. **Connect** from an **elevated** Windows Terminal / PowerShell ("Run as administrator"):
|
||||||
|
```powershell
|
||||||
|
netbird up --management-url https://netbird.askari.wingu.me
|
||||||
|
```
|
||||||
|
A browser opens — sign in with your dashboard account. (SSO won't open a browser?
|
||||||
|
use a key: `netbird up --setup-key <KEY> --management-url https://netbird.askari.wingu.me`.)
|
||||||
|
3. Proceed to **Part C** (verify).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part B — Other platforms (same management URL)
|
||||||
|
|
||||||
|
- **macOS / Linux desktop:** install the client (macOS: NetBird app / Homebrew; Linux:
|
||||||
|
`pkgs.netbird.io` per the distro — same apt/rpm flow as `base`'s `mesh` concern), then
|
||||||
|
`netbird up --management-url https://netbird.askari.wingu.me` (Linux: prefix `sudo`).
|
||||||
|
- **Android / iOS:** install the **NetBird** app, then in **Settings → Advanced /
|
||||||
|
Server** set the management server to `https://netbird.askari.wingu.me` **before**
|
||||||
|
logging in; connect and complete the SSO login. (Setup keys are supported in-app too.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part C — Verify + use
|
||||||
|
|
||||||
|
```sh
|
||||||
|
netbird status # expect: Management: Connected, Signal: Connected, a 100.x NetBird IP
|
||||||
|
netbird status -d # peer detail — ubongo (100.99.146.14) + askari (100.99.226.39) listed
|
||||||
|
```
|
||||||
|
Reach `ubongo` over the mesh:
|
||||||
|
```sh
|
||||||
|
ssh sjat@100.99.146.14 # or: ssh sjat@ubongo.netbird.selfhosted
|
||||||
|
```
|
||||||
|
**SSH auth is separate from the mesh:** `ubongo` is key-only (passwords disabled), so the
|
||||||
|
device needs an SSH key authorised for `sjat@ubongo`. The mesh provides the network path;
|
||||||
|
the SSH key provides auth.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting — mesh drops / SSH to `ubongo` times out
|
||||||
|
|
||||||
|
Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
|
||||||
|
`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
|
||||||
|
|
||||||
|
verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
|
||||||
|
mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
|
||||||
|
`/help/troubleshooting-client`) · 2026-06-18
|
||||||
|
|
||||||
|
**1. Triage — is it your device or the coordinator?** On the device:
|
||||||
|
```sh
|
||||||
|
netbird status -d # Management/Signal Connected? peers P2P/Relayed?
|
||||||
|
nslookup netbird.askari.wingu.me # coordinator FQDN
|
||||||
|
nslookup pkgs.netbird.io # a PUBLIC name — control test
|
||||||
|
```
|
||||||
|
If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
|
||||||
|
a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
|
||||||
|
dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
|
||||||
|
`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
|
||||||
|
Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
|
||||||
|
`ubongo` staying green ⇒ the fault is your device's network.
|
||||||
|
|
||||||
|
**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
|
||||||
|
network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
|
||||||
|
can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
|
||||||
|
is no direct path to fall back to, so SSH dies until DNS recovers.
|
||||||
|
|
||||||
|
**2. Make the device resilient:**
|
||||||
|
- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
|
||||||
|
rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
|
||||||
|
Windows: inspect with `Get-DnsClientServerAddress`.
|
||||||
|
- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
|
||||||
|
(`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
|
||||||
|
```
|
||||||
|
77.42.120.136 netbird.askari.wingu.me
|
||||||
|
```
|
||||||
|
`askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
|
||||||
|
reconnect deadlocks.
|
||||||
|
|
||||||
|
**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
|
||||||
|
a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
|
||||||
|
coordinator DNS entirely):
|
||||||
|
```sh
|
||||||
|
ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW
|
||||||
|
```
|
||||||
|
> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
|
||||||
|
> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
|
||||||
|
> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
|
||||||
|
> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
|
||||||
|
> DNS/mesh outage = full lockout. (ADR-021 break-glass.)
|
||||||
|
|
||||||
|
**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
|
||||||
|
`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
|
||||||
|
hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
|
||||||
|
current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
|
||||||
|
STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
|
||||||
|
deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
|
||||||
|
DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
|
||||||
|
hardening, not a quick fix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
|
||||||
|
networking is unaffected.
|
||||||
|
- **Persistence:** the service auto-starts on boot and reconnects; the tray app has
|
||||||
|
Connect/Disconnect; CLI `netbird down` / `netbird up` (no flags after first setup).
|
||||||
|
- **Troubleshooting** — *"failed while getting Management Service public key"* / won't
|
||||||
|
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
|
||||||
|
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
|
||||||
|
terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
|
||||||
|
timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
|
||||||
|
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
|
||||||
|
(and the setup key if one-off).
|
||||||
|
|
@ -2,7 +2,8 @@
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
|
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
|
||||||
|
Not needed for the control node `ubongo`, which is bare-metal (Part E).
|
||||||
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
|
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
|
||||||
- The host's intended hostname and IP are decided
|
- The host's intended hostname and IP are decided
|
||||||
|
|
||||||
|
|
@ -57,9 +58,9 @@ locals {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Terraform clones the cloud-init template from Part A, sets the cloud-init values
|
Terraform clones the cloud-init template from Part A and sets the cloud-init values
|
||||||
(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009
|
(hostname, SSH key, IP/gateway). It writes no DNS records — the `dns` role owns the
|
||||||
for the full handoff and the `vms` output → inventory data contract.
|
internal zone. See ADR-009 for the full handoff and the `vms` output → inventory data contract.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -67,7 +68,7 @@ for the full handoff and the `vms` output → inventory data contract.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
make tf-plan TF_ENV=production # review — confirm only the new VM is added
|
make tf-plan TF_ENV=production # review — confirm only the new VM is added
|
||||||
make tf-apply TF_ENV=production # create the VM + write its DNS A record
|
make tf-apply TF_ENV=production # create the VM (no DNS records written)
|
||||||
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
|
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -108,29 +109,47 @@ make check PLAYBOOK=site
|
||||||
# Should report no changes
|
# Should report no changes
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> **Pre-flight before lockout-risky changes (firewall / sshd / boot):** before applying
|
||||||
|
> any change that touches nftables rules, SSH configuration, or boot ordering, run
|
||||||
|
> `make test-integration HOST=<name>` and confirm reboot-recovery on the local VM
|
||||||
|
> **while the break-glass (Proxmox console / Hetzner console) is still open**. Do not
|
||||||
|
> retire the break-glass until the integration test passes. See
|
||||||
|
> `docs/runbooks/integration-testing.md` and ADR-025.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Part E — Control node (manual exception)
|
## Part E — Control node (`ubongo`, manual exception)
|
||||||
|
|
||||||
The control node runs Terraform and Ansible, so it cannot be created by the
|
The control node runs Terraform and Ansible, so it cannot be created by the
|
||||||
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
|
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
||||||
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
|
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||||
|
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||||
|
|
||||||
|
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
|
||||||
|
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
|
||||||
|
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
|
||||||
|
> **pending** item. Steps below describe the intended end state.
|
||||||
|
|
||||||
|
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||||
|
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo` —
|
||||||
|
currently managed as `sjat`; see the note above.)*
|
||||||
|
3. Set up the Ansible environment on it:
|
||||||
```bash
|
```bash
|
||||||
# Clone the template by hand (Proxmox UI or qm clone)
|
git clone <repo> ~/ansible
|
||||||
qm clone 9000 <VMID> --name <hostname> --full
|
cd ~/ansible
|
||||||
qm set <VMID> --memory 2048 --cores 2 \
|
make setup # venv + Python deps
|
||||||
--ciuser ansible \
|
make collections # Ansible collections
|
||||||
--sshkeys /path/to/ansible_ed25519.pub \
|
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
||||||
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
|
|
||||||
qm start <VMID>
|
|
||||||
```
|
```
|
||||||
|
4. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016) — so it is
|
||||||
|
reachable over SSH from elsewhere.
|
||||||
|
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
||||||
|
|
||||||
Then set up the Ansible environment on it (`make setup`, `make collections`, set up
|
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
||||||
`rbw` and `rbw unlock`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
|
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
|
||||||
`control` group. Because the control node is not in `local.vms`, this is the only
|
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
|
||||||
case where editing `hosts.yml` by hand is expected — every other host comes from
|
`control` entry — re-add `ubongo` after running it (preserving the control entry in
|
||||||
`make tf-inventory`.
|
the generator is tracked separately, not yet built).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -82,7 +82,52 @@ service clears the security bar — record any conscious deviation in
|
||||||
manual in review today, with the planned `/security-review` aggregating every
|
manual in review today, with the planned `/security-review` aggregating every
|
||||||
`roles/*/SECURITY.md` to automate it.
|
`roles/*/SECURITY.md` to automate it.
|
||||||
|
|
||||||
### 10. Commit
|
### 10. Write the per-service verification spec (services)
|
||||||
|
|
||||||
|
For a **service** role, copy `docs/testing/service-verify-template.md` to
|
||||||
|
`roles/<rolename>/VERIFY.md` and fill it in: the critical user journeys that define
|
||||||
|
"working" for this service, what good looks like, what is not browser-verifiable
|
||||||
|
(→ manual handoff), and the test data needed. This is the per-service backbone for the
|
||||||
|
Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production
|
||||||
|
service-clearance gate (`docs/security/service-checklist.md`).
|
||||||
|
|
||||||
|
### 11. Write the per-service operational-access record (services)
|
||||||
|
|
||||||
|
For a **service** role, copy `docs/access/service-access-template.md` to
|
||||||
|
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
|
||||||
|
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
|
||||||
|
`access__log.loki_labels`, and `access__api` — `enabled` + endpoint + `firewall_ref` +
|
||||||
|
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
|
||||||
|
rendered from that data; the admin-API path must `firewall_ref` an entry in the
|
||||||
|
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
|
||||||
|
`/check-access <rolename>` proves the documented paths are live — part of the
|
||||||
|
service-clearance gate (`docs/security/service-checklist.md`).
|
||||||
|
|
||||||
|
### 12. Write the per-service backup record (stateful services)
|
||||||
|
|
||||||
|
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
|
||||||
|
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
|
||||||
|
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
|
||||||
|
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
|
||||||
|
is rendered from that data. A **stateless** service sets `backup__state: false` with a
|
||||||
|
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
|
||||||
|
proves the declared state is captured — part of the service-clearance gate
|
||||||
|
(`docs/security/service-checklist.md`).
|
||||||
|
|
||||||
|
### 13. Pre-flight for lockout-risky roles
|
||||||
|
|
||||||
|
If the new role touches nftables rules, SSH configuration, or boot ordering, run a
|
||||||
|
local VM integration test and confirm reboot-recovery **before** deploying to a live
|
||||||
|
host and while the host's break-glass (Proxmox console / Hetzner console) is still
|
||||||
|
open:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test-integration HOST=<target-host>
|
||||||
|
```
|
||||||
|
|
||||||
|
See `docs/runbooks/integration-testing.md` and ADR-025.
|
||||||
|
|
||||||
|
### 14. Commit
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git checkout -b role/<rolename>
|
git checkout -b role/<rolename>
|
||||||
|
|
|
||||||
|
|
@ -30,6 +30,28 @@ clear "run: rbw unlock" error rather than a hang.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Break-glass — vault access during a full cluster outage
|
||||||
|
|
||||||
|
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
|
||||||
|
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
|
||||||
|
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||||
|
it **offline** with your Vaultwarden master password — no live server needed for
|
||||||
|
entries it has already synced. The recovery design therefore requires:
|
||||||
|
|
||||||
|
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
|
||||||
|
once** while Vaultwarden was reachable (`rbw sync`).
|
||||||
|
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
|
||||||
|
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
|
||||||
|
|
||||||
|
There is always exactly one irreducible offline root secret; here it is the
|
||||||
|
Vaultwarden master password. Keep it recoverable without the cluster.
|
||||||
|
|
||||||
|
> **Verified (2026-06-11, ADR-014):** confirmed on `ubongo` with rbw 1.15.0 — with
|
||||||
|
> the Vaultwarden host unreachable, `rbw sync` fails but `rbw get boma-ansible-vault`
|
||||||
|
> still decrypts from the local cache. Re-verify after an `rbw` major-version bump.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Rotating a single secret value
|
## Rotating a single secret value
|
||||||
|
|
||||||
1. Ensure the agent is unlocked: `rbw unlock`
|
1. Ensure the agent is unlocked: `rbw unlock`
|
||||||
|
|
|
||||||
|
|
@ -15,8 +15,14 @@ revisit (trigger).
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
|
||||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||||
|
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||||
|
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
||||||
|
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
|
||||||
|
| R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked |
|
||||||
|
| R7 | **`claude` AI-worker has `NOPASSWD:ALL` sudo on `ubongo`** — the automated AI-worker account can execute any command as root on the control node without a password prompt. A compromised or misbehaving agent session could make arbitrary root-level changes to ubongo | The account is **password-locked** (no interactive `claude` login; `NOPASSWD` sudo is the account's only escalation path, so there is no "su to claude + sudo" attack). `auditd` + Loki attribution (ADR-018) logs every `sudo` invocation with the originating user. The drop-in (`/etc/sudoers.d/claude-ai-worker`) is repo-managed via `base__ai_worker_user` — revocable in one commit + one deploy. Single-operator homelab; all changes in git; off-machine backups (ADR-022). Full rationale: ADR-015 amendment (2026-06-18) + ADR-021 §Sudo model. | The AI-worker executes a destructive action that cannot be rolled back via git; the account key is compromised; the threat model shifts toward targeted remote attackers |
|
||||||
|
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
|
||||||
|
|
||||||
_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
_Last reviewed: 2026-06-20. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||||
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
||||||
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
|
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
|
||||||
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
|
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
|
||||||
|
|
|
||||||
|
|
@ -47,7 +47,17 @@ This checklist is the generic **bar**. Each service answers it in its own
|
||||||
## Operability (security-adjacent)
|
## Operability (security-adjacent)
|
||||||
|
|
||||||
- [ ] Logs go somewhere reviewable (central aggregation when available)
|
- [ ] Logs go somewhere reviewable (central aggregation when available)
|
||||||
- [ ] Backup/restore is covered if the service holds state
|
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
|
||||||
|
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
|
||||||
|
reports the declared paths/dumps captured in the latest snapshot — or the service
|
||||||
|
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
|
||||||
|
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
|
||||||
|
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
|
||||||
|
verified (ADR-008 Level 4 / ADR-017)
|
||||||
|
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
|
||||||
|
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
|
||||||
|
documented paths green — or a deviation is recorded in
|
||||||
|
`docs/security/accepted-risks.md`
|
||||||
|
|
||||||
> Deviations are allowed but must be **conscious**: record them in
|
> Deviations are allowed but must be **conscious**: record them in
|
||||||
> `docs/security/accepted-risks.md`, don't leave them implicit.
|
> `docs/security/accepted-risks.md`, don't leave them implicit.
|
||||||
|
|
|
||||||
484
docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md
Normal file
484
docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md
Normal file
|
|
@ -0,0 +1,484 @@
|
||||||
|
# Mesh VPN (NetBird) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN.
|
||||||
|
|
||||||
|
**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only.
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pre-flight (read once before starting)
|
||||||
|
|
||||||
|
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`.
|
||||||
|
- **Commit style:** one commit per task, imperative subject ≤72 chars.
|
||||||
|
- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it.
|
||||||
|
- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`.
|
||||||
|
- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File map
|
||||||
|
|
||||||
|
| File | Action | Responsibility after change |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision |
|
||||||
|
| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator |
|
||||||
|
| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) |
|
||||||
|
| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk |
|
||||||
|
| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted |
|
||||||
|
| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) |
|
||||||
|
| `CLAUDE.md` | Modify | ADR-016 in Further reading |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Author ADR-016 (the home of record)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/016-mesh-vpn.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the ADR file**
|
||||||
|
|
||||||
|
Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||||
|
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
|
||||||
|
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
|
||||||
|
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
|
||||||
|
weigh. This ADR settles it.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
|
||||||
|
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
|
||||||
|
|
||||||
|
The decision in four parts:
|
||||||
|
|
||||||
|
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
|
||||||
|
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
|
||||||
|
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
|
||||||
|
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
|
||||||
|
survives a homelab outage and stays out of the cluster it administers.
|
||||||
|
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
|
||||||
|
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
|
||||||
|
parity) — ruled out below.
|
||||||
|
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5
|
||||||
|
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
|
||||||
|
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
|
||||||
|
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
|
||||||
|
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
|
||||||
|
5. **Identity — embedded local users** (Dex in the management container); external SSO
|
||||||
|
(Zitadel/Keycloak) stays an optional future.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||||
|
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
|
||||||
|
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
|
||||||
|
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
|
||||||
|
|
||||||
|
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
|
||||||
|
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
|
||||||
|
open-core feature gating.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
|
||||||
|
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
|
||||||
|
allocated for it.
|
||||||
|
|
||||||
|
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
|
||||||
|
- `ubongo` — agent.
|
||||||
|
- All Linux managed hosts — agent via the `base` role.
|
||||||
|
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
|
||||||
|
- OPNsense / `mgmt` — single non-agent exception.
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
|
||||||
|
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
|
||||||
|
privilege.
|
||||||
|
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||||
|
consumed by `base`; prefer ephemeral/scoped keys.
|
||||||
|
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||||
|
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||||
|
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||||
|
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||||
|
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||||
|
Recorded as accepted-risk R3.
|
||||||
|
|
||||||
|
## Recovery & operations
|
||||||
|
|
||||||
|
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
|
||||||
|
mesh/coordinator outage never blocks on-LAN runs.
|
||||||
|
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` →
|
||||||
|
`base` enrolls the fleet.
|
||||||
|
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||||
|
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
||||||
|
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
||||||
|
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
|
||||||
|
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||||
|
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||||
|
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||||
|
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||||
|
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||||
|
`base` exists.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
|
||||||
|
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
|
||||||
|
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
|
||||||
|
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
|
||||||
|
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||||
|
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||||
|
|
||||||
|
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||||
|
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||||
|
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md`
|
||||||
|
Expected: Passed/Skipped (ansible-lint Skipped for non-YAML).
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/016-mesh-vpn.md
|
||||||
|
git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/007-network.md`
|
||||||
|
|
||||||
|
Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Replace the VLAN-99 addressing subsection**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
|
||||||
|
|
||||||
|
| Address | Host |
|
||||||
|
|---|---|
|
||||||
|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
|
||||||
|
| `10.99.0.2` | `askari` (Hetzner VPS) |
|
||||||
|
| `10.99.0.10`+ | Road-warrior clients |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
### VLAN 99 — vpn — retired
|
||||||
|
|
||||||
|
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
|
||||||
|
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
|
||||||
|
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
|
||||||
|
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
|
||||||
|
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
|
||||||
|
`10.99.0.0/24` is freed.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
|
||||||
|
| `vpn` | `mgmt` | allow (administration from askari) |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
|
||||||
|
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Rewrite the "External monitoring — askari" section**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
|
||||||
|
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
|
||||||
|
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
|
||||||
|
for administration.
|
||||||
|
|
||||||
|
`askari` is provisioned and managed independently of the Proxmox cluster — it
|
||||||
|
must be reachable even when the homelab is down (its entire purpose).
|
||||||
|
FQDN: `askari.baobab.band`.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
|
||||||
|
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
|
||||||
|
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
||||||
|
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
||||||
|
|
||||||
|
`askari` is provisioned and managed independently of the Proxmox cluster — it must
|
||||||
|
be reachable even when the homelab is down (its entire purpose), which is also why
|
||||||
|
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
|
||||||
|
FQDN: `askari.baobab.band`.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/007-network.md
|
||||||
|
git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Resolve ADR-015 deferred item #1
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/015-control-host.md`
|
||||||
|
|
||||||
|
Read the file first, then make THREE exact edits.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update provisioning step 3**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
3. Join the mesh VPN (choice deferred — see below).
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Update the Access & security mesh line**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||||
|
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
|
||||||
|
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
|
||||||
|
stays inside ADR-002.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Resolve deferred item #1**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||||
|
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||||
|
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||||
|
or it recreates the chicken-and-egg.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
|
||||||
|
(off-site, so it survives a homelab outage and stays out of the cluster it
|
||||||
|
administers). Replaces ADR-007's OPNsense WireGuard.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/015-control-host.md
|
||||||
|
git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Replace accepted-risks R3 with the concrete residual risk
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/accepted-risks.md`
|
||||||
|
|
||||||
|
Read the file first, then make ONE exact edit. (The row is long — match it whole.)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace the R3 row**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Bump the "Last reviewed" date**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
_Last reviewed: 2026-06-05. The prior gaps
|
||||||
|
```
|
||||||
|
This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.)
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/security/accepted-risks.md
|
||||||
|
git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Update the CAPABILITIES VPN row
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/CAPABILITIES.md`
|
||||||
|
|
||||||
|
Read the file first, then make ONE exact edit.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace the VPN / remote access row**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/CAPABILITIES.md
|
||||||
|
git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Add NetBird rows to STATUS.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
|
||||||
|
Read the file first, then make ONE exact edit (add two rows after the `ubongo` row).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the two rows**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by the two new rows:
|
||||||
|
```
|
||||||
|
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||||
|
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
|
||||||
|
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add STATUS.md
|
||||||
|
git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Link ADR-016 from CLAUDE.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
|
||||||
|
Read the file first, then make ONE exact edit.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the Further reading row after Network topology**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| Network topology | `docs/decisions/007-network.md` |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by the new row:
|
||||||
|
```
|
||||||
|
| Network topology | `docs/decisions/007-network.md` |
|
||||||
|
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md
|
||||||
|
git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Final consistency sweep
|
||||||
|
|
||||||
|
**Files:** none modified (verification only)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||||
|
```
|
||||||
|
Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present"
|
||||||
|
grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||||
|
```
|
||||||
|
Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Full hook run**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --all-files`
|
||||||
|
Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Push (only if the user asks)**
|
||||||
|
|
||||||
|
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
|
||||||
|
```bash
|
||||||
|
git push origin <branch-or-main-after-merge>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review notes (author)
|
||||||
|
|
||||||
|
- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 2–7; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓
|
||||||
|
- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓
|
||||||
|
- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓
|
||||||
|
- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓
|
||||||
|
```
|
||||||
605
docs/superpowers/plans/2026-06-05-service-ui-verification.md
Normal file
605
docs/superpowers/plans/2026-06-05-service-ui-verification.md
Normal file
|
|
@ -0,0 +1,605 @@
|
||||||
|
# Service-UI Verification (Level 4) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Build the authorable-now parts of ADR-008 Level 4 — a Claude-driven exploratory service-UI verification harness — namely ADR-017, the `/verify-service` skill, the per-service `VERIFY.md` template/convention, and the doc reconciliations; the *live run* stays deferred on `ubongo`/Authentik/staging.
|
||||||
|
|
||||||
|
**Architecture:** Mostly documentation + two new authorable artifacts (the `/verify-service` Claude Code command and the `VERIFY.md` template). No application code, no Ansible roles (none of the prerequisite roles exist). The harness *mechanism* is the `playwright` Claude Code plugin driving Chromium on `ubongo`; this plan does not install or run it — it records the decision, the standards, and the orchestration logic.
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown + a Claude Code command file. Verification is the repo's pre-commit hooks plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pre-flight (read once before starting)
|
||||||
|
|
||||||
|
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked`; if it exits non-zero, stop and ask the user to `rbw unlock`.
|
||||||
|
- **Commit style:** one commit per task, imperative subject ≤72 chars.
|
||||||
|
- **Order matters:** Task 1 (ADR-017) lands first — later tasks link to it.
|
||||||
|
- **Spec reference:** `docs/superpowers/specs/2026-06-05-service-ui-verification-design.md`.
|
||||||
|
- **Branch:** the controller creates `chore/service-ui-verification-docs` off `main` before dispatching Task 1; do not implement on `main`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File map
|
||||||
|
|
||||||
|
| File | Action | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/decisions/017-service-ui-verification.md` | Create | Home of record for Level 4 verification |
|
||||||
|
| `docs/decisions/008-testing.md` | Modify | Expand the Level 4 stub; link ADR-017 |
|
||||||
|
| `docs/testing/service-verify-template.md` | Create | The `VERIFY.md` template (parallels `service-security-template.md`) |
|
||||||
|
| `.claude/commands/verify-service.md` | Create | The `/verify-service <name>` orchestrating skill |
|
||||||
|
| `docs/security/service-checklist.md` | Modify | Add "passed Level 4" to the pre-deploy gate |
|
||||||
|
| `CLAUDE.md` | Modify | Role-convention bullet (`VERIFY.md`); Further-reading ADR-017 row |
|
||||||
|
| `.gitignore` | Modify | Ignore the screenshot working dir |
|
||||||
|
| `docs/testing/reviews/README.md` | Create | Explains the committed-report dir (also makes the dir exist in git) |
|
||||||
|
| `STATUS.md` | Modify | Row: Level 4 verification (skill/template authorable; running deferred) |
|
||||||
|
| `docs/TODO.md` | Modify | Mark 2.2 (browser) + 2.3 addressed by ADR-017 |
|
||||||
|
|
||||||
|
**Deferred (not in this plan):** scaffolding `VERIFY.md` into `make new-role` (do it when that scaffold is next touched — noted in ADR-017); the Authentik test-user provisioning automation; per-service `VERIFY.md` files (no service roles exist); installing/running the `playwright` plugin.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Author ADR-017 (the home of record)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/017-service-ui-verification.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the ADR file**
|
||||||
|
|
||||||
|
Create `docs/decisions/017-service-ui-verification.md` with exactly this content (preserve em-dashes —, backticks, table pipes):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-017 — Service-UI acceptance verification (Level 4)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
||||||
|
Level 4 stub. Nothing below Level 4 exercises a service's **application UI** — none
|
||||||
|
answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?"
|
||||||
|
(TODO 8.2). The operator's ask (TODO 2.2 headless browsing + TODO 2.3 test users +
|
||||||
|
manual-test instruction): Claude spins up a browser, *sees* the service UI, exercises
|
||||||
|
it, generates test users, and instructs the operator on manual tests. Today Claude sees
|
||||||
|
a browser only passively (`/screenshot` fetches operator-taken shots from `mamba`); this
|
||||||
|
is the active counterpart.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A Claude-driven exploratory service-UI verification harness — **Level 4** — invoked as
|
||||||
|
`/verify-service <name>` on `ubongo`. Five settled forks:
|
||||||
|
|
||||||
|
1. **Claude-driven exploratory** — Claude navigates with judgment, not deterministic
|
||||||
|
scripts. A scripted regression suite is explicitly not built here.
|
||||||
|
2. **Interactive, Claude-in-the-loop** — exploratory judgment can't be a headless cron
|
||||||
|
gate; scheduled smoke is a determinism job for health checks / Uptime Kuma later.
|
||||||
|
3. **Staging, full exercise** — Claude creates test users and exercises features
|
||||||
|
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
|
||||||
|
resolves safety.
|
||||||
|
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
|
||||||
|
Traefik + Authentik as a real user would.
|
||||||
|
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
|
||||||
|
acceptance spec of critical journeys; Claude executes it and explores beyond it.
|
||||||
|
|
||||||
|
## VERIFY.md standard
|
||||||
|
|
||||||
|
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from
|
||||||
|
`docs/testing/service-verify-template.md` — parallel to `SECURITY.md` from
|
||||||
|
`service-security-template.md`. A new role convention. It lists the service's critical
|
||||||
|
user journeys (what "working" means), what good looks like, and what is not
|
||||||
|
browser-verifiable (→ manual handoff). It also joins the pre-production gate in
|
||||||
|
`docs/security/service-checklist.md`.
|
||||||
|
|
||||||
|
## Test-user standard (TODO 2.3)
|
||||||
|
|
||||||
|
Test identities live only in the **staging** Authentik (never production): a dedicated
|
||||||
|
`test` group / naming prefix; ephemeral per-run credentials (staging is rebuildable, so
|
||||||
|
nothing persisted, none in `vault.yml`); reuse-or-create; teardown via staging rebuild
|
||||||
|
or explicit `test`-group cleanup.
|
||||||
|
|
||||||
|
## Reporting & manual handoff
|
||||||
|
|
||||||
|
`/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md` (+ `latest.md`),
|
||||||
|
mirroring `/review-repo` and `/capacity-review`: pass/fail per `VERIFY.md` journey,
|
||||||
|
observations, the test-user/env used, a verdict, and a structured **manual-test
|
||||||
|
checklist** for anything Claude can't do (physical device, paid/external flow,
|
||||||
|
subjective judgment) — the "instruct me on tests" output. Screenshots are saved to a
|
||||||
|
git-ignored working dir on `ubongo` (PNG bloat + secret-leak risk); the report links
|
||||||
|
them.
|
||||||
|
|
||||||
|
## Safety
|
||||||
|
|
||||||
|
- **Staging-only guard** — the skill refuses to run against production (exploratory
|
||||||
|
clicking is destructive); ADR-002-aligned hard stop.
|
||||||
|
- **Confined blast radius** — test users only in the staging `test` group; the run
|
||||||
|
sticks to the target service.
|
||||||
|
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
|
||||||
|
avoid capturing credential screens.
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||||
|
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||||
|
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
|
||||||
|
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
|
||||||
|
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
|
||||||
|
- A staging deploy of the service (ADR-008 Level 2) — staging is currently empty stubs.
|
||||||
|
- `make new-role` scaffolding `VERIFY.md` — deferred to when that scaffold is next touched.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Scripted Playwright regression suite | Operator wants exploratory judgment; scripts add maintenance burden. Could be a later layer, not this. |
|
||||||
|
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
|
||||||
|
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
|
||||||
|
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
|
||||||
|
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
|
||||||
|
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
|
||||||
|
|
||||||
|
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
|
||||||
|
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/017-service-ui-verification.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/017-service-ui-verification.md
|
||||||
|
git commit -m "Add ADR-017 (service-UI acceptance verification, Level 4)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Expand the ADR-008 Level 4 stub
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/008-testing.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace the Level 4 stub with the full definition**
|
||||||
|
|
||||||
|
Find this exact block:
|
||||||
|
```
|
||||||
|
### Level 4 — Service-UI acceptance (planned, not built)
|
||||||
|
|
||||||
|
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
|
||||||
|
the rendered UI, creates test users, exercises features, and hands the operator a
|
||||||
|
manual test script for the rest. Catches application-level regressions that no lower
|
||||||
|
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
|
||||||
|
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
|
||||||
|
(STATUS.md).
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
### Level 4 — Service-UI acceptance (Claude-driven exploratory)
|
||||||
|
|
||||||
|
A Claude-driven exploratory check of a service's **application UI**, run as
|
||||||
|
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
|
||||||
|
`playwright` plugin against a **staging** deploy, authenticates through the real
|
||||||
|
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
|
||||||
|
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
|
||||||
|
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
|
||||||
|
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
|
||||||
|
it can't verify (hardware, paid/external flows, subjective judgment).
|
||||||
|
|
||||||
|
Catches application-level regressions no lower level sees ("does PhotoPrism actually
|
||||||
|
serve photos?"). Placement: after Level 2 (staging deploy), before production
|
||||||
|
promotion. Exploratory and interactive by design — *not* a deterministic CI/cron gate
|
||||||
|
(that role belongs to health checks / Uptime Kuma).
|
||||||
|
|
||||||
|
**Status:** the skill, the `VERIFY.md` template, and standards are authorable now;
|
||||||
|
running it is deferred on `ubongo` + the `playwright` plugin + Authentik + a staging
|
||||||
|
deploy (STATUS.md). Full design: ADR-017.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/008-testing.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/008-testing.md
|
||||||
|
git commit -m "ADR-008: expand Level 4 into the verify-service harness (ADR-017)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Create the `VERIFY.md` template
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/testing/service-verify-template.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the template**
|
||||||
|
|
||||||
|
Create `docs/testing/service-verify-template.md` with exactly this content (preserve `<`/`>` HTML escapes, em-dashes, backticks):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Per-service verification record — template
|
||||||
|
|
||||||
|
Copy this file to `roles/<service>/VERIFY.md` and fill it in when building a service
|
||||||
|
role (ADR-008 Level 4 / ADR-017). It is the per-service **acceptance spec**: the
|
||||||
|
critical user journeys that define "working" for this service. `/verify-service <name>`
|
||||||
|
reads it, drives a browser through them against the staging deploy, and explores beyond
|
||||||
|
them.
|
||||||
|
|
||||||
|
Delete this preamble in the copy and start from the heading below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Verify — <service>
|
||||||
|
|
||||||
|
## Critical user journeys
|
||||||
|
|
||||||
|
The acceptance criteria — what "working" means for this service. Numbered; each is an
|
||||||
|
action and its expected result. Example shape (replace with this service's flows):
|
||||||
|
|
||||||
|
1. SSO login via Authentik succeeds and lands on the service's home/dashboard.
|
||||||
|
2. <core action> — e.g. "upload a test image" → <expected> — "a thumbnail renders".
|
||||||
|
3. <core action> → <expected>.
|
||||||
|
|
||||||
|
## What good looks like
|
||||||
|
|
||||||
|
Key states/screens Claude should confirm (and screenshot) — the visual/textual signals
|
||||||
|
that the journeys above actually succeeded.
|
||||||
|
|
||||||
|
- <e.g. "the uploaded image appears in the library grid within ~10s">
|
||||||
|
|
||||||
|
## Not browser-verifiable
|
||||||
|
|
||||||
|
Items to route to the manual-test handoff — things a headless browser can't or
|
||||||
|
shouldn't judge.
|
||||||
|
|
||||||
|
- <e.g. hardware passthrough, a paid/external integration, subjective media quality>
|
||||||
|
|
||||||
|
## Test data
|
||||||
|
|
||||||
|
What the journeys need, provisioned in the **staging** Authentik `test` group
|
||||||
|
(ephemeral, torn down by staging rebuild).
|
||||||
|
|
||||||
|
- <e.g. "one test user; no pre-seeded content">
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/testing/service-verify-template.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/testing/service-verify-template.md
|
||||||
|
git commit -m "Add VERIFY.md template for service-UI acceptance (ADR-017)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Create the `/verify-service` skill
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `.claude/commands/verify-service.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the command file**
|
||||||
|
|
||||||
|
Create `.claude/commands/verify-service.md` with exactly this content (preserve em-dashes, backticks, code fences):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
Exploratory service-UI verification (ADR-008 Level 4 / ADR-017)
|
||||||
|
|
||||||
|
Drive a browser against a **staging** deploy of a service, exercise its
|
||||||
|
`roles/<service>/VERIFY.md` acceptance journeys plus free exploration, and write a
|
||||||
|
tracked report. Argument: the service/role name (e.g. `/verify-service photoprism`).
|
||||||
|
|
||||||
|
## Prerequisites (this is forward-looking — ADR-017 dependencies)
|
||||||
|
|
||||||
|
This skill cannot run until all of these exist; if any is missing, say so and stop —
|
||||||
|
do not improvise around it:
|
||||||
|
|
||||||
|
- `ubongo` with the `playwright` Claude Code plugin (browser automation tools).
|
||||||
|
- A **staging** deploy of the target service (ADR-008 Level 2).
|
||||||
|
- Authentik (staging) for test-user provisioning + SSO.
|
||||||
|
- `roles/<name>/VERIFY.md` present.
|
||||||
|
|
||||||
|
## Process
|
||||||
|
|
||||||
|
### Phase 0 — safety gate (staging only)
|
||||||
|
|
||||||
|
Confirm the target resolves to the **staging** environment/inventory, never production.
|
||||||
|
If you cannot prove it is staging, **stop** — exploratory clicking is destructive
|
||||||
|
(ADR-002). State why you stopped.
|
||||||
|
|
||||||
|
### Phase 1 — read intent
|
||||||
|
|
||||||
|
Read `roles/<name>/VERIFY.md`: the Critical user journeys, What good looks like, Not
|
||||||
|
browser-verifiable, and Test data sections.
|
||||||
|
|
||||||
|
### Phase 2 — test user
|
||||||
|
|
||||||
|
Provision (reuse-or-create) a test user in the staging Authentik `test` group, with
|
||||||
|
ephemeral credentials held only for this run. Never use a real/production account.
|
||||||
|
|
||||||
|
### Phase 3 — drive the browser
|
||||||
|
|
||||||
|
Via the `playwright` plugin, on `ubongo`: open the service's staging URL (resolved via
|
||||||
|
boma DNS), authenticate through the real Traefik + Authentik SSO flow, then execute each
|
||||||
|
`VERIFY.md` journey — judging pass/fail and screenshotting key states — and free-explore
|
||||||
|
for anything obviously broken. Save screenshots to the git-ignored `.verify-runs/`
|
||||||
|
working dir; avoid capturing credential screens.
|
||||||
|
|
||||||
|
### Phase 4 — write the report
|
||||||
|
|
||||||
|
Save to `docs/testing/reviews/YYYY-MM-DD-<name>.md` and overwrite
|
||||||
|
`docs/testing/reviews/latest.md`. Structure:
|
||||||
|
|
||||||
|
- **One-line verdict** — e.g. "5/5 journeys passed; one manual check pending".
|
||||||
|
- **Run metadata** — date, service, staging env, test user, reviewed commit SHA.
|
||||||
|
- **Per-journey result** — pass/fail against `VERIFY.md`, with the evidence (linked
|
||||||
|
screenshot path) and any observation.
|
||||||
|
- **Free-exploration findings** — anything noticed beyond the listed journeys.
|
||||||
|
- **Manual-test checklist** — the "Not browser-verifiable" items plus anything Claude
|
||||||
|
couldn't do: numbered steps, expected result, and why it was handed off.
|
||||||
|
|
||||||
|
### Phase 5 — clean up + commit
|
||||||
|
|
||||||
|
Offer to clean up the `test`-group user (or note that the staging rebuild will).
|
||||||
|
Commit the report markdown per CLAUDE.md git conventions. **Do not** commit
|
||||||
|
`.verify-runs/` (git-ignored).
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Reports (markdown) are committed; screenshots stay local on `ubongo` in `.verify-runs/`.
|
||||||
|
- Exploratory and interactive — this is not a deterministic CI gate.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files .claude/commands/verify-service.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/verify-service.md
|
||||||
|
git commit -m "Add /verify-service skill for Level 4 UI verification (ADR-017)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Add Level 4 to the service-clearance gate
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/service-checklist.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add an Operability bullet for Level 4**
|
||||||
|
|
||||||
|
Find this exact block:
|
||||||
|
```
|
||||||
|
## Operability (security-adjacent)
|
||||||
|
|
||||||
|
- [ ] Logs go somewhere reviewable (central aggregation when available)
|
||||||
|
- [ ] Backup/restore is covered if the service holds state
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
## Operability (security-adjacent)
|
||||||
|
|
||||||
|
- [ ] Logs go somewhere reviewable (central aggregation when available)
|
||||||
|
- [ ] Backup/restore is covered if the service holds state
|
||||||
|
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
|
||||||
|
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
|
||||||
|
verified (ADR-008 Level 4 / ADR-017)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/security/service-checklist.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/security/service-checklist.md
|
||||||
|
git commit -m "service-checklist: add Level 4 UI verification to the gate"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Update CLAUDE.md (role convention + Further reading)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the `VERIFY.md` role-convention bullet**
|
||||||
|
|
||||||
|
Find this exact line:
|
||||||
|
```
|
||||||
|
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by a new bullet:
|
||||||
|
```
|
||||||
|
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
||||||
|
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the ADR-017 Further-reading row**
|
||||||
|
|
||||||
|
Find this exact line:
|
||||||
|
```
|
||||||
|
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by a new row:
|
||||||
|
```
|
||||||
|
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||||
|
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md
|
||||||
|
git commit -m "CLAUDE.md: VERIFY.md role convention; link ADR-017"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Git-ignore screenshots + create the reviews dir
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `.gitignore`
|
||||||
|
- Create: `docs/testing/reviews/README.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the screenshot working dir to `.gitignore`**
|
||||||
|
|
||||||
|
Find this exact block at the end of `.gitignore`:
|
||||||
|
```
|
||||||
|
# Terraform
|
||||||
|
terraform/**/.terraform/
|
||||||
|
terraform/**/*.tfstate
|
||||||
|
terraform/**/*.tfstate.backup
|
||||||
|
terraform/**/terraform.tfvars
|
||||||
|
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
# Terraform
|
||||||
|
terraform/**/.terraform/
|
||||||
|
terraform/**/*.tfstate
|
||||||
|
terraform/**/*.tfstate.backup
|
||||||
|
terraform/**/terraform.tfvars
|
||||||
|
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
|
||||||
|
|
||||||
|
# Service-UI verification screenshots (kept locally on ubongo, not committed — ADR-017)
|
||||||
|
.verify-runs/
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the reviews dir README (so the dir exists in git)**
|
||||||
|
|
||||||
|
Create `docs/testing/reviews/README.md` with exactly this content:
|
||||||
|
```markdown
|
||||||
|
# Service-UI verification reports
|
||||||
|
|
||||||
|
Dated reports written by `/verify-service` (ADR-008 Level 4 / ADR-017), one per run:
|
||||||
|
`YYYY-MM-DD-<service>.md`, plus `latest.md`. These markdown reports are committed; the
|
||||||
|
screenshots they reference stay local on `ubongo` in the git-ignored `.verify-runs/`
|
||||||
|
working dir.
|
||||||
|
|
||||||
|
No reports yet — the harness is designed, not yet runnable (see STATUS.md).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files .gitignore docs/testing/reviews/README.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add .gitignore docs/testing/reviews/README.md
|
||||||
|
git commit -m "Git-ignore verify screenshots; add testing/reviews dir"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Add the Level 4 row to STATUS.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add a row to the "Designed but not built" table**
|
||||||
|
|
||||||
|
Find this exact line:
|
||||||
|
```
|
||||||
|
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by the new row:
|
||||||
|
```
|
||||||
|
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
|
||||||
|
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | `/verify-service` skill + `VERIFY.md` template + standards are authorable and present; *running* deferred on ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add STATUS.md
|
||||||
|
git commit -m "STATUS: record Level 4 service-UI verification (ADR-017)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 9: Mark TODO 2.2/2.3 addressed
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/TODO.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Annotate the Testing items**
|
||||||
|
|
||||||
|
Find this exact block:
|
||||||
|
```
|
||||||
|
2. **Testing**
|
||||||
|
1. Choose and configure code-testing tooling (Molecule, etc.).
|
||||||
|
2. Decide how the AI interprets Molecule output and performs live testing:
|
||||||
|
API calls, curl pulls of web products, log reviews, and headless browsing.
|
||||||
|
3. Define a standard for generating test users and for instructing the user to
|
||||||
|
perform relevant manual tests.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
2. **Testing**
|
||||||
|
1. Choose and configure code-testing tooling (Molecule, etc.).
|
||||||
|
2. Decide how the AI interprets Molecule output and performs live testing:
|
||||||
|
API calls, curl pulls of web products, log reviews, and headless browsing.
|
||||||
|
— Headless browsing DECIDED (ADR-017): the `/verify-service` Level 4 harness.
|
||||||
|
The API/curl/log-review siblings remain open.
|
||||||
|
3. ~~Define a standard for generating test users and for instructing the user to
|
||||||
|
perform relevant manual tests.~~ DECIDED (ADR-017): test users in the staging
|
||||||
|
Authentik `test` group; manual tests handed off as a checklist in the
|
||||||
|
`/verify-service` report.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/TODO.md
|
||||||
|
git commit -m "TODO: mark headless-browsing + test-user standard decided (ADR-017)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 10: Final consistency sweep
|
||||||
|
|
||||||
|
**Files:** none modified (verification only)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Confirm ADR-017 is present and cross-linked**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
test -f docs/decisions/017-service-ui-verification.md && echo "ADR-017 present"
|
||||||
|
grep -rl "ADR-017\|017-service-ui-verification" docs/ CLAUDE.md STATUS.md .claude/ | grep -vE "superpowers/(plans|specs)/"
|
||||||
|
```
|
||||||
|
Expected: the file exists and the referencing files appear — ADR-008, CLAUDE.md, STATUS.md, the `VERIFY.md` template, the `/verify-service` skill, service-checklist, TODO, the reviews README.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Confirm the new artifacts exist and the Level 4 stub is gone**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
ls docs/testing/service-verify-template.md .claude/commands/verify-service.md docs/testing/reviews/README.md
|
||||||
|
grep -n "planned, not built" docs/decisions/008-testing.md || echo "Level 4 stub replaced (good)"
|
||||||
|
grep -n "\.verify-runs/" .gitignore && echo "screenshot dir ignored (good)"
|
||||||
|
```
|
||||||
|
Expected: all three files listed; the old Level 4 "planned, not built" stub line gone; `.verify-runs/` in `.gitignore`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Full hook run**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --all-files`
|
||||||
|
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Push (only if the user asks)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git push origin <branch-or-main-after-merge>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review notes (author)
|
||||||
|
|
||||||
|
- **Spec coverage:** decision/forks/architecture → Task 1 (ADR-017) + Task 2 (ADR-008); `VERIFY.md` standard → Task 3 (template) + Task 6 (convention) + Task 5 (gate); skill/mechanism/reporting/safety → Task 4 (`/verify-service`); reporting dir + screenshot policy → Task 7; STATUS/TODO reconciliation → Tasks 8–9. ✓
|
||||||
|
- **Buildable-now vs deferred:** every task is authorable without `ubongo`/Authentik/staging; the skill carries an explicit Prerequisites gate so it cannot pretend to run. Deferred items (new-role scaffold, Authentik automation, per-service `VERIFY.md`, plugin install) are recorded in ADR-017/STATUS, not implemented. ✓
|
||||||
|
- **No placeholders:** every create/edit shows exact content; the `<…>` tokens in the template are deliberate (match `service-security-template.md`'s house style). ✓
|
||||||
|
- **Name consistency:** `/verify-service`, `roles/<service>/VERIFY.md`, `docs/testing/service-verify-template.md`, `docs/testing/reviews/`, `.verify-runs/`, and the `test` Authentik group are used identically across all tasks. ✓
|
||||||
|
```
|
||||||
331
docs/superpowers/plans/2026-06-06-firewall-strategy.md
Normal file
331
docs/superpowers/plans/2026-06-06-firewall-strategy.md
Normal file
|
|
@ -0,0 +1,331 @@
|
||||||
|
# Firewall Strategy (ADR-020) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Land the firewall *strategy* as ADR-020 and fold it into the living docs — no firewall code is built here (the host-nftables and OPNsense-as-code builds are separate follow-up specs).
|
||||||
|
|
||||||
|
**Architecture:** This is a documentation-only change. It creates `docs/decisions/020-firewall.md` from the approved design spec, then updates CLAUDE.md (Further reading + the firewall guardrail), `docs/TODO.md` (mark 3.5 decided), and `docs/CAPABILITIES.md` (point the firewall note at ADR-020). There is no executable code, so verification is consistency greps + `make lint`.
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown docs only. `make lint` (yamllint + ansible-lint + check-tags) must stay green; none of these tools lint Markdown content, but the run confirms nothing else broke.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
| File | Responsibility | Action |
|
||||||
|
|------|----------------|--------|
|
||||||
|
| `docs/decisions/020-firewall.md` | The firewall strategy ADR (two-layer model, shared catalog, deferred builds) | Create |
|
||||||
|
| `CLAUDE.md` | Add ADR-020 to *Further reading*; harden the firewall guardrail bullet to reference the catalog/ADR-020 | Modify |
|
||||||
|
| `docs/TODO.md` | Mark item 3.5 DECIDED (ADR-020) | Modify |
|
||||||
|
| `docs/CAPABILITIES.md` | Point the existing firewall parenthetical at ADR-020 + the two-layer model | Modify |
|
||||||
|
|
||||||
|
Notes for the implementer:
|
||||||
|
- The design spec this ADR is based on is `docs/superpowers/specs/2026-06-06-firewall-strategy-design.md` — read it if you need the full rationale, but the ADR text below is complete and self-contained.
|
||||||
|
- Existing ADRs live in `docs/decisions/` numbered 001–019; this is 020. Match their concise, decision-focused tone (ADR-019 is a good recent reference).
|
||||||
|
- Before any `git commit`, the pre-commit hook runs and decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good). If locked, ask the user to `rbw unlock` and wait. None of these tasks touch vault files.
|
||||||
|
- Run `make lint` via the repo venv wiring (the Makefile handles paths).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Write ADR-020
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/020-firewall.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the ADR**
|
||||||
|
|
||||||
|
Create `docs/decisions/020-firewall.md` with exactly this content:
|
||||||
|
|
||||||
|
````markdown
|
||||||
|
# ADR-020 — Firewall strategy: two-layer model with a shared service catalog
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-06). Resolves TODO 3.5 ("Decide the firewall strategy — which
|
||||||
|
firewall, ruleset, per-host vs central").
|
||||||
|
|
||||||
|
**Strategy ADR.** It pins the architecture and each layer's responsibilities; the
|
||||||
|
detailed builds are separate follow-up efforts (see *Scope*).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma needs a firewall strategy that is predictable, declarative, and defends the stated
|
||||||
|
threat model — opportunistic external, lateral movement / blast radius, operator/agent
|
||||||
|
error (ADR-002). The pieces were already committed across other ADRs (`nftables`
|
||||||
|
default-deny on hosts — ADR-002; OPNsense at the perimeter — ADR-007; Docker with
|
||||||
|
`iptables: false` — ADR-004), but nothing tied them together: which layer owns what,
|
||||||
|
where firewall intent is declared, and how the layers stay consistent. Without that,
|
||||||
|
ports drift open ad-hoc and "per-host vs central" stays unanswered.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Two layers, distinct jobs
|
||||||
|
|
||||||
|
**OPNsense — perimeter + inter-VLAN.** Owns the WAN edge and all policy *between zones*:
|
||||||
|
`lan`/`iot`/`guest` → `srv`, `mgmt` access, and the per-VLAN egress rules (ADR-007). It
|
||||||
|
is **structurally blind to intra-`srv` traffic** — services share the switched `srv`
|
||||||
|
subnet (VLAN 20), which never reaches the gateway.
|
||||||
|
|
||||||
|
**Host nftables — host-local + east-west within `srv`** (in the `base` role, every VM):
|
||||||
|
|
||||||
|
- **Default-deny inbound**; allow loopback + established/related.
|
||||||
|
- **East-west allowlist**: a service host accepts a connection only from declared
|
||||||
|
sources (e.g. the reverse proxy, a named peer) — the lateral-movement control OPNsense
|
||||||
|
cannot provide.
|
||||||
|
- **Permissive egress**: allow outbound + established/related; per-VLAN egress
|
||||||
|
restriction stays at OPNsense (ADR-007). Host-level egress allowlisting is
|
||||||
|
high-friction (every DNS/NTP/update/registry/webhook must be enumerated) for limited
|
||||||
|
added benefit once the VLAN already bounds where a host can go.
|
||||||
|
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
|
||||||
|
including container traffic (ADR-004).
|
||||||
|
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
|
||||||
|
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
|
||||||
|
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
|
||||||
|
is allowed only on `wt0`.)
|
||||||
|
|
||||||
|
So "per-host vs central" is answered: **both**, with clear ownership.
|
||||||
|
|
||||||
|
### Single source of truth — a shared service catalog
|
||||||
|
|
||||||
|
A central, declarative **service catalog** in `group_vars/` is the one source of truth
|
||||||
|
for firewall intent (aligning with ADR-002's "port definitions live in `group_vars/`",
|
||||||
|
and keeping connectivity *topology* in inventory rather than in any one self-contained
|
||||||
|
service role — ADR-004). Each entry describes a service's **ingress**:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
photoprism:
|
||||||
|
ingress:
|
||||||
|
- { from: reverse_proxy, port: 2342, proto: tcp }
|
||||||
|
reverse_proxy:
|
||||||
|
ingress:
|
||||||
|
- { from: lan, port: 443, proto: tcp }
|
||||||
|
```
|
||||||
|
|
||||||
|
`from` is **symbolic**, resolved at render time: a host/group → IP(s) from inventory; a
|
||||||
|
role (`reverse_proxy`) → the host(s) filling it; a VLAN/zone (`lan`) → the subnet from
|
||||||
|
the ADR-007 table. This keeps the catalog readable and resilient to IP changes.
|
||||||
|
|
||||||
|
### Each layer renders only its own slice
|
||||||
|
|
||||||
|
| Ingress rule | Host nftables | OPNsense |
|
||||||
|
|---|---|---|
|
||||||
|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
|
||||||
|
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
|
||||||
|
|
||||||
|
The dominant pattern falls out naturally: most services are **proxied** — their only
|
||||||
|
ingress is `from: reverse_proxy`, and users reach them through the reverse proxy, which
|
||||||
|
alone carries `from: lan, port: 443` (matches "services sit behind the reverse proxy
|
||||||
|
with authentication", ADR-002).
|
||||||
|
|
||||||
|
This was chosen over a single connectivity-model-generates-both (too much machinery,
|
||||||
|
tight coupling of two very different rule domains) and over fully independent per-layer
|
||||||
|
declarations (real drift risk).
|
||||||
|
|
||||||
|
### OPNsense automation — owned here, mechanism deferred
|
||||||
|
|
||||||
|
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
|
||||||
|
OPNsense provider"). It renders the cross-VLAN slice of the catalog plus the static
|
||||||
|
ADR-007 facts. The **how** — config-XML templating vs the OPNsense API vs a plugin — is
|
||||||
|
deferred to the OPNsense-as-code follow-up spec. Recorded as an explicit open
|
||||||
|
sub-decision.
|
||||||
|
|
||||||
|
## Guardrails
|
||||||
|
|
||||||
|
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist —
|
||||||
|
hardening the existing rule "never open a firewall port ad-hoc on a host" (ADR-002).
|
||||||
|
- **The `firewall` tag** (ADR-019) marks firewall tasks; `--tags firewall` re-renders
|
||||||
|
rules.
|
||||||
|
- **Drift detection (aspiration).** A deterministic check — in the spirit of
|
||||||
|
`scripts/check-tags.py` — comparing each host's live `nft` ruleset / listening ports
|
||||||
|
against the catalog and flagging anything undeclared. Ties to TODO 8.5
|
||||||
|
(`/security-review`). Not necessarily built first.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Lateral movement within `srv` is constrained — the gap OPNsense structurally can't
|
||||||
|
close.
|
||||||
|
- One declarative catalog → no ad-hoc ports and no cross-layer drift on shared facts
|
||||||
|
(ports, IPs, sources).
|
||||||
|
- Cost: the catalog + render-per-layer machinery must be built and maintained; east-west
|
||||||
|
allowlisting adds per-service ingress declarations (mitigated by proxied-by-default,
|
||||||
|
which keeps most entries to a single line).
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
**Decided here:** the two-layer model and responsibilities; host nftables = default-deny
|
||||||
|
inbound + east-west allowlist + permissive egress + guaranteed management plane + Docker
|
||||||
|
`iptables:false`; the shared `group_vars` catalog as single source of truth with
|
||||||
|
symbolic sources; each layer renders its own slice; the no-ad-hoc-ports guardrail.
|
||||||
|
|
||||||
|
**Deferred to follow-up specs (each its own brainstorm → plan):**
|
||||||
|
|
||||||
|
1. **Host nftables implementation** in `base` — catalog schema, nftables template,
|
||||||
|
Docker `iptables:false` integration, fail-safe ordering, Molecule tests. The natural
|
||||||
|
next spec.
|
||||||
|
2. **OPNsense-as-code** — tooling mechanism + cross-VLAN rule rendering.
|
||||||
|
3. **Drift-detection check** — if/when built.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
|
||||||
|
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
|
||||||
|
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).
|
||||||
|
````
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify the file is well-formed**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
test -f docs/decisions/020-firewall.md && grep -c "^## " docs/decisions/020-firewall.md
|
||||||
|
```
|
||||||
|
Expected: exit 0 and a printed count of `7` (the H2 sections: Status, Context, Decision, Guardrails, Consequences, Scope, Related — H3 subsections under Decision are not counted by `^## `).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/020-firewall.md
|
||||||
|
git commit -m "docs(adr): ADR-020 firewall strategy (two-layer + shared catalog)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Wire ADR-020 into CLAUDE.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md` (Further reading table; firewall guardrail bullet)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add ADR-020 to the Further reading table**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, find this row (around line 225):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
Add this row immediately after it:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Firewall strategy | `docs/decisions/020-firewall.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
(Exact column padding need not match perfectly — just produce a valid Markdown table row consistent with the surrounding rows.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Harden the firewall guardrail bullet**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, find this bullet (around line 172, under "What Claude must not do without explicit instruction"):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace it with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify both edits**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
grep -n "020-firewall" CLAUDE.md && grep -n "service catalog" CLAUDE.md
|
||||||
|
```
|
||||||
|
Expected: the Further reading row matches `020-firewall`, and the guardrail bullet now contains "service catalog".
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md
|
||||||
|
git commit -m "docs: link ADR-020; harden firewall guardrail to the service catalog"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Mark TODO 3.5 decided
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/TODO.md` (item 3.5)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Strike through and annotate item 3.5**
|
||||||
|
|
||||||
|
In `docs/TODO.md`, find this line (around line 26):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace it with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
5. ~~Decide the firewall strategy (which firewall, ruleset, per-host vs central).~~
|
||||||
|
DECIDED (ADR-020): two layers — OPNsense (perimeter + inter-VLAN) + host nftables
|
||||||
|
(default-deny inbound + east-west allowlist, permissive egress). Single source of
|
||||||
|
truth: a `group_vars` service catalog with symbolic sources; each layer renders
|
||||||
|
its own slice. Builds deferred to follow-up specs (host nftables in `base`, then
|
||||||
|
OPNsense-as-code).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify**
|
||||||
|
|
||||||
|
Run: `grep -n "DECIDED (ADR-020)" docs/TODO.md`
|
||||||
|
Expected: one match on the item 3.5 annotation.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/TODO.md
|
||||||
|
git commit -m "docs(todo): mark 3.5 firewall strategy decided (ADR-020)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Update CAPABILITIES.md firewall note
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/CAPABILITIES.md` (the firewall parenthetical in §1 Edge & networking, around line 32)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Point the firewall note at ADR-020**
|
||||||
|
|
||||||
|
In `docs/CAPABILITIES.md`, find this line (around line 32, just under the §1 table):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace it with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
|
||||||
|
|
||||||
|
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
|
||||||
|
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
|
||||||
|
role from a shared `group_vars` service catalog. Both layers are still to be built._
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and run the full lint suite**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
grep -n "ADR-020" docs/CAPABILITIES.md && make lint
|
||||||
|
```
|
||||||
|
Expected: the new ADR-020 note is found, and `make lint` passes (yamllint clean, ansible-lint clean, `check-tags: OK`).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/CAPABILITIES.md
|
||||||
|
git commit -m "docs(capabilities): note two-layer firewall model (ADR-020)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final verification
|
||||||
|
|
||||||
|
- [ ] Confirm cross-references resolve:
|
||||||
|
```bash
|
||||||
|
ls docs/decisions/020-firewall.md && grep -rl "ADR-020\|020-firewall" CLAUDE.md docs/TODO.md docs/CAPABILITIES.md
|
||||||
|
```
|
||||||
|
Expected: the ADR file exists and all three living docs reference it.
|
||||||
|
- [ ] `make lint` passes end to end.
|
||||||
|
- [ ] `git log --oneline -4` shows the four task commits.
|
||||||
|
- [ ] Sanity: the ADR's *Scope* section names the two deferred build specs (host nftables in `base`, OPNsense-as-code) so the next brainstorm has an obvious starting point.
|
||||||
712
docs/superpowers/plans/2026-06-06-host-nftables-firewall.md
Normal file
712
docs/superpowers/plans/2026-06-06-host-nftables-firewall.md
Normal file
|
|
@ -0,0 +1,712 @@
|
||||||
|
# Host nftables Firewall (`base` firewall concern) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Build the `firewall`-tagged concern of the `base` role — default-deny nftables rendered from a shared `group_vars` service catalog, applied with an auto-rollback safety net.
|
||||||
|
|
||||||
|
**Architecture:** A pure Python filter plugin resolves the global `firewall_catalog`/`firewall_zones` into a flat per-host rule list; a Jinja template renders `/etc/nftables.conf` (validated at render time with `nft -c`); tasks apply it safely (snapshot → armed `systemd-run` revert → apply → confirm/disarm → persist). Molecule renders + syntax-checks only (never applies — it shares the host kernel); the resolver is unit-tested with pytest; real enforcement is a Level-2 staging concern.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`ansible.builtin` only — no new collections), nftables, Python 3 filter plugin + pytest, Molecule (Docker driver), systemd (`systemd-run` transient timer).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
| File | Responsibility | Action |
|
||||||
|
|------|----------------|--------|
|
||||||
|
| `roles/base/` (scaffold) | the base role skeleton | Create via `make new-role` |
|
||||||
|
| `roles/base/meta/main.yml` | role metadata (galaxy_info) | Fill |
|
||||||
|
| `roles/base/defaults/main.yml` | `base__firewall_*` behaviour knobs | Create |
|
||||||
|
| `inventories/{staging,production}/group_vars/all/firewall.yml` | shared `firewall_zones` + `firewall_catalog` | Create |
|
||||||
|
| `roles/base/filter_plugins/firewall_rules.py` | pure catalog→rules resolver | Create |
|
||||||
|
| `tests/test_firewall_rules.py` | pytest units for the resolver | Create |
|
||||||
|
| `roles/base/templates/nftables.conf.j2` | the ruleset | Create |
|
||||||
|
| `roles/base/tasks/main.yml` | include `firewall.yml` (tagged) | Replace scaffold |
|
||||||
|
| `roles/base/tasks/firewall.yml` | install + render + safe-apply | Create |
|
||||||
|
| `roles/base/molecule/default/molecule.yml` | fixture `ansible_host` | Adjust scaffold |
|
||||||
|
| `roles/base/molecule/default/converge.yml` | fixture catalog/zones + `apply:false` | Replace scaffold |
|
||||||
|
| `roles/base/molecule/default/verify.yml` | assert rendered rules + `nft -c` | Replace scaffold |
|
||||||
|
| `roles/base/README.md` | document the firewall concern | Fill |
|
||||||
|
| `STATUS.md`, `docs/CAPABILITIES.md` | reflect the build | Modify |
|
||||||
|
|
||||||
|
Notes for the implementer:
|
||||||
|
- Run Ansible/Python via the repo venv (`.venv/bin/...`); the Makefile wires paths. Molecule: `make test ROLE=base`.
|
||||||
|
- The Molecule platform pulls `forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest`. If the registry/image is unreachable in your environment, `make test` can't run — report DONE_WITH_CONCERNS for that step; the pytest units (Task 3) still fully validate the resolver logic, which is the only non-trivial code.
|
||||||
|
- Before any `git commit`, the pre-commit hook decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good); if locked, ask the user to `rbw unlock`. None of these tasks touch vault files.
|
||||||
|
- `make lint` must stay green (yamllint + ansible-lint over the new role + `check-tags`). Use FQCN, a tag on every task, string `mode:`, and `changed_when:` on every `command`/`shell`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Scaffold the `base` role
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `roles/base/` (via `make new-role`)
|
||||||
|
- Fill: `roles/base/meta/main.yml`, `roles/base/README.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Scaffold**
|
||||||
|
|
||||||
|
Run: `make new-role NAME=base`
|
||||||
|
Expected: prints "Role base scaffolded at roles/base/". Creates `roles/base/{tasks,handlers,defaults,templates,files,meta,molecule/default}` and a scaffold `tasks/main.yml` (`---`), `molecule/default/{molecule.yml,converge.yml,verify.yml}`, `README.md`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Fill `roles/base/meta/main.yml`**
|
||||||
|
|
||||||
|
Replace the scaffold `---` with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
galaxy_info:
|
||||||
|
author: sjat
|
||||||
|
description: Hardened baseline configuration for all boma hosts (Debian 13).
|
||||||
|
license: MIT
|
||||||
|
min_ansible_version: "2.17"
|
||||||
|
platforms:
|
||||||
|
- name: Debian
|
||||||
|
versions:
|
||||||
|
- trixie
|
||||||
|
dependencies: []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write `roles/base/README.md`**
|
||||||
|
|
||||||
|
Replace the scaffold content with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# base
|
||||||
|
|
||||||
|
Hardened baseline applied to every boma host. Built incrementally; the first concern
|
||||||
|
implemented is the **host firewall** (`firewall` tag).
|
||||||
|
|
||||||
|
## Firewall (nftables)
|
||||||
|
|
||||||
|
Default-deny inbound + east-west allowlisting + permissive egress, per ADR-020. Rules
|
||||||
|
are rendered from the shared `firewall_catalog` / `firewall_zones` (in `group_vars/all`)
|
||||||
|
by the `resolve_firewall_rules` filter, written to `/etc/nftables.conf`, syntax-checked
|
||||||
|
with `nft -c` at render time, and applied with an **auto-rollback safety net**
|
||||||
|
(`systemd-run` arms a revert that a follow-up task cancels once connectivity is
|
||||||
|
confirmed). The apply sequence lives in tasks rather than a handler so the confirm/cancel
|
||||||
|
step is controllable.
|
||||||
|
|
||||||
|
`/etc/nftables.d/*.nft` is `include`d by the ruleset — the extension hook the
|
||||||
|
`docker_host` role uses for container forward/NAT rules.
|
||||||
|
|
||||||
|
### Variables
|
||||||
|
See `defaults/main.yml` (`base__firewall_*`). SSH is accepted only on
|
||||||
|
`base__firewall_mgmt_interface` (default `wt0`, the NetBird overlay — ADR-016); set it to
|
||||||
|
a reachable interface/source until NetBird is built. Set `base__firewall_apply: false` to
|
||||||
|
render + validate without applying (used by Molecule).
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
- `tests/test_firewall_rules.py` — pytest units for the resolver.
|
||||||
|
- `make test ROLE=base` — Molecule renders + `nft -c` syntax-checks (never applies; it
|
||||||
|
shares the host kernel). Enforcement + the apply/rollback path are verified at ADR-008
|
||||||
|
Level 2 on staging VMs.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify scaffold + lint**
|
||||||
|
|
||||||
|
Run: `test -d roles/base/molecule/default && .venv/bin/ansible-lint roles/base`
|
||||||
|
Expected: directory exists; ansible-lint passes (the scaffold `tasks/main.yml` is empty `---`, meta is now filled).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base
|
||||||
|
git commit -m "feat(base): scaffold role + meta/README (firewall concern incoming)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Shared catalog/zones + role defaults
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `inventories/staging/group_vars/all/firewall.yml`
|
||||||
|
- Create: `inventories/production/group_vars/all/firewall.yml`
|
||||||
|
- Create: `roles/base/defaults/main.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the shared firewall data (both envs)**
|
||||||
|
|
||||||
|
Write this identical content to **both** `inventories/staging/group_vars/all/firewall.yml`
|
||||||
|
**and** `inventories/production/group_vars/all/firewall.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Shared firewall topology — single source of truth for the host nftables layer
|
||||||
|
# (base role) and OPNsense (future). See docs/decisions/020-firewall.md.
|
||||||
|
|
||||||
|
# Zone → subnet (from ADR-007).
|
||||||
|
firewall_zones:
|
||||||
|
mgmt: 10.10.0.0/24
|
||||||
|
srv: 10.20.0.0/24
|
||||||
|
lan: 10.30.0.0/24
|
||||||
|
iot: 10.40.0.0/24
|
||||||
|
guest: 10.50.0.0/24
|
||||||
|
|
||||||
|
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
|
||||||
|
# Empty until services are built; hosts still get default-deny + the management plane.
|
||||||
|
firewall_catalog: {}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create `roles/base/defaults/main.yml`**
|
||||||
|
|
||||||
|
Replace the scaffold `---` with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/
|
||||||
|
# firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md.
|
||||||
|
base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016)
|
||||||
|
base__firewall_ssh_port: 22
|
||||||
|
base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply
|
||||||
|
base__firewall_dropin_dir: /etc/nftables.d
|
||||||
|
base__firewall_apply: true # set false to render+validate without applying (CI/Molecule)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify + lint**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -c "import yaml; [print(sorted(yaml.safe_load(open(p))['firewall_zones'])) for p in ['inventories/staging/group_vars/all/firewall.yml','inventories/production/group_vars/all/firewall.yml']]" && make lint`
|
||||||
|
Expected: prints the sorted zone list twice (`['guest', 'iot', 'lan', 'mgmt', 'srv']`); `make lint` passes.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/staging/group_vars/all/firewall.yml inventories/production/group_vars/all/firewall.yml roles/base/defaults/main.yml
|
||||||
|
git commit -m "feat(base): shared firewall catalog/zones + firewall defaults"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: The resolver filter plugin (TDD)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `roles/base/filter_plugins/firewall_rules.py`
|
||||||
|
- Test: `tests/test_firewall_rules.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing tests**
|
||||||
|
|
||||||
|
Create `tests/test_firewall_rules.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import importlib.util
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
_PATH = (
|
||||||
|
pathlib.Path(__file__).resolve().parent.parent
|
||||||
|
/ "roles" / "base" / "filter_plugins" / "firewall_rules.py"
|
||||||
|
)
|
||||||
|
_spec = importlib.util.spec_from_file_location("firewall_rules", _PATH)
|
||||||
|
fr = importlib.util.module_from_spec(_spec)
|
||||||
|
_spec.loader.exec_module(fr)
|
||||||
|
|
||||||
|
ZONES = {"lan": "10.30.0.0/24", "srv": "10.20.0.0/24"}
|
||||||
|
HOSTVARS = {
|
||||||
|
"docker01": {"ansible_host": "10.20.0.50"},
|
||||||
|
"docker02": {"ansible_host": "10.20.0.51"},
|
||||||
|
}
|
||||||
|
GROUPS = {"docker_hosts": ["docker01", "docker02"]}
|
||||||
|
|
||||||
|
|
||||||
|
def test_zone_source():
|
||||||
|
cat = {"reverse_proxy": {"host": "docker01",
|
||||||
|
"ingress": [{"from": "lan", "port": 443, "proto": "tcp"}]}}
|
||||||
|
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
|
||||||
|
assert out == [{"proto": "tcp", "port": 443, "sources": ["10.30.0.0/24"]}]
|
||||||
|
|
||||||
|
|
||||||
|
def test_service_source_resolves_to_host_ip():
|
||||||
|
cat = {
|
||||||
|
"reverse_proxy": {"host": "docker01", "ingress": []},
|
||||||
|
"photoprism": {"host": "docker01",
|
||||||
|
"ingress": [{"from": "reverse_proxy", "port": 2342, "proto": "tcp"}]},
|
||||||
|
}
|
||||||
|
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
|
||||||
|
assert out == [{"proto": "tcp", "port": 2342, "sources": ["10.20.0.50/32"]}]
|
||||||
|
|
||||||
|
|
||||||
|
def test_group_placement_and_source_multi_host():
|
||||||
|
cat = {"dns": {"group": "docker_hosts",
|
||||||
|
"ingress": [{"from": "docker_hosts", "port": 53, "proto": "udp"}]}}
|
||||||
|
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
|
||||||
|
assert out == [{"proto": "udp", "port": 53,
|
||||||
|
"sources": ["10.20.0.50/32", "10.20.0.51/32"]}]
|
||||||
|
|
||||||
|
|
||||||
|
def test_host_with_no_services_returns_empty():
|
||||||
|
cat = {"photoprism": {"host": "docker02",
|
||||||
|
"ingress": [{"from": "lan", "port": 2342, "proto": "tcp"}]}}
|
||||||
|
assert fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_unresolvable_from_raises():
|
||||||
|
cat = {"x": {"host": "docker01",
|
||||||
|
"ingress": [{"from": "nope", "port": 80, "proto": "tcp"}]}}
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
|
||||||
|
|
||||||
|
|
||||||
|
def test_duplicate_rules_deduped():
|
||||||
|
cat = {"app": {"host": "docker01", "ingress": [
|
||||||
|
{"from": "lan", "port": 8080, "proto": "tcp"},
|
||||||
|
{"from": "lan", "port": 8080, "proto": "tcp"},
|
||||||
|
]}}
|
||||||
|
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
|
||||||
|
assert out == [{"proto": "tcp", "port": 8080, "sources": ["10.30.0.0/24"]}]
|
||||||
|
|
||||||
|
|
||||||
|
def test_missing_ansible_host_raises():
|
||||||
|
cat = {"x": {"host": "docker01",
|
||||||
|
"ingress": [{"from": "docker02", "port": 80, "proto": "tcp"}]}}
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
fr.resolve_firewall_rules(cat, ZONES, "docker01", {"docker01": {}, "docker02": {}}, GROUPS)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -v`
|
||||||
|
Expected: FAIL — `FileNotFoundError` / import error (the module doesn't exist yet).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write the filter plugin**
|
||||||
|
|
||||||
|
Create `roles/base/filter_plugins/firewall_rules.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
"""Resolve the shared firewall catalog into concrete nftables ingress rules for one host.
|
||||||
|
|
||||||
|
Used by the base role's nftables template (ADR-020 / host-nftables design). Pure
|
||||||
|
functions — unit-tested in tests/test_firewall_rules.py.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def _placement_hosts(entry, groups):
|
||||||
|
"""Hostnames a catalog entry is placed on (exactly one of host/group/hosts)."""
|
||||||
|
if "host" in entry:
|
||||||
|
return [entry["host"]]
|
||||||
|
if "group" in entry:
|
||||||
|
return list(groups.get(entry["group"], []))
|
||||||
|
if "hosts" in entry:
|
||||||
|
return list(entry["hosts"])
|
||||||
|
raise ValueError(f"catalog entry has no placement (host/group/hosts): {entry!r}")
|
||||||
|
|
||||||
|
|
||||||
|
def _host_cidr(host, hostvars):
|
||||||
|
hv = hostvars.get(host) or {}
|
||||||
|
ip = hv.get("ansible_host")
|
||||||
|
if not ip:
|
||||||
|
raise ValueError(f"no ansible_host for '{host}' — cannot resolve firewall source")
|
||||||
|
return f"{ip}/32"
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_source(frm, catalog, zones, hostvars, groups):
|
||||||
|
"""Resolve a symbolic `from` to a sorted list of source CIDRs."""
|
||||||
|
if frm in zones:
|
||||||
|
return [zones[frm]]
|
||||||
|
if frm in catalog:
|
||||||
|
return sorted(_host_cidr(h, hostvars)
|
||||||
|
for h in _placement_hosts(catalog[frm], groups))
|
||||||
|
if frm in groups:
|
||||||
|
return sorted(_host_cidr(h, hostvars) for h in groups[frm])
|
||||||
|
if frm in hostvars:
|
||||||
|
return [_host_cidr(frm, hostvars)]
|
||||||
|
raise ValueError(f"unresolvable firewall source '{frm}'")
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_firewall_rules(catalog, zones, inventory_hostname, hostvars, groups):
|
||||||
|
"""Return sorted, de-duped [{proto, port, sources:[cidr,...]}] for services on this host."""
|
||||||
|
catalog = catalog or {}
|
||||||
|
zones = zones or {}
|
||||||
|
groups = groups or {}
|
||||||
|
|
||||||
|
rules = []
|
||||||
|
for _name, entry in sorted(catalog.items()):
|
||||||
|
if inventory_hostname not in _placement_hosts(entry, groups):
|
||||||
|
continue
|
||||||
|
for ing in entry.get("ingress", []):
|
||||||
|
rules.append({
|
||||||
|
"proto": ing.get("proto", "tcp"),
|
||||||
|
"port": int(ing["port"]),
|
||||||
|
"sources": _resolve_source(ing["from"], catalog, zones, hostvars, groups),
|
||||||
|
})
|
||||||
|
|
||||||
|
seen = set()
|
||||||
|
out = []
|
||||||
|
for r in sorted(rules, key=lambda x: (x["port"], x["proto"], x["sources"])):
|
||||||
|
key = (r["proto"], r["port"], tuple(r["sources"]))
|
||||||
|
if key not in seen:
|
||||||
|
seen.add(key)
|
||||||
|
out.append(r)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class FilterModule:
|
||||||
|
"""Ansible filter plugin entry point."""
|
||||||
|
|
||||||
|
def filters(self):
|
||||||
|
return {"resolve_firewall_rules": resolve_firewall_rules}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests to verify they pass**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -v`
|
||||||
|
Expected: PASS (all 7 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/filter_plugins/firewall_rules.py tests/test_firewall_rules.py
|
||||||
|
git commit -m "feat(base): firewall catalog resolver filter plugin + tests"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Template + render tasks + Molecule fixtures
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `roles/base/templates/nftables.conf.j2`
|
||||||
|
- Create: `roles/base/tasks/firewall.yml`
|
||||||
|
- Replace: `roles/base/tasks/main.yml`
|
||||||
|
- Adjust: `roles/base/molecule/default/molecule.yml`
|
||||||
|
- Replace: `roles/base/molecule/default/converge.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the template**
|
||||||
|
|
||||||
|
Create `roles/base/templates/nftables.conf.j2`:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
#!/usr/sbin/nft -f
|
||||||
|
# Ansible managed — do not edit by hand. Source: roles/base (ADR-020).
|
||||||
|
flush ruleset
|
||||||
|
|
||||||
|
table inet filter {
|
||||||
|
chain input {
|
||||||
|
type filter hook input priority 0; policy drop;
|
||||||
|
iif "lo" accept
|
||||||
|
ct state established,related accept
|
||||||
|
ct state invalid drop
|
||||||
|
iif "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
|
||||||
|
ip protocol icmp accept
|
||||||
|
ip6 nexthdr ipv6-icmp accept
|
||||||
|
{% for r in base__firewall_resolved %}
|
||||||
|
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
|
||||||
|
{% endfor %}
|
||||||
|
}
|
||||||
|
chain forward { type filter hook forward priority 0; policy drop; }
|
||||||
|
chain output { type filter hook output priority 0; policy accept; }
|
||||||
|
}
|
||||||
|
|
||||||
|
include "{{ base__firewall_dropin_dir }}/*.nft"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create `roles/base/tasks/firewall.yml`** (render path only; apply added in Task 5)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Install nftables
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: nftables
|
||||||
|
state: present
|
||||||
|
tags: [firewall]
|
||||||
|
|
||||||
|
- name: Ensure nftables drop-in dir exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ base__firewall_dropin_dir }}"
|
||||||
|
state: directory
|
||||||
|
mode: "0755"
|
||||||
|
tags: [firewall]
|
||||||
|
|
||||||
|
- name: Resolve firewall ingress rules for this host
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
base__firewall_resolved: >-
|
||||||
|
{{ firewall_catalog | default({})
|
||||||
|
| resolve_firewall_rules(firewall_zones | default({}),
|
||||||
|
inventory_hostname, hostvars, groups) }}
|
||||||
|
tags: [firewall]
|
||||||
|
|
||||||
|
- name: Render nftables ruleset (syntax-checked before install)
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: nftables.conf.j2
|
||||||
|
dest: /etc/nftables.conf
|
||||||
|
mode: "0644"
|
||||||
|
validate: "nft -c -f %s"
|
||||||
|
register: base__firewall_render
|
||||||
|
tags: [firewall]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Replace `roles/base/tasks/main.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Configure host firewall (nftables)
|
||||||
|
ansible.builtin.include_tasks: firewall.yml
|
||||||
|
tags: [firewall]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add a fixture IP in `roles/base/molecule/default/molecule.yml`**
|
||||||
|
|
||||||
|
In the `provisioner.inventory.host_vars.instance` map (which already sets
|
||||||
|
`ansible_user: root`), add `ansible_host: 10.20.0.50` so the resolver can map the
|
||||||
|
instance to an IP. The block becomes:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
provisioner:
|
||||||
|
name: ansible
|
||||||
|
inventory:
|
||||||
|
host_vars:
|
||||||
|
instance:
|
||||||
|
ansible_user: root
|
||||||
|
ansible_host: 10.20.0.50
|
||||||
|
```
|
||||||
|
|
||||||
|
(The Molecule Docker connection addresses the container by name, not `ansible_host`, so
|
||||||
|
this is data-only and won't affect connectivity.)
|
||||||
|
|
||||||
|
- [ ] **Step 5: Replace `roles/base/molecule/default/converge.yml`** with a fixture catalog and `apply: false`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Converge
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
vars:
|
||||||
|
base__firewall_apply: false
|
||||||
|
firewall_zones:
|
||||||
|
lan: 10.30.0.0/24
|
||||||
|
srv: 10.20.0.0/24
|
||||||
|
mgmt: 10.10.0.0/24
|
||||||
|
firewall_catalog:
|
||||||
|
reverse_proxy:
|
||||||
|
host: instance
|
||||||
|
ingress:
|
||||||
|
- { from: lan, port: 443, proto: tcp }
|
||||||
|
photoprism:
|
||||||
|
host: instance
|
||||||
|
ingress:
|
||||||
|
- { from: reverse_proxy, port: 2342, proto: tcp }
|
||||||
|
roles:
|
||||||
|
- role: base
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run Molecule (scaffold verify still trivially passes) + lint**
|
||||||
|
|
||||||
|
Run: `make lint && make test ROLE=base`
|
||||||
|
Expected: `make lint` passes. Molecule creates the container, converges (installs nftables, renders `/etc/nftables.conf`, and the `nft -c` `validate` succeeds), passes the idempotence run (second converge reports no changes), runs the scaffold `verify.yml` (asserts `true`), and destroys. If the registry image is unreachable, report DONE_WITH_CONCERNS and confirm `make lint` + Task 3 pytest still pass.
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/templates/nftables.conf.j2 roles/base/tasks/firewall.yml roles/base/tasks/main.yml roles/base/molecule/default/molecule.yml roles/base/molecule/default/converge.yml
|
||||||
|
git commit -m "feat(base): render nftables ruleset from catalog (+ molecule fixture)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Safe apply with auto-rollback
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/tasks/firewall.yml` (append the apply block)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Append the safe-apply block to `roles/base/tasks/firewall.yml`**
|
||||||
|
|
||||||
|
Add at the end of the file:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Apply firewall ruleset safely (with auto-rollback)
|
||||||
|
when:
|
||||||
|
- base__firewall_apply | bool
|
||||||
|
- base__firewall_render is changed
|
||||||
|
tags: [firewall]
|
||||||
|
block:
|
||||||
|
- name: Snapshot the current ruleset as the rollback point
|
||||||
|
ansible.builtin.shell: "nft list ruleset > /etc/nftables.rollback"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear any stale rollback unit
|
||||||
|
ansible.builtin.shell: >-
|
||||||
|
systemctl stop nft-rollback.timer nft-rollback.service 2>/dev/null;
|
||||||
|
systemctl reset-failed nft-rollback.timer nft-rollback.service 2>/dev/null;
|
||||||
|
true
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Arm the auto-rollback timer
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: >-
|
||||||
|
systemd-run --on-active={{ base__firewall_rollback_timeout }}
|
||||||
|
--unit=nft-rollback /usr/sbin/nft -f /etc/nftables.rollback
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Apply the new ruleset
|
||||||
|
ansible.builtin.command: nft -f /etc/nftables.conf
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Confirm connectivity survived, then disarm the rollback
|
||||||
|
ansible.builtin.shell: >-
|
||||||
|
systemctl stop nft-rollback.timer nft-rollback.service 2>/dev/null;
|
||||||
|
systemctl reset-failed nft-rollback.timer nft-rollback.service 2>/dev/null;
|
||||||
|
true
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Enable nftables.service so the ruleset persists across reboot
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: nftables
|
||||||
|
enabled: true
|
||||||
|
when: base__firewall_apply | bool
|
||||||
|
tags: [firewall]
|
||||||
|
```
|
||||||
|
|
||||||
|
(The "Confirm" step runs only if the play reached it — i.e. the apply did not sever the
|
||||||
|
connection. If the apply locked the host out, the play cannot continue, the armed timer
|
||||||
|
fires after `base__firewall_rollback_timeout` seconds, and the host self-heals to the
|
||||||
|
snapshot. Molecule sets `base__firewall_apply: false`, so this block is skipped there.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Re-run Molecule + lint (apply still skipped, must stay idempotent)**
|
||||||
|
|
||||||
|
Run: `make lint && make test ROLE=base`
|
||||||
|
Expected: `make lint` passes (no `no-changed-when`/FQCN findings — every command/shell has `changed_when`). Molecule still green and idempotent (the apply block is gated off by `base__firewall_apply: false`). DONE_WITH_CONCERNS if the image is unreachable.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/tasks/firewall.yml
|
||||||
|
git commit -m "feat(base): safe nftables apply with systemd-run auto-rollback"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Molecule verify — assert rendered rules + syntax
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Replace: `roles/base/molecule/default/verify.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace `roles/base/molecule/default/verify.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Verify
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
tasks:
|
||||||
|
- name: Read the rendered ruleset
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/nftables.conf
|
||||||
|
register: ruleset
|
||||||
|
|
||||||
|
- name: Decode it
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
nft: "{{ ruleset.content | b64decode }}"
|
||||||
|
|
||||||
|
- name: Assert default-deny input policy and management plane
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'type filter hook input priority 0; policy drop;' in nft"
|
||||||
|
- "'ct state established,related accept' in nft"
|
||||||
|
- "'iif \"wt0\" tcp dport 22 accept' in nft"
|
||||||
|
fail_msg: "input chain is missing default-deny or the management plane"
|
||||||
|
|
||||||
|
- name: Assert the lan->reverse_proxy:443 ingress rule
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'10.30.0.0/24' in nft"
|
||||||
|
- "'tcp dport 443 accept' in nft"
|
||||||
|
fail_msg: "missing lan->443 rule for reverse_proxy"
|
||||||
|
|
||||||
|
- name: Assert the reverse_proxy->photoprism:2342 ingress rule (resolved to host IP)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'10.20.0.50/32' in nft"
|
||||||
|
- "'tcp dport 2342 accept' in nft"
|
||||||
|
fail_msg: "missing reverse_proxy->2342 rule for photoprism"
|
||||||
|
|
||||||
|
- name: Assert the docker_host extension hook is present
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'include \"/etc/nftables.d/*.nft\"' in nft"
|
||||||
|
fail_msg: "missing drop-in include hook"
|
||||||
|
|
||||||
|
- name: Syntax-check the rendered ruleset (no apply)
|
||||||
|
ansible.builtin.command: nft -c -f /etc/nftables.conf
|
||||||
|
changed_when: false
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the full Molecule sequence + lint**
|
||||||
|
|
||||||
|
Run: `make lint && make test ROLE=base`
|
||||||
|
Expected: `make lint` passes; Molecule converge renders, then `verify.yml` passes all
|
||||||
|
assertions and the `nft -c` check. DONE_WITH_CONCERNS if the image is unreachable (note
|
||||||
|
that the assertions could not be exercised).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/molecule/default/verify.yml
|
||||||
|
git commit -m "test(base): molecule verify asserts rendered firewall rules + nft -c"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Reflect the build in STATUS + CAPABILITIES
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
- Modify: `docs/CAPABILITIES.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update the `roles/base/` row in STATUS.md**
|
||||||
|
|
||||||
|
In `STATUS.md`, under "## Scaffolded but empty — NOT implemented", find the row:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| `roles/base/` | Not in git — only an empty dir on disk (untracked). `site.yml` references it, so a clean clone errors on `make deploy PLAYBOOK=site` until it is built. |
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace it with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| `roles/base/` | **Partially built.** The `firewall` concern is implemented (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) with pytest + Molecule render/syntax tests. Other concerns (SSH hardening, fail2ban, auditd, packages, users) are **not** built yet, so `make deploy PLAYBOOK=site` is still incomplete. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Update the firewall note in CAPABILITIES.md**
|
||||||
|
|
||||||
|
In `docs/CAPABILITIES.md` (§1 Edge & networking), find the line added for ADR-020:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
|
||||||
|
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
|
||||||
|
role from a shared `group_vars` service catalog. Both layers are still to be built._
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace the final sentence so it reads:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
|
||||||
|
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
|
||||||
|
role from a shared `group_vars` service catalog. The host `nftables` layer is built (the
|
||||||
|
`base` firewall concern); the OPNsense layer is still to be built._
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update the `_Last reviewed_` date in STATUS.md**
|
||||||
|
|
||||||
|
In `STATUS.md`, change the `_Last reviewed: ..._` line to `_Last reviewed: 2026-06-06._`
|
||||||
|
(if it is not already that date).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify + lint**
|
||||||
|
|
||||||
|
Run: `grep -n "Partially built" STATUS.md && grep -n "host .nftables. layer is built" docs/CAPABILITIES.md && make lint`
|
||||||
|
Expected: both greps match; `make lint` passes.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add STATUS.md docs/CAPABILITIES.md
|
||||||
|
git commit -m "docs: record base firewall concern built (ADR-020 host layer)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final verification
|
||||||
|
|
||||||
|
- [ ] `make lint` passes end to end (yamllint + ansible-lint over `roles/base` + `check-tags: OK`).
|
||||||
|
- [ ] `.venv/bin/python -m pytest tests/ -v` passes (the `check-tags` suite + the 7 new `firewall_rules` tests).
|
||||||
|
- [ ] `make test ROLE=base` is green (or DONE_WITH_CONCERNS with a clear note if the Molecule image is unreachable in this environment).
|
||||||
|
- [ ] `git log --oneline -7` shows the seven task commits.
|
||||||
|
- [ ] Sanity: `roles/base/tasks/firewall.yml` never applies when `base__firewall_apply` is false, and every `command`/`shell` task has `changed_when` (ansible-lint clean).
|
||||||
480
docs/superpowers/plans/2026-06-06-logging-log-integrity.md
Normal file
480
docs/superpowers/plans/2026-06-06-logging-log-integrity.md
Normal file
|
|
@ -0,0 +1,480 @@
|
||||||
|
# Logging & Log Integrity Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability.
|
||||||
|
|
||||||
|
**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation.
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pre-flight (read once)
|
||||||
|
|
||||||
|
- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`.
|
||||||
|
- **Commit style:** one commit per task, imperative subject ≤72 chars.
|
||||||
|
- **Order:** Task 1 (ADR-018) first — later tasks link to it.
|
||||||
|
- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`.
|
||||||
|
- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File map
|
||||||
|
|
||||||
|
| File | Action | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture |
|
||||||
|
| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) |
|
||||||
|
| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs |
|
||||||
|
| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting |
|
||||||
|
| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric |
|
||||||
|
| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) |
|
||||||
|
| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing |
|
||||||
|
| `CLAUDE.md` | Modify | ADR-018 in Further reading |
|
||||||
|
|
||||||
|
**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Author ADR-018 (the home of record)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/018-logging.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the ADR**
|
||||||
|
|
||||||
|
Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-018 — Logging and log integrity
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||||
|
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
|
||||||
|
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
|
||||||
|
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
|
||||||
|
off-site watchdog). Undecided: the architecture and the **integrity** question — an
|
||||||
|
attacker who roots a host will try to clear logs to cover their tracks.
|
||||||
|
|
||||||
|
The framing insight: the biggest anti-tampering win is that logs **leave the host in
|
||||||
|
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
|
||||||
|
local copy is futile. How far to harden the central store is set by the threat model.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
|
||||||
|
forensic-grade.
|
||||||
|
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
|
||||||
|
trends. Near-real-time shipping already defeats per-host track-covering.
|
||||||
|
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** —
|
||||||
|
tamper-resistant against full-cluster compromise, at bounded volume.
|
||||||
|
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
|
||||||
|
proportionate control.
|
||||||
|
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
|
||||||
|
retention + wearout monitoring.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
|
||||||
|
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
|
||||||
|
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
|
||||||
|
single-binary mode; NVMe; bounded retention.
|
||||||
|
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
|
||||||
|
only, write-only, long retention, tiny volume.
|
||||||
|
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
|
||||||
|
+ the alerting ADR-002 calls for.
|
||||||
|
|
||||||
|
## Data flow & the security subset
|
||||||
|
|
||||||
|
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
|
||||||
|
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
|
||||||
|
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
|
||||||
|
it syslog-forwards its alerts to the ingest point), and key container security events.
|
||||||
|
|
||||||
|
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
|
||||||
|
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
|
||||||
|
hosts. The push API has no edit/delete verb, so a compromised host can append but not
|
||||||
|
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
|
||||||
|
(WAL) + retries across a brief outage.
|
||||||
|
|
||||||
|
## Security, integrity & residual risks
|
||||||
|
|
||||||
|
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
|
||||||
|
(append-only, off-cluster). The security trail survives full-cluster compromise.
|
||||||
|
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
|
||||||
|
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
|
||||||
|
shipping but not alter shipped history; **a host going silent is itself an alert**; a
|
||||||
|
stolen push credential appends noise but can't delete; an `askari` outage buffers +
|
||||||
|
flushes on reconnect.
|
||||||
|
|
||||||
|
## Retention & disk-wear
|
||||||
|
|
||||||
|
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
|
||||||
|
bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr).
|
||||||
|
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
|
||||||
|
verbosity at source (sane levels, selective access logging, a targeted `auditd`
|
||||||
|
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
|
||||||
|
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||||
|
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||||
|
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||||
|
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||||
|
and the live pipeline.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||||
|
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
|
||||||
|
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
|
||||||
|
(sibling effort, TODO 3.6).
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
|
||||||
|
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||||
|
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
|
||||||
|
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
||||||
|
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
||||||
|
|
||||||
|
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||||
|
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||||
|
standard), ADR-011 (health checks — distinct from this).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/018-logging.md
|
||||||
|
git commit -m "Add ADR-018 (logging and log integrity)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Make ADR-002's logging bullets concrete
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/002-security.md`
|
||||||
|
|
||||||
|
Read the file first, then two exact edits.
|
||||||
|
|
||||||
|
- [ ] **Step 1: The audit-trail bullet**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
- `auditd` installed and running with a baseline ruleset
|
||||||
|
- Logs shipped to a central location if a log aggregation service is available
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
- `auditd` installed and running with a baseline ruleset
|
||||||
|
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
|
||||||
|
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
|
||||||
|
trail survives host (and full-cluster) compromise (ADR-018)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: The active-alerting bullet**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
|
||||||
|
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
|
||||||
|
log-source-silence (a host that stops shipping) — into Grafana alerting on the
|
||||||
|
Loki/Grafana stack (ADR-018; planned)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/002-security.md
|
||||||
|
git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Add accepted-risk R4 (no WORM for logs)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/accepted-risks.md`
|
||||||
|
|
||||||
|
Read the file first, then one exact edit (add R4 after R3).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the R4 row**
|
||||||
|
|
||||||
|
Find this exact line (the R3 row):
|
||||||
|
```
|
||||||
|
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||||
|
```
|
||||||
|
Add immediately **after** it:
|
||||||
|
```
|
||||||
|
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Bump the "Last reviewed" date**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
_Last reviewed: 2026-06-05. The prior gaps
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
_Last reviewed: 2026-06-06. The prior gaps
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/security/accepted-risks.md
|
||||||
|
git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Update CAPABILITIES §3 (Observability)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/CAPABILITIES.md`
|
||||||
|
|
||||||
|
Read the file first, then three exact edits.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Loki row → decided, note the off-site sink**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited)
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
|
||||||
|
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/CAPABILITIES.md
|
||||||
|
git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: ADR-012 — log-storage allocation + wearout metric
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/012-hardware-capacity.md`
|
||||||
|
|
||||||
|
Read the file first, then one exact edit (add a Consequences bullet).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add a Consequences bullet**
|
||||||
|
|
||||||
|
Find this exact block:
|
||||||
|
```
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||||
|
- `reference.md` table headers are a parser contract — changing them needs a
|
||||||
|
matching `capacity-scan.py` change.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||||
|
- `reference.md` table headers are a parser contract — changing them needs a
|
||||||
|
matching `capacity-scan.py` change.
|
||||||
|
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
|
||||||
|
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
|
||||||
|
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
|
||||||
|
not assumed.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/012-hardware-capacity.md
|
||||||
|
git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Add logging rows to STATUS.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
|
||||||
|
Read the file first, then one exact edit (add two rows after the Level 4 row).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the rows**
|
||||||
|
|
||||||
|
Find this exact line:
|
||||||
|
```
|
||||||
|
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by the two new rows:
|
||||||
|
```
|
||||||
|
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||||
|
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
||||||
|
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add STATUS.md
|
||||||
|
git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Reconcile TODO 3.1 and 3.6
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/TODO.md`
|
||||||
|
|
||||||
|
Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Mark 3.1 decided**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
3. **Building services**
|
||||||
|
1. Decide how to manage logs.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
3. **Building services**
|
||||||
|
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
|
||||||
|
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
|
||||||
|
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
||||||
|
Kuma alerts on askari.
|
||||||
|
```
|
||||||
|
Replace with:
|
||||||
|
```
|
||||||
|
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
|
||||||
|
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
|
||||||
|
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
|
||||||
|
Uptime Kuma, and exactly which alerts live where.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add docs/TODO.md
|
||||||
|
git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Link ADR-018 from CLAUDE.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
|
||||||
|
Read the file first, then one exact edit.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the Further-reading row after Hardware & capacity**
|
||||||
|
|
||||||
|
Find:
|
||||||
|
```
|
||||||
|
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||||
|
```
|
||||||
|
Replace with that SAME line followed by the new row:
|
||||||
|
```
|
||||||
|
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||||
|
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify and commit**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||||
|
Expected: Passed/Skipped.
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md
|
||||||
|
git commit -m "CLAUDE.md: link ADR-018 (logging)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 9: Final consistency sweep
|
||||||
|
|
||||||
|
**Files:** none modified (verification only)
|
||||||
|
|
||||||
|
- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
test -f docs/decisions/018-logging.md && echo "ADR-018 present"
|
||||||
|
grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||||
|
```
|
||||||
|
Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md.
|
||||||
|
|
||||||
|
- [ ] **Step 2: No stale "logging undecided / if available" language**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||||
|
```
|
||||||
|
Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Full hook run**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && pre-commit run --all-files`
|
||||||
|
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Push (only if the user asks)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git push origin <branch-or-main-after-merge>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review notes (author)
|
||||||
|
|
||||||
|
- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 2–8 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓
|
||||||
|
- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓
|
||||||
|
- **No placeholders:** every create/edit shows exact text. ✓
|
||||||
|
- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓
|
||||||
|
```
|
||||||
728
docs/superpowers/plans/2026-06-06-tagging-strategy.md
Normal file
728
docs/superpowers/plans/2026-06-06-tagging-strategy.md
Normal file
|
|
@ -0,0 +1,728 @@
|
||||||
|
# Ansible Tagging Standard Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Establish a two-tier Ansible tagging standard (role-name tags + a closed concern list) with machine-enforced vocabulary, plus a Proxmox VM metadata-tag convention, so playbook runs are targeted, transparent, and predictable.
|
||||||
|
|
||||||
|
**Architecture:** A single source-of-truth YAML (`tests/tags.yml`) lists the allowed concern/special/opt-in/playbook tags. A Python checker (`scripts/check-tags.py`) scans `roles/` and `playbooks/`, computes the allowed set as `{role dir names} ∪ {tags.yml entries}`, and fails `make lint` on any unknown tag. Terraform gets a documented three-tag VM convention (metadata only). The standard is recorded as ADR-019 and folded into CLAUDE.md.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3 (stdlib + PyYAML, already present via ansible-core), pytest (already in `requirements.txt`), Make, Terraform (HCL edit only — not `init`ed), Markdown docs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
| File | Responsibility | Action |
|
||||||
|
|------|----------------|--------|
|
||||||
|
| `tests/tags.yml` | Single source of truth: allowed concern/special/opt-in/playbook tags | Create |
|
||||||
|
| `scripts/check-tags.py` | Scan `roles/`+`playbooks/`, fail on tags outside the allowed set | Create |
|
||||||
|
| `tests/test_check_tags.py` | Unit tests for the checker (mirrors `tests/test_capacity_scan.py`) | Create |
|
||||||
|
| `Makefile` | Wire `check-tags.py` into the `lint` target | Modify |
|
||||||
|
| `playbooks/site.yml` | Fix `docker_host` role tag (`docker` → `docker_host`) | Modify |
|
||||||
|
| `docs/decisions/019-tagging.md` | The ADR (the standard itself) | Create |
|
||||||
|
| `CLAUDE.md` | Reword tag rule; add Proxmox tag convention; add ADR-019 to Further reading | Modify |
|
||||||
|
| `terraform/environments/staging/main.tf` | Add `managed-by=terraform` tag | Modify |
|
||||||
|
| `terraform/environments/production/main.tf` | Add `managed-by=terraform` tag | Modify |
|
||||||
|
| `docs/TODO.md` | Mark 3.7 and 3.11 DECIDED | Modify |
|
||||||
|
| `docs/CAPABILITIES.md` | Note targeted runs as a capability | Modify |
|
||||||
|
|
||||||
|
Notes for the implementer:
|
||||||
|
- The repo venv is `.venv`. Run Python as `.venv/bin/python` (Makefile vars: `PYTHON := .venv/bin/python`). If `.venv` is missing, run `make setup` first.
|
||||||
|
- PyYAML is available in the venv (ansible-core depends on it) — `import yaml` works.
|
||||||
|
- Terraform is **not** `init`ed in this repo, so `terraform validate`/`plan` will fail offline. Only use `terraform fmt` (offline-safe) for the HCL tasks.
|
||||||
|
- Before any `git commit`, the pre-commit hook decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good). If locked, ask the user to `rbw unlock` and wait. None of these tasks touch vault files, but the hook still runs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Tag vocabulary file (`tests/tags.yml`)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `tests/tags.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the vocabulary file**
|
||||||
|
|
||||||
|
Create `tests/tags.yml` with exactly this content:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Allowed Ansible tag vocabulary — single source of truth for scripts/check-tags.py.
|
||||||
|
# Authoritative reference & rationale: docs/decisions/019-tagging.md.
|
||||||
|
#
|
||||||
|
# The full allowed set the linter enforces is:
|
||||||
|
# {role directory names under roles/} ∪ everything listed below.
|
||||||
|
#
|
||||||
|
# To add a CONCERN tag: add it here AND add a row to the ADR-019 table with a
|
||||||
|
# one-line justification (cross-cutting, used in 2+ roles, distinct).
|
||||||
|
|
||||||
|
# Cross-cutting concern tags, applied per-task/block where a task belongs to the
|
||||||
|
# concern. Targeted one at a time (tags are union/OR, never intersected).
|
||||||
|
concerns:
|
||||||
|
- packages # apt package install/management
|
||||||
|
- users # accounts, groups, sudo
|
||||||
|
- firewall # nftables rulesets & port definitions (ADR-002)
|
||||||
|
- hardening # security baseline — sshd config, fail2ban, auditd, sysctl
|
||||||
|
- logging # Alloy / log-shipping config (ADR-018)
|
||||||
|
- monitoring # metric exporters / health checks
|
||||||
|
- config # render templated config/compose files to disk — no restart
|
||||||
|
- deploy # bring services up / restart (compose up -d)
|
||||||
|
- proxy # reverse-proxy + TLS registration (Traefik routes, Authentik)
|
||||||
|
|
||||||
|
# Ansible built-in special tags. Narrow use only:
|
||||||
|
# always — cheap preflight assertions (run regardless of --tags)
|
||||||
|
# never — destructive/expensive tasks, paired with an opt-in tag below
|
||||||
|
special:
|
||||||
|
- always
|
||||||
|
- never
|
||||||
|
|
||||||
|
# `never`-paired opt-in tags: destructive/expensive tasks that only run when
|
||||||
|
# named explicitly (e.g. `tags: [never, force_pull]`). Empty until a role adds one.
|
||||||
|
opt_ins: []
|
||||||
|
|
||||||
|
# Playbook-level identity tags for role-less lifecycle plays (e.g. bootstrap.yml).
|
||||||
|
playbooks:
|
||||||
|
- bootstrap
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify it parses and has the expected shape**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
.venv/bin/python -c "import yaml; d=yaml.safe_load(open('tests/tags.yml')); assert len(d['concerns'])==9, d['concerns']; assert d['special']==['always','never']; assert d['opt_ins']==[]; assert d['playbooks']==['bootstrap']; print('tags.yml OK')"
|
||||||
|
```
|
||||||
|
Expected: prints `tags.yml OK` and exits 0.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tests/tags.yml
|
||||||
|
git commit -m "feat(tags): add allowed-tag vocabulary (tests/tags.yml)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Checker core — tag collection & allowed-set helpers
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scripts/check-tags.py`
|
||||||
|
- Test: `tests/test_check_tags.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing tests**
|
||||||
|
|
||||||
|
Create `tests/test_check_tags.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import importlib.util
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "check-tags.py"
|
||||||
|
_spec = importlib.util.spec_from_file_location("check_tags", _PATH)
|
||||||
|
ct = importlib.util.module_from_spec(_spec)
|
||||||
|
_spec.loader.exec_module(ct)
|
||||||
|
|
||||||
|
|
||||||
|
def test_collect_tags_list_form():
|
||||||
|
node = {"name": "t", "tags": ["firewall", "users"]}
|
||||||
|
assert ct.collect_tags(node) == {"firewall", "users"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_collect_tags_string_form():
|
||||||
|
node = {"name": "t", "tags": "always"}
|
||||||
|
assert ct.collect_tags(node) == {"always"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_collect_tags_nested_blocks_and_roles():
|
||||||
|
doc = [
|
||||||
|
{"hosts": "all", "roles": [{"role": "base", "tags": ["base"]}]},
|
||||||
|
{"block": [{"name": "x", "tags": ["config"]}], "tags": ["deploy"]},
|
||||||
|
]
|
||||||
|
assert ct.collect_tags(doc) == {"base", "config", "deploy"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_collect_tags_ignores_templated_values():
|
||||||
|
node = {"tags": ["{{ dynamic }}", "logging"]}
|
||||||
|
assert ct.collect_tags(node) == {"logging"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_vocab_unions_all_categories():
|
||||||
|
vocab = ct.load_vocab()
|
||||||
|
assert "firewall" in vocab # concern
|
||||||
|
assert "always" in vocab # special
|
||||||
|
assert "bootstrap" in vocab # playbook identity
|
||||||
|
assert len([c for c in vocab]) >= 12
|
||||||
|
|
||||||
|
|
||||||
|
def test_role_names_reads_role_dirs():
|
||||||
|
names = ct.role_names()
|
||||||
|
assert "base" in names
|
||||||
|
assert "docker_host" in names
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
|
||||||
|
Expected: FAIL — `ModuleNotFoundError` / file not found for `scripts/check-tags.py` (the module can't be imported yet).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write the minimal implementation**
|
||||||
|
|
||||||
|
Create `scripts/check-tags.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Validate that every Ansible tag used under roles/ and playbooks/ belongs to the
|
||||||
|
approved vocabulary. Single source of truth: tests/tags.yml. Rationale: ADR-019.
|
||||||
|
|
||||||
|
Allowed set = {role directory names under roles/} ∪ {concerns, special, opt_ins,
|
||||||
|
playbooks from tests/tags.yml}. Templated tags (containing "{{") are skipped —
|
||||||
|
they can't be statically validated.
|
||||||
|
|
||||||
|
Usage: python3 scripts/check-tags.py
|
||||||
|
Exit 0 = all tags allowed; exit 1 = unknown tag(s) found.
|
||||||
|
"""
|
||||||
|
import pathlib
|
||||||
|
import sys
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
REPO = pathlib.Path(__file__).resolve().parent.parent
|
||||||
|
VOCAB_FILE = REPO / "tests" / "tags.yml"
|
||||||
|
SCAN_DIRS = ("roles", "playbooks")
|
||||||
|
|
||||||
|
|
||||||
|
class _IgnoreUnknownTags(yaml.SafeLoader):
|
||||||
|
"""SafeLoader that tolerates custom YAML tags (e.g. !vault) instead of crashing."""
|
||||||
|
|
||||||
|
|
||||||
|
def _ignore(loader, tag_suffix, node):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
_IgnoreUnknownTags.add_multi_constructor("", _ignore)
|
||||||
|
_IgnoreUnknownTags.add_multi_constructor("!", _ignore)
|
||||||
|
|
||||||
|
|
||||||
|
def _static_str(value):
|
||||||
|
return isinstance(value, str) and "{{" not in value
|
||||||
|
|
||||||
|
|
||||||
|
def load_vocab(path=VOCAB_FILE):
|
||||||
|
data = yaml.safe_load(path.read_text()) or {}
|
||||||
|
vocab = set()
|
||||||
|
for key in ("concerns", "special", "opt_ins", "playbooks"):
|
||||||
|
vocab.update(data.get(key) or [])
|
||||||
|
return vocab
|
||||||
|
|
||||||
|
|
||||||
|
def role_names(repo=REPO):
|
||||||
|
roles_dir = repo / "roles"
|
||||||
|
if not roles_dir.is_dir():
|
||||||
|
return set()
|
||||||
|
return {p.name for p in roles_dir.iterdir() if p.is_dir()}
|
||||||
|
|
||||||
|
|
||||||
|
def collect_tags(node):
|
||||||
|
"""Recursively collect every static tag string under any 'tags:' key."""
|
||||||
|
tags = set()
|
||||||
|
if isinstance(node, dict):
|
||||||
|
for key, value in node.items():
|
||||||
|
if key == "tags":
|
||||||
|
if _static_str(value):
|
||||||
|
tags.add(value)
|
||||||
|
elif isinstance(value, list):
|
||||||
|
tags.update(t for t in value if _static_str(t))
|
||||||
|
tags |= collect_tags(value)
|
||||||
|
elif isinstance(node, list):
|
||||||
|
for item in node:
|
||||||
|
tags |= collect_tags(item)
|
||||||
|
return tags
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": # pragma: no cover
|
||||||
|
sys.exit(0)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests to verify they pass**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
|
||||||
|
Expected: PASS (all 6 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/check-tags.py tests/test_check_tags.py
|
||||||
|
git commit -m "feat(tags): checker helpers — tag collection & allowed-set"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Checker validation — scan files and fail on unknown tags
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/check-tags.py`
|
||||||
|
- Test: `tests/test_check_tags.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing tests**
|
||||||
|
|
||||||
|
Append to `tests/test_check_tags.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_scan_text_collects_from_yaml_string():
|
||||||
|
text = """
|
||||||
|
- hosts: all
|
||||||
|
roles:
|
||||||
|
- role: base
|
||||||
|
tags: [base]
|
||||||
|
tasks:
|
||||||
|
- name: open port
|
||||||
|
tags: [firewall]
|
||||||
|
"""
|
||||||
|
assert ct.scan_text(text) == {"base", "firewall"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_scan_text_tolerates_custom_yaml_tags():
|
||||||
|
text = "- name: t\n secret: !vault xxx\n tags: [users]\n"
|
||||||
|
assert ct.scan_text(text) == {"users"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_violations_flags_unknown_tag():
|
||||||
|
allowed = {"base", "firewall"}
|
||||||
|
used = {"base", "frewall"} # typo
|
||||||
|
assert ct.find_violations(used, allowed) == ["frewall"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_violations_empty_when_all_allowed():
|
||||||
|
assert ct.find_violations({"base", "firewall"}, {"base", "firewall"}) == []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
|
||||||
|
Expected: FAIL — `AttributeError: module 'check_tags' has no attribute 'scan_text'` (and `find_violations`).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the scanning + validation functions**
|
||||||
|
|
||||||
|
In `scripts/check-tags.py`, replace the final block:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if __name__ == "__main__": # pragma: no cover
|
||||||
|
sys.exit(0)
|
||||||
|
```
|
||||||
|
|
||||||
|
with:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def scan_text(text):
|
||||||
|
"""Collect static tags from a (possibly multi-document) YAML string."""
|
||||||
|
found = set()
|
||||||
|
for doc in yaml.load_all(text, Loader=_IgnoreUnknownTags):
|
||||||
|
found |= collect_tags(doc)
|
||||||
|
return found
|
||||||
|
|
||||||
|
|
||||||
|
def iter_yaml_files(repo=REPO, scan_dirs=SCAN_DIRS):
|
||||||
|
for name in scan_dirs:
|
||||||
|
base = repo / name
|
||||||
|
if not base.is_dir():
|
||||||
|
continue
|
||||||
|
for ext in ("*.yml", "*.yaml"):
|
||||||
|
yield from sorted(base.rglob(ext))
|
||||||
|
|
||||||
|
|
||||||
|
def find_violations(used, allowed):
|
||||||
|
return sorted(used - allowed)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
allowed = load_vocab() | role_names()
|
||||||
|
violations = []
|
||||||
|
for path in iter_yaml_files():
|
||||||
|
try:
|
||||||
|
used = scan_text(path.read_text())
|
||||||
|
except yaml.YAMLError as exc:
|
||||||
|
print(f"warning: could not parse {path}: {exc}", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
for tag in find_violations(used, allowed):
|
||||||
|
violations.append((path.relative_to(REPO), tag))
|
||||||
|
|
||||||
|
if violations:
|
||||||
|
print(
|
||||||
|
"error: Ansible tag(s) not in tests/tags.yml or role names "
|
||||||
|
"(see docs/decisions/019-tagging.md):",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
for relpath, tag in violations:
|
||||||
|
print(f" {relpath}: '{tag}'", file=sys.stderr)
|
||||||
|
print(f"\nallowed: {', '.join(sorted(allowed))}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"check-tags: OK ({len(allowed)} tags allowed across {len(SCAN_DIRS)} dirs)")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests to verify they pass**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
|
||||||
|
Expected: PASS (all 10 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/check-tags.py tests/test_check_tags.py
|
||||||
|
git commit -m "feat(tags): scan roles/+playbooks/ and fail on unknown tags"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Reconcile existing tags & wire into `make lint`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `playbooks/site.yml:18-19`
|
||||||
|
- Modify: `Makefile` (the `lint:` target)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Run the checker against the current repo (expect one violation)**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python scripts/check-tags.py`
|
||||||
|
Expected: FAIL (exit 1) reporting `playbooks/site.yml: 'docker'` — because the `docker_host` role is tagged `[docker]`, which is neither a role name nor a vocabulary tag. This confirms the checker works end-to-end.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Fix the role tag to equal the role name**
|
||||||
|
|
||||||
|
In `playbooks/site.yml`, change:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: docker_host
|
||||||
|
tags: [docker]
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: docker_host
|
||||||
|
tags: [docker_host]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Re-run the checker (expect clean)**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python scripts/check-tags.py`
|
||||||
|
Expected: PASS — prints `check-tags: OK (... tags allowed across 2 dirs)` and exits 0.
|
||||||
|
(Allowed set now includes role names `base`, `docker_host`; used tags are `base`, `docker_host`, `bootstrap` — all allowed.)
|
||||||
|
|
||||||
|
- [ ] **Step 4: Wire the checker into `make lint`**
|
||||||
|
|
||||||
|
In `Makefile`, change the `lint:` target from:
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
lint:
|
||||||
|
$(VENV)/bin/yamllint .
|
||||||
|
$(LINT)
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
lint:
|
||||||
|
$(VENV)/bin/yamllint .
|
||||||
|
$(LINT)
|
||||||
|
$(PYTHON) scripts/check-tags.py
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run the full lint suite and the test suite**
|
||||||
|
|
||||||
|
Run: `make lint && .venv/bin/python -m pytest tests/test_check_tags.py -v`
|
||||||
|
Expected: yamllint passes, ansible-lint passes, `check-tags: OK`, and all pytest tests PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add playbooks/site.yml Makefile
|
||||||
|
git commit -m "feat(tags): enforce tag vocabulary in make lint; fix docker_host tag"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Terraform Proxmox VM tag convention
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `terraform/environments/staging/main.tf` (the `tags =` line in `module "vms"`)
|
||||||
|
- Modify: `terraform/environments/production/main.tf` (the `tags =` line in `module "vms"`)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `managed-by=terraform` to the staging VM tags**
|
||||||
|
|
||||||
|
In `terraform/environments/staging/main.tf`, change:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
tags = ["staging", each.value.group]
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
tags = ["staging", each.value.group, "managed-by=terraform"]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add `managed-by=terraform` to the production VM tags**
|
||||||
|
|
||||||
|
In `terraform/environments/production/main.tf`, change:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
tags = ["production", each.value.group]
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
tags = ["production", each.value.group, "managed-by=terraform"]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Format-check the HCL (offline-safe)**
|
||||||
|
|
||||||
|
Run: `terraform -chdir=terraform/environments/staging fmt && terraform -chdir=terraform/environments/production fmt`
|
||||||
|
Expected: either no output (already formatted) or the filename printed (reformatted). Exit 0.
|
||||||
|
(Do NOT run `terraform validate`/`plan` — Terraform is not `init`ed in this repo and they will fail offline.)
|
||||||
|
|
||||||
|
- [ ] **Step 4: Confirm the edits**
|
||||||
|
|
||||||
|
Run: `grep -n "managed-by=terraform" terraform/environments/staging/main.tf terraform/environments/production/main.tf`
|
||||||
|
Expected: one match in each file.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add terraform/environments/staging/main.tf terraform/environments/production/main.tf
|
||||||
|
git commit -m "feat(tags): Proxmox VM metadata convention (managed-by=terraform)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Documentation — ADR-019, CLAUDE.md, TODO, CAPABILITIES
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/019-tagging.md`
|
||||||
|
- Modify: `CLAUDE.md` (Ansible conventions; Terraform conventions; Further reading)
|
||||||
|
- Modify: `docs/TODO.md` (items 3.7 and 3.11)
|
||||||
|
- Modify: `docs/CAPABILITIES.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the ADR**
|
||||||
|
|
||||||
|
Create `docs/decisions/019-tagging.md`:
|
||||||
|
|
||||||
|
````markdown
|
||||||
|
# ADR-019 — Tagging standard for targeted, predictable runs
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-06). Resolves TODO 3.7 ("Define a tagging standard that lets us
|
||||||
|
target runs without over-tagging") and TODO 3.11 ("Deliberate tagging strategy").
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma wants to run playbooks **targeted** — a single service, a single layer, or a
|
||||||
|
single cross-cutting concern — **transparently and predictably**: a reader should
|
||||||
|
know from a `--tags` invocation exactly what it will and won't touch. CLAUDE.md
|
||||||
|
already requires tag-filterable tasks, but no vocabulary or convention existed, and
|
||||||
|
the TODO explicitly warns against the opposite failure mode: **over-tagging**.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Two-tier tagging
|
||||||
|
|
||||||
|
**Tier 1 — role/service tag (mechanical).** The tag equals the role name, applied
|
||||||
|
once at the role-import level:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
roles:
|
||||||
|
- role: photoprism
|
||||||
|
tags: [photoprism]
|
||||||
|
```
|
||||||
|
|
||||||
|
Ansible propagates it to every task in the role. Because one service = one role
|
||||||
|
(ADR-004), this single rule covers both the *layer/role* and *single-service*
|
||||||
|
targeting axes with zero per-task burden. Role-less lifecycle playbooks
|
||||||
|
(e.g. `bootstrap.yml`) carry a single playbook-identity tag instead.
|
||||||
|
|
||||||
|
**Tier 2 — concern tag (curated).** A small **closed list** of cross-cutting concern
|
||||||
|
tags, applied per-task/block **only where a task genuinely belongs to that concern**.
|
||||||
|
|
||||||
|
### The closed concern list
|
||||||
|
|
||||||
|
A concern earns a tag only if it (a) appears in 2+ roles, (b) is worth running as a
|
||||||
|
slice on its own, and (c) doesn't overlap confusingly with another.
|
||||||
|
|
||||||
|
| Tag | Covers |
|
||||||
|
|-----|--------|
|
||||||
|
| `packages` | apt package install/management |
|
||||||
|
| `users` | accounts, groups, sudo |
|
||||||
|
| `firewall` | nftables rulesets & port definitions (ADR-002) |
|
||||||
|
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
|
||||||
|
| `logging` | Alloy / log-shipping config (ADR-018) |
|
||||||
|
| `monitoring` | metric exporters / health checks |
|
||||||
|
| `config` | render templated config/compose files to disk — **no restart** |
|
||||||
|
| `deploy` | bring services up / restart (`compose up -d`) |
|
||||||
|
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
|
||||||
|
|
||||||
|
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
|
||||||
|
config`) without bouncing services, then restart deliberately (`--tags deploy`).
|
||||||
|
`backup` and `secrets` are intentionally omitted until the roles needing them exist.
|
||||||
|
|
||||||
|
### `always` / `never`
|
||||||
|
|
||||||
|
- **`always`** — reserved for cheap preflight assertions (vault unlocked, OS is
|
||||||
|
Debian 13, required vars present), so even `--tags config` runs its safety guards.
|
||||||
|
- **`never`** — reserved for destructive/expensive opt-in tasks, each paired with a
|
||||||
|
descriptive tag (e.g. `tags: [never, force_pull]`); they run only when named.
|
||||||
|
|
||||||
|
### Predictability principle: tags are union-only
|
||||||
|
|
||||||
|
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. boma therefore
|
||||||
|
targets **one axis at a time**: either a role/service *or* a concern, never an
|
||||||
|
intersection like "photoprism's firewall only." If that's ever needed, just run
|
||||||
|
`--tags photoprism` (idempotent and fast). Designing for intersection is the
|
||||||
|
over-tagging trap; we decline it on purpose.
|
||||||
|
|
||||||
|
### Terraform / Proxmox VM tags (metadata only)
|
||||||
|
|
||||||
|
Every Terraform-managed VM carries exactly three Proxmox tags:
|
||||||
|
|
||||||
|
| Tag | Value | Purpose |
|
||||||
|
|-----|-------|---------|
|
||||||
|
| env | `staging` \| `production` | which environment |
|
||||||
|
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
|
||||||
|
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
|
||||||
|
|
||||||
|
These are **pure metadata for transparency** (glanceable in the Proxmox UI). They do
|
||||||
|
**not** drive run-targeting and do **not** feed inventory — `scripts/tf_to_inventory.py`
|
||||||
|
keeps building groups from the `group` output field, the single source of truth.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
`tests/tags.yml` is the single source of truth for the allowed concern/special/
|
||||||
|
opt-in/playbook tags. `scripts/check-tags.py` (run by `make lint`, covered by
|
||||||
|
`tests/test_check_tags.py`) scans `roles/` and `playbooks/` and fails on any tag
|
||||||
|
outside `{role directory names} ∪ {tests/tags.yml entries}`.
|
||||||
|
|
||||||
|
## Extending the vocabulary
|
||||||
|
|
||||||
|
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the concern
|
||||||
|
table above with a one-line justification showing it passes the litmus test
|
||||||
|
(cross-cutting, 2+ roles, distinct). That is the whole gate — lightweight, but it
|
||||||
|
leaves a paper trail.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Targeted runs are predictable: only two kinds of tags exist, one of them mechanical.
|
||||||
|
- Over-tagging is structurally resisted (closed list + lint enforcement).
|
||||||
|
- Intersection targeting is unavailable by design.
|
||||||
|
- Authors must keep role tags = role names; the linter enforces it.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline / firewall), ADR-004 (one service = one role),
|
||||||
|
ADR-009 (TF↔Ansible handoff / inventory), ADR-018 (logging).
|
||||||
|
````
|
||||||
|
|
||||||
|
- [ ] **Step 2: Reword the tag rule in CLAUDE.md**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, under **Ansible conventions**, change:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Tags**: every task must have at least one tag; playbooks support `--tags` filtering
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
|
||||||
|
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
|
||||||
|
approved list (`tests/tags.yml`) only where it genuinely belongs to that concern —
|
||||||
|
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
|
||||||
|
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the Proxmox tag convention to CLAUDE.md**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, under **Terraform conventions**, add this bullet after the existing
|
||||||
|
"Terraform owns VM existence only" bullet:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- Every TF-managed VM carries three Proxmox tags — `<env>`, its inventory `group`, and
|
||||||
|
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
|
||||||
|
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add ADR-019 to the Further reading table**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, in the **Further reading** table, add this row immediately after the
|
||||||
|
`Logging & log integrity` row:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Mark the TODO items decided**
|
||||||
|
|
||||||
|
In `docs/TODO.md`, change line for item 3.7:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
7. Define a tagging standard that lets us target runs without over-tagging.
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
7. ~~Define a tagging standard that lets us target runs without over-tagging.~~
|
||||||
|
DECIDED (ADR-019): two-tier — role-name tags (auto, at play level) + a closed
|
||||||
|
9-tag concern list (`tests/tags.yml`); union-only targeting; enforced by `make lint`.
|
||||||
|
```
|
||||||
|
|
||||||
|
and change item 3.11:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
11. Deliberate tagging strategy.
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
11. ~~Deliberate tagging strategy.~~ DECIDED (ADR-019) — folded into 3.7.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Note the capability in CAPABILITIES.md**
|
||||||
|
|
||||||
|
Run: `grep -n "^## \|^### " docs/CAPABILITIES.md` to locate the section covering
|
||||||
|
operations / CI / how playbooks are run. Add this bullet under the most appropriate
|
||||||
|
existing section (operations or testing/CI):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Targeted runs** (ADR-019): playbooks are sliced with `--tags` along two axes —
|
||||||
|
role/service (tag = role name) or a closed list of cross-cutting concerns
|
||||||
|
(`firewall`, `logging`, `config`, `deploy`, …); the vocabulary is lint-enforced.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Verify docs are consistent and lint still passes**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
grep -n "019-tagging" CLAUDE.md && grep -c "managed-by=terraform" CLAUDE.md && make lint
|
||||||
|
```
|
||||||
|
Expected: the ADR-019 row is found in CLAUDE.md, `managed-by=terraform` appears at
|
||||||
|
least once, and `make lint` passes (including `check-tags: OK`).
|
||||||
|
|
||||||
|
- [ ] **Step 8: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/019-tagging.md CLAUDE.md docs/TODO.md docs/CAPABILITIES.md
|
||||||
|
git commit -m "docs(tags): ADR-019 + CLAUDE.md/TODO/CAPABILITIES (tagging standard)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final verification
|
||||||
|
|
||||||
|
- [ ] Run the full suite once more: `make lint && .venv/bin/python -m pytest tests/ -v`
|
||||||
|
Expected: yamllint + ansible-lint pass, `check-tags: OK`, all tests PASS.
|
||||||
|
- [ ] Confirm a deliberate violation is caught: temporarily add `tags: [bogus]` to a
|
||||||
|
task in `playbooks/site.yml`, run `.venv/bin/python scripts/check-tags.py`, confirm it
|
||||||
|
exits 1 reporting `'bogus'`, then revert the edit.
|
||||||
|
- [ ] `git log --oneline -7` shows the six task commits.
|
||||||
544
docs/superpowers/plans/2026-06-09-operational-access.md
Normal file
544
docs/superpowers/plans/2026-06-09-operational-access.md
Normal file
|
|
@ -0,0 +1,544 @@
|
||||||
|
# Operational Access (ADR-021) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the `ACCESS.md` record + `/check-access` verifier into boma's governance.
|
||||||
|
|
||||||
|
**Architecture:** Source of truth is the committed design spec `docs/superpowers/specs/2026-06-09-operational-access-design.md`. Structured access facts live as declarative `access__*` data that renders `ACCESS.md` and drives `/check-access` (the access analogue of `VERIFY.md` + `/verify-service`). Work is split into **Tranche A — land now** (doctrine docs, the one firewall code change, the dormant `/check-access` command, governance wiring) and **Tranche B — build-pending on infra** (per-service `access__*` population, rendered `ACCESS.md` files, and `/check-access` *running*), which arrive with service roles and live hosts and require no action in this plan.
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown ADRs/docs; Ansible role `base` (Jinja2 nftables template + `defaults/main.yml`); Molecule (Debian 13, render + `nft -c`, no apply) for the firewall test; Claude Code command file for `/check-access`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
| File | Tranche | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/decisions/021-operational-access.md` | A | NEW — the doctrine (two layers, three-tier ladder, break-glass, `access__*` model, `/check-access`) |
|
||||||
|
| `docs/decisions/016-mesh-vpn.md` | A | MODIFY — reconcile: SSH on `wt0` **and** from `ubongo`'s LAN address |
|
||||||
|
| `docs/decisions/020-firewall.md` | A | MODIFY — guaranteed management plane gains the control-node SSH source |
|
||||||
|
| `docs/access/service-access-template.md` | A | NEW — the `ACCESS.md` record shape (rendered-from-data + prose tail) |
|
||||||
|
| `roles/base/defaults/main.yml` | A | MODIFY — add `base__firewall_control_addr` knob (default empty → no-op) |
|
||||||
|
| `roles/base/templates/nftables.conf.j2` | A | MODIFY — conditional management-plane SSH rule for the control address |
|
||||||
|
| `roles/base/molecule/default/converge.yml` | A | MODIFY — set the knob for the test |
|
||||||
|
| `roles/base/molecule/default/verify.yml` | A | MODIFY — assert the rendered rule |
|
||||||
|
| `.claude/commands/check-access.md` | A | NEW — the `/check-access` verifier command (dormant until infra exists) |
|
||||||
|
| `docs/security/service-checklist.md` | A | MODIFY — one new gate item |
|
||||||
|
| `docs/runbooks/new-role.md` | A | MODIFY — new step: write `ACCESS.md` (mirrors SECURITY/VERIFY steps) |
|
||||||
|
| `CLAUDE.md` | A | MODIFY — `ACCESS.md` in Role conventions; ADR-021 in Further reading |
|
||||||
|
| `STATUS.md` | A | MODIFY — new rows for the doctrine, the firewall source, `/check-access` |
|
||||||
|
| `docs/TODO.md` | A | MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021 |
|
||||||
|
|
||||||
|
**Tranche B (no tasks here — captured for the record):** per-service `access__*` blocks + rendered `roles/<svc>/ACCESS.md` land when each service role is built (governed by the Tranche-A checklist + runbook); `/check-access` *running* lands when `ubongo` + staging + vault exist. Both are designed-now, build-pending — exactly like `/verify-service` under ADR-017.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tranche A — Land now
|
||||||
|
|
||||||
|
### Task 1: Write ADR-021
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/021-operational-access.md`
|
||||||
|
|
||||||
|
The ADR is the durable decision record derived from the committed spec
|
||||||
|
`docs/superpowers/specs/2026-06-09-operational-access-design.md`. Match the prose style and
|
||||||
|
heading shape of an existing ADR (read `docs/decisions/020-firewall.md` first). The ADR
|
||||||
|
**must** state these specifics — they are the parts easy to get wrong:
|
||||||
|
|
||||||
|
- **Doctrine sentence (verbatim):** *"Every host and every service guarantees at least one
|
||||||
|
documented, verifiable way in for operational troubleshooting — and the deploy that
|
||||||
|
creates it also records and proves it."*
|
||||||
|
- **Two layers:** host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2).
|
||||||
|
- **Three-tier access ladder:** (1) `wt0` mesh SSH — primary, WireGuard-authenticated;
|
||||||
|
(2) LAN SSH from `ubongo` only — secondary, mesh-independent, source-IP-gated **plus**
|
||||||
|
keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass
|
||||||
|
per host class: cluster VMs → Proxmox serial/VNC console, `askari` → Hetzner
|
||||||
|
rescue/console, `ubongo` → local console; reachability-checked, never exercised.
|
||||||
|
- **Reconciliation, not weakening (state this explicitly):** ADR-016 already requires
|
||||||
|
Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN
|
||||||
|
runs"), which *requires* LAN SSH from `ubongo`; yet ADR-016 also said "SSH only on `wt0`"
|
||||||
|
and ADR-020's guaranteed management plane listed only `wt0`. ADR-021 resolves that latent
|
||||||
|
contradiction by making the control-node SSH allow explicit and adding it to the
|
||||||
|
guaranteed management plane. It does **not** weaken default-deny: exactly one extra
|
||||||
|
trusted source on the LAN.
|
||||||
|
- **Declarative `access__*` data model:** service-role defaults carry `access__service`,
|
||||||
|
`access__compose_project`, `access__compose_path`, `access__containers`,
|
||||||
|
`access__log.loki_labels`, and `access__api` (`enabled`, `base_url`, `firewall_ref`,
|
||||||
|
`auth.vault_ref`, `health_path`; or `enabled: false` + `reason`). **Invariant:**
|
||||||
|
`access__api` never opens a port — it `firewall_ref`s the `group_vars` firewall catalog;
|
||||||
|
ADR-020 stays the sole owner of exposure.
|
||||||
|
- **Rendered record:** `ACCESS.md` is rendered from that data + a prose tail (operational
|
||||||
|
notes / gotchas). First-class sibling of `SECURITY.md`/`VERIFY.md`.
|
||||||
|
- **`/check-access`:** the verifier that probes each declared path and reports which are
|
||||||
|
live; break-glass reachability-only; designed now, build-pending on infra.
|
||||||
|
- **Status / consequences:** what lands now vs build-pending (mirror this plan's split).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Author the ADR**
|
||||||
|
|
||||||
|
Write `docs/decisions/021-operational-access.md` covering every bullet above, in the
|
||||||
|
house style of `docs/decisions/020-firewall.md` (problem → decision → layers/ladder →
|
||||||
|
data model → verifier → consequences). Open with a one-line title heading
|
||||||
|
`# ADR-021 — Operational access: documented, verifiable ways in`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Sanity-check internal links**
|
||||||
|
|
||||||
|
Run: `grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md`
|
||||||
|
Expected: references to ADR-016, ADR-020, the `access__*` keys, `/check-access`, and
|
||||||
|
`ACCESS.md` all present.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/021-operational-access.md
|
||||||
|
git commit -m "docs(access): add ADR-021 operational-access doctrine"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Reconcile ADR-016 and ADR-020
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/016-mesh-vpn.md` (the "Host firewall" bullet, ~line 64-65)
|
||||||
|
- Modify: `docs/decisions/020-firewall.md` (the "Guaranteed management plane" bullet, ~line 42-45)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Amend ADR-016's Host-firewall bullet**
|
||||||
|
|
||||||
|
Replace the existing bullet:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||||
|
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||||
|
```
|
||||||
|
|
||||||
|
with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
|
||||||
|
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
|
||||||
|
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
|
||||||
|
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
|
||||||
|
explicit the control-node SSH allow that the recovery model already implied; the access
|
||||||
|
doctrine and the three-tier access ladder live in **ADR-021**.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Amend ADR-020's guaranteed-management-plane bullet**
|
||||||
|
|
||||||
|
Replace:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
|
||||||
|
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
|
||||||
|
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
|
||||||
|
is allowed only on `wt0`.)
|
||||||
|
```
|
||||||
|
|
||||||
|
with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
|
||||||
|
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
|
||||||
|
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
|
||||||
|
catalog, applied atomically — a malformed or empty catalog can never lock out
|
||||||
|
management. The control-node source is part of the guaranteed plane, not the service
|
||||||
|
catalog (it is management, not a service); see ADR-021 for the access doctrine.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md
|
||||||
|
git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: The `ACCESS.md` record template
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/access/service-access-template.md`
|
||||||
|
|
||||||
|
Match the preamble convention of `docs/security/service-security-template.md` and
|
||||||
|
`docs/testing/service-verify-template.md` (a "copy this to `roles/<service>/ACCESS.md`"
|
||||||
|
preamble, then a `---`, then the record).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the template**
|
||||||
|
|
||||||
|
Create `docs/access/service-access-template.md`:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Per-service operational-access record — template
|
||||||
|
|
||||||
|
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
|
||||||
|
It is the per-service **operational-access record**: every documented, verifiable way in
|
||||||
|
for troubleshooting. The structured parts are **rendered from the role's `access__*`
|
||||||
|
data** (the single source of truth that also drives `/check-access`) — keep the data
|
||||||
|
authoritative and regenerate this file rather than hand-editing the tables. The prose
|
||||||
|
"Operational notes" tail is hand-written.
|
||||||
|
|
||||||
|
Delete this preamble in the copy and start from the heading below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Access — <service>
|
||||||
|
|
||||||
|
## Access paths
|
||||||
|
|
||||||
|
The mesh-reachable ways in, by tier (rendered from `access__*`):
|
||||||
|
|
||||||
|
| Tier | Path | Invocation |
|
||||||
|
|---|---|---|
|
||||||
|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
|
||||||
|
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
|
||||||
|
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
|
||||||
|
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
|
||||||
|
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
|
||||||
|
|
||||||
|
## Break-glass
|
||||||
|
|
||||||
|
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
||||||
|
|
||||||
|
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
|
||||||
|
|
||||||
|
## Operational notes
|
||||||
|
|
||||||
|
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
|
||||||
|
|
||||||
|
- <none yet>
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/access/service-access-template.md
|
||||||
|
git commit -m "docs(access): add ACCESS.md service record template"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Add the control-node SSH source to the host firewall (TDD)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/defaults/main.yml`
|
||||||
|
- Modify: `roles/base/templates/nftables.conf.j2`
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml`
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml`
|
||||||
|
|
||||||
|
This is the only code in Tranche A. It adds an **optional** guaranteed-management-plane
|
||||||
|
allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒
|
||||||
|
no behaviour change until a real `ubongo` address is set in `group_vars` (build-pending).
|
||||||
|
Test path is the established one for this role: Molecule render + `nft -c` (no apply).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test — converge sets the knob, verify asserts the rule**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, add the knob under `vars:` (alongside
|
||||||
|
`base__firewall_apply: false`):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
|
||||||
|
```
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, extend the "management plane" assert block's
|
||||||
|
`that:` list (the task asserting default-deny + `wt0` SSH) with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the test to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: FAIL — the verify assert "input chain is missing default-deny or the management
|
||||||
|
plane" fires, because the template does not yet render the control-address rule.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the default knob**
|
||||||
|
|
||||||
|
In `roles/base/defaults/main.yml`, after the `base__firewall_mgmt_interface` line, add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
|
||||||
|
# as the guaranteed-management-plane `ssh-from-control`
|
||||||
|
# source (ADR-021). Empty = no rule. Set in group_vars
|
||||||
|
# once ubongo exists.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Render the rule in the template**
|
||||||
|
|
||||||
|
In `roles/base/templates/nftables.conf.j2`, immediately after the `wt0` SSH line (the
|
||||||
|
`iifname "{{ base__firewall_mgmt_interface }}" ...` line), add:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
{% if base__firewall_control_addr %}
|
||||||
|
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
||||||
|
{% endif %}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: PASS — the rule `ip saddr 10.10.0.99 tcp dport 22 accept` renders, `nft -c`
|
||||||
|
syntax-check succeeds, and all prior assertions (default-deny, `wt0` SSH, zone rules,
|
||||||
|
drop-in hook) still pass.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: PASS (no tag/FQCN/yaml regressions).
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
||||||
|
git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Author the `/check-access` command (dormant until infra)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `.claude/commands/check-access.md`
|
||||||
|
|
||||||
|
Mirror the structure of `.claude/commands/verify-service.md` (a forward-looking command
|
||||||
|
with a hard Prerequisites gate). It does not run until `ubongo` + live/staging hosts +
|
||||||
|
vault exist; if a prerequisite is missing it must say so and stop.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the command**
|
||||||
|
|
||||||
|
Create `.claude/commands/check-access.md`:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
Operational-access verification (ADR-021)
|
||||||
|
|
||||||
|
Probe every documented way in to a service or host from `ubongo` and report which paths
|
||||||
|
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
|
||||||
|
`ACCESS.md` can never disagree. Argument: a service/role name or a host
|
||||||
|
(e.g. `/check-access photoprism`, `/check-access docker01`).
|
||||||
|
|
||||||
|
## Prerequisites (forward-looking — ADR-021 dependencies)
|
||||||
|
|
||||||
|
This skill cannot run until these exist; if any is missing, say so and stop — do not
|
||||||
|
improvise around it:
|
||||||
|
|
||||||
|
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
|
||||||
|
- The target host/service is deployed (staging or production inventory).
|
||||||
|
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
|
||||||
|
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
|
||||||
|
|
||||||
|
## Process
|
||||||
|
|
||||||
|
### Phase 0 — resolve the target
|
||||||
|
|
||||||
|
Resolve the argument to a host or a service role + its host. Load the `access__*` data
|
||||||
|
(service) or the host-baseline + break-glass record (host). State what you will probe.
|
||||||
|
|
||||||
|
### Phase 1 — probe each declared path
|
||||||
|
|
||||||
|
| Path | Probe | Green = |
|
||||||
|
|---|---|---|
|
||||||
|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
|
||||||
|
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
|
||||||
|
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
|
||||||
|
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
|
||||||
|
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
|
||||||
|
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
|
||||||
|
|
||||||
|
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
|
||||||
|
fallback exists, do not drive it.
|
||||||
|
|
||||||
|
### Phase 2 — report
|
||||||
|
|
||||||
|
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
|
||||||
|
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
|
||||||
|
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Read-only and non-destructive — probes confirm reachability, they do not change state.
|
||||||
|
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
|
||||||
|
control node + hosts exist.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/check-access.md
|
||||||
|
git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Governance wiring — checklist + runbook
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/service-checklist.md` (the "Operability (security-adjacent)" section)
|
||||||
|
- Modify: `docs/runbooks/new-role.md` (after step 10, the VERIFY.md step)
|
||||||
|
|
||||||
|
ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a
|
||||||
|
checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS
|
||||||
|
follows the same manual-copy pattern — no Makefile change).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the checklist gate item**
|
||||||
|
|
||||||
|
In `docs/security/service-checklist.md`, under `## Operability (security-adjacent)`, add a
|
||||||
|
bullet after the `/verify-service` item:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
|
||||||
|
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
|
||||||
|
documented paths green — or a deviation is recorded in
|
||||||
|
`docs/security/accepted-risks.md`
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the runbook step**
|
||||||
|
|
||||||
|
In `docs/runbooks/new-role.md`, insert a new step between step 10 (VERIFY.md) and the
|
||||||
|
final commit step, and renumber the commit step to 12:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
### 11. Write the per-service operational-access record (services)
|
||||||
|
|
||||||
|
For a **service** role, copy `docs/access/service-access-template.md` to
|
||||||
|
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
|
||||||
|
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
|
||||||
|
`access__log.loki_labels`, and `access__api` — `enabled` + endpoint + `firewall_ref` +
|
||||||
|
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
|
||||||
|
rendered from that data; the admin-API path must `firewall_ref` an entry in the
|
||||||
|
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
|
||||||
|
`/check-access <rolename>` proves the documented paths are live — part of the
|
||||||
|
service-clearance gate (`docs/security/service-checklist.md`).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify renumbering**
|
||||||
|
|
||||||
|
Run: `grep -n "^### 1[12]\." docs/runbooks/new-role.md`
|
||||||
|
Expected: `### 11. Write the per-service operational-access record` and `### 12. Commit`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/security/service-checklist.md docs/runbooks/new-role.md
|
||||||
|
git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md` (Role conventions list + Further reading table)
|
||||||
|
- Modify: `STATUS.md` (Designed-but-not-built table)
|
||||||
|
- Modify: `docs/TODO.md` (items 3.2 and 7.2)
|
||||||
|
|
||||||
|
- [ ] **Step 1: CLAUDE.md — Role conventions**
|
||||||
|
|
||||||
|
In the `## Role conventions` list, after the `VERIFY.md` bullet
|
||||||
|
("Every **service** role must have a populated `VERIFY.md` ..."), add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
|
||||||
|
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: CLAUDE.md — Further reading**
|
||||||
|
|
||||||
|
In the Further reading table, after the Firewall strategy row, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Operational access | `docs/decisions/021-operational-access.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: STATUS.md — new rows**
|
||||||
|
|
||||||
|
In the `## Designed but not built` table, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
|
||||||
|
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
|
||||||
|
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
|
||||||
|
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED**
|
||||||
|
|
||||||
|
In `docs/TODO.md`, change item **3.2** from:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
2. Decide how to manage APIs / API access.
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
|
||||||
|
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
|
||||||
|
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
|
||||||
|
the two-layer operational-access doctrine.
|
||||||
|
```
|
||||||
|
|
||||||
|
And change item **7.2** from:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
2. Decide what to set up on the hosts, given that direct access will be rare.
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
|
||||||
|
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
|
||||||
|
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
|
||||||
|
host class.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify and commit**
|
||||||
|
|
||||||
|
Run: `grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md`
|
||||||
|
Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows
|
||||||
|
are present.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md STATUS.md docs/TODO.md
|
||||||
|
git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tranche B — Build-pending on infra (no tasks now)
|
||||||
|
|
||||||
|
Recorded so the boundary is explicit; nothing here is actioned by this plan.
|
||||||
|
|
||||||
|
- **Per-service `access__*` + rendered `ACCESS.md`** — authored when each service role is
|
||||||
|
built, governed by the Task 6 checklist item + runbook step. The first real service role
|
||||||
|
is where this first runs.
|
||||||
|
- **`/check-access` running** — needs `ubongo` + a live/staging host + vault. The command
|
||||||
|
(Task 5) already gates on these and stops cleanly until then.
|
||||||
|
- **Real `base__firewall_control_addr` value** — set in `group_vars/all` to `ubongo`'s LAN
|
||||||
|
address once `ubongo` is in inventory; the machinery + test landed in Task 4.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review
|
||||||
|
|
||||||
|
**Spec coverage:** doctrine + two layers → Task 1; three-tier ladder + ADR-016/020
|
||||||
|
reconciliation → Tasks 1–2, 4; `access__*` model + invariant → Tasks 1, 3, 6; rendered
|
||||||
|
`ACCESS.md` → Task 3; `/check-access` → Task 5; governance (checklist/runbook) → Task 6;
|
||||||
|
repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches
|
||||||
|
A/B. All spec sections map to a task.
|
||||||
|
|
||||||
|
**Deviations from the spec (deliberate, flagged for the user):**
|
||||||
|
1. The spec called `ssh-from-control` a *catalog* source; the plan places it in the
|
||||||
|
*guaranteed management plane* (`base__firewall_control_addr`) instead — ADR-020 already
|
||||||
|
houses SSH/Ansible management allows there, independent of the catalog, and the spec's
|
||||||
|
own invariant says the catalog owns *service* exposure only. Same intent, correct home.
|
||||||
|
2. The spec said `make new-role` would *scaffold* an `ACCESS.md` stub; the plan instead adds
|
||||||
|
a manual runbook step (Task 6) mirroring how `SECURITY.md`/`VERIFY.md` are handled today
|
||||||
|
(also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold;
|
||||||
|
the "can't be forgotten" intent is met by the checklist gate + runbook step.
|
||||||
|
|
||||||
|
**Type/name consistency:** `base__firewall_control_addr` (knob), `access__service` /
|
||||||
|
`access__compose_project` / `access__compose_path` / `access__containers` /
|
||||||
|
`access__log.loki_labels` / `access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path}`
|
||||||
|
are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string
|
||||||
|
`ip saddr <addr> tcp dport 22 accept` matches between Task 4's template (Step 4) and its
|
||||||
|
assertion (Step 1).
|
||||||
556
docs/superpowers/plans/2026-06-10-adr-structure.md
Normal file
556
docs/superpowers/plans/2026-06-10-adr-structure.md
Normal file
|
|
@ -0,0 +1,556 @@
|
||||||
|
# ADR Structure & Lifecycle Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Codify how boma's ADRs are structured — a canonical section set, an Accepted/Superseded/Deprecated lifecycle, a template, a lightweight enforcement check, and a one-time Status backfill of the back-catalogue.
|
||||||
|
|
||||||
|
**Architecture:** Five independent units. (1) A pure-function `adr-structure` check added to the existing `scripts/repo-scan.py` (stdlib only, pytest-tested like its siblings), verifying every numbered ADR has the four mandatory sections and a parseable Status line — presence only, not order. (2) An `adr-template.md` scaffold. (3) ADR-023 itself, written to pass its own check. (4) Wiring into CLAUDE.md and the `/review-repo` command doc. (5) A mechanical backfill adding `## Status` to ADRs 001–018, dated from each file's first git-commit.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3 stdlib (`scripts/repo-scan.py`), pytest (`.venv/bin/pytest`), Markdown, git.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-10-adr-structure-design.md`
|
||||||
|
|
||||||
|
**Branch:** `feat/adr-structure` (already created; the design spec is the first commit).
|
||||||
|
|
||||||
|
**Convention reminders (from CLAUDE.md):** docs-/script-only commits skip the ansible-lint pre-commit hook and need no `rbw` unlock. Imperative subject ≤72 chars. `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>` trailer on every commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decisions locked by the spec (do not re-litigate)
|
||||||
|
|
||||||
|
- **Mandatory sections, in this order:** `## Status`, `## Context`, `## Decision`, `## Consequences`.
|
||||||
|
- **Optional sections:** `## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`, `## Verified facts (ADR-014)`.
|
||||||
|
- **Status lifecycle (4 states):** `Proposed (YYYY-MM-DD)` (genuine drafts, e.g. ADR-011) → `Accepted (YYYY-MM-DD)` (the common starting state) → optionally `Superseded by ADR-NNN (YYYY-MM-DD)` or `Deprecated (YYYY-MM-DD)`. (`Proposed` was added on the evidence of ADR-011, which is a real draft with open questions.)
|
||||||
|
- **No silent rewrites:** material reversal = new ADR + `Superseded by` marker; bidirectional link.
|
||||||
|
- **Enforcement checks presence + parseable Status line, NOT section order.** Order is demonstrated by the template, not machine-enforced.
|
||||||
|
- **Back-catalogue is fully restructured (no grandfathering)** — ADRs 001–018 are brought to all-four-section conformance. The restructure is **presentational**: relabel/regroup/demote existing headings, add a dated Status, assemble a Consequences section from implications the ADR already states. **The substance of no decision is changed.** If a faithful Consequences cannot be drawn from existing content, escalate that file rather than inventing one.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 1: `adr-structure` check in repo-scan.py
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/repo-scan.py` (add module-level regexes near the other `_RE` definitions ~line 38–44; add `adr_structure_findings()` next to `deferred_findings()` ~line 96; wire it into `scan()` at the `findings.extend(...)` site ~line 215)
|
||||||
|
- Test: `tests/test_repo_scan.py` (new)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Create `tests/test_repo_scan.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import importlib.util
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "repo-scan.py"
|
||||||
|
_spec = importlib.util.spec_from_file_location("repo_scan", _PATH)
|
||||||
|
rs = importlib.util.module_from_spec(_spec)
|
||||||
|
_spec.loader.exec_module(rs)
|
||||||
|
|
||||||
|
GOOD = [
|
||||||
|
"# ADR-099 — Example\n", "\n",
|
||||||
|
"## Status\n", "\n", "Accepted (2026-06-10)\n", "\n",
|
||||||
|
"## Context\n", "\n", "Why.\n", "\n",
|
||||||
|
"## Decision\n", "\n", "What.\n", "\n",
|
||||||
|
"## Consequences\n", "\n", "So what.\n",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _checks(findings):
|
||||||
|
return [f for f in findings if f["check"] == "adr-structure"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_good_adr_has_no_findings():
|
||||||
|
out = rs.adr_structure_findings({"docs/decisions/099-example.md": GOOD})
|
||||||
|
assert _checks(out) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_missing_mandatory_section_is_flagged():
|
||||||
|
lines = [ln for ln in GOOD if not ln.startswith("## Consequences")]
|
||||||
|
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
|
||||||
|
assert len(out) == 1
|
||||||
|
assert "Consequences" in out[0]["detail"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_unparseable_status_is_flagged():
|
||||||
|
lines = [("Designed, not built.\n" if ln == "Accepted (2026-06-10)\n" else ln)
|
||||||
|
for ln in GOOD]
|
||||||
|
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
|
||||||
|
assert len(out) == 1
|
||||||
|
assert "Status not parseable" in out[0]["detail"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_superseded_status_is_accepted():
|
||||||
|
lines = [("Superseded by ADR-100 (2026-06-11)\n" if ln == "Accepted (2026-06-10)\n"
|
||||||
|
else ln) for ln in GOOD]
|
||||||
|
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
|
||||||
|
assert out == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_non_numbered_file_is_skipped():
|
||||||
|
bare = ["# ADR template\n", "\n", "## Status\n", "\n", "<!-- hint -->\n"]
|
||||||
|
out = _checks(rs.adr_structure_findings({"docs/decisions/adr-template.md": bare}))
|
||||||
|
assert out == []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the test to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
|
||||||
|
Expected: FAIL — `AttributeError: module 'repo_scan' has no attribute 'adr_structure_findings'`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the regexes**
|
||||||
|
|
||||||
|
In `scripts/repo-scan.py`, after the `RESOLVE_WORD_RE = ...` line (~line 44), add:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# ADR-structure check (ADR-023): numbered ADRs must carry the four mandatory
|
||||||
|
# sections and a parseable Status line. Presence only — section ORDER is a
|
||||||
|
# template-demonstrated convention, not machine-enforced.
|
||||||
|
ADR_FILE_RE = re.compile(r"^\d{3}-.*\.md$")
|
||||||
|
ADR_REQUIRED_SECTIONS = ("Status", "Context", "Decision", "Consequences")
|
||||||
|
ADR_STATUS_LINE_RE = re.compile(
|
||||||
|
r"^(Accepted \(\d{4}-\d{2}-\d{2}\)"
|
||||||
|
r"|Superseded by ADR-\d{3}"
|
||||||
|
r"|Deprecated \(\d{4}-\d{2}-\d{2}\))")
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the check function**
|
||||||
|
|
||||||
|
In `scripts/repo-scan.py`, immediately after the `deferred_findings(...)` function (it ends ~line 96, just before `def walk_files():`), add:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def adr_structure_findings(adr_files):
|
||||||
|
"""adr_files: {rel_path: [lines]} for docs/decisions/*.md.
|
||||||
|
Flags numbered ADRs (NNN-*.md) missing a mandatory section or whose Status
|
||||||
|
section has no parseable lifecycle line. Non-numbered files (e.g.
|
||||||
|
adr-template.md) are skipped. Section order is NOT checked (ADR-023)."""
|
||||||
|
out = []
|
||||||
|
for rpath, lines in sorted(adr_files.items()):
|
||||||
|
if not ADR_FILE_RE.match(os.path.basename(rpath)):
|
||||||
|
continue
|
||||||
|
headings = {}
|
||||||
|
for i, line in enumerate(lines):
|
||||||
|
m = re.match(r"^##\s+(\w+)", line)
|
||||||
|
if m:
|
||||||
|
headings.setdefault(m.group(1), i)
|
||||||
|
missing = [s for s in ADR_REQUIRED_SECTIONS if s not in headings]
|
||||||
|
if missing:
|
||||||
|
out.append({"check": "adr-structure", "severity": "medium",
|
||||||
|
"path": rpath, "line": 1,
|
||||||
|
"detail": f"missing mandatory section(s): {', '.join(missing)}"})
|
||||||
|
if "Status" in headings:
|
||||||
|
body = []
|
||||||
|
for line in lines[headings["Status"] + 1:]:
|
||||||
|
if line.startswith("## "):
|
||||||
|
break
|
||||||
|
body.append(line)
|
||||||
|
status_text = next((ln.strip() for ln in body if ln.strip()), "")
|
||||||
|
if not ADR_STATUS_LINE_RE.match(status_text):
|
||||||
|
out.append({"check": "adr-structure", "severity": "medium",
|
||||||
|
"path": rpath, "line": headings["Status"] + 1,
|
||||||
|
"detail": "Status not parseable (want 'Accepted (YYYY-MM-DD)', "
|
||||||
|
"'Superseded by ADR-NNN', or 'Deprecated (YYYY-MM-DD)'); "
|
||||||
|
f"got: {status_text[:60]!r}"})
|
||||||
|
return out
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
|
||||||
|
Expected: PASS — 5 passed.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Wire the check into `scan()`**
|
||||||
|
|
||||||
|
In `scripts/repo-scan.py`, find (~line 215):
|
||||||
|
|
||||||
|
```python
|
||||||
|
findings.extend(deferred_findings(adr_files, defer_refs))
|
||||||
|
return findings
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace with:
|
||||||
|
|
||||||
|
```python
|
||||||
|
findings.extend(deferred_findings(adr_files, defer_refs))
|
||||||
|
findings.extend(adr_structure_findings(adr_files))
|
||||||
|
return findings
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Confirm the check fires on the real (not-yet-backfilled) repo**
|
||||||
|
|
||||||
|
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print(sorted({f['path'] for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure'}))"`
|
||||||
|
Expected: a list including `docs/decisions/001-architecture.md` … through `018-logging.md` (001–015 missing Status; 016–018 unparseable Status). 019–022 and 023 must NOT appear. This proves the check works and previews Task 5's worklist.
|
||||||
|
|
||||||
|
- [ ] **Step 8: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/repo-scan.py tests/test_repo_scan.py
|
||||||
|
git commit -m "feat(review): add adr-structure check to repo-scan
|
||||||
|
|
||||||
|
Flags numbered ADRs missing a mandatory section (Status/Context/Decision/
|
||||||
|
Consequences) or with an unparseable Status line. Presence only, not order.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 2: ADR template
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/adr-template.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the template**
|
||||||
|
|
||||||
|
Create `docs/decisions/adr-template.md` with exactly:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-NNN — <Title>: <optional clarifying subtitle>
|
||||||
|
|
||||||
|
<!-- Filename: NNN-kebab-title.md (zero-padded, monotonic, never reused).
|
||||||
|
Register a row in CLAUDE.md "Further reading" when this ADR is created.
|
||||||
|
Sections below in order. Mandatory: Status, Context, Decision, Consequences.
|
||||||
|
Delete this comment and any optional section you don't use. -->
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (YYYY-MM-DD)
|
||||||
|
<!-- Lifecycle: "Accepted (YYYY-MM-DD)" → later "Superseded by ADR-NNN (YYYY-MM-DD)"
|
||||||
|
or "Deprecated (YYYY-MM-DD)" + one-line why. Optional trailing note OK, e.g.
|
||||||
|
"Accepted (2026-06-10). Doctrine ADR — pins policy, builds nothing yet." -->
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
<!-- The forces, the problem, what exists today, why now. -->
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
<!-- What we are doing. Use numbered sub-decisions (### 1. ...) for multi-part ADRs. -->
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
<!-- Results, trade-offs explicitly accepted, follow-on work. -->
|
||||||
|
|
||||||
|
<!-- Optional sections — uncomment any that genuinely apply; never pad:
|
||||||
|
|
||||||
|
## Scope — explicit in / out-of-scope boundaries.
|
||||||
|
|
||||||
|
## Guardrails — how the decision is mechanically enforced (lint, CI, hooks).
|
||||||
|
|
||||||
|
## What was ruled out — rejected alternatives, each with its reason.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014) — verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>
|
||||||
|
|
||||||
|
## Related — links to other ADRs by number; bidirectional for Supersedes/Superseded-by.
|
||||||
|
-->
|
||||||
|
```
|
||||||
|
|
||||||
|
(HTML comments do not nest — optional sections use one flat comment block with inline
|
||||||
|
em-dash descriptions, not commented sub-hints inside an outer comment.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Confirm the template is skipped by the check**
|
||||||
|
|
||||||
|
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print([f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure' and 'adr-template' in f['path']])"`
|
||||||
|
Expected: `[]` (non-numbered filename → skipped).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/adr-template.md
|
||||||
|
git commit -m "docs(adr): add adr-template.md scaffold (ADR-023)
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 3: ADR-023 itself
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/023-adr-structure.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write ADR-023**
|
||||||
|
|
||||||
|
Create `docs/decisions/023-adr-structure.md`. It must pass its own check (Status/Context/Decision/Consequences present; parseable Status line). Use this content:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-023 — ADR structure & lifecycle
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-10). Meta/doctrine ADR — pins how ADRs are written; the
|
||||||
|
`adr-structure` check (`scripts/repo-scan.py`) and `docs/decisions/adr-template.md`
|
||||||
|
ship with it, and ADRs 001–018 were retroactively restructured to conform. Resolves
|
||||||
|
the FRICTION signal (2026-05-31) about ADR-writing policy being unsettled.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
boma records architectural decisions as numbered ADRs in `docs/decisions/`, and
|
||||||
|
CLAUDE.md treats them as load-bearing. Yet no ADR said how an ADR is written. The
|
||||||
|
newest ADRs (019–022) converged on a clean shape — Status → Context → Decision →
|
||||||
|
Consequences → Related — but only by imitation. ADRs 001–018 predate it and drifted
|
||||||
|
widely: most lacked a `## Status` section entirely (016–018 carried only a trailing
|
||||||
|
build-state note), and many lacked an explicit `## Decision` or `## Consequences`
|
||||||
|
heading, their decisions spread across ad-hoc topical sections. The result was
|
||||||
|
structural drift and no uniform way to tell an active decision from a superseded or
|
||||||
|
deprecated one.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Title & filename
|
||||||
|
|
||||||
|
Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash). Filename:
|
||||||
|
`NNN-kebab-title.md`, zero-padded 3-digit, monotonic, never reused — a superseded ADR
|
||||||
|
keeps its number and file. A new ADR is registered as a row in the CLAUDE.md
|
||||||
|
"Further reading" table.
|
||||||
|
|
||||||
|
### 2. Mandatory sections, in this order
|
||||||
|
|
||||||
|
- `## Status` — a lifecycle line, usually `Accepted (YYYY-MM-DD)` (see §4), plus an
|
||||||
|
optional one-line note.
|
||||||
|
- `## Context` — the forces, the problem, what exists today, why now.
|
||||||
|
- `## Decision` — what we are doing; numbered sub-decisions for multi-part ADRs.
|
||||||
|
- `## Consequences` — results, trade-offs explicitly accepted, follow-on work.
|
||||||
|
|
||||||
|
### 3. Optional sections (use only where they genuinely apply)
|
||||||
|
|
||||||
|
`## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`,
|
||||||
|
`## Verified facts (ADR-014)`.
|
||||||
|
|
||||||
|
### 4. Status lifecycle
|
||||||
|
|
||||||
|
Four states. Because boma is single-contributor and trunk-based with no review gate,
|
||||||
|
most ADRs are **born `Accepted (YYYY-MM-DD)`** — committed-to on writing. A
|
||||||
|
**`Proposed`** state exists for a genuine draft whose core direction is recorded but
|
||||||
|
whose specifics are still open for discussion (e.g. ADR-011); it is promoted to
|
||||||
|
`Accepted` once settled.
|
||||||
|
|
||||||
|
- **`Proposed (YYYY-MM-DD)`** — drafted, under discussion, not yet committed-to. May
|
||||||
|
carry open questions. Promoted to `Accepted (YYYY-MM-DD)` when decided.
|
||||||
|
- **`Accepted (YYYY-MM-DD)`** — committed-to. The common starting state.
|
||||||
|
- Replaced → old ADR's Status becomes **`Superseded by ADR-NNN (YYYY-MM-DD)`**; the new
|
||||||
|
ADR records `Supersedes ADR-MMM` in its Status and `## Related`. The link is
|
||||||
|
**bidirectional**.
|
||||||
|
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** + a one-line reason.
|
||||||
|
|
||||||
|
**No silent rewrites.** An Accepted ADR is not edited to reverse its decision. Typo and
|
||||||
|
clarity fixes are fine; a material reversal requires a new ADR and a `Superseded by`
|
||||||
|
marker on the old one.
|
||||||
|
|
||||||
|
### 5. Template & enforcement
|
||||||
|
|
||||||
|
`docs/decisions/adr-template.md` is the scaffold for new ADRs. The `/review-repo`
|
||||||
|
command's pre-scan (`scripts/repo-scan.py`) emits an `adr-structure` finding for any
|
||||||
|
numbered ADR missing a mandatory section or with an unparseable Status line. It checks
|
||||||
|
**presence and Status, not section order** — order is a convention the template carries,
|
||||||
|
deliberately not gated, to keep enforcement lightweight (consistent with boma's other
|
||||||
|
doctrine ADRs adding no CI gate).
|
||||||
|
|
||||||
|
### 6. Retroactive conformance of the back-catalogue
|
||||||
|
|
||||||
|
ADRs 001–018 are restructured to satisfy this standard rather than grandfathered. The
|
||||||
|
restructure is **presentational** — existing headings are relabelled, regrouped, or
|
||||||
|
demoted under a `## Decision` umbrella; a dated `## Status` is added; a `## Consequences`
|
||||||
|
section is assembled from implications the ADR already states. **The substance of no
|
||||||
|
decision is changed.** This keeps the check uniform (no number threshold) and the corpus
|
||||||
|
a consistent, legible decision history.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- New ADRs have one obvious shape and a scaffold; structural drift stops.
|
||||||
|
- Every ADR declares its lifecycle state uniformly, and reversals are traceable.
|
||||||
|
- The whole corpus conforms; the check needs no grandfathering and stays simple.
|
||||||
|
- One-time restructure churn across ADRs 001–018 (heading reorganization + a Status and
|
||||||
|
a Consequences section per file; no decision substance changed).
|
||||||
|
- `/review-repo` grows one deterministic check; no new CI machinery.
|
||||||
|
- This ADR is the first conformant example and is held to its own check.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
- **A `make lint` / CI gate for ADR structure** — heavier than the risk warrants;
|
||||||
|
the `/review-repo` check and the template suffice.
|
||||||
|
- **Machine-enforcing section order** — brittle for marginal value; left as a
|
||||||
|
template-demonstrated convention.
|
||||||
|
- **Grandfathering 001–018 from the check** — rejected in favour of restructuring the
|
||||||
|
whole corpus to conform, so the standard applies uniformly with no exceptions.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-014 — knowledge sourcing (the `Verified facts` optional section).
|
||||||
|
- ADR-019/020/021/022 — the emergent structure this ADR codifies.
|
||||||
|
- `docs/decisions/adr-template.md` — the scaffold.
|
||||||
|
- `scripts/repo-scan.py` — the `adr-structure` enforcement check.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Confirm ADR-023 passes its own check**
|
||||||
|
|
||||||
|
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print([f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure' and '023-' in f['path']])"`
|
||||||
|
Expected: `[]`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/023-adr-structure.md
|
||||||
|
git commit -m "docs(adr): ADR-023 — ADR structure & lifecycle
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 4: Wire into CLAUDE.md and the review-repo command doc
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `CLAUDE.md` ("Further reading" table)
|
||||||
|
- Modify: `.claude/commands/review-repo.md` (the deterministic-findings description, ~line 26–28)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the CLAUDE.md "Further reading" row**
|
||||||
|
|
||||||
|
In `CLAUDE.md`, in the "Further reading" table, after the `Backup & disaster recovery` row, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Mention the new check in review-repo.md**
|
||||||
|
|
||||||
|
In `.claude/commands/review-repo.md`, find (~line 27–28):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
|
||||||
|
(markers, broken refs, unencrypted vaults). Fold these into the report verbatim.
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace the parenthetical with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
|
||||||
|
(markers, broken refs, unencrypted vaults, ADR-structure violations). Fold these into
|
||||||
|
the report verbatim.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the CLAUDE.md link resolves**
|
||||||
|
|
||||||
|
Run: `test -f docs/decisions/023-adr-structure.md && echo OK`
|
||||||
|
Expected: `OK`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add CLAUDE.md .claude/commands/review-repo.md
|
||||||
|
git commit -m "docs(adr): register ADR-023 and note adr-structure check
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 5: Retroactively restructure ADRs 001–018 to full conformance
|
||||||
|
|
||||||
|
**Goal:** every ADR in 001–018 ends with all four mandatory sections present and a
|
||||||
|
parseable Status line, so the `adr-structure` check reports zero findings — **without
|
||||||
|
changing the substance of any decision.**
|
||||||
|
|
||||||
|
**Files (current findings — the exact worklist):**
|
||||||
|
- Missing `Status` + `Consequences`: `001-architecture.md`, `002-security.md`, `004-docker-model.md`, `005-bootstrapping.md`, `014-knowledge-sourcing.md`
|
||||||
|
- Missing `Status` + `Decision` + `Consequences`: `006-terraform.md`, `007-network.md`, `008-testing.md`, `009-provisioning-handoff.md`, `010-forgejo-ci.md`, `011-update-management.md`
|
||||||
|
- Missing all four: `003-toolchain.md`
|
||||||
|
- Missing `Status` + `Decision`: `013-heritage-v4.md`
|
||||||
|
- Missing `Status` only: `012-hardware-capacity.md`, `015-control-host.md`
|
||||||
|
- Have unparseable `Status` + missing `Consequences`: `016-mesh-vpn.md`, `017-service-ui-verification.md`, `018-logging.md`
|
||||||
|
|
||||||
|
(`010`/`011` use `## Decisions` (plural) → relabel to `## Decision`. The "missing
|
||||||
|
Decision" cases generally have the decision spread across topical `##` headings.)
|
||||||
|
|
||||||
|
**THE FAITHFULNESS RULE (non-negotiable):** This is a *presentational* restructure.
|
||||||
|
You MAY: add a `## Status` section; relabel a heading (`## Decisions` → `## Decision`);
|
||||||
|
introduce a `## Decision` umbrella heading and **demote** existing topical `##` headings
|
||||||
|
to `###` beneath it; add a `## Consequences` section. You MUST NOT alter any existing
|
||||||
|
sentence of decision prose, reword arguments, or add new policy. A `## Consequences`
|
||||||
|
section is assembled **only** from implications the ADR already states (its trade-offs,
|
||||||
|
"what was ruled out", "open questions", named follow-on work). **If an ADR states
|
||||||
|
nothing that can be faithfully cast as a consequence, STOP and report it as
|
||||||
|
DONE_WITH_CONCERNS / escalate — do not invent consequences.**
|
||||||
|
|
||||||
|
**Per-file date source:** the file's first git-commit (add) date —
|
||||||
|
`git log --diff-filter=A --format=%as -- <path> | tail -1` (yields `YYYY-MM-DD`).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add a dated `## Status` section to each ADR**
|
||||||
|
|
||||||
|
For 001–015 (no Status today): insert, between the title line and the first `##`
|
||||||
|
heading, a Status section:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (<d>)
|
||||||
|
```
|
||||||
|
|
||||||
|
where `<d>` is the file's first-git-commit date. For 016/017/018 (unparseable Status
|
||||||
|
today): prepend a parseable `Accepted (<d>). ` clause to the first line of their
|
||||||
|
existing `## Status` section so the build-state note becomes its tail, e.g.
|
||||||
|
`Accepted (2026-06-05). Designed. **Authorable now:** ...`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Ensure a `## Decision` section exists**
|
||||||
|
|
||||||
|
For ADRs flagged "missing Decision" (003, 006, 007, 008, 009, 010, 011, 013): relabel a
|
||||||
|
plural/synonym heading where one exists (`## Decisions` → `## Decision` in 010/011), or
|
||||||
|
introduce a `## Decision` umbrella immediately after `## Context` and demote the existing
|
||||||
|
topical `##` body headings (e.g. in 003: "Execution engine", "Python environment", …) to
|
||||||
|
`###`. Do not move or rewrite the prose under them.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Ensure a `## Consequences` section exists**
|
||||||
|
|
||||||
|
For every ADR flagged "missing Consequences" (001, 002, 003, 004, 005, 006, 007, 008,
|
||||||
|
009, 010, 011, 014, 016, 017, 018): add a `## Consequences` section near the end,
|
||||||
|
assembled strictly from implications the ADR already states. Where an ADR has a trailing
|
||||||
|
section that *is* consequences under another name (e.g. "What was ruled out", "Open
|
||||||
|
questions", "Trade-offs"), you may keep that section and add a short `## Consequences`
|
||||||
|
that references/summarizes the already-stated trade-offs — without introducing new
|
||||||
|
claims. **Honour the faithfulness rule; escalate any ADR where no faithful Consequences
|
||||||
|
can be drawn.**
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify the whole corpus passes the check**
|
||||||
|
|
||||||
|
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; v=[f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure']; print('adr-structure findings:', len(v)); [print(' ', f['path'], '—', f['detail']) for f in v]"`
|
||||||
|
Expected: `adr-structure findings: 0`.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify faithfulness via diff**
|
||||||
|
|
||||||
|
Run: `git diff --stat` and spot-check `git diff docs/decisions/003-toolchain.md`.
|
||||||
|
Expected: changes are heading additions/relabels/level-demotions, a new Status section,
|
||||||
|
and a new Consequences section — **no edits to existing decision sentences.**
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run the repo-scan test suite**
|
||||||
|
|
||||||
|
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
|
||||||
|
Expected: PASS — 5 passed.
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/0*.md docs/decisions/1*.md
|
||||||
|
git commit -m "docs(adr): restructure ADRs 001-018 to ADR-023 conformance
|
||||||
|
|
||||||
|
Presentational only: add a dated Status section, relabel/regroup headings
|
||||||
|
under Decision, and add a Consequences section assembled from each ADR's
|
||||||
|
already-stated implications. No decision substance changed.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final verification (after all tasks)
|
||||||
|
|
||||||
|
- [ ] **Lint:** `make lint` — Expected: passes (docs + a stdlib script touched; ansible content unchanged).
|
||||||
|
- [ ] **Full deterministic scan clean for our check:** `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print('adr-structure:', sum(1 for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure'))"` → `adr-structure: 0`.
|
||||||
|
- [ ] **Tests green:** `.venv/bin/pytest tests/ -q` → all pass.
|
||||||
|
- [ ] **Branch ready:** invoke `superpowers:finishing-a-development-branch` to merge `feat/adr-structure` to `main` (trunk-based, no PR) and delete the branch.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review notes
|
||||||
|
|
||||||
|
- **Spec coverage:** §1 title/filename → Task 3 + template; §2 sections → Tasks 2/3 + check; §3 lifecycle → Task 3; §4 cross-refs → Task 3 `## Related`; §5 template → Task 2; §6 retroactive restructure → Task 5; §7 enforcement → Task 1 + Task 4. All covered.
|
||||||
|
- **Order nuance:** spec says sections come "in this order"; the check enforces presence + Status only. This is intentional and stated in both the spec's enforcement wording ("the four mandatory sections and a parseable Status line") and ADR-023's Decision §5 / "What was ruled out". Not a gap.
|
||||||
|
- **Type/name consistency:** `adr_structure_findings` and the `"adr-structure"` check key are used identically in the function, the `scan()` wiring, the tests, and both verification one-liners.
|
||||||
476
docs/superpowers/plans/2026-06-10-backup-strategy.md
Normal file
476
docs/superpowers/plans/2026-06-10-backup-strategy.md
Normal file
|
|
@ -0,0 +1,476 @@
|
||||||
|
# Backup & DR Strategy — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
|
||||||
|
|
||||||
|
**Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs/<concern>/`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
|
||||||
|
|
||||||
|
**Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch.
|
||||||
|
|
||||||
|
**Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 1–13 referenced by number throughout).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decomposition & roadmap
|
||||||
|
|
||||||
|
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
|
||||||
|
|
||||||
|
- **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role.
|
||||||
|
- **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`.
|
||||||
|
- **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live.
|
||||||
|
|
||||||
|
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Plan 1 file map
|
||||||
|
|
||||||
|
| File | Action | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 1–13 |
|
||||||
|
| `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape |
|
||||||
|
| `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) |
|
||||||
|
| `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
|
||||||
|
| `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate |
|
||||||
|
| `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
|
||||||
|
| `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row |
|
||||||
|
| `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
|
||||||
|
| `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract |
|
||||||
|
| `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 |
|
||||||
|
|
||||||
|
**Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git checkout -b feat/backup-foundation
|
||||||
|
```
|
||||||
|
|
||||||
|
Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/022-backup.md`
|
||||||
|
- Modify: `CLAUDE.md` (Further-reading table; role-conventions block)
|
||||||
|
- Modify: `STATUS.md` ("Designed but not built" table)
|
||||||
|
- Modify: `docs/TODO.md` (item 3.8)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write `docs/decisions/022-backup.md`**
|
||||||
|
|
||||||
|
Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
|
||||||
|
|
||||||
|
1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
|
||||||
|
2. **One tier, ~24 h RPO.** (Decision 2)
|
||||||
|
3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
|
||||||
|
4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4)
|
||||||
|
5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5)
|
||||||
|
6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
|
||||||
|
7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7)
|
||||||
|
8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
|
||||||
|
9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9)
|
||||||
|
10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10)
|
||||||
|
11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11)
|
||||||
|
12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12)
|
||||||
|
13. **Schedule.** (Decision 13)
|
||||||
|
|
||||||
|
`## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 2–3 as the build path.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`**
|
||||||
|
|
||||||
|
In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add:
|
||||||
|
|
||||||
|
```
|
||||||
|
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`**
|
||||||
|
|
||||||
|
In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add:
|
||||||
|
|
||||||
|
```
|
||||||
|
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
|
||||||
|
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
|
||||||
|
data. A stateless service records `backup__state: false` with a reason.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add STATUS.md rows**
|
||||||
|
|
||||||
|
In the "Designed but not built" table in `STATUS.md`, add two rows:
|
||||||
|
|
||||||
|
```
|
||||||
|
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
||||||
|
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Update TODO item 3.8**
|
||||||
|
|
||||||
|
In `docs/TODO.md`, change the item-3.8 line:
|
||||||
|
|
||||||
|
From:
|
||||||
|
```
|
||||||
|
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
||||||
|
```
|
||||||
|
To:
|
||||||
|
```
|
||||||
|
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
|
||||||
|
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
|
||||||
|
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
|
||||||
|
pCloud + rotated USB air-gap. Build: Plans 2–3.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Verify**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed.
|
||||||
|
|
||||||
|
Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md`
|
||||||
|
Expected: matches in every listed file (cross-references resolve).
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
|
||||||
|
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/backup/service-backup-template.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the template**
|
||||||
|
|
||||||
|
Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
|
||||||
|
|
||||||
|
````markdown
|
||||||
|
# Per-service backup record — template
|
||||||
|
|
||||||
|
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
|
||||||
|
role (ADR-022). It is the per-service **backup record**: what state the service holds,
|
||||||
|
how it is captured consistently, and how it is restored. The structured parts are
|
||||||
|
**rendered from the role's `backup__*` data** (the single source of truth that also
|
||||||
|
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
|
||||||
|
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
|
||||||
|
|
||||||
|
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
|
||||||
|
`backup__state: false` with a reason in its role defaults instead.
|
||||||
|
|
||||||
|
Delete this preamble in the copy and start from the heading below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Backup — <service>
|
||||||
|
|
||||||
|
## State captured
|
||||||
|
|
||||||
|
Rendered from `backup__*`:
|
||||||
|
|
||||||
|
| What | Source | How captured |
|
||||||
|
|---|---|---|
|
||||||
|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
|
||||||
|
| database | `<backup__dumps[*].cmd>` → `<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
|
||||||
|
|
||||||
|
- **Quiesce:** `<backup__quiesce>` — `true` means the service is stopped → backed up →
|
||||||
|
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
|
||||||
|
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
|
||||||
|
|
||||||
|
## Restore procedure
|
||||||
|
|
||||||
|
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
|
||||||
|
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
|
||||||
|
3. Replay each `<backup__dumps[*].dest>` into its database.
|
||||||
|
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
|
||||||
|
|
||||||
|
## Restore notes
|
||||||
|
|
||||||
|
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
|
||||||
|
known-tricky migrations.
|
||||||
|
|
||||||
|
- <none yet>
|
||||||
|
````
|
||||||
|
|
||||||
|
The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
backup__service: <name> # identifier; matches the role / compose project
|
||||||
|
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||||
|
backup__paths: # bind-mount dirs/files holding state ([] = none)
|
||||||
|
- /srv/<service>/data
|
||||||
|
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
|
||||||
|
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
|
||||||
|
dest: <service>-db.sql
|
||||||
|
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify**
|
||||||
|
|
||||||
|
Run: `test -f docs/backup/service-backup-template.md && echo PRESENT`
|
||||||
|
Expected: `PRESENT`
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: PASS (markdown only; confirms no regression).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/backup/service-backup-template.md
|
||||||
|
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Strengthen the per-service checklist gate
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/service-checklist.md` (Operability section)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Replace the weak backup line with the ADR-022 gate**
|
||||||
|
|
||||||
|
In the "Operability (security-adjacent)" section, replace this line:
|
||||||
|
|
||||||
|
```
|
||||||
|
- [ ] Backup/restore is covered if the service holds state
|
||||||
|
```
|
||||||
|
|
||||||
|
with (mirroring the existing ADR-021 access line directly below it):
|
||||||
|
|
||||||
|
```
|
||||||
|
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
|
||||||
|
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
|
||||||
|
reports the declared paths/dumps captured in the latest snapshot — or the service
|
||||||
|
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify**
|
||||||
|
|
||||||
|
Run: `grep -n "ADR-022" docs/security/service-checklist.md`
|
||||||
|
Expected: one match (the new gate line).
|
||||||
|
|
||||||
|
Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md`
|
||||||
|
Expected: `0` (old weak line gone).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/security/service-checklist.md
|
||||||
|
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Add the BACKUP.md step to the new-role runbook
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Insert the new step**
|
||||||
|
|
||||||
|
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
### 12. Write the per-service backup record (stateful services)
|
||||||
|
|
||||||
|
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
|
||||||
|
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
|
||||||
|
`backup__paths`, `backup__dumps` — `cmd` + `dest` per logical dump — and `backup__quiesce`;
|
||||||
|
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
|
||||||
|
is rendered from that data. A **stateless** service sets `backup__state: false` with a
|
||||||
|
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
|
||||||
|
proves the declared state is captured — part of the service-clearance gate
|
||||||
|
(`docs/security/service-checklist.md`).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Renumber the commit step**
|
||||||
|
|
||||||
|
Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify**
|
||||||
|
|
||||||
|
Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md`
|
||||||
|
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/runbooks/new-role.md
|
||||||
|
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Create the dormant `/check-backup` verifier command
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `.claude/commands/check-backup.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the command**
|
||||||
|
|
||||||
|
Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write:
|
||||||
|
|
||||||
|
````markdown
|
||||||
|
---
|
||||||
|
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
|
||||||
|
---
|
||||||
|
|
||||||
|
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
|
||||||
|
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
|
||||||
|
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
|
||||||
|
|
||||||
|
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
|
||||||
|
command reports `not-yet-available` rather than failing.
|
||||||
|
|
||||||
|
## Preconditions
|
||||||
|
|
||||||
|
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
|
||||||
|
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
|
||||||
|
`not-yet-available` and stop.
|
||||||
|
|
||||||
|
## Checks (when live)
|
||||||
|
|
||||||
|
Load the `backup__*` data for the resolved role, then:
|
||||||
|
|
||||||
|
| Check | How | Green when |
|
||||||
|
|---|---|---|
|
||||||
|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
|
||||||
|
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
|
||||||
|
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
|
||||||
|
| integrity | `restic check --read-data-subset` (sampled) | no errors |
|
||||||
|
|
||||||
|
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
|
||||||
|
````
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify**
|
||||||
|
|
||||||
|
Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md`
|
||||||
|
Expected: file present, first line `---` (valid frontmatter).
|
||||||
|
|
||||||
|
Run: `grep -n "not-yet-available" .claude/commands/check-backup.md`
|
||||||
|
Expected: matches (dormancy explicit).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/check-backup.md
|
||||||
|
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Update hardware reference and capabilities
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table)
|
||||||
|
- Modify: `docs/CAPABILITIES.md` (§9 Data & backup)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update the `ubongo` prose block**
|
||||||
|
|
||||||
|
In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine:
|
||||||
|
|
||||||
|
From:
|
||||||
|
```
|
||||||
|
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||||
|
```
|
||||||
|
To:
|
||||||
|
```
|
||||||
|
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add a `fisi` prose block**
|
||||||
|
|
||||||
|
After the `ubongo` block in §1, add:
|
||||||
|
|
||||||
|
```
|
||||||
|
### fisi (backup node — outside the cluster; provisional)
|
||||||
|
- **Model / form factor:** HP Elite 600 G9 (tower)
|
||||||
|
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
|
||||||
|
- **RAM:** 16 GB+ (TBD exact)
|
||||||
|
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
|
||||||
|
- **NICs:** wired GbE
|
||||||
|
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
|
||||||
|
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
|
||||||
|
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update the machine-readable capacity table**
|
||||||
|
|
||||||
|
In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`):
|
||||||
|
|
||||||
|
From:
|
||||||
|
```
|
||||||
|
| ubongo | 4 | 16 | 250 |
|
||||||
|
```
|
||||||
|
To:
|
||||||
|
```
|
||||||
|
| ubongo | 4 | 16 | 1000 |
|
||||||
|
| fisi | 4 | 16 | 8000 |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update CAPABILITIES §9**
|
||||||
|
|
||||||
|
In `docs/CAPABILITIES.md` §9 table, replace the three backup rows:
|
||||||
|
|
||||||
|
From:
|
||||||
|
```
|
||||||
|
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
|
||||||
|
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
|
||||||
|
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
|
||||||
|
```
|
||||||
|
To:
|
||||||
|
```
|
||||||
|
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
|
||||||
|
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
|
||||||
|
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK`
|
||||||
|
Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted).
|
||||||
|
|
||||||
|
Run: `grep -n "ADR-022" docs/CAPABILITIES.md`
|
||||||
|
Expected: three matches (the updated backup rows).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/hardware/reference.md docs/CAPABILITIES.md
|
||||||
|
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Final review and merge
|
||||||
|
|
||||||
|
- [ ] **Step 1: Full lint + capacity sanity**
|
||||||
|
|
||||||
|
Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN`
|
||||||
|
Expected: `ALL_GREEN`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Cross-reference audit**
|
||||||
|
|
||||||
|
Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/`
|
||||||
|
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Merge to main and delete the branch**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git checkout main
|
||||||
|
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
|
||||||
|
git branch -d feat/backup-foundation
|
||||||
|
git push origin main
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review (completed by plan author)
|
||||||
|
|
||||||
|
- **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 2–6. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 2–3 (see *Decomposition & roadmap*), not silently dropped.
|
||||||
|
- **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (`<service>`/`<name>` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
|
||||||
|
- **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.
|
||||||
58
docs/superpowers/plans/2026-06-11-dev-env-role.md
Normal file
58
docs/superpowers/plans/2026-06-11-dev-env-role.md
Normal file
|
|
@ -0,0 +1,58 @@
|
||||||
|
# `dev_env` Role — Implementation Plan (iteration 1)
|
||||||
|
|
||||||
|
> Built in the same 2026-06-11 session as the `ubongo` bring-up. A developer
|
||||||
|
> interactive environment (zsh/tmux/nvim) for **workstation-class** hosts.
|
||||||
|
|
||||||
|
**Goal:** Give `ubongo` (and future `mamba`) a clean interactive shell/editor setup,
|
||||||
|
reproducibly, as a boma-native Ansible role — so the operator (and the `claude` agent
|
||||||
|
user) can work comfortably over SSH.
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
- **Separate role, never part of `base`.** `base` is the security/infra baseline for
|
||||||
|
*every* host; a dev environment is only for human workstation-class hosts. Servers and
|
||||||
|
service VMs must never get it.
|
||||||
|
- **Stow, not templating.** Dotfiles are **real files** under `files/dotfiles/{zsh,tmux,nvim}/`
|
||||||
|
(re-derived `$HOME`-relative from `fisi`'s live configs), symlinked into `~` with GNU
|
||||||
|
stow. No Jinja-templated dotfiles (they rot; you'd edit templates not configs).
|
||||||
|
- **Users:** `dev_env__users` (default `[]`). Set to `[sjat, claude]` for `ubongo` in
|
||||||
|
`group_vars/control`.
|
||||||
|
- **V4 (ADR-013):** configs/package-lists/install-mechanism *consulted* from V4 and
|
||||||
|
**re-derived on boma's terms** — not its structure. V4 identifiers stripped from the
|
||||||
|
dotfiles.
|
||||||
|
|
||||||
|
## Re-derivations vs V4
|
||||||
|
|
||||||
|
- **No Nerd Font** on `ubongo` — it's headless; fonts are a client-side concern.
|
||||||
|
- **No system-wide LSP suite** — the operator's nvim uses **mason**, which self-installs
|
||||||
|
LSPs/formatters inside nvim (needs only nvim + git + a C compiler + node).
|
||||||
|
- **Pinned versions** (ADR-014): nvim `v0.12.2`, oh-my-posh `29.0.1` (V4 tracks "latest").
|
||||||
|
- **Plugins self-bootstrap**: lazy.nvim installs nvim plugins on first launch; the role
|
||||||
|
only lays down config + pre-clones omz/tmux plugins.
|
||||||
|
|
||||||
|
## Tasks (role: `roles/dev_env/`)
|
||||||
|
|
||||||
|
- `tasks/main.yml` — apt packages (`packages` tag) → include `neovim.yml`, `oh_my_posh.yml`
|
||||||
|
→ loop `per_user.yml` over `dev_env__users`.
|
||||||
|
- `tasks/neovim.yml` — install pinned nvim release to `/opt`, symlink, version sentinel.
|
||||||
|
- `tasks/oh_my_posh.yml` — install pinned oh-my-posh binary + deploy `zen.toml` to `/etc`.
|
||||||
|
- `tasks/per_user.yml` — set login shell to zsh (`users`); clone oh-my-zsh + custom
|
||||||
|
plugins + tmux/TPM plugins; copy dotfiles to `~/.dotfiles`; `stow` into `~` (`config`).
|
||||||
|
- `defaults/main.yml`, `meta/main.yml`, `README.md`, `requirements.yml`.
|
||||||
|
- `molecule/default/{converge,verify}.yml` — create a `tester` user, apply, assert
|
||||||
|
packages + nvim/omp/zen present + shell=zsh + dotfiles stowed (symlinks).
|
||||||
|
- `playbooks/workstation.yml` — apply `dev_env` to the `control` group (ubongo).
|
||||||
|
- `inventories/production/group_vars/control/vars.yml` — `dev_env__users: [sjat, claude]`.
|
||||||
|
|
||||||
|
## Verify / apply
|
||||||
|
|
||||||
|
- `make lint`; `make test ROLE=dev_env` (Molecule, Debian 13) must pass.
|
||||||
|
- Apply to `ubongo`: `make check`/`deploy PLAYBOOK=workstation` from a host that can SSH
|
||||||
|
to `ubongo` as `sjat` with `--ask-become-pass` (the Ansible-manages-ubongo connection
|
||||||
|
isn't bootstrapped yet — handle at apply time).
|
||||||
|
|
||||||
|
## Deferred (iteration 2+)
|
||||||
|
|
||||||
|
- A proper `workstations` inventory group (when `mamba` joins) instead of reusing `control`.
|
||||||
|
- lazygit, extra CLI tooling, any system LSP/formatters mason can't cover.
|
||||||
|
- Pinning tmux plugins to commits (currently `master` except catppuccin `v1.0.3`).
|
||||||
150
docs/superpowers/plans/2026-06-11-ubongo-build.md
Normal file
150
docs/superpowers/plans/2026-06-11-ubongo-build.md
Normal file
|
|
@ -0,0 +1,150 @@
|
||||||
|
# Ubongo Physical Build — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** Execute task-by-task. This is the **physical bring-up** of
|
||||||
|
> `ubongo`. The 2026-06-05 plan (`2026-06-05-ubongo-control-host.md`) was
|
||||||
|
> *documentation-only* (it authored ADR-015); this is its sequel — taking the actual
|
||||||
|
> box from bare Debian 13 to a working control / AI-worker node.
|
||||||
|
|
||||||
|
**Goal:** Bring the Lenovo ThinkCentre M70q from a fresh Debian 13 install to a working
|
||||||
|
control node: toolchain, dedicated `claude` identity, repo + Claude Code, vault access,
|
||||||
|
inventory wiring, keys-only SSH, and reconciliation of the docs to "built."
|
||||||
|
|
||||||
|
**Spec / decisions of record:** ADR-015 + `docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md`,
|
||||||
|
plus the interactive build decisions captured below (2026-06-11 session).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decisions made this session (2026-06-11)
|
||||||
|
|
||||||
|
- **Hardware:** Lenovo ThinkCentre M70q Tiny · i3-10100T (4c/8t) · 16 GB · 256 GB
|
||||||
|
SanDisk X600 SATA SSD (TCG **Opal**-capable; Opal **unused**, see encryption).
|
||||||
|
- **BIOS:** auto-power-on after loss; Wake-on-LAN on; ErP/deep-S5 off; **supervisor
|
||||||
|
password set**; external/USB + PXE boot **disabled**; Secure Boot on; TPM (PTT) on;
|
||||||
|
VT-x/VT-d on; Better-Thermal cooling.
|
||||||
|
- **Disk encryption: NONE.** Accepted risk — compensated by physical security + BIOS
|
||||||
|
supervisor password + disabled external boot. Recorded in `accepted-risks.md` (Task H1).
|
||||||
|
- **Partitioning:** simple single ext4 root (`/dev/sda2`, 221 G) + 12 G swap, no LVM.
|
||||||
|
Revisit via reinstall onto LVM/bigger drive only if the layout bites.
|
||||||
|
- **Identity:** dedicated **`claude`** user — for **attribution + revocation, not
|
||||||
|
containment**. In the `docker` group (Molecule); **no local sudo** (boma deploys run
|
||||||
|
over SSH as `ansible`; the agent needs Docker, not root). Reached via `sudo -iu claude`
|
||||||
|
from `sjat`. Own `ed25519` key for Forgejo. ADR-021 leaves this identity open — note it.
|
||||||
|
- **Access:** LAN SSH only for now — the NetBird mesh (ADR-016) is deferred (`askari` +
|
||||||
|
service machinery unbuilt). Keys-only enforced after bootstrap.
|
||||||
|
- **Address:** `10.20.10.151/24` on `eno1`. Make stable via an OPNsense DHCP reservation.
|
||||||
|
|
||||||
|
**Pinned versions (match `fisi`):** docker 29.5.2 · rbw 1.15.0 · node 20.19.2 ·
|
||||||
|
claude 2.1.173. Terraform is absent on `fisi` (TF un-init'd) — install deferred.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pre-flight
|
||||||
|
|
||||||
|
- **Temp passwordless sudo** for `sjat` during the build (`/etc/sudoers.d/99-boma-build`);
|
||||||
|
**removed in Task F2**. Without it, non-interactive SSH `sudo` hangs.
|
||||||
|
- **`rbw unlock`** on `fisi` before any commit (pre-commit decrypts `vault.yml`).
|
||||||
|
- **Commit style:** one commit per logical unit; imperative subject ≤72 chars; trailer
|
||||||
|
`Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`.
|
||||||
|
- Drive the live box (`ubongo`) directly over SSH; do repo/doc tasks (H) as clean commits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage A — Toolchain (on `ubongo`, via `sjat` sudo)
|
||||||
|
|
||||||
|
- [ ] **A1.** apt base: `git make build-essential python3-venv python3-pip curl
|
||||||
|
ca-certificates gnupg jq` (+ `apt update`).
|
||||||
|
- [ ] **A2.** Docker Engine from Docker's official apt repo (Debian 13/trixie); enable +
|
||||||
|
start; confirm `docker --version` ≈ 29.5.2.
|
||||||
|
- [ ] **A3.** `rbw` 1.15.0 — try `apt install rbw`; if the version doesn't match, install
|
||||||
|
the pinned release binary to `/usr/local/bin` (match `fisi`).
|
||||||
|
- [ ] **A4.** Node 20.19.2 (nodesource or distro) — only if Claude Code needs it; the
|
||||||
|
native installer bundles its runtime, so Node may be optional.
|
||||||
|
- [ ] **A5.** Claude Code via the **native installer** (matches `fisi`'s
|
||||||
|
`~/.local/share/claude/versions/`), installed under the `claude` user in Stage C.
|
||||||
|
- [ ] Defer Terraform (absent on `fisi`).
|
||||||
|
|
||||||
|
## Stage B — Identity (`claude` user)
|
||||||
|
|
||||||
|
- [ ] **B1.** `useradd -m -s /bin/bash claude`; lock the password (`passwd -l claude`) —
|
||||||
|
reached only via `sudo -iu claude` from `sjat` or its own key.
|
||||||
|
- [ ] **B2.** Add `claude` to the `docker` group.
|
||||||
|
- [ ] **B3.** No sudo for `claude` (explicit decision). Confirm `sudo -iu claude` works.
|
||||||
|
|
||||||
|
## Stage C — Repo + Claude Code (as `claude`)
|
||||||
|
|
||||||
|
- [ ] **C1.** Generate `claude`'s `ed25519` key; **[USER]** register the public key in
|
||||||
|
Forgejo (Settings → SSH keys).
|
||||||
|
- [ ] **C2.** Clone `ssh://git@forgejo.nyumbani.baobab.band:7577/sjat/boma.git` into
|
||||||
|
`/home/claude/Projects/boma`.
|
||||||
|
- [ ] **C3.** `make setup` (venv + `requirements.txt`); `make collections`.
|
||||||
|
- [ ] **C4.** Install Claude Code (native installer) for `claude`; set up plugins/MCP/
|
||||||
|
settings per `docs/runbooks/claude-code-setup.md`. Set git `user.name`/`user.email`.
|
||||||
|
|
||||||
|
## Stage D — Vault (`rbw`)
|
||||||
|
|
||||||
|
- [ ] **D1.** `rbw config set base_url https://vaultwarden.baobab.band`; set email.
|
||||||
|
- [ ] **D2. [USER]** `rbw login` (master password) on `ubongo`; then `rbw sync`,
|
||||||
|
`rbw unlock`; verify `rbw get boma-ansible-vault` returns the vault password.
|
||||||
|
- [ ] **D3.** **Offline-cache verification (ADR-015 open item, security-relevant):**
|
||||||
|
confirm `rbw` decrypts its local cache with Vaultwarden unreachable. Stamp the result
|
||||||
|
into ADR-015 / `rotate-secrets.md` (replaces the `TO VERIFY` note).
|
||||||
|
|
||||||
|
## Stage E — Inventory + base (partial)
|
||||||
|
|
||||||
|
- [ ] **E1.** Add `ubongo` to `inventories/production/hosts.yml` under `control`
|
||||||
|
(manual exception; note `tf-inventory` will overwrite — re-add after).
|
||||||
|
- [ ] **E2.** Set `base__firewall_control_addr` to `10.20.10.151` in the appropriate
|
||||||
|
`group_vars` (the dormant `ssh-from-control` knob, ADR-020/021).
|
||||||
|
- [ ] **E3.** `make check PLAYBOOK=site` against `control`; apply the built `firewall`
|
||||||
|
concern only (SSH-hardening/fail2ban/auditd concerns are unbuilt — note the gap).
|
||||||
|
|
||||||
|
## Stage F — Hardening / address
|
||||||
|
|
||||||
|
- [ ] **F1.** Disable SSH password auth (keys-only) via `/etc/ssh/sshd_config.d/`;
|
||||||
|
`PermitRootLogin no`; reload `sshd` (we're on a key, so safe).
|
||||||
|
- [ ] **F2.** **Remove the temp NOPASSWD** drop-in (`/etc/sudoers.d/99-boma-build`).
|
||||||
|
- [ ] **F3. [USER]** OPNsense DHCP reservation for `10.20.10.151`.
|
||||||
|
|
||||||
|
## Stage H — Docs reconciliation (repo commits)
|
||||||
|
|
||||||
|
- [ ] **H1.** `accepted-risks.md`: add the plaintext-disk accepted risk (compensations:
|
||||||
|
physical security, BIOS supervisor password, no external boot).
|
||||||
|
- [ ] **H2.** `docs/hardware/reference.md`: fill `ubongo`'s real specs (M70q, i3-10100T,
|
||||||
|
16 GB, 256 GB SanDisk X600) into the TBD skeleton; node-capacity row already present.
|
||||||
|
- [ ] **H3.** `STATUS.md`: move `ubongo` from "Designed but not built" toward built
|
||||||
|
(note what's live vs. still pending — mesh, full `base`).
|
||||||
|
- [ ] **H4.** Note the dedicated-`claude` identity decision (short amendment to ADR-021
|
||||||
|
or ADR-015) and the LAN address.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Out of scope this session
|
||||||
|
|
||||||
|
- **Mesh VPN** (NetBird) — needs `askari` + service roles (ADR-016). SSH stays LAN-only.
|
||||||
|
- **Full `base` hardening** — SSH/fail2ban/auditd concerns not built (only `firewall`).
|
||||||
|
- **Recovery wiring (G)** — TF-state backup to `mamba`, rbw mirror — no TF state yet
|
||||||
|
(TF un-init'd). `mamba` as break-glass clone tracked separately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Outcome (2026-06-11)
|
||||||
|
|
||||||
|
`STATUS.md` is the live source of truth; this is the session record.
|
||||||
|
|
||||||
|
**Done:** A (toolchain — Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173; Node deferred),
|
||||||
|
B (dedicated `claude` user — docker group, no sudo), C (repo cloned, `make setup` +
|
||||||
|
`collections`, git identity; plugins install on first interactive launch), D (vault via
|
||||||
|
rbw + **offline-cache decryption verified**), E1/E2 (inventory + `ssh-from-control`
|
||||||
|
knob), F1 (key-only SSH), F2 (temp NOPASSWD removed), H1–H4 (docs reconciled).
|
||||||
|
|
||||||
|
**Deferred, with reason:**
|
||||||
|
- **E3 — apply `base` to `ubongo`:** would push nftables default-deny with SSH allowed
|
||||||
|
*only on the mesh interface*, but no mesh exists yet → would deny inbound SSH on `eno1`
|
||||||
|
and strand the box. Wait for NetBird (ADR-016). `base` is also firewall-concern-only.
|
||||||
|
- **F3 — OPNsense DHCP reservation** for `10.20.10.151` (MAC `88:a4:c2:e0:ee:da`): operator action.
|
||||||
|
- **Mesh enrollment, full `base` hardening, recovery wiring (G):** out of scope (above).
|
||||||
|
|
||||||
|
**Follow-ups flagged:** (1) `ubongo` sits in `10.20.10.0/24`, which doesn't match
|
||||||
|
ADR-007's zone map (`srv: 10.20.0.0/24`) — network-design drift to reconcile. (2) The
|
||||||
|
hardware reference previously assumed `ubongo` had 1 TB NVMe for an ADR-022 "restore-verify"
|
||||||
|
role; the real disk is 256 GB — check ADR-022 doesn't bank on the larger size.
|
||||||
538
docs/superpowers/plans/2026-06-14-askari-provisioning-m2.md
Normal file
538
docs/superpowers/plans/2026-06-14-askari-provisioning-m2.md
Normal file
|
|
@ -0,0 +1,538 @@
|
||||||
|
# askari Provisioning (M2) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Provision `askari` (the off-site Hetzner VPS) as Terraform IaC — a `hetzner_vm` module + an `offsite` stack — behind a TF-managed cloud firewall, hand it into the `offsite_hosts` inventory, and bootstrap it.
|
||||||
|
|
||||||
|
**Architecture:** Generalize boma's "Terraform owns VM existence" principle (ADR-006) from Proxmox to Hetzner. A reusable `hetzner_vm` module wraps `hcloud_server` + `hcloud_firewall` + `hcloud_ssh_key`; an `offsite` environment (own local state) declares `askari` (CAX11/ARM, Helsinki, Debian 13). cloud-init creates the `ansible` user with ubongo's key; the firewall allows SSH from ubongo only. Handoff stays ADR-009-shaped: the offsite env outputs `vms`, and `tf_to_inventory.py` (already offsite-aware) generates an inventory file merged via a **directory inventory**.
|
||||||
|
|
||||||
|
**Tech Stack:** Terraform (`hetznercloud/hcloud` provider), Hetzner Cloud, cloud-init, Ansible. Token from `vault.hetzner.token` → `TF_VAR_hcloud_token`.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`
|
||||||
|
|
||||||
|
**Execution context:** Tasks 1–6 + 9 are authoring + `terraform fmt/validate/plan` (need `terraform` installed + the token, but no resources are created). **Task 7 (`terraform apply`) and Task 8 (bootstrap) create a real, billed VPS** — gated, run with explicit user go, `tf-plan` shown first (CLAUDE.md). If `terraform` is absent in the working env, Tasks 6–8 defer to ubongo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
- `terraform/modules/hetzner_vm/{variables,main,outputs}.tf` (create) — wraps server + firewall + ssh key + cloud-init.
|
||||||
|
- `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf` + `terraform.tfvars.example` (create) — the askari stack, own local state.
|
||||||
|
- `Makefile` (modify) — inject `TF_VAR_hcloud_token` for `TF_ENV=offsite`; directory inventory; `tf-inventory-offsite` target.
|
||||||
|
- `scripts/tf_to_inventory.py` (no change — already offsite-aware) + `tests/test_tf_to_inventory.py` (create) — lock the offsite handoff.
|
||||||
|
- `docs/decisions/{006,009,020,007,016}-*.md`, `STATUS.md` (modify) — ADR amendments + status.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Verify the Hetzner provider/image facts (ADR-014)
|
||||||
|
|
||||||
|
**Files:** none (research; pin values used by later tasks).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Verify and record**
|
||||||
|
|
||||||
|
Verify (WebFetch registry.terraform.io / docs.hetzner.com, or `terraform` once init'd):
|
||||||
|
- latest `hetznercloud/hcloud` provider version to pin (expected `~> 1.48`+),
|
||||||
|
- the Debian 13 image slug (expected `debian-13`),
|
||||||
|
- that server type `cax11` exists in location `hel1`.
|
||||||
|
|
||||||
|
Record a stamp in the offsite `providers.tf` comment, e.g.:
|
||||||
|
`# verified: hetznercloud/hcloud <ver> · debian-13 image · cax11@hel1 · <source> · <date>`
|
||||||
|
|
||||||
|
- [ ] **Step 2: No commit** (values land in later tasks).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: The `hetzner_vm` module
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `terraform/modules/hetzner_vm/variables.tf`, `main.tf`, `outputs.tf`
|
||||||
|
|
||||||
|
- [ ] **Step 1: `variables.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
variable "name" {
|
||||||
|
description = "Server name (and hostname)"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "server_type" {
|
||||||
|
description = "Hetzner server type, e.g. cax11 (ARM)"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "location" {
|
||||||
|
description = "Hetzner location, e.g. hel1"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "image" {
|
||||||
|
description = "OS image slug, e.g. debian-13"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "ansible_ssh_pubkey" {
|
||||||
|
description = "Public SSH key provisioned for the ansible user via cloud-init"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "ssh_admin_cidrs" {
|
||||||
|
description = "Source CIDRs allowed to reach SSH (e.g. ubongo's address/32)"
|
||||||
|
type = list(string)
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "labels" {
|
||||||
|
description = "Hetzner resource labels (metadata only)"
|
||||||
|
type = map(string)
|
||||||
|
default = {}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: `main.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# cloud-init: create the unprivileged `ansible` user with ubongo's key + sudo.
|
||||||
|
# (Mirrors the proxmox_vm module's user_account; Hetzner has no structured field.)
|
||||||
|
locals {
|
||||||
|
user_data = <<-EOT
|
||||||
|
#cloud-config
|
||||||
|
users:
|
||||||
|
- name: ansible
|
||||||
|
groups: [sudo]
|
||||||
|
sudo: "ALL=(ALL) NOPASSWD:ALL"
|
||||||
|
shell: /bin/bash
|
||||||
|
ssh_authorized_keys:
|
||||||
|
- ${var.ansible_ssh_pubkey}
|
||||||
|
package_update: true
|
||||||
|
packages:
|
||||||
|
- python3
|
||||||
|
EOT
|
||||||
|
}
|
||||||
|
|
||||||
|
resource "hcloud_ssh_key" "ansible" {
|
||||||
|
name = "${var.name}-ansible"
|
||||||
|
public_key = var.ansible_ssh_pubkey
|
||||||
|
}
|
||||||
|
|
||||||
|
resource "hcloud_firewall" "this" {
|
||||||
|
name = "${var.name}-fw"
|
||||||
|
|
||||||
|
# SSH from the control node only (NetBird ports are added in M4 when the
|
||||||
|
# coordinator deploys — see ADR-020; the host nftables layer is catalog-driven).
|
||||||
|
rule {
|
||||||
|
direction = "in"
|
||||||
|
protocol = "tcp"
|
||||||
|
port = "22"
|
||||||
|
source_ips = var.ssh_admin_cidrs
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
resource "hcloud_server" "this" {
|
||||||
|
name = var.name
|
||||||
|
server_type = var.server_type
|
||||||
|
location = var.location
|
||||||
|
image = var.image
|
||||||
|
ssh_keys = [hcloud_ssh_key.ansible.id]
|
||||||
|
user_data = local.user_data
|
||||||
|
firewall_ids = [hcloud_firewall.this.id]
|
||||||
|
labels = var.labels
|
||||||
|
|
||||||
|
public_net {
|
||||||
|
ipv4_enabled = true
|
||||||
|
ipv6_enabled = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: `outputs.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
output "ipv4_address" {
|
||||||
|
description = "Server public IPv4"
|
||||||
|
value = hcloud_server.this.ipv4_address
|
||||||
|
}
|
||||||
|
|
||||||
|
output "name" {
|
||||||
|
description = "Server name"
|
||||||
|
value = hcloud_server.this.name
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Format**
|
||||||
|
|
||||||
|
Run: `terraform fmt terraform/modules/hetzner_vm/`
|
||||||
|
Expected: files formatted (or already formatted).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add terraform/modules/hetzner_vm
|
||||||
|
git commit -m "feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)"
|
||||||
|
```
|
||||||
|
(append `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: The `offsite` environment
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf`, `terraform.tfvars.example`
|
||||||
|
|
||||||
|
- [ ] **Step 1: `providers.tf`** (pin the version from Task 1)
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# verified: hetznercloud/hcloud ~> 1.48 · debian-13 · cax11@hel1 · <source> · <date>
|
||||||
|
terraform {
|
||||||
|
required_version = ">= 1.9"
|
||||||
|
|
||||||
|
required_providers {
|
||||||
|
hcloud = {
|
||||||
|
source = "hetznercloud/hcloud"
|
||||||
|
version = "~> 1.48"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
provider "hcloud" {
|
||||||
|
token = var.hcloud_token
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: `variables.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
variable "hcloud_token" {
|
||||||
|
description = "Hetzner Cloud API token — set via TF_VAR_hcloud_token (from vault.hetzner.token)"
|
||||||
|
type = string
|
||||||
|
sensitive = true
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "ansible_ssh_pubkey" {
|
||||||
|
description = "ubongo's control SSH public key, provisioned for the ansible user"
|
||||||
|
type = string
|
||||||
|
}
|
||||||
|
|
||||||
|
variable "ssh_admin_cidrs" {
|
||||||
|
description = "Source CIDRs allowed to SSH askari (ubongo's address/32)"
|
||||||
|
type = list(string)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: `main.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# offsite/main.tf — off-site Hetzner hosts. Terraform owns VM existence (ADR-006,
|
||||||
|
# generalized to Hetzner). ALWAYS `make tf-plan TF_ENV=offsite` and review before
|
||||||
|
# `make tf-apply TF_ENV=offsite`.
|
||||||
|
|
||||||
|
module "askari" {
|
||||||
|
source = "../../modules/hetzner_vm"
|
||||||
|
|
||||||
|
name = "askari"
|
||||||
|
server_type = "cax11" # ARM, 2 vCPU / 4 GB
|
||||||
|
location = "hel1" # Helsinki
|
||||||
|
image = "debian-13"
|
||||||
|
ansible_ssh_pubkey = var.ansible_ssh_pubkey
|
||||||
|
ssh_admin_cidrs = var.ssh_admin_cidrs
|
||||||
|
labels = {
|
||||||
|
env = "offsite"
|
||||||
|
group = "offsite_hosts"
|
||||||
|
managed-by = "terraform"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: `outputs.tf`** (the `tf_to_inventory.py` contract — `vms` map)
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
output "vms" {
|
||||||
|
description = "Hostname → IP and Ansible group — consumed by make tf-inventory-offsite"
|
||||||
|
value = {
|
||||||
|
askari = {
|
||||||
|
ip = module.askari.ipv4_address
|
||||||
|
group = "offsite_hosts"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: `backend.tf`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# Terraform state: LOCAL, on the control node (like the Proxmox envs; ADR-006).
|
||||||
|
# askari survives a homelab outage by design, so a lost state is recovered by
|
||||||
|
# `terraform import` of the running server — not a rebuild. Back the state up with
|
||||||
|
# the control node (ADR-022).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: `terraform.tfvars.example`**
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# offsite environment — non-secret values. Copy to terraform.tfvars and fill in.
|
||||||
|
#
|
||||||
|
# Secret is exported as an env var (never in this file):
|
||||||
|
# export TF_VAR_hcloud_token="$(...from vault.hetzner.token...)" # make handles this
|
||||||
|
#
|
||||||
|
# State is local (see backend.tf).
|
||||||
|
|
||||||
|
ansible_ssh_pubkey = "ssh-ed25519 AAAA... ansible@ubongo"
|
||||||
|
ssh_admin_cidrs = ["10.20.10.151/32"] # ubongo's LAN address (ADR-021)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Format + commit**
|
||||||
|
|
||||||
|
Run: `terraform fmt terraform/environments/offsite/`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add terraform/environments/offsite
|
||||||
|
git commit -m "feat(tf): offsite environment — askari (CAX11/hel1/debian-13)"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Makefile — token injection, directory inventory, offsite handoff
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `Makefile`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Inject the Hetzner token for `TF_ENV=offsite`**
|
||||||
|
|
||||||
|
The `tf-*` targets need `TF_VAR_hcloud_token` for offsite, sourced from the vault. Add a guarded helper variable near the `TF` definition:
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
# For TF_ENV=offsite, export the Hetzner token from the vault (rbw unlocked).
|
||||||
|
# Reads vault.hetzner.token in-memory; never written to a tfvars file (CLAUDE.md).
|
||||||
|
ifeq ($(TF_ENV),offsite)
|
||||||
|
TF_TOKEN_ENV = TF_VAR_hcloud_token="$$($(VENV)/bin/ansible-vault view inventories/production/group_vars/all/vault.yml | $(VENV)/bin/python -c 'import sys,yaml; print(yaml.safe_load(sys)["vault"]["hetzner"]["token"])')"
|
||||||
|
else
|
||||||
|
TF_TOKEN_ENV =
|
||||||
|
endif
|
||||||
|
```
|
||||||
|
|
||||||
|
Then prefix the `tf-init`/`tf-plan`/`tf-apply`/`tf-output` recipes with `$(TF_TOKEN_ENV)`, e.g.:
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
tf-plan:
|
||||||
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan
|
||||||
|
```
|
||||||
|
|
||||||
|
(Apply the same prefix to `tf-init`, `tf-apply`, `tf-output`.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Directory inventory**
|
||||||
|
|
||||||
|
Change the inventory so multiple TF envs can each generate a file:
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
INVENTORY := -i inventories/production/
|
||||||
|
```
|
||||||
|
|
||||||
|
(Ansible reads every file in the directory as an inventory source and merges them; `group_vars/`/`host_vars/` remain variable dirs. Verify `ansible.cfg` does not also hard-set `inventory=`; if it does, update it to match.)
|
||||||
|
|
||||||
|
- [ ] **Step 3: `tf-inventory-offsite` target**
|
||||||
|
|
||||||
|
Add (writes the offsite hosts into the production inventory dir, beside the Proxmox-generated `hosts.yml`):
|
||||||
|
|
||||||
|
```makefile
|
||||||
|
tf-inventory-offsite:
|
||||||
|
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
|
||||||
|
| $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
|
||||||
|
@echo "Offsite inventory written to inventories/production/offsite.yml"
|
||||||
|
```
|
||||||
|
Add `tf-inventory-offsite` to `.PHONY` and a help line.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify existing playbooks still resolve under the directory inventory**
|
||||||
|
|
||||||
|
Run: `make check PLAYBOOK=dns 2>&1 | tail -3`
|
||||||
|
Expected: still resolves the `control` host and runs (no inventory errors). If `connection:`/group_vars break, fix before committing.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add Makefile
|
||||||
|
git commit -m "feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Lock the offsite inventory handoff (TDD)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Test: `tests/test_tf_to_inventory.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
```python
|
||||||
|
import json
|
||||||
|
import pathlib
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
_SCRIPT = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "tf_to_inventory.py"
|
||||||
|
|
||||||
|
|
||||||
|
def _run(tf_output: dict) -> str:
|
||||||
|
return subprocess.run(
|
||||||
|
[sys.executable, str(_SCRIPT)],
|
||||||
|
input=json.dumps(tf_output), capture_output=True, text=True, check=True,
|
||||||
|
).stdout
|
||||||
|
|
||||||
|
|
||||||
|
def test_offsite_host_lands_in_offsite_hosts():
|
||||||
|
out = _run({"vms": {"value": {"askari": {"ip": "203.0.113.7", "group": "offsite_hosts"}}}})
|
||||||
|
assert "offsite_hosts:" in out
|
||||||
|
assert "askari:" in out
|
||||||
|
assert "ansible_host: 203.0.113.7" in out
|
||||||
|
|
||||||
|
|
||||||
|
def test_unknown_group_rejected():
|
||||||
|
proc = subprocess.run(
|
||||||
|
[sys.executable, str(_SCRIPT)],
|
||||||
|
input=json.dumps({"vms": {"value": {"x": {"ip": "1.2.3.4", "group": "nope"}}}}),
|
||||||
|
capture_output=True, text=True,
|
||||||
|
)
|
||||||
|
assert proc.returncode == 1
|
||||||
|
assert "unknown group" in proc.stderr
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run it**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_tf_to_inventory.py -v`
|
||||||
|
Expected: PASS — `tf_to_inventory.py` already supports `offsite_hosts` and rejects unknown groups (this test locks that behaviour for the M2 handoff; no code change needed). If it fails, fix `scripts/tf_to_inventory.py` minimally and report.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tests/test_tf_to_inventory.py
|
||||||
|
git commit -m "test(tf): lock the offsite_hosts inventory handoff"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Init, validate, plan (gated — needs terraform + token)
|
||||||
|
|
||||||
|
> Needs `terraform` installed and `rbw` unlocked. Creates **no** resources. If `terraform` is absent, defer Tasks 6–8 to ubongo.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Set tfvars**
|
||||||
|
|
||||||
|
`cp terraform/environments/offsite/terraform.tfvars.example terraform/environments/offsite/terraform.tfvars` and set `ansible_ssh_pubkey` to ubongo's real control public key and `ssh_admin_cidrs` to ubongo's address (`10.20.10.151/32`). (`terraform.tfvars` is gitignored.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Init (tracks the lock file)**
|
||||||
|
|
||||||
|
Run: `make tf-init TF_ENV=offsite`
|
||||||
|
Expected: providers installed; `terraform/environments/offsite/.terraform.lock.hcl` created. `git add` the lock file (tracked per CLAUDE.md).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Validate + plan**
|
||||||
|
|
||||||
|
Run: `terraform -chdir=terraform/environments/offsite validate` → `Success`.
|
||||||
|
Run: `make tf-plan TF_ENV=offsite` → review: **1 server + 1 firewall + 1 ssh key to add**. Confirm CAX11/hel1/debian-13 and the SSH-from-ubongo rule.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit the lock file**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add terraform/environments/offsite/.terraform.lock.hcl
|
||||||
|
git commit -m "chore(tf): pin offsite provider lock (hcloud)"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Apply — create askari (GATED, real billed VPS)
|
||||||
|
|
||||||
|
> **Explicit user go required.** Run on ubongo. The plan from Task 6 must be reviewed first (CLAUDE.md: never apply without a shown plan).
|
||||||
|
|
||||||
|
- [ ] **Step 1: Apply**
|
||||||
|
|
||||||
|
Run: `make tf-apply TF_ENV=offsite`
|
||||||
|
Expected: `hcloud_ssh_key`, `hcloud_firewall`, `hcloud_server.askari` created; outputs show `askari`'s IPv4.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Generate the offsite inventory**
|
||||||
|
|
||||||
|
Run: `make tf-inventory-offsite`
|
||||||
|
Expected: `inventories/production/offsite.yml` written with `askari` under `offsite_hosts`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the inventory merges**
|
||||||
|
|
||||||
|
Run: `.venv/bin/ansible-inventory $(INVENTORY) --host askari` (or `--list`)
|
||||||
|
Expected: `askari` present with its `ansible_host`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit the generated inventory**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/production/offsite.yml
|
||||||
|
git commit -m "chore(inventory): askari in offsite_hosts (generated)"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Bootstrap askari (GATED — needs the live host)
|
||||||
|
|
||||||
|
> Run on ubongo after Task 7. `rbw` unlocked.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Reach it**
|
||||||
|
|
||||||
|
Run: `ssh ansible@<askari-ip>` (cloud-init created the `ansible` user with ubongo's key) — expect a shell. If refused, check the firewall `ssh_admin_cidrs` matches ubongo's egress IP.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Bootstrap**
|
||||||
|
|
||||||
|
Run: `make check PLAYBOOK=bootstrap` (review) then `make deploy PLAYBOOK=bootstrap` — expect the `ansible` user + sudoers confirmed/created on askari (idempotent).
|
||||||
|
|
||||||
|
- [ ] **Step 3: No repo commit** — this configures the host, not the repo. (`base` subset = M3.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 9: ADR amendments + STATUS
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/006-terraform.md`, `009-provisioning-handoff.md`, `020-firewall.md`, `007-network.md`, `016-mesh-vpn.md`, `STATUS.md`
|
||||||
|
|
||||||
|
For each: **Read the relevant section first**, then apply the change.
|
||||||
|
|
||||||
|
- [ ] **Step 1: ADR-006 — generalize the provider scope**
|
||||||
|
|
||||||
|
In the **Providers** section, the line "`bpg/proxmox` … This is the only provider." → note a second provider:
|
||||||
|
```
|
||||||
|
**`hetznercloud/hcloud`**: owns off-site VM existence (`askari`). ADR-006's scope is
|
||||||
|
**Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across providers; the
|
||||||
|
`offsite` environment + `hetzner_vm` module live alongside the Proxmox env + module.
|
||||||
|
```
|
||||||
|
Also adjust the Context line "creating and destroying VMs on Proxmox" → "on Proxmox and Hetzner".
|
||||||
|
|
||||||
|
- [ ] **Step 2: ADR-009 — offsite handoff**
|
||||||
|
|
||||||
|
Add a note that `offsite` is a TF environment whose `vms` output feeds `offsite_hosts` via `tf_to_inventory.py` (`make tf-inventory-offsite` → `inventories/production/offsite.yml`), and that the production inventory is a **directory** merging the Proxmox + offsite generated files.
|
||||||
|
|
||||||
|
- [ ] **Step 3: ADR-020 — askari's perimeter**
|
||||||
|
|
||||||
|
Note that off-cluster `askari` has no OPNsense; its **perimeter** is a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo now; NetBird ports in M4). The `group_vars` catalog stays authoritative for the host nftables layer.
|
||||||
|
|
||||||
|
- [ ] **Step 4: ADR-007 / ADR-016 — askari is TF-provisioned**
|
||||||
|
|
||||||
|
Replace "provisioned … independently … added manually" wording for askari with "provisioned as Terraform IaC (hcloud), managed independently of the Proxmox cluster (own provider + state)."
|
||||||
|
|
||||||
|
- [ ] **Step 5: STATUS.md**
|
||||||
|
|
||||||
|
Move/realize askari's row per how far Task 7/8 got. If applied: under "Real and working today" — `askari` **Built + applied** (CAX11/hel1/debian-13, cloud firewall SSH-from-ubongo, bootstrapped, in `offsite_hosts`). If only authored (apply deferred): note the TF is written + `tf-plan` clean, apply pending on ubongo.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Lint + commit**
|
||||||
|
|
||||||
|
Run: `make lint` (must pass).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/006-terraform.md docs/decisions/009-provisioning-handoff.md \
|
||||||
|
docs/decisions/020-firewall.md docs/decisions/007-network.md \
|
||||||
|
docs/decisions/016-mesh-vpn.md STATUS.md
|
||||||
|
git commit -m "docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** TF owns existence / generalize ADR-006 (Decision 1) → Tasks 2,3,9; CAX11/hel1/debian-13 (Decision 2) → Task 3; TF cloud firewall, SSH-from-ubongo, NetBird ports later (Decision 3) → Task 2 + Task 9 ADR-020; token via `TF_VAR_hcloud_token` from vault (Decision 4) → Task 4; ADR-009 handoff via `tf_to_inventory` (Decision 5) → Tasks 4,5,7; cloud-init `ansible` user + bootstrap → Tasks 2,8; state + DR (import) → Task 3 backend; ADR amendments → Task 9. All covered.
|
||||||
|
- **Placeholder scan:** none — HCL, make, and test content are concrete. `<askari-ip>`/`<source>`/`<date>` are runtime/verification values, not unspecified logic.
|
||||||
|
- **Type/name consistency:** module vars (`name`, `server_type`, `location`, `image`, `ansible_ssh_pubkey`, `ssh_admin_cidrs`, `labels`) match between module + env call; the `vms` output shape (`{ip, group}`) matches `tf_to_inventory.py`'s contract; `TF_VAR_hcloud_token` ↔ `var.hcloud_token`; `vault.hetzner.token` matches the stored key.
|
||||||
|
- **Notes for the implementer:** (a) confirm Ansible merges the directory inventory's two files so `askari` resolves (Task 7 Step 3); (b) verify `hcloud_server` arg names against the pinned provider version (Task 1) — adjust `public_net`/`firewall_ids` if the provider differs; (c) Tasks 7–8 create a billed VPS — gated on explicit go.
|
||||||
250
docs/superpowers/plans/2026-06-14-base-ssh-fail2ban-m3.md
Normal file
250
docs/superpowers/plans/2026-06-14-base-ssh-fail2ban-m3.md
Normal file
|
|
@ -0,0 +1,250 @@
|
||||||
|
# base SSH hardening + fail2ban (M3) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Add SSH-hardening + fail2ban concerns to the `base` role (ADR-002 baseline) and apply them to askari — without locking anything out.
|
||||||
|
|
||||||
|
**Architecture:** Two new `base` task files (`ssh.yml`, `fail2ban.yml`), both under the existing `hardening` concern tag, included after `firewall.yml`. Applied to askari **by tag** (`hardening`) so the host firewall (default-deny) is NOT applied pre-mesh — the Hetzner Cloud Firewall remains askari's perimeter until M5. A `LIMIT=`/`TAGS=` passthrough on `make check/deploy` enables the targeted apply.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`ansible.builtin`, `ansible.posix.authorized_key` — already vendored), sshd drop-in config, fail2ban.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-14-base-ssh-fail2ban-m3-design.md`
|
||||||
|
|
||||||
|
**Execution context:** Tasks 1–3 author + Molecule (Docker available). **Task 4 applies to live askari** (gated; reachable from ubongo). No new billed resources.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: `make check/deploy` LIMIT + TAGS passthrough
|
||||||
|
|
||||||
|
**Files:** Modify `Makefile` (the `check` and `deploy` recipes).
|
||||||
|
|
||||||
|
- [ ] **Step 1:** In the `check:` recipe, change the command line to:
|
||||||
|
```makefile
|
||||||
|
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) --check --diff playbooks/$(PLAYBOOK).yml
|
||||||
|
```
|
||||||
|
- [ ] **Step 2:** In the `deploy:` recipe, change the command line to:
|
||||||
|
```makefile
|
||||||
|
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) playbooks/$(PLAYBOOK).yml
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** Add help lines noting `[LIMIT=<host>] [TAGS=<tags>]` are optional on check/deploy.
|
||||||
|
- [ ] **Step 4:** Sanity-check it parses: `make check PLAYBOOK=dns LIMIT=control TAGS=public_dns 2>&1 | tail -2` (should run check-mode scoped to control). Expected: no make/syntax error.
|
||||||
|
- [ ] **Step 5:** Commit:
|
||||||
|
```bash
|
||||||
|
git add Makefile
|
||||||
|
git commit -m "feat(make): optional LIMIT= and TAGS= passthrough on check/deploy"
|
||||||
|
```
|
||||||
|
(append `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: base `hardening` concern — ssh + fail2ban
|
||||||
|
|
||||||
|
**Files:** Create `roles/base/tasks/ssh.yml`, `roles/base/tasks/fail2ban.yml`, `roles/base/templates/sshd_hardening.conf.j2`, `roles/base/templates/fail2ban_sshd.local.j2`; modify `roles/base/tasks/main.yml`, `roles/base/defaults/main.yml`, `roles/base/handlers/main.yml`, `inventories/production/group_vars/all/vars.yml`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Append to `roles/base/defaults/main.yml`:
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# SSH hardening + fail2ban (ADR-002) — `hardening` concern.
|
||||||
|
base__ssh_password_authentication: "no"
|
||||||
|
base__ssh_permit_root_login: "no"
|
||||||
|
base__fail2ban_maxretry: 5
|
||||||
|
base__fail2ban_bantime: 1h
|
||||||
|
base__fail2ban_findtime: 10m
|
||||||
|
# base__ssh_authorised_keys lives in group_vars/all/vars.yml (per-person control keys).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2:** Create `roles/base/templates/sshd_hardening.conf.j2`:
|
||||||
|
```
|
||||||
|
# Managed by Ansible (base role, ADR-002). Do not edit on the host.
|
||||||
|
PasswordAuthentication {{ base__ssh_password_authentication }}
|
||||||
|
PermitRootLogin {{ base__ssh_permit_root_login }}
|
||||||
|
PubkeyAuthentication yes
|
||||||
|
KbdInteractiveAuthentication no
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3:** Create `roles/base/templates/fail2ban_sshd.local.j2`:
|
||||||
|
```
|
||||||
|
# Managed by Ansible (base role, ADR-002).
|
||||||
|
[sshd]
|
||||||
|
enabled = true
|
||||||
|
maxretry = {{ base__fail2ban_maxretry }}
|
||||||
|
bantime = {{ base__fail2ban_bantime }}
|
||||||
|
findtime = {{ base__fail2ban_findtime }}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4:** Create `roles/base/tasks/ssh.yml`:
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Ensure openssh-server is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: openssh-server
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Render hardened sshd drop-in
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: sshd_hardening.conf.j2
|
||||||
|
dest: /etc/ssh/sshd_config.d/10-boma.conf
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify: reload sshd
|
||||||
|
|
||||||
|
- name: Validate the full sshd config (drop-in included)
|
||||||
|
ansible.builtin.command: sshd -t
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Authorise control SSH keys for the ansible user
|
||||||
|
ansible.posix.authorized_key:
|
||||||
|
user: "{{ ansible_user | default('ansible') }}"
|
||||||
|
key: "{{ base__ssh_authorised_keys | join('\n') }}"
|
||||||
|
exclusive: true
|
||||||
|
when: base__ssh_authorised_keys | length > 0
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5:** Create `roles/base/tasks/fail2ban.yml`:
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Install fail2ban
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: fail2ban
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Configure the sshd jail
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: fail2ban_sshd.local.j2
|
||||||
|
dest: /etc/fail2ban/jail.d/sshd.local
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify: restart fail2ban
|
||||||
|
|
||||||
|
- name: Enable and start fail2ban
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: fail2ban
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6:** Replace `roles/base/handlers/main.yml`:
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Reload sshd
|
||||||
|
listen: reload sshd
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: ssh
|
||||||
|
state: reloaded
|
||||||
|
|
||||||
|
- name: Restart fail2ban
|
||||||
|
listen: restart fail2ban
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: fail2ban
|
||||||
|
state: restarted
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7:** In `roles/base/tasks/main.yml`, add after the firewall include:
|
||||||
|
```yaml
|
||||||
|
- name: SSH hardening
|
||||||
|
ansible.builtin.include_tasks: ssh.yml
|
||||||
|
tags: [hardening]
|
||||||
|
|
||||||
|
- name: fail2ban intrusion deterrence
|
||||||
|
ansible.builtin.include_tasks: fail2ban.yml
|
||||||
|
tags: [hardening]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 8:** In `inventories/production/group_vars/all/vars.yml`, set `base__ssh_authorised_keys` (replace the empty `[]`):
|
||||||
|
```yaml
|
||||||
|
base__ssh_authorised_keys:
|
||||||
|
- "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKSx1TFLJ9H8vCe5ZJSu7MYmAiH0/OC8evloQjGR0Bqw claude@ubongo"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 9:** `make lint` — expect `0 failure(s)` + `check-tags: OK` (the `hardening` tag is already in `tests/tags.yml`).
|
||||||
|
- [ ] **Step 10:** Commit:
|
||||||
|
```bash
|
||||||
|
git add roles/base inventories/production/group_vars/all/vars.yml
|
||||||
|
git commit -m "feat(base): ssh hardening + fail2ban (hardening concern, ADR-002)"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Molecule coverage
|
||||||
|
|
||||||
|
**Files:** Modify `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** In `converge.yml`, the role already runs with `base__firewall_apply: false`. Leave `base__ssh_authorised_keys` unset (defaults to `[]` → the `authorized_key` task is skipped, no test user needed). No converge change needed unless vars are missing — confirm the play still has `roles: [base]`.
|
||||||
|
|
||||||
|
- [ ] **Step 2:** Append assertions to `verify.yml` (after the existing firewall checks):
|
||||||
|
```yaml
|
||||||
|
- name: sshd drop-in present and config valid
|
||||||
|
ansible.builtin.command: sshd -t
|
||||||
|
changed_when: false
|
||||||
|
tags: [verify]
|
||||||
|
|
||||||
|
- name: PasswordAuthentication is disabled
|
||||||
|
ansible.builtin.command: grep -q '^PasswordAuthentication no' /etc/ssh/sshd_config.d/10-boma.conf
|
||||||
|
changed_when: false
|
||||||
|
tags: [verify]
|
||||||
|
|
||||||
|
- name: fail2ban sshd jail configured
|
||||||
|
ansible.builtin.command: grep -q '^\[sshd\]' /etc/fail2ban/jail.d/sshd.local
|
||||||
|
changed_when: false
|
||||||
|
tags: [verify]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3:** Run `make test ROLE=base`. Expected: converge installs openssh-server + fail2ban, renders the drop-ins, validates sshd, starts fail2ban; verify passes; idempotence clean. If the Molecule image lacks systemd-for-fail2ban or apt fails offline, capture the error (the image is systemd-enabled per `molecule.yml`).
|
||||||
|
- [ ] **Step 4:** Commit:
|
||||||
|
```bash
|
||||||
|
git add roles/base/molecule
|
||||||
|
git commit -m "test(base): Molecule coverage for ssh hardening + fail2ban"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Apply to askari (gated — live host)
|
||||||
|
|
||||||
|
> Runs against live askari (reachable from ubongo). `rbw` unlocked. Applies ONLY the
|
||||||
|
> `hardening` concern (`--tags hardening`) so the host firewall is not touched.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Dry-run.** `make check PLAYBOOK=site LIMIT=askari TAGS=hardening` — review: openssh-server present, sshd drop-in (`PasswordAuthentication no`, `PermitRootLogin no`), authorized_key for `ansible`, fail2ban installed + sshd jail. Confirm NO firewall tasks appear.
|
||||||
|
- [ ] **Step 2: Apply.** `make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening` — expect changed for the drop-in, fail2ban install/config; `failed=0`.
|
||||||
|
- [ ] **Step 3: Verify SSH still works (lock-out guard).** `.venv/bin/ansible offsite_hosts -m ping` → `pong`. And `.venv/bin/ansible offsite_hosts -b -m command -a 'sshd -t'` → rc=0.
|
||||||
|
- [ ] **Step 4: Verify fail2ban.** `.venv/bin/ansible offsite_hosts -b -m command -a 'fail2ban-client status sshd'` → shows the sshd jail active.
|
||||||
|
- [ ] **Step 5: Idempotence.** Re-run Step 2 → `changed=0`.
|
||||||
|
- [ ] **Step 6: No repo commit** (configures the host, not the repo).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Docs
|
||||||
|
|
||||||
|
**Files:** Modify `STATUS.md`, `docs/ROADMAP.md`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** In `STATUS.md`, update the `roles/base/` row (under "Scaffolded but empty"/partial) to note the `hardening` concern (ssh + fail2ban) is now built, and **applied to askari**; firewall concern still pending application (mesh-gated). If askari's row exists in "Real and working today," append "SSH hardened + fail2ban (M3)".
|
||||||
|
- [ ] **Step 2:** In `docs/ROADMAP.md`, mark **M3** as done (ssh + fail2ban built + applied to askari; NetBird agent deferred to M4; host firewall + ubongo hardening at M5).
|
||||||
|
- [ ] **Step 3:** `make lint`; commit:
|
||||||
|
```bash
|
||||||
|
git add STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs(base): M3 — ssh hardening + fail2ban applied to askari; STATUS + roadmap"
|
||||||
|
```
|
||||||
|
(Co-Authored-By trailer)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** ssh + fail2ban concerns under `hardening` (Decision 1) → Task 2;
|
||||||
|
apply-by-tag, no firewall (Decision 2) → Task 4 (`TAGS=hardening`); `base__ssh_authorised_keys`
|
||||||
|
populated (Decision 3) → Task 2 Step 8; LIMIT/TAGS passthrough (Decision 4) → Task 1;
|
||||||
|
ADR-002 controls (key-only, no root, fail2ban 5/1h) → Tasks 2; Molecule + live verify
|
||||||
|
(testing) → Tasks 3, 4. Deferrals (agent/M4, host-fw+ubongo/M5, auditd/Phase 2) honoured.
|
||||||
|
- **Placeholder scan:** none — all task/template/handler content is concrete.
|
||||||
|
- **Name consistency:** `base__ssh_*` / `base__fail2ban_*` / `base__ssh_authorised_keys`
|
||||||
|
used identically across defaults, templates, tasks, and group_vars; handler listen-topics
|
||||||
|
(`reload sshd`, `restart fail2ban`) match the `notify:` strings.
|
||||||
|
- **Lock-out guard:** sshd hardening only disables password+root (we use key+sudo); the
|
||||||
|
`ansible` user's key is preserved (`base__ssh_authorised_keys` has it); `sshd -t`
|
||||||
|
validates before reload; firewall untouched (`--tags hardening`). Task 4 verifies SSH
|
||||||
|
post-apply.
|
||||||
641
docs/superpowers/plans/2026-06-14-kaizen-command.md
Normal file
641
docs/superpowers/plans/2026-06-14-kaizen-command.md
Normal file
|
|
@ -0,0 +1,641 @@
|
||||||
|
# `/kaizen` Command Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Build the `/kaizen` kaizen-loop command — a stdlib scanner that parses `docs/FRICTION.md` *Open signals* plus an interactive command that curates them (add/change/park/remove) into the decisions ledger.
|
||||||
|
|
||||||
|
**Architecture:** Mirrors `/review-repo` exactly: a deterministic stdlib Phase-0 scanner (`scripts/friction-scan.py`, unit-tested) feeds a markdown command (`.claude/commands/kaizen.md`) that drives the interactive curation. The same scanner powers a stage-2 nudge surfaced in `/review-repo`.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3 standard library only (matches `scripts/repo-scan.py`); pytest; markdown command docs.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
- Create: `scripts/friction-scan.py` — stdlib parser of `FRICTION.md` *Open signals*; `--json` (default) and `--nudge` modes. One responsibility: turn the prose signal log into structured data + the nudge line.
|
||||||
|
- Create: `tests/test_friction_scan.py` — unit tests for the parser (string-based, deterministic via `--today`), matching `tests/test_repo_scan.py`.
|
||||||
|
- Create: `.claude/commands/kaizen.md` — the interactive curation process.
|
||||||
|
- Modify: `.claude/commands/review-repo.md` — add the stage-2 nudge line to its report.
|
||||||
|
- Modify: `STATUS.md` — add a `/kaizen` row.
|
||||||
|
- Modify: `docs/TODO.md` — mark item 11.1 in progress / built.
|
||||||
|
|
||||||
|
All scanner logic lives in functions that take strings/data (not files) so tests need no fixtures on disk; only `load_signals(path, today)` and `main()` touch the filesystem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 1: Scanner scaffold — section extraction + signal splitting
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scripts/friction-scan.py`
|
||||||
|
- Test: `tests/test_friction_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# tests/test_friction_scan.py
|
||||||
|
import importlib.util
|
||||||
|
import os
|
||||||
|
|
||||||
|
_SPEC = importlib.util.spec_from_file_location(
|
||||||
|
"friction_scan",
|
||||||
|
os.path.join(os.path.dirname(__file__), "..", "scripts", "friction-scan.py"),
|
||||||
|
)
|
||||||
|
fs = importlib.util.module_from_spec(_SPEC)
|
||||||
|
_SPEC.loader.exec_module(fs)
|
||||||
|
|
||||||
|
SAMPLE = """# FRICTION.md
|
||||||
|
|
||||||
|
## Open signals
|
||||||
|
|
||||||
|
_(append new raw signals here)_
|
||||||
|
|
||||||
|
- `[gotcha]` **First thing** (2026-06-01): body line one.
|
||||||
|
continuation line two.
|
||||||
|
- `[friction]` **Second thing** (2026-06-10): only one line.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kaizen reviews — decisions ledger
|
||||||
|
|
||||||
|
- `[gotcha]` **Should not be parsed** (2026-01-01): in the ledger.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_open_section_stops_at_next_heading():
|
||||||
|
section = fs.extract_open_section(SAMPLE)
|
||||||
|
assert "First thing" in section
|
||||||
|
assert "Second thing" in section
|
||||||
|
assert "Should not be parsed" not in section
|
||||||
|
|
||||||
|
|
||||||
|
def test_split_signals_finds_two_items_and_joins_continuations():
|
||||||
|
signals = fs.split_signals(fs.extract_open_section(SAMPLE))
|
||||||
|
assert len(signals) == 2
|
||||||
|
assert "continuation line two" in signals[0]
|
||||||
|
assert signals[1].startswith("`[friction]`")
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
|
||||||
|
Expected: FAIL — `friction-scan.py` does not exist / `extract_open_section` undefined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation**
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Parse docs/FRICTION.md 'Open signals' into structured data for /kaizen.
|
||||||
|
|
||||||
|
Stdlib only. Modes:
|
||||||
|
--json (default): emit the open signals as JSON (Phase-0 input for /kaizen)
|
||||||
|
--nudge : print a one-line 'loop overdue?' summary
|
||||||
|
|
||||||
|
Authoritative design: docs/superpowers/specs/2026-06-14-kaizen-command-design.md
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import datetime
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
|
||||||
|
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
FRICTION = os.path.join(REPO_ROOT, "docs", "FRICTION.md")
|
||||||
|
|
||||||
|
|
||||||
|
def extract_open_section(text):
|
||||||
|
"""Return the body between '## Open signals' and the next '## ' heading."""
|
||||||
|
lines = text.splitlines()
|
||||||
|
start = None
|
||||||
|
for i, line in enumerate(lines):
|
||||||
|
if line.strip().lower() == "## open signals":
|
||||||
|
start = i + 1
|
||||||
|
break
|
||||||
|
if start is None:
|
||||||
|
return ""
|
||||||
|
end = len(lines)
|
||||||
|
for j in range(start, len(lines)):
|
||||||
|
if lines[j].startswith("## "):
|
||||||
|
end = j
|
||||||
|
break
|
||||||
|
return "\n".join(lines[start:end])
|
||||||
|
|
||||||
|
|
||||||
|
def split_signals(section):
|
||||||
|
"""Split the Open-signals body into raw per-signal blocks.
|
||||||
|
|
||||||
|
A signal starts with a top-level '- ' bullet; indented or blank lines are
|
||||||
|
continuations. Returns a list of multi-line strings with the leading '- '
|
||||||
|
stripped from the first line."""
|
||||||
|
signals = []
|
||||||
|
current = None
|
||||||
|
for line in section.splitlines():
|
||||||
|
if line.startswith("- "):
|
||||||
|
if current is not None:
|
||||||
|
signals.append("\n".join(current).strip())
|
||||||
|
current = [line[2:]]
|
||||||
|
elif current is not None:
|
||||||
|
if line.strip() == "" or line.startswith(" "):
|
||||||
|
current.append(line.strip())
|
||||||
|
else:
|
||||||
|
signals.append("\n".join(current).strip())
|
||||||
|
current = None
|
||||||
|
if current is not None:
|
||||||
|
signals.append("\n".join(current).strip())
|
||||||
|
return [s for s in signals if s]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": # pragma: no cover (filled in Task 4)
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
|
||||||
|
Expected: PASS (2 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/friction-scan.py tests/test_friction_scan.py
|
||||||
|
git commit -m "feat(kaizen): friction-scan section extraction + signal split"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 2: Per-signal fields — tag, first_seen, age_days
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/friction-scan.py`
|
||||||
|
- Test: `tests/test_friction_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
```python
|
||||||
|
import datetime
|
||||||
|
|
||||||
|
TODAY = datetime.date(2026, 6, 15)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_signal_extracts_tag_and_date_and_age():
|
||||||
|
raw = fs.split_signals(fs.extract_open_section(SAMPLE))[0]
|
||||||
|
sig = fs.parse_signal(raw, TODAY)
|
||||||
|
assert sig["tag"] == "gotcha"
|
||||||
|
assert sig["first_seen"] == "2026-06-01"
|
||||||
|
assert sig["age_days"] == 14
|
||||||
|
assert "First thing" in sig["text"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_signal_handles_missing_date():
|
||||||
|
sig = fs.parse_signal("`[unused]` **No date here** something", TODAY)
|
||||||
|
assert sig["tag"] == "unused"
|
||||||
|
assert sig["first_seen"] is None
|
||||||
|
assert sig["age_days"] is None
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py::test_parse_signal_extracts_tag_and_date_and_age -v`
|
||||||
|
Expected: FAIL — `parse_signal` undefined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation**
|
||||||
|
|
||||||
|
Add near the top (after imports):
|
||||||
|
|
||||||
|
```python
|
||||||
|
TAG_RE = re.compile(r"`\[(friction|gotcha|recurring|unused)\]`")
|
||||||
|
DATE_RE = re.compile(r"(\d{4})-(\d{2})-(\d{2})")
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the function (above the `__main__` block):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def parse_signal(raw, today):
|
||||||
|
"""Turn one raw signal block into a structured dict."""
|
||||||
|
tag_m = TAG_RE.search(raw)
|
||||||
|
date_m = DATE_RE.search(raw)
|
||||||
|
if date_m:
|
||||||
|
first_seen = date_m.group(0)
|
||||||
|
seen = datetime.date(int(date_m.group(1)), int(date_m.group(2)), int(date_m.group(3)))
|
||||||
|
age_days = (today - seen).days
|
||||||
|
else:
|
||||||
|
first_seen = None
|
||||||
|
age_days = None
|
||||||
|
return {
|
||||||
|
"tag": tag_m.group(1) if tag_m else None,
|
||||||
|
"first_seen": first_seen,
|
||||||
|
"age_days": age_days,
|
||||||
|
"recurrence_count": 1, # refined in Task 3
|
||||||
|
"referenced_paths": [], # filled in Task 3
|
||||||
|
"still_exists": True, # filled in Task 3
|
||||||
|
"text": " ".join(raw.split()),
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
|
||||||
|
Expected: PASS (4 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/friction-scan.py tests/test_friction_scan.py
|
||||||
|
git commit -m "feat(kaizen): parse tag/first_seen/age per signal"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 3: Recurrence count + referenced paths + still_exists
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/friction-scan.py`
|
||||||
|
- Test: `tests/test_friction_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_recurrence_from_ordinal():
|
||||||
|
assert fs.parse_recurrence("blah 5th occurrence (06-05/06/06) blah") == 5
|
||||||
|
|
||||||
|
|
||||||
|
def test_recurrence_from_datelist_when_no_ordinal():
|
||||||
|
# three slash-separated date fragments → recurrence 3
|
||||||
|
assert fs.parse_recurrence("recurred (06-05/06-09/06-10) again") == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_recurrence_defaults_to_one():
|
||||||
|
assert fs.parse_recurrence("a one-off gotcha") == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_paths_picks_repo_paths_only():
|
||||||
|
paths = fs.parse_paths("see `scripts/repo-scan.py` and `latest` and `foo.yml`")
|
||||||
|
assert "scripts/repo-scan.py" in paths
|
||||||
|
assert "foo.yml" in paths
|
||||||
|
assert "latest" not in paths
|
||||||
|
|
||||||
|
|
||||||
|
def test_still_exists_false_for_missing_path():
|
||||||
|
sig = fs.parse_signal("`[unused]` **x** (2026-06-01): `scripts/nope-not-real.py`", TODAY)
|
||||||
|
assert sig["still_exists"] is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_still_exists_true_for_real_path():
|
||||||
|
sig = fs.parse_signal("`[gotcha]` **x** (2026-06-01): `scripts/repo-scan.py`", TODAY)
|
||||||
|
assert sig["still_exists"] is True
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -k "recurrence or paths or still_exists" -v`
|
||||||
|
Expected: FAIL — `parse_recurrence` / `parse_paths` undefined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation**
|
||||||
|
|
||||||
|
Add the regexes near the others:
|
||||||
|
|
||||||
|
```python
|
||||||
|
ORDINAL_RE = re.compile(r"(\d+)(?:st|nd|rd|th)\s+(?:occurrence|reinforcement|time)", re.I)
|
||||||
|
DATELIST_RE = re.compile(r"\((\d{2}-\d{2}(?:/[\d/-]+)+)\)")
|
||||||
|
BACKTICK_RE = re.compile(r"`([^`]+)`")
|
||||||
|
PATH_EXTS = (".py", ".yml", ".yaml", ".md", ".sh", ".tf", ".j2", ".toml", ".cfg", ".hcl")
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the helpers (above `parse_signal`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def parse_recurrence(text):
|
||||||
|
"""Best-effort recurrence count from explicit markers; default 1."""
|
||||||
|
counts = [1]
|
||||||
|
m = ORDINAL_RE.search(text)
|
||||||
|
if m:
|
||||||
|
counts.append(int(m.group(1)))
|
||||||
|
dl = DATELIST_RE.search(text)
|
||||||
|
if dl:
|
||||||
|
counts.append(dl.group(1).count("/") + 1)
|
||||||
|
return max(counts)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_paths(text):
|
||||||
|
"""Backtick tokens that look like repo paths (contain '/' or a known ext)."""
|
||||||
|
out, seen = [], set()
|
||||||
|
for m in BACKTICK_RE.finditer(text):
|
||||||
|
tok = m.group(1).strip()
|
||||||
|
if ("/" in tok or tok.endswith(PATH_EXTS)) and tok not in seen:
|
||||||
|
seen.add(tok)
|
||||||
|
out.append(tok)
|
||||||
|
return out
|
||||||
|
```
|
||||||
|
|
||||||
|
Then update `parse_signal` — replace the three placeholder fields:
|
||||||
|
|
||||||
|
```python
|
||||||
|
paths = parse_paths(raw)
|
||||||
|
still_exists = all(os.path.exists(os.path.join(REPO_ROOT, p)) for p in paths) if paths else True
|
||||||
|
return {
|
||||||
|
"tag": tag_m.group(1) if tag_m else None,
|
||||||
|
"first_seen": first_seen,
|
||||||
|
"age_days": age_days,
|
||||||
|
"recurrence_count": parse_recurrence(raw),
|
||||||
|
"referenced_paths": paths,
|
||||||
|
"still_exists": still_exists,
|
||||||
|
"text": " ".join(raw.split()),
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
|
||||||
|
Expected: PASS (10 tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/friction-scan.py tests/test_friction_scan.py
|
||||||
|
git commit -m "feat(kaizen): recurrence count + referenced-path existence"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 4: CLI — `load_signals`, `--json`, `--nudge`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/friction-scan.py`
|
||||||
|
- Test: `tests/test_friction_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_nudge_line_overdue_on_recurrence():
|
||||||
|
sigs = [{"age_days": 2, "recurrence_count": 5}]
|
||||||
|
line = fs.nudge_line(sigs)
|
||||||
|
assert "OVERDUE" in line
|
||||||
|
assert "max recurrence 5x" in line
|
||||||
|
|
||||||
|
|
||||||
|
def test_nudge_line_ok_when_quiet():
|
||||||
|
sigs = [{"age_days": 3, "recurrence_count": 1}, {"age_days": 1, "recurrence_count": 1}]
|
||||||
|
line = fs.nudge_line(sigs)
|
||||||
|
assert "ok" in line
|
||||||
|
assert "OVERDUE" not in line
|
||||||
|
|
||||||
|
|
||||||
|
def test_nudge_line_overdue_on_count():
|
||||||
|
sigs = [{"age_days": 1, "recurrence_count": 1} for _ in range(8)]
|
||||||
|
assert "OVERDUE" in fs.nudge_line(sigs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_signals_reads_real_friction_file():
|
||||||
|
path = os.path.join(os.path.dirname(__file__), "..", "docs", "FRICTION.md")
|
||||||
|
sigs = fs.load_signals(path, TODAY)
|
||||||
|
assert len(sigs) >= 1
|
||||||
|
assert all(s["tag"] in {"friction", "gotcha", "recurring", "unused"} for s in sigs)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -k "nudge or load_signals" -v`
|
||||||
|
Expected: FAIL — `nudge_line` / `load_signals` undefined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation**
|
||||||
|
|
||||||
|
Add thresholds near the top (after `FRICTION = ...`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Nudge thresholds (tunable; the /kaizen self-eval phase revisits these).
|
||||||
|
NUDGE_MIN_OPEN = 8
|
||||||
|
NUDGE_MAX_AGE_DAYS = 21
|
||||||
|
NUDGE_MIN_RECURRENCE = 3
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the functions and replace the `__main__` block:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def load_signals(path, today):
|
||||||
|
with open(path, encoding="utf-8") as fh:
|
||||||
|
text = fh.read()
|
||||||
|
return [parse_signal(s, today) for s in split_signals(extract_open_section(text))]
|
||||||
|
|
||||||
|
|
||||||
|
def nudge_line(signals):
|
||||||
|
n = len(signals)
|
||||||
|
ages = [s["age_days"] for s in signals if s.get("age_days") is not None]
|
||||||
|
oldest = max(ages) if ages else 0
|
||||||
|
max_rec = max((s["recurrence_count"] for s in signals), default=0)
|
||||||
|
overdue = n >= NUDGE_MIN_OPEN or oldest >= NUDGE_MAX_AGE_DAYS or max_rec >= NUDGE_MIN_RECURRENCE
|
||||||
|
status = "OVERDUE — run /kaizen" if overdue else "ok"
|
||||||
|
return f"kaizen: {n} open signals, oldest {oldest}d, max recurrence {max_rec}x — {status}"
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Parse FRICTION.md Open signals for /kaizen.")
|
||||||
|
parser.add_argument("--nudge", action="store_true", help="print a one-line overdue summary")
|
||||||
|
parser.add_argument("--today", help="override today's date (YYYY-MM-DD) for testing")
|
||||||
|
parser.add_argument("--file", default=FRICTION, help="path to FRICTION.md")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.today:
|
||||||
|
y, m, d = args.today.split("-")
|
||||||
|
today = datetime.date(int(y), int(m), int(d))
|
||||||
|
else:
|
||||||
|
today = datetime.date.today()
|
||||||
|
|
||||||
|
signals = load_signals(args.file, today)
|
||||||
|
if args.nudge:
|
||||||
|
print(nudge_line(signals))
|
||||||
|
else:
|
||||||
|
print(json.dumps(signals, indent=2))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests + smoke-test the CLI**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
|
||||||
|
Expected: PASS (14 tests).
|
||||||
|
|
||||||
|
Run: `python3 scripts/friction-scan.py --nudge`
|
||||||
|
Expected: one line like `kaizen: 13 open signals, oldest 14d, max recurrence 5x — OVERDUE — run /kaizen`.
|
||||||
|
|
||||||
|
Run: `python3 scripts/friction-scan.py | head -20`
|
||||||
|
Expected: a JSON array of signal objects.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/friction-scan.py tests/test_friction_scan.py
|
||||||
|
git commit -m "feat(kaizen): friction-scan CLI (--json default, --nudge)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 5: The `/kaizen` command document
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `.claude/commands/kaizen.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the command file**
|
||||||
|
|
||||||
|
Write `.claude/commands/kaizen.md` with exactly this content:
|
||||||
|
|
||||||
|
````markdown
|
||||||
|
# Kaizen — curate the friction log into improvements
|
||||||
|
|
||||||
|
Consume the **Open signals** in `docs/FRICTION.md`: decide a verdict for each, migrate
|
||||||
|
durable knowledge into the right docs, and archive consumed signals into the decisions
|
||||||
|
ledger. **Curate-only** — do not hunt for new signals; capture stays manual. This is an
|
||||||
|
interactive, judgment-dense pass: propose, the operator decides, you apply on approval.
|
||||||
|
|
||||||
|
Design: `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`.
|
||||||
|
|
||||||
|
## Phase 0 — scan
|
||||||
|
Run `python3 scripts/friction-scan.py > /tmp/kaizen.json`. It returns each Open signal as
|
||||||
|
`{tag, first_seen, age_days, recurrence_count, referenced_paths, still_exists, text}`.
|
||||||
|
Treat `still_exists: false` as a hint the signal may already be resolved.
|
||||||
|
|
||||||
|
## Phase 1 — triage
|
||||||
|
Order signals by `recurrence_count` desc, then `age_days` desc, then tag. **Group signals
|
||||||
|
that share a root cause** and curate them together. Present the agenda before editing
|
||||||
|
anything: total open, how many recurring (≥3), how many look already-resolved.
|
||||||
|
|
||||||
|
## Phase 2 — per-signal curation (interactive)
|
||||||
|
For each signal/group, present: a one-line restatement, the evidence (age, recurrence,
|
||||||
|
still-real), and a proposed **verdict**. Verdicts:
|
||||||
|
|
||||||
|
- **SYSTEMATIZE** — migrate the durable lesson into its right home (a runbook, an ADR,
|
||||||
|
`CLAUDE.md`, a new `scripts/repo-scan.py` check, or a hook).
|
||||||
|
- **CHANGE** — adjust an existing tool/convention/config rather than document it.
|
||||||
|
- **PARK** — *out-of-phase but not obsolete*. Remove from the active tree, but write a
|
||||||
|
ledger row recording **where it now lives (git SHA/branch/doc) and a resurrection
|
||||||
|
trigger**. The default for "not touched lately but not wrong."
|
||||||
|
- **REMOVE** — *obsolete*: superseded, wrong, never worked, duplicated. Ledger row states
|
||||||
|
why.
|
||||||
|
- **ALREADY-BUILT** — the systematization already exists / the fix landed; archive.
|
||||||
|
- **ACCEPTED** — conscious no-op (revisit-if-recurs); archive.
|
||||||
|
- **KEEP-OPEN** — still accruing, not ripe; leave it in *Open signals* (no ledger row).
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- **Knowledge is never removed** — SYSTEMATIZE/migrate it; only *active surface* (scripts,
|
||||||
|
checks, conventions, plugins) is parked/removed.
|
||||||
|
- Every reductive verdict must classify *why unused*: **obsolete → REMOVE**,
|
||||||
|
**out-of-phase → PARK**.
|
||||||
|
- The operator approves / modifies / rejects each verdict. On approval: do the mechanical
|
||||||
|
edit (migrate text into the target doc; **move the signal from *Open signals* into the
|
||||||
|
ledger table**; delete the parked/removed file) and show the diff.
|
||||||
|
- PARK and REMOVE both delete from the active tree — the difference is the ledger row.
|
||||||
|
Git history + the ledger row are the park mechanism; never create a `parked/` directory.
|
||||||
|
|
||||||
|
## Phase 3 — close-out
|
||||||
|
- Add a new dated block under `## Kaizen reviews — decisions ledger` (newest first), same
|
||||||
|
shape as the existing block: a table with columns **Signal (first seen) | Verdict |
|
||||||
|
Resolution / where it lives now**.
|
||||||
|
- **Bias-to-remove discipline check:** if every verdict this pass was SYSTEMATIZE/CHANGE
|
||||||
|
(only accreting), say so explicitly.
|
||||||
|
- **Self-eval (light):** is `/kaizen` being run often enough (oldest consumed age)? Should
|
||||||
|
the nudge thresholds in `scripts/friction-scan.py` change? Note it.
|
||||||
|
- Run `make lint` if any code/docs changed; revert anything that breaks it.
|
||||||
|
- Commit per `CLAUDE.md` git conventions (one logical unit — straight to `main` if
|
||||||
|
small/safe, a branch if sweeping; show the diff first for a branch).
|
||||||
|
- Print a one-line summary: `consumed X · parked Y · removed Z · kept-open W · migrated → <docs>`.
|
||||||
|
|
||||||
|
## Headless / cron (future)
|
||||||
|
Deferred until the notify + cron stack exists (`docs/TODO.md` 11.3). When run
|
||||||
|
non-interactively, **report only**: print the proposed verdicts and the nudge, do not edit
|
||||||
|
or commit.
|
||||||
|
````
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify it parses against the real log**
|
||||||
|
|
||||||
|
Run: `python3 scripts/friction-scan.py --today 2026-06-15 | python3 -c "import sys,json; print(len(json.load(sys.stdin)), 'signals')"`
|
||||||
|
Expected: prints a non-zero signal count with no traceback.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: passes (markdown isn't linted by yamllint/ansible-lint, but this confirms nothing else broke).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/kaizen.md
|
||||||
|
git commit -m "feat(kaizen): /kaizen command — interactive friction curation"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 6: Stage-2 nudge in `/review-repo` + STATUS/TODO
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `.claude/commands/review-repo.md`
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
- Modify: `docs/TODO.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the nudge to the review-repo command**
|
||||||
|
|
||||||
|
In `.claude/commands/review-repo.md`, find the "Phase 0 — deterministic pre-scan" section
|
||||||
|
(it runs `scripts/repo-scan.py`). Immediately after that paragraph, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
Also run `python3 scripts/friction-scan.py --nudge` and include its one-line output in the
|
||||||
|
report's summary — it flags when the kaizen loop (`/kaizen`) is overdue (recurring signals,
|
||||||
|
backlog size, or age). This is a reminder only; do not act on `FRICTION.md` from here.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add a STATUS row**
|
||||||
|
|
||||||
|
In `STATUS.md`, under "Real and working today", add a row:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| `/kaizen` | Curate `docs/FRICTION.md` Open signals → decisions ledger (`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`). On-demand; `--nudge` surfaces in `/review-repo`. Headless/cron deferred (TODO 11.3). |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update TODO 11**
|
||||||
|
|
||||||
|
In `docs/TODO.md` item 11, mark sub-item 1 built:
|
||||||
|
|
||||||
|
Change `1. Build `/retro`: ...` to begin with `1. ~~Build `/retro``... ` — i.e. strike it
|
||||||
|
through and append: `DONE — built as `/kaizen` (scope narrowed to curate-only per the
|
||||||
|
2026-06-14 spec; `/retro` name dropped). `scripts/friction-scan.py` + `.claude/commands/kaizen.md`.`
|
||||||
|
|
||||||
|
- [ ] **Step 4: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: passes.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/review-repo.md STATUS.md docs/TODO.md
|
||||||
|
git commit -m "feat(kaizen): nudge in /review-repo; STATUS + TODO"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 7: Dogfood — first real `/kaizen` run
|
||||||
|
|
||||||
|
This task is **not** automated; it is the first real use, done interactively with the operator.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Run `/kaizen` against the current Open signals (there are several,
|
||||||
|
including the 3 added 2026-06-14 and the 5× execution-mode-menu signal).
|
||||||
|
- [ ] **Step 2:** Work the interactive curation (Phase 2) with the operator, applying
|
||||||
|
verdicts on approval.
|
||||||
|
- [ ] **Step 3:** Confirm the close-out: ledger updated, `make lint` green, summary printed.
|
||||||
|
This both processes the backlog and validates the command end-to-end.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review notes (author)
|
||||||
|
|
||||||
|
- **Spec coverage:** scope-A curate-only → Task 5 Phase 0–2; verdict model incl. PARK →
|
||||||
|
Task 5 Phase 2 + ledger; single source FRICTION.md → Task 4 `load_signals`; interactive
|
||||||
|
apply (B) → Task 5; ledger format → Task 5 Phase 3; scanner schema → Tasks 2–4; nudge +
|
||||||
|
thresholds → Task 4 + Task 6; out-of-scope items → not built (correct); `/review-repo`
|
||||||
|
relationship → Task 6 nudge. All covered.
|
||||||
|
- **No placeholders:** every code step shows complete code; the command doc is written in
|
||||||
|
full.
|
||||||
|
- **Type consistency:** the signal dict keys (`tag, first_seen, age_days,
|
||||||
|
recurrence_count, referenced_paths, still_exists, text`) are identical across Tasks 2–4
|
||||||
|
and the command doc; `nudge_line` reads `age_days`/`recurrence_count` only.
|
||||||
146
docs/superpowers/plans/2026-06-14-m4a-docker-caddy.md
Normal file
146
docs/superpowers/plans/2026-06-14-m4a-docker-caddy.md
Normal file
|
|
@ -0,0 +1,146 @@
|
||||||
|
# M4a — Docker + Caddy reverse proxy (platform) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax.
|
||||||
|
|
||||||
|
**Goal:** Stand up the platform NetBird needs — Docker on askari + boma's standard Caddy reverse proxy with Gandi DNS-01 wildcard certs — proven end-to-end by serving a test route over TLS.
|
||||||
|
|
||||||
|
**Architecture:** `docker_host` installs Docker engine + compose (pinned). A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) gives DNS-01 via `vault.gandi.pat`. The `reverse_proxy` role renders a Caddyfile from `reverse_proxy__routes` data + an `.env`. The M2 Hetzner firewall opens 80/443; `public_dns` publishes `*.askari.wingu.me`. M4b adds NetBird as a route.
|
||||||
|
|
||||||
|
**Tech Stack:** Docker CE, Caddy (custom xcaddy build), ACME DNS-01 (Gandi), Ansible, Terraform (hcloud firewall).
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md`
|
||||||
|
|
||||||
|
**Execution context:** Tasks author here; **Task 7 applies live to askari + issues a real cert** (gated). The custom image builds with Docker (available).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: ADR — boma's reverse proxy is Caddy
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Create `docs/decisions/024-reverse-proxy.md` following ADR-023's
|
||||||
|
structure (Status: Accepted; Context; Decision; Consequences; Related). Decision:
|
||||||
|
**Caddy** is boma's reverse proxy (rationale from the M4 spec Decision 1: Ansible-rendered
|
||||||
|
config fits Caddy not Traefik's discovery; automatic HTTPS + Gandi DNS-01; simpler at
|
||||||
|
this scale; `forward_auth` to Authentik preserved). Note it amends the soft Traefik
|
||||||
|
assumption in the roadmap/ADR-017 prose (no prior ADR pinned Traefik).
|
||||||
|
- [ ] **Step 2:** Add the ADR-024 row to `CLAUDE.md`'s Further-reading table and update
|
||||||
|
the roadmap Phase-2 "auth + reverse proxy" line (Authentik + **Caddy**, not Traefik).
|
||||||
|
- [ ] **Step 3:** `make lint`; commit `docs(adr): ADR-024 — Caddy is boma's reverse proxy`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: `docker_host` — install Docker engine
|
||||||
|
|
||||||
|
**Files:** `roles/docker_host/{defaults,tasks}/main.yml`, `roles/docker_host/README.md`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `defaults/main.yml` — `docker_host__compose_version`-style pins (use the
|
||||||
|
Docker apt repo; pin via apt or accept repo latest with a comment). Variables:
|
||||||
|
`docker_host__packages: [docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin]`.
|
||||||
|
- [ ] **Step 2:** `tasks/main.yml` — add the Docker apt repo + GPG key (`ansible.builtin.apt_key`/`deb822_repository`),
|
||||||
|
`apt` install `docker_host__packages`, enable+start `docker`. (Tag: role-name; concern `packages`.)
|
||||||
|
- [ ] **Step 3:** Fill `README.md` (purpose, vars). `make lint`.
|
||||||
|
- [ ] **Step 4:** Molecule: converge installs Docker; verify `docker --version` + service active. (`make test ROLE=docker_host`; build the image if needed.)
|
||||||
|
- [ ] **Step 5:** Commit `feat(docker_host): install Docker engine + compose plugin`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Custom Caddy image (xcaddy + caddy-dns/gandi)
|
||||||
|
|
||||||
|
**Files:** `.docker/caddy-gandi/Dockerfile`, `Makefile` (a `caddy-image` target).
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `.docker/caddy-gandi/Dockerfile` (verify the latest stable Caddy + plugin tags per ADR-014):
|
||||||
|
```dockerfile
|
||||||
|
FROM caddy:2-builder AS build
|
||||||
|
RUN xcaddy build --with github.com/caddy-dns/gandi
|
||||||
|
|
||||||
|
FROM caddy:2
|
||||||
|
COPY --from=build /usr/bin/caddy /usr/bin/caddy
|
||||||
|
```
|
||||||
|
- [ ] **Step 2:** `Makefile` — add `caddy-image` (build, tagged for the Forgejo registry like the Molecule image) + `caddy-image-push`. Add to `.PHONY` + help.
|
||||||
|
- [ ] **Step 3:** Build it: `make caddy-image`; verify `docker run --rm <img> caddy list-modules | grep dns.providers.gandi`. Expected: the module is listed.
|
||||||
|
- [ ] **Step 4:** Commit `feat(docker): custom Caddy image with the Gandi DNS-01 plugin`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: `reverse_proxy` role (Caddy)
|
||||||
|
|
||||||
|
**Files:** create `roles/reverse_proxy/{defaults,tasks}/main.yml`, `templates/{docker-compose.yml.j2,Caddyfile.j2,env.j2}`, `README.md`; `inventories/production/group_vars/all/reverse_proxy.yml`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `group_vars/all/reverse_proxy.yml` — route data:
|
||||||
|
```yaml
|
||||||
|
reverse_proxy__image: "<forgejo-registry>/sjat/caddy-gandi:latest"
|
||||||
|
reverse_proxy__base_dir: /opt/services/reverse_proxy
|
||||||
|
reverse_proxy__acme_domain: askari.wingu.me # wildcard *.askari.wingu.me
|
||||||
|
reverse_proxy__routes: [] # M4b appends: {host: netbird.askari.wingu.me, upstream: "netbird-dashboard:80"}
|
||||||
|
```
|
||||||
|
- [ ] **Step 2:** `templates/Caddyfile.j2` — global TLS via Gandi DNS-01 + a per-route block:
|
||||||
|
```
|
||||||
|
{
|
||||||
|
email admin@wingu.me
|
||||||
|
}
|
||||||
|
*.{{ reverse_proxy__acme_domain }} {
|
||||||
|
tls {
|
||||||
|
dns gandi {env.GANDI_BEARER_TOKEN}
|
||||||
|
}
|
||||||
|
{% for r in reverse_proxy__routes %}
|
||||||
|
@{{ r.host | replace('.', '_') }} host {{ r.host }}
|
||||||
|
handle @{{ r.host | replace('.', '_') }} {
|
||||||
|
reverse_proxy {{ r.upstream }}
|
||||||
|
}
|
||||||
|
{% endfor %}
|
||||||
|
handle {
|
||||||
|
respond "boma reverse proxy" 200
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** `templates/env.j2` — `GANDI_BEARER_TOKEN={{ vault.gandi.pat }}`.
|
||||||
|
- [ ] **Step 4:** `templates/docker-compose.yml.j2` — the Caddy service (image `reverse_proxy__image`, ports 80:80 + 443:443, env_file, volumes for the Caddyfile + cert data, restart unless-stopped).
|
||||||
|
- [ ] **Step 5:** `tasks/main.yml` — ADR-004 deploy mechanics: ensure `base_dir`, render compose+Caddyfile+env, `community.docker.docker_compose_v2` up. (Adds `community.docker` to `requirements.yml` with the on-demand comment.)
|
||||||
|
- [ ] **Step 6:** `README.md`; `make lint`.
|
||||||
|
- [ ] **Step 7:** Molecule (render-only): converge renders the files (compose `apply:false`-style or skip the up in container); verify `caddy validate --config Caddyfile` passes. Commit `feat(reverse_proxy): Caddy role (Gandi DNS-01, route catalog)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Open the firewall (TF) + DNS
|
||||||
|
|
||||||
|
- [ ] **Step 1:** In `terraform/modules/hetzner_vm/main.tf`, add Caddy ports to the firewall (variable-driven so other hosts differ): inbound **80/tcp** + **443/tcp** from `0.0.0.0/0` + **3478/udp** (NetBird, M4b uses it) — gate behind a `var.public_web` bool defaulting false; set true for askari in `environments/offsite/main.tf`. `terraform fmt`.
|
||||||
|
- [ ] **Step 2:** `make tf-plan TF_ENV=offsite` (review: firewall adds 80/443[/3478]) → **gated** `make tf-apply TF_ENV=offsite`.
|
||||||
|
- [ ] **Step 3:** Add `*.askari.wingu.me` A → askari's IP to `public_dns__records` (`group_vars/all/public_dns.yml`); `make deploy PLAYBOOK=dns`; `dig +short test.askari.wingu.me` → askari IP.
|
||||||
|
- [ ] **Step 4:** Commit the TF + DNS changes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Playbook wiring
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Create `playbooks/offsite.yml` targeting `offsite_hosts`: roles `docker_host` then `reverse_proxy` (each with its role-name tag). `make lint` (check-tags verifies the role-name tags).
|
||||||
|
- [ ] **Step 2:** Commit `feat(offsite): playbook applying docker_host + reverse_proxy to askari`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Apply to askari + prove TLS (gated, live)
|
||||||
|
|
||||||
|
> Live on askari. Issues a **real cert** via DNS-01. `rbw` unlocked.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `make check PLAYBOOK=offsite LIMIT=askari` — review.
|
||||||
|
- [ ] **Step 2:** `make deploy PLAYBOOK=offsite LIMIT=askari` — Docker installs, Caddy comes up.
|
||||||
|
- [ ] **Step 3:** Prove it (from ubongo): `curl -sSI https://test.askari.wingu.me` → `HTTP/2 200` with a **valid Let's Encrypt cert** (the wildcard `*.askari.wingu.me` issued via Gandi DNS-01). `curl -s https://test.askari.wingu.me` → `boma reverse proxy`.
|
||||||
|
- [ ] **Step 4:** `.venv/bin/ansible offsite_hosts -b -m command -a 'docker compose -f /opt/services/reverse_proxy/docker-compose.yml ps'` → Caddy healthy.
|
||||||
|
- [ ] **Step 5:** No repo commit (host state).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Docs
|
||||||
|
|
||||||
|
- [ ] **Step 1:** STATUS.md — Docker on askari + the `reverse_proxy` (Caddy) role built + applied; `*.askari.wingu.me` cert live. ROADMAP M4 — note M4a done, M4b (NetBird) next.
|
||||||
|
- [ ] **Step 2:** `make lint`; commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** Caddy-as-standard ADR (Decision 1) → Task 1; docker_host (Decision 4) →
|
||||||
|
Task 2; custom Caddy image + DNS-01 (Decisions 2) → Task 3; reverse_proxy role + route
|
||||||
|
catalog (Decision 4) → Task 4; firewall 80/443/3478 (Decision 5) → Task 5; DNS (Decision 6)
|
||||||
|
→ Task 5; live cert proof (testing) → Task 7. NetBird itself (Decisions 3,7,8) → **M4b**, correct.
|
||||||
|
- **Placeholder scan:** `<forgejo-registry>` is the known registry host (`forgejo.nyumbani.baobab.band/...`) — fill from the Molecule image var; not a logic gap. Version pins (Caddy, Docker, plugin) are flagged ADR-014 verifications, done in their tasks.
|
||||||
|
- **Name consistency:** `reverse_proxy__*`, `vault.gandi.pat`→`GANDI_BEARER_TOKEN`, `*.askari.wingu.me` used consistently across role, templates, firewall, and DNS.
|
||||||
|
- **Risk:** the custom image + DNS-01 is the novel bit — Task 3 verifies the module loads and Task 7 proves a real cert issues before M4b depends on it.
|
||||||
91
docs/superpowers/plans/2026-06-14-m4b-netbird.md
Normal file
91
docs/superpowers/plans/2026-06-14-m4b-netbird.md
Normal file
|
|
@ -0,0 +1,91 @@
|
||||||
|
# M4b — NetBird coordinator (service role) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
|
||||||
|
|
||||||
|
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
|
||||||
|
|
||||||
|
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
|
||||||
|
|
||||||
|
**Tech Stack:** NetBird (combined `netbird-server` container if stable for the pinned version, else the multi-container set), embedded Dex IdP, Coturn, Docker Compose, Caddy (M4a), Ansible.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md` · **Prereq:** M4a (Docker + Caddy) ✓ on askari.
|
||||||
|
|
||||||
|
**Execution context:** Task 1 runs `configure.sh` in a scratch dir (capture only). Tasks 2–6 author. **Task 7 deploys live to askari** (gated). NetBird self-hosting is finicky — expect live debugging.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Capture NetBird's reference setup (pin the version)
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Pick + pin the NetBird version (ADR-014 — check the latest stable release). Record it.
|
||||||
|
- [ ] **Step 2:** In a scratch dir (on ubongo, throwaway), fetch NetBird's `getting-started`/`configure.sh` for that version and run it with answers for: domain `netbird.askari.wingu.me`, **external reverse proxy** (disable bundled Traefik/Caddy), **embedded Dex** (no external SSO), Let's Encrypt off (Caddy terminates TLS).
|
||||||
|
- [ ] **Step 3:** Capture the generated files verbatim into the plan/notes: `docker-compose.yml`, `management.json` (or `config.yaml`), `turnserver.conf`, `openid-configuration.json`, dashboard env. Also capture NetBird's **Caddy external-proxy template** (their docs ship one) — it shows the exact upstreams + HTTP/2/gRPC routing the dashboard/management/signal/relay need.
|
||||||
|
- [ ] **Step 4:** No commit (reference capture; informs Tasks 2–4).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: `netbird_coordinator` service role — templates
|
||||||
|
|
||||||
|
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
|
||||||
|
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
|
||||||
|
- [ ] **Step 3:** `turnserver.conf.j2` (realm = `netbird.askari.wingu.me`, the TURN secret), `openid-configuration.json.j2`, `dashboard.env.j2` (`NETBIRD_MGMT_API_ENDPOINT=https://netbird.askari.wingu.me`, the `AUTH_*` Dex values).
|
||||||
|
- [ ] **Step 4:** `defaults/main.yml` (`netbird__*` knobs: version, base_dir `/opt/services/netbird`, domain) + `tasks/main.yml` (ADR-004 deploy mechanics: ensure dir, render all files, `community.docker.docker_compose_v2` up; `netbird__manage` toggle for Molecule).
|
||||||
|
- [ ] **Step 5:** `make lint`; commit `feat(netbird): coordinator service role (compose + config templates)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Secrets (CHANGEME convention + generated)
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Add to vault (`make edit-vault`): `vault.netbird.datastore_key`, `vault.netbird.turn_password`, any Dex client secret — **generate** strong values (or stub `CHANGEME` + a comment if operator-supplied). Add `vault.netbird.setup_key: CHANGEME` with a comment "created in the NetBird dashboard after first boot — M5 enrolment".
|
||||||
|
- [ ] **Step 2:** `make check-vault` confirms structure + lists the `setup_key` placeholder.
|
||||||
|
- [ ] **Step 3:** Commit the vault.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Wire Caddy + DNS
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Append to `reverse_proxy__routes` (`group_vars/all/reverse_proxy.yml`): `{host: netbird.askari.wingu.me, upstream: "<netbird container:port>"}` — per the captured Caddy template (NetBird needs HTTP/2 + gRPC; add the required Caddy directives, e.g. separate handles for the management gRPC path if the template shows them).
|
||||||
|
- [ ] **Step 2:** `netbird.askari.wingu.me` already resolves via the `*.askari.wingu.me` wildcard (M4a) — no new DNS record.
|
||||||
|
- [ ] **Step 3:** Commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Service-role standard files (ADR-004, authored)
|
||||||
|
|
||||||
|
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
|
||||||
|
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
|
||||||
|
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
|
||||||
|
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful** → `backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
|
||||||
|
- [ ] **Step 5:** `make lint`; commit `docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Add netbird to the offsite playbook
|
||||||
|
|
||||||
|
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Deploy to askari + verify (gated, live — expect debugging)
|
||||||
|
|
||||||
|
> NetBird self-hosting is finicky; budget for iterating on the management config + Caddy routing.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `make check PLAYBOOK=offsite LIMIT=askari TAGS=netbird` — review.
|
||||||
|
- [ ] **Step 2:** `make deploy PLAYBOOK=offsite LIMIT=askari TAGS=netbird` → `make deploy ... TAGS=reverse_proxy` (Caddy reloads with the netbird route).
|
||||||
|
- [ ] **Step 3:** Verify: `docker compose ps` all healthy; `curl -sI https://netbird.askari.wingu.me` → 200 with the M4a cert; the **dashboard loads** in a browser; the management API responds. Iterate on config/routing until green.
|
||||||
|
- [ ] **Step 4:** No repo commit (host state).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Docs
|
||||||
|
|
||||||
|
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** external-proxy NetBird + embedded Dex (Decisions 3) → Tasks 1,2,4; first service role + standard files (Decision 7) → Tasks 2,5; firewall 3478 (Decision 5) → done in M4a; setup key M5 + CHANGEME (Decision 8) → Task 3; Caddy front (M4a) → Task 4. Enrolment → M5, correct.
|
||||||
|
- **Placeholder scan:** the concrete config field *values* are intentionally captured from `configure.sh` (Task 1) rather than invented — version-sensitive, and inventing them would be wrong. The plan pins the method, not guesses.
|
||||||
|
- **Risk:** NetBird's external-proxy + gRPC routing is the hard part — Task 1 captures NetBird's own Caddy template to get it right, and Task 7 budgets for live iteration.
|
||||||
551
docs/superpowers/plans/2026-06-14-public-dns-m1.md
Normal file
551
docs/superpowers/plans/2026-06-14-public-dns-m1.md
Normal file
|
|
@ -0,0 +1,551 @@
|
||||||
|
# Public DNS (M1) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Build the `public_dns` role that manages `wingu.me`'s records at Gandi LiveDNS as code, purging Gandi's seeded defaults and applying boma's anti-spoof baseline.
|
||||||
|
|
||||||
|
**Architecture:** A control-node role drives `community.general.gandi_livedns` over declarative record lists in `group_vars/all/public_dns.yml` (mirroring the firewall-catalog pattern). Records to keep are `state: present`; Gandi's auto-seeded defaults are `state: absent`. A `public_dns__apply` toggle lets Molecule converge without calling the API; a pytest validates the data shape; the live run happens via `make check`/`deploy PLAYBOOK=dns` on ubongo.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`community.general.gandi_livedns`, PAT auth), pytest, Gandi LiveDNS API. Secrets from `vault.gandi.pat`.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md`
|
||||||
|
|
||||||
|
**Execution context:** Tasks 1–6 + 8 are authoring (any machine with the venv). **Task 7 runs on ubongo** (has the vault + Gandi egress) and is the only one that touches live Gandi.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
- `requirements.yml` (modify) — add `community.general` (≥9.0.0) for `gandi_livedns`.
|
||||||
|
- `roles/public_dns/` (create) — `defaults/main.yml`, `tasks/main.yml`, `meta/main.yml`, `README.md`, `molecule/default/`.
|
||||||
|
- `inventories/production/group_vars/all/public_dns.yml` (create) — `public_dns__domain` + `public_dns__records` (present) + `public_dns__absent` (Gandi defaults).
|
||||||
|
- `playbooks/dns.yml` (create) — control-node play running the role.
|
||||||
|
- `tests/test_public_dns.py` (create) — pytest over the record data.
|
||||||
|
- `docs/decisions/007-network.md`, `STATUS.md`, `docs/TODO.md`, `docs/CAPABILITIES.md` (modify) — doc reconciliation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Add the `community.general` collection
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `requirements.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the collection with the on-demand comment**
|
||||||
|
|
||||||
|
In `requirements.yml`, under `collections:`, append:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# community.general — gandi_livedns (public_dns role manages wingu.me at Gandi
|
||||||
|
# LiveDNS). PAT auth requires >= 9.0.0.
|
||||||
|
- name: community.general
|
||||||
|
version: ">=9.0.0"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Install it**
|
||||||
|
|
||||||
|
Run: `make collections`
|
||||||
|
Expected: installs `community.general` (≥9.0.0) with no errors.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the module is available**
|
||||||
|
|
||||||
|
Run: `.venv/bin/ansible-doc community.general.gandi_livedns | head -5`
|
||||||
|
Expected: prints the module doc header (confirms the module resolves), mentioning `personal_access_token`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add requirements.yml
|
||||||
|
git commit -m "deps: add community.general for gandi_livedns (public_dns)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Scaffold the role
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `roles/public_dns/` (via the scaffolder)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Scaffold**
|
||||||
|
|
||||||
|
Run: `make new-role NAME=public_dns`
|
||||||
|
Expected: `Role public_dns scaffolded at roles/public_dns/` (creates `tasks/`, `handlers/`, `defaults/`, `meta/`, `templates/`, `files/`, `molecule/default/`, `README.md`).
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit the scaffold**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/public_dns
|
||||||
|
git commit -m "scaffold(public_dns): empty role structure"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Record data + validation test (TDD)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Test: `tests/test_public_dns.py`
|
||||||
|
- Create: `inventories/production/group_vars/all/public_dns.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Create `tests/test_public_dns.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
_DATA = (
|
||||||
|
pathlib.Path(__file__).resolve().parent.parent
|
||||||
|
/ "inventories" / "production" / "group_vars" / "all" / "public_dns.yml"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Gandi auto-seeds these on a fresh .me zone; boma purges them (verified 2026-06-14).
|
||||||
|
GANDI_DEFAULTS_ABSENT = {
|
||||||
|
("@", "A"), ("www", "CNAME"), ("webmail", "CNAME"),
|
||||||
|
("gm1._domainkey", "CNAME"), ("gm2._domainkey", "CNAME"), ("gm3._domainkey", "CNAME"),
|
||||||
|
("_imap._tcp", "SRV"), ("_imaps._tcp", "SRV"), ("_pop3._tcp", "SRV"),
|
||||||
|
("_pop3s._tcp", "SRV"), ("_submission._tcp", "SRV"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _load():
|
||||||
|
return yaml.safe_load(_DATA.read_text())
|
||||||
|
|
||||||
|
|
||||||
|
def test_domain_is_wingu():
|
||||||
|
assert _load()["public_dns__domain"] == "wingu.me"
|
||||||
|
|
||||||
|
|
||||||
|
def test_present_records_well_formed():
|
||||||
|
for r in _load()["public_dns__records"]:
|
||||||
|
assert r["record"] and r["type"]
|
||||||
|
assert isinstance(r["values"], list) and r["values"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_anti_spoof_baseline_present():
|
||||||
|
recs = {(r["record"], r["type"]): r["values"] for r in _load()["public_dns__records"]}
|
||||||
|
assert recs[("@", "MX")] == ["0 ."] # null MX
|
||||||
|
assert recs[("@", "TXT")] == ['"v=spf1 -all"'] # SPF deny-all
|
||||||
|
assert recs[("_dmarc", "TXT")] == ['"v=DMARC1; p=reject;"']
|
||||||
|
|
||||||
|
|
||||||
|
def test_gandi_defaults_marked_absent():
|
||||||
|
absent = {(r["record"], r["type"]) for r in _load()["public_dns__absent"]}
|
||||||
|
assert GANDI_DEFAULTS_ABSENT <= absent
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_record_both_present_and_absent():
|
||||||
|
present = {(r["record"], r["type"]) for r in _load()["public_dns__records"]}
|
||||||
|
absent = {(r["record"], r["type"]) for r in _load()["public_dns__absent"]}
|
||||||
|
assert present.isdisjoint(absent)
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_duplicate_present_records():
|
||||||
|
keys = [(r["record"], r["type"]) for r in _load()["public_dns__records"]]
|
||||||
|
assert len(keys) == len(set(keys))
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run it to verify it fails**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_public_dns.py -v`
|
||||||
|
Expected: FAIL (the data file does not exist yet — `FileNotFoundError`).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Create the record data**
|
||||||
|
|
||||||
|
Create `inventories/production/group_vars/all/public_dns.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Public DNS — wingu.me at Gandi LiveDNS, managed by the public_dns role (M1).
|
||||||
|
# Mesh/LAN-only by default: only deliberate public records live here. PAT in
|
||||||
|
# vault.gandi.pat. See docs/decisions/007-network.md and the M1 spec.
|
||||||
|
public_dns__domain: wingu.me
|
||||||
|
|
||||||
|
# Present — anti-spoof baseline for a no-mail domain (overwrites Gandi's seeded mail set).
|
||||||
|
public_dns__records:
|
||||||
|
- { record: "@", type: MX, values: ["0 ."], ttl: 3600 }
|
||||||
|
- { record: "@", type: TXT, values: ['"v=spf1 -all"'], ttl: 3600 }
|
||||||
|
- { record: _dmarc, type: TXT, values: ['"v=DMARC1; p=reject;"'], ttl: 3600 }
|
||||||
|
# Service records appear as public-tier needs arise (askari A in M4).
|
||||||
|
# Mesh/LAN-only services never appear here.
|
||||||
|
|
||||||
|
# Absent — Gandi's auto-seeded defaults we don't want (purged once, idempotent thereafter).
|
||||||
|
public_dns__absent:
|
||||||
|
- { record: "@", type: A } # Gandi parking IP
|
||||||
|
- { record: www, type: CNAME } # Gandi web-redirect
|
||||||
|
- { record: webmail, type: CNAME } # Gandi webmail
|
||||||
|
- { record: gm1._domainkey, type: CNAME } # Gandi DKIM
|
||||||
|
- { record: gm2._domainkey, type: CNAME }
|
||||||
|
- { record: gm3._domainkey, type: CNAME }
|
||||||
|
- { record: _imap._tcp, type: SRV } # Gandi mail autodiscovery
|
||||||
|
- { record: _imaps._tcp, type: SRV }
|
||||||
|
- { record: _pop3._tcp, type: SRV }
|
||||||
|
- { record: _pop3s._tcp, type: SRV }
|
||||||
|
- { record: _submission._tcp, type: SRV }
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_public_dns.py -v`
|
||||||
|
Expected: PASS (6 passed).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tests/test_public_dns.py inventories/production/group_vars/all/public_dns.yml
|
||||||
|
git commit -m "feat(public_dns): wingu.me record data + validation test"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Role implementation (defaults, tasks, meta, README)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/public_dns/defaults/main.yml`
|
||||||
|
- Modify: `roles/public_dns/tasks/main.yml`
|
||||||
|
- Modify: `roles/public_dns/meta/main.yml`
|
||||||
|
- Modify: `roles/public_dns/README.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write `defaults/main.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# public_dns — manage the public zone at Gandi LiveDNS as code (M1).
|
||||||
|
# Record data (public_dns__domain / __records / __absent) lives in group_vars/all.
|
||||||
|
# See docs/decisions/007-network.md.
|
||||||
|
public_dns__apply: true # set false to validate without calling the Gandi API (Molecule)
|
||||||
|
public_dns__default_ttl: 1800 # TTL when a record omits one
|
||||||
|
public_dns__domain: "" # overridden in group_vars/all
|
||||||
|
public_dns__records: [] # present records
|
||||||
|
public_dns__absent: [] # records to remove
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Write `tasks/main.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Assert public DNS data is sane
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- public_dns__domain | length > 0
|
||||||
|
- public_dns__records | selectattr('type', 'equalto', 'MX') | list | length > 0
|
||||||
|
fail_msg: >-
|
||||||
|
public_dns__domain must be set and a null-MX anti-spoof record declared in
|
||||||
|
public_dns__records (group_vars/all/public_dns.yml).
|
||||||
|
run_once: true
|
||||||
|
|
||||||
|
- name: Ensure desired records are present (Gandi LiveDNS)
|
||||||
|
community.general.gandi_livedns:
|
||||||
|
domain: "{{ public_dns__domain }}"
|
||||||
|
record: "{{ item.record }}"
|
||||||
|
type: "{{ item.type }}"
|
||||||
|
values: "{{ item.values }}"
|
||||||
|
ttl: "{{ item.ttl | default(public_dns__default_ttl) }}"
|
||||||
|
state: present
|
||||||
|
personal_access_token: "{{ vault.gandi.pat }}"
|
||||||
|
loop: "{{ public_dns__records }}"
|
||||||
|
loop_control:
|
||||||
|
label: "{{ item.record }} {{ item.type }}"
|
||||||
|
run_once: true
|
||||||
|
when: public_dns__apply | bool
|
||||||
|
|
||||||
|
- name: Ensure unwanted records are absent (Gandi LiveDNS)
|
||||||
|
community.general.gandi_livedns:
|
||||||
|
domain: "{{ public_dns__domain }}"
|
||||||
|
record: "{{ item.record }}"
|
||||||
|
type: "{{ item.type }}"
|
||||||
|
state: absent
|
||||||
|
personal_access_token: "{{ vault.gandi.pat }}"
|
||||||
|
loop: "{{ public_dns__absent }}"
|
||||||
|
loop_control:
|
||||||
|
label: "{{ item.record }} {{ item.type }}"
|
||||||
|
run_once: true
|
||||||
|
when: public_dns__apply | bool
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write `meta/main.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
galaxy_info:
|
||||||
|
author: sjat
|
||||||
|
description: Manage boma's public DNS zone (wingu.me) at Gandi LiveDNS as code.
|
||||||
|
license: MIT
|
||||||
|
min_ansible_version: "2.17"
|
||||||
|
platforms:
|
||||||
|
- name: Debian
|
||||||
|
versions:
|
||||||
|
- trixie
|
||||||
|
dependencies: []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Write `README.md`**
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# public_dns
|
||||||
|
|
||||||
|
Manages boma's public DNS zone (**wingu.me**) at **Gandi LiveDNS** as code, via
|
||||||
|
`community.general.gandi_livedns` (PAT auth from `vault.gandi.pat`). Provider-agnostic
|
||||||
|
name on purpose. Run from the control node: `make check/deploy PLAYBOOK=dns`.
|
||||||
|
|
||||||
|
Mesh/LAN-only by default — only deliberate public records live in the zone (the
|
||||||
|
anti-spoof baseline now; `askari` in M4). Everything else is reached over LAN/mesh and
|
||||||
|
never appears here.
|
||||||
|
|
||||||
|
## Data (in `group_vars/all/public_dns.yml`)
|
||||||
|
|
||||||
|
| Var | Meaning |
|
||||||
|
|---|---|
|
||||||
|
| `public_dns__domain` | the zone (`wingu.me`) |
|
||||||
|
| `public_dns__records` | records to ensure **present** (`record`, `type`, `values`, optional `ttl`) |
|
||||||
|
| `public_dns__absent` | records to ensure **absent** (Gandi's auto-seeded defaults) |
|
||||||
|
|
||||||
|
## Behaviour knobs (`defaults/main.yml`)
|
||||||
|
|
||||||
|
| Var | Default | Meaning |
|
||||||
|
|---|---|---|
|
||||||
|
| `public_dns__apply` | `true` | set `false` to validate without calling the Gandi API (Molecule) |
|
||||||
|
| `public_dns__default_ttl` | `1800` | TTL when a record omits one |
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13
|
||||||
|
default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the
|
||||||
|
anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/public_dns
|
||||||
|
git commit -m "feat(public_dns): role tasks, defaults, meta, README"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Molecule scenario (no live API)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/public_dns/molecule/default/converge.yml`
|
||||||
|
- Modify: `roles/public_dns/molecule/default/verify.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write `converge.yml` (apply disabled, sample data)**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Converge
|
||||||
|
hosts: all
|
||||||
|
gather_facts: true
|
||||||
|
vars:
|
||||||
|
public_dns__apply: false # never call the Gandi API from a container
|
||||||
|
public_dns__domain: example.test
|
||||||
|
public_dns__records:
|
||||||
|
- { record: "@", type: MX, values: ["0 ."], ttl: 3600 }
|
||||||
|
- { record: "@", type: TXT, values: ['"v=spf1 -all"'], ttl: 3600 }
|
||||||
|
public_dns__absent:
|
||||||
|
- { record: www, type: CNAME }
|
||||||
|
roles:
|
||||||
|
- role: public_dns
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Write `verify.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
- name: Verify
|
||||||
|
hosts: all
|
||||||
|
gather_facts: false
|
||||||
|
tasks:
|
||||||
|
- name: Role variables resolved
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- public_dns__domain == "example.test"
|
||||||
|
- public_dns__apply | bool == false
|
||||||
|
msg: "public_dns defaults/vars did not resolve as expected"
|
||||||
|
tags: [verify]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run Molecule**
|
||||||
|
|
||||||
|
Run: `make test ROLE=public_dns`
|
||||||
|
Expected: PASS — converge applies the role (the `assert` passes; the `gandi_livedns` tasks are skipped because `public_dns__apply: false`), verify passes, idempotence clean.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/public_dns/molecule
|
||||||
|
git commit -m "test(public_dns): Molecule scenario (apply disabled, no live API)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: The `dns.yml` playbook
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `playbooks/dns.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the play**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# dns.yml — manage the public DNS zone (wingu.me) at Gandi LiveDNS as code.
|
||||||
|
# Runs on the control node (ubongo) against the Gandi API — no host config.
|
||||||
|
# Run: make check PLAYBOOK=dns then make deploy PLAYBOOK=dns
|
||||||
|
- name: Manage public DNS (Gandi LiveDNS)
|
||||||
|
hosts: control
|
||||||
|
connection: local
|
||||||
|
gather_facts: false
|
||||||
|
become: false
|
||||||
|
roles:
|
||||||
|
- role: public_dns
|
||||||
|
tags: [public_dns]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Lint (verifies the role-name tag on the import)**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: `Passed: 0 failure(s)` and `check-tags: OK (... role imports verified)`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add playbooks/dns.yml
|
||||||
|
git commit -m "feat(public_dns): dns.yml play (control-node, Gandi LiveDNS)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Live run on ubongo (purge + baseline) — gated
|
||||||
|
|
||||||
|
> **Runs on ubongo only** (vault + Gandi egress). `rbw unlock` first. This is the one
|
||||||
|
> task that mutates live Gandi; review the check-mode diff before deploying.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Dry-run (check mode + diff)**
|
||||||
|
|
||||||
|
Run: `make check PLAYBOOK=dns`
|
||||||
|
Expected: the diff shows the 3 present records being set (null MX, SPF `-all`, DMARC `reject`) and the 11 Gandi defaults being removed. **Review it.**
|
||||||
|
|
||||||
|
- [ ] **Step 2: Apply**
|
||||||
|
|
||||||
|
Run: `make deploy PLAYBOOK=dns`
|
||||||
|
Expected: `changed` for the present + absent records; no errors.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify idempotence**
|
||||||
|
|
||||||
|
Run: `make deploy PLAYBOOK=dns`
|
||||||
|
Expected: `ok=... changed=0` — a second run makes no changes.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify with dig**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dig +short MX wingu.me # expect: 0 .
|
||||||
|
dig +short TXT wingu.me # expect: "v=spf1 -all"
|
||||||
|
dig +short TXT _dmarc.wingu.me # expect: "v=DMARC1; p=reject;"
|
||||||
|
dig +short www.wingu.me # expect: empty (CNAME removed)
|
||||||
|
```
|
||||||
|
Expected: as annotated (allow for TTL/propagation).
|
||||||
|
|
||||||
|
- [ ] **Step 5: No commit** — this task changes live Gandi, not the repo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Documentation reconciliation
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/decisions/007-network.md`
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
- Modify: `docs/TODO.md`
|
||||||
|
- Modify: `docs/CAPABILITIES.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Amend ADR-007 — naming scheme row**
|
||||||
|
|
||||||
|
Replace the `Public service FQDN` row of the naming-scheme table:
|
||||||
|
|
||||||
|
```
|
||||||
|
| Public service FQDN | `<service>.baobab.band` | `forgejo.nyumbani.baobab.band` |
|
||||||
|
```
|
||||||
|
with:
|
||||||
|
|
||||||
|
```
|
||||||
|
| Public service FQDN | `<service>.wingu.me` | `vaultwarden.wingu.me` |
|
||||||
|
| Off-site (VPS) FQDN | `<service>.askari.wingu.me` | `netbird.askari.wingu.me` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Amend ADR-007 — public zone + scheme**
|
||||||
|
|
||||||
|
Replace the **Public zone** paragraph:
|
||||||
|
|
||||||
|
```
|
||||||
|
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
|
||||||
|
Public-facing services resolve to the public IP or Cloudflare proxy.
|
||||||
|
```
|
||||||
|
with:
|
||||||
|
|
||||||
|
```
|
||||||
|
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
|
||||||
|
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
|
||||||
|
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||||
|
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
|
||||||
|
(reached over LAN or the NetBird mesh); only deliberate exceptions are published.
|
||||||
|
The project is `boma`; the domain is `wingu.me` (see the M1 spec). The legacy
|
||||||
|
`baobab.band` zone (Cloudflare) is out of scope here.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update the split-horizon example**
|
||||||
|
|
||||||
|
In the **Split-horizon** paragraph, replace the example `forgejo.nyumbani.baobab.band`
|
||||||
|
with `vaultwarden.wingu.me` (internal → private proxy IP; public → only if a deliberate
|
||||||
|
exception). Leave the internal-zone (`boma.baobab.band` → to become `boma.wingu.me` when
|
||||||
|
the `dns` role lands in Phase 2) wording; add a parenthetical: *(internal zone is renamed
|
||||||
|
to `boma.wingu.me` when the `dns` role is built — Phase 2)*.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Mark STATUS — public_dns built**
|
||||||
|
|
||||||
|
In `STATUS.md`, under "Real and working today", add a row:
|
||||||
|
|
||||||
|
```
|
||||||
|
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); purged Gandi's seeded defaults, applied the anti-spoof baseline (null MX, SPF `-all`, DMARC reject). Mesh/LAN-only default. M1 of the roadmap. |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Resolve TODO 4**
|
||||||
|
|
||||||
|
In `docs/TODO.md`, change item 4 to struck-through/decided:
|
||||||
|
|
||||||
|
```
|
||||||
|
4. ~~**Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?~~
|
||||||
|
DECIDED (M1): three-tier scheme on `wingu.me`; `nyumbani` dropped; mesh/LAN-only
|
||||||
|
default. See `docs/decisions/007-network.md` + the M1 spec.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Add a CAPABILITIES row**
|
||||||
|
|
||||||
|
In `docs/CAPABILITIES.md`, near the Internal DNS row, add:
|
||||||
|
|
||||||
|
```
|
||||||
|
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only |
|
||||||
|
```
|
||||||
|
(Match the surrounding table's column shape; adjust the status letter to the table's convention.)
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint + commit**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: clean.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/007-network.md STATUS.md docs/TODO.md docs/CAPABILITIES.md
|
||||||
|
git commit -m "docs(public_dns): amend ADR-007 to wingu.me/Gandi; resolve TODO 4; STATUS + CAPABILITIES"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** role + group_vars data (Decisions 4,5) → Tasks 3,4; `gandi_livedns` + PAT (Decision 5, Verified facts) → Task 4; collections-on-demand (Decision 5) → Task 1; anti-spoof baseline + Gandi-defaults purge (Problem, Data model) → Tasks 3,7; cert scope (Decision 6) → out of scope (no cert tasks, correct); testing (check-mode/idempotence/dig + pytest) → Tasks 5,7,3; ADR-007 amendment + TODO 4/O12 → Task 8. All covered.
|
||||||
|
- **Placeholder scan:** none — every code/content step is concrete.
|
||||||
|
- **Type/name consistency:** `public_dns__domain`/`__records`/`__absent`/`__apply`/`__default_ttl` and `vault.gandi.pat` used identically across data, role, play, and tests. `gandi_livedns` params match the verified module signature.
|
||||||
|
- **Note for the implementer:** Task 7 assumes ubongo. If the `gandi_livedns` `absent` call needs `values` for some record types, add them from `public_dns__absent` (verify against the pinned `community.general` version per ADR-014).
|
||||||
234
docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
Normal file
234
docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
Normal file
|
|
@ -0,0 +1,234 @@
|
||||||
|
# M5 — Mesh enrollment (NetBird agents) Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax.
|
||||||
|
|
||||||
|
**Goal:** `ubongo` reachable from anywhere over the NetBird mesh — enrol NetBird agents on `ubongo` + `askari` via a new opt-in `base` `mesh` concern; the operator enrols the laptops.
|
||||||
|
|
||||||
|
**Architecture:** A new `base` concern (`roles/base/tasks/mesh.yml`) installs a pinned NetBird agent and runs `netbird up` with a reusable scoped setup key from vault. Gated by `base__mesh_enabled` (per-host opt-in) and `base__mesh_manage` (skips network/daemon actions for Molecule). **No firewall change** — enrollment is additive (`wt0` comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on.
|
||||||
|
|
||||||
|
**Tech Stack:** NetBird agent (apt, pinned), Ansible (`base` role), Molecule, the M4b coordinator at `https://netbird.askari.wingu.me`.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md`
|
||||||
|
|
||||||
|
**Execution context:** Tasks 1–4 author + commit (need nothing from the operator). **Task 5 is an operator handoff** (dashboard `/setup` + mint key). **Task 6 applies live to `ubongo` + `askari`** (gated). Task 7 is operator-only (laptops). Task 8 docs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
| File | Change | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `tests/tags.yml` | modify | add the `mesh` concern to the closed tag vocabulary |
|
||||||
|
| `roles/base/defaults/main.yml` | modify | `base__mesh_*` knobs |
|
||||||
|
| `roles/base/tasks/mesh.yml` | **create** | the enrollment concern (install + `netbird up`) |
|
||||||
|
| `roles/base/tasks/main.yml` | modify | include `mesh.yml` (gated, tagged) |
|
||||||
|
| `roles/base/README.md` | modify | document the `mesh` concern + knobs |
|
||||||
|
| `roles/base/molecule/default/converge.yml` | modify | enable mesh (manage off) + dummy key |
|
||||||
|
| `roles/base/molecule/default/verify.yml` | modify | assert mesh wiring / no-op |
|
||||||
|
| `inventories/production/group_vars/control/vars.yml` | modify | `base__mesh_enabled: true` (ubongo) |
|
||||||
|
| `inventories/production/group_vars/offsite_hosts/vars.yml` | **create** | `base__mesh_enabled: true` (askari) |
|
||||||
|
| `inventories/production/group_vars/all/vault.yml` | modify (vault) | `vault.netbird.setup_key: CHANGEME` |
|
||||||
|
| `STATUS.md`, `docs/ROADMAP.md`, `docs/FRICTION.md` | modify | M5 done; deferred hardening; friction note |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Verify + pin the NetBird agent; add the `mesh` tag
|
||||||
|
|
||||||
|
- [ ] **Step 1 (ADR-014 verification — record the answers):** confirm against current NetBird docs/repo (WebFetch `docs.netbird.io`, `pkgs.netbird.io`):
|
||||||
|
- the **apt repo** URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact `deb` line and key URL);
|
||||||
|
- the **package name** (headless agent — expected `netbird`) and that **version `0.72.4`** (matching the coordinator) is installable, plus the apt **version-pin syntax**;
|
||||||
|
- the exact **`netbird status`** output string that indicates an established management connection (for the idempotency guard — e.g. `Management: Connected`);
|
||||||
|
- the **`netbird up`** flags (`--management-url`, `--setup-key`);
|
||||||
|
- whether the pinned NetBird's **default peer policy is allow-by-default** (decides §Task 6 step 4). Record all of this in the commit message / a note block.
|
||||||
|
- [ ] **Step 2:** add `mesh` to `tests/tags.yml` under `concerns:`:
|
||||||
|
```yaml
|
||||||
|
- mesh # NetBird agent enrollment (ADR-016)
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** `make lint` → expect `check-tags: OK` (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures.
|
||||||
|
- [ ] **Step 4:** commit `feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: `base` `mesh` concern — defaults + tasks + include + README
|
||||||
|
|
||||||
|
**Files:** `roles/base/defaults/main.yml`, `roles/base/tasks/mesh.yml` (create), `roles/base/tasks/main.yml`, `roles/base/README.md`.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** append the knobs to `roles/base/defaults/main.yml`:
|
||||||
|
```yaml
|
||||||
|
# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a
|
||||||
|
# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install
|
||||||
|
# over the network, `netbird up` against the coordinator) are additionally gated by
|
||||||
|
# base__mesh_manage so Molecule can exercise the wiring without a coordinator.
|
||||||
|
base__mesh_enabled: false
|
||||||
|
base__mesh_manage: true
|
||||||
|
base__mesh_management_url: "https://netbird.askari.wingu.me"
|
||||||
|
base__mesh_setup_key: "{{ vault.netbird.setup_key }}" # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix
|
||||||
|
base__mesh_version: "0.72.4" # match the coordinator; confirmed installable in Task 1
|
||||||
|
```
|
||||||
|
- [ ] **Step 2:** create `roles/base/tasks/mesh.yml` (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm):
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# NetBird agent enrollment (ADR-016). Additive only — no firewall change here.
|
||||||
|
- name: Ensure /etc/apt/keyrings exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/apt/keyrings
|
||||||
|
state: directory
|
||||||
|
mode: "0755"
|
||||||
|
tags: [mesh]
|
||||||
|
|
||||||
|
- name: Add the NetBird APT GPG key
|
||||||
|
ansible.builtin.get_url:
|
||||||
|
url: https://pkgs.netbird.io/debian/public.key # confirm in Task 1
|
||||||
|
dest: /etc/apt/keyrings/netbird.asc
|
||||||
|
mode: "0644"
|
||||||
|
when: base__mesh_manage | bool
|
||||||
|
tags: [mesh]
|
||||||
|
|
||||||
|
- name: Add the NetBird APT repository
|
||||||
|
ansible.builtin.apt_repository:
|
||||||
|
repo: >-
|
||||||
|
deb [signed-by=/etc/apt/keyrings/netbird.asc]
|
||||||
|
https://pkgs.netbird.io/debian stable main # confirm in Task 1
|
||||||
|
filename: netbird
|
||||||
|
state: present
|
||||||
|
when: base__mesh_manage | bool
|
||||||
|
tags: [mesh]
|
||||||
|
|
||||||
|
- name: Install the NetBird agent (pinned)
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: "netbird={{ base__mesh_version }}" # confirm pin syntax in Task 1
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
when: base__mesh_manage | bool
|
||||||
|
tags: [mesh]
|
||||||
|
|
||||||
|
- name: Check current NetBird connection status
|
||||||
|
ansible.builtin.command: netbird status
|
||||||
|
register: _netbird_status
|
||||||
|
changed_when: false
|
||||||
|
failed_when: false
|
||||||
|
when: base__mesh_manage | bool
|
||||||
|
tags: [mesh]
|
||||||
|
|
||||||
|
- name: Enrol this host in the mesh
|
||||||
|
ansible.builtin.command: >-
|
||||||
|
netbird up
|
||||||
|
--management-url {{ base__mesh_management_url }}
|
||||||
|
--setup-key {{ base__mesh_setup_key }}
|
||||||
|
register: _netbird_up
|
||||||
|
changed_when: _netbird_up.rc == 0
|
||||||
|
when:
|
||||||
|
- base__mesh_manage | bool
|
||||||
|
- "'Management: Connected' not in (_netbird_status.stdout | default(''))" # confirm string in Task 1
|
||||||
|
no_log: true # setup key is on the argv
|
||||||
|
tags: [mesh]
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** in `roles/base/tasks/main.yml`, add the include (after the existing concerns), gated by `base__mesh_enabled`:
|
||||||
|
```yaml
|
||||||
|
- name: NetBird mesh enrollment
|
||||||
|
ansible.builtin.include_tasks:
|
||||||
|
file: mesh.yml
|
||||||
|
apply:
|
||||||
|
tags: [mesh]
|
||||||
|
when: base__mesh_enabled | bool
|
||||||
|
tags: [mesh]
|
||||||
|
```
|
||||||
|
- [ ] **Step 4:** document the concern in `roles/base/README.md` (purpose; the `base__mesh_*` knobs table; that it is additive/no-firewall; that the setup key comes from `vault.netbird.setup_key`; the `enabled`/`manage` gating).
|
||||||
|
- [ ] **Step 5:** `make lint` → 0 failures. Commit `feat(base): NetBird agent enrollment concern (mesh)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Molecule coverage
|
||||||
|
|
||||||
|
**Files:** `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
|
||||||
|
|
||||||
|
> The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with `manage: false` does not break the base converge and is idempotent; (b) `base__mesh_enabled: false` (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6.
|
||||||
|
|
||||||
|
- [ ] **Step 1:** in `converge.yml` add to `vars:`:
|
||||||
|
```yaml
|
||||||
|
base__mesh_enabled: true
|
||||||
|
base__mesh_manage: false # skip network/daemon actions
|
||||||
|
base__mesh_setup_key: "dummy-molecule-key"
|
||||||
|
```
|
||||||
|
- [ ] **Step 2:** in `verify.yml` add a task asserting the concern is a clean no-op under `manage: false` — `netbird` is NOT installed and `wt0` does not exist (since all live actions are gated off):
|
||||||
|
```yaml
|
||||||
|
- name: Confirm mesh manage=false did not install/enrol
|
||||||
|
ansible.builtin.command: which netbird
|
||||||
|
register: _nb
|
||||||
|
changed_when: false
|
||||||
|
failed_when: false
|
||||||
|
- name: Assert netbird absent under manage=false
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- _nb.rc != 0
|
||||||
|
fail_msg: "netbird should not be installed when base__mesh_manage is false"
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** `make test ROLE=base` → converge + idempotence + verify pass (`failed=0`). The existing firewall assertions still pass (mesh vars don't affect them).
|
||||||
|
- [ ] **Step 4:** commit `test(base): molecule coverage for the mesh concern (manage-off no-op)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Vault stub + per-host opt-in
|
||||||
|
|
||||||
|
- [ ] **Step 1 (vault — needs `rbw` unlocked):** `make decrypt FILE=inventories/production/group_vars/all/vault.yml`; add under `vault.netbird` (alongside `auth_secret`/`datastore_key`):
|
||||||
|
```yaml
|
||||||
|
# Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the
|
||||||
|
# dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the
|
||||||
|
# base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`.
|
||||||
|
setup_key: CHANGEME
|
||||||
|
```
|
||||||
|
`make encrypt FILE=...`; `make check-vault` → confirms structure + lists the `setup_key` CHANGEME.
|
||||||
|
- [ ] **Step 2:** set the opt-in. In `inventories/production/group_vars/control/vars.yml` add `base__mesh_enabled: true` (ubongo). Create `inventories/production/group_vars/offsite_hosts/vars.yml`:
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# askari is a NetBird peer as well as the coordinator host (ADR-016).
|
||||||
|
base__mesh_enabled: true
|
||||||
|
```
|
||||||
|
- [ ] **Step 3:** `make lint` → 0 failures. Commit `feat(base): vault setup_key stub + enable mesh on ubongo + askari`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this)
|
||||||
|
|
||||||
|
> Nothing here is automatable — the agent cannot create a dashboard admin or mint a key.
|
||||||
|
|
||||||
|
- [ ] **Step 1 (operator):** browse `https://netbird.askari.wingu.me`, complete the one-time `/setup` to create the admin user, log in.
|
||||||
|
- [ ] **Step 2 (operator):** create a **reusable** setup key, **scoped** to auto-assign peers to a `boma-hosts` group, with an **expiry**. Copy the key value.
|
||||||
|
- [ ] **Step 3 (operator):** `make edit-vault` → replace `vault.netbird.setup_key`'s `CHANGEME` with the real key → `:wq` (re-encrypts) → `make check-vault` shows no outstanding CHANGEME. The key never enters the chat.
|
||||||
|
- [ ] **Step 4:** no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Enrol `ubongo` + `askari` (GATED, live — needs Task 5 done + `rbw` unlocked)
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `make check PLAYBOOK=site LIMIT=askari TAGS=mesh` — review (askari is `ansible`-user managed; cleaner first target than the control node). Then `make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh`.
|
||||||
|
- [ ] **Step 2:** verify on askari: `netbird status` shows `Management: Connected`; `ip link show wt0` exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.)
|
||||||
|
- [ ] **Step 3:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh` — review. Note: ubongo is managed as `sjat` with `become: true` (same path `dev_env` used via `playbooks/workstation.yml`); confirm `sjat` sudo works (the run will prompt/fail clearly if a become password is needed). Then `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh`.
|
||||||
|
- [ ] **Step 4:** verify the mesh link from ubongo: `netbird status` shows `ubongo` connected and lists `askari` as a peer; ping askari's NetBird (`100.x`) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → `ubongo` SSH (or temporarily the default policy) so Task 7 can connect.
|
||||||
|
- [ ] **Step 5:** no repo commit (host state).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Enrol the road-warrior clients → goal lands (operator)
|
||||||
|
|
||||||
|
- [ ] **Step 1 (operator):** install the NetBird client on `mamba` + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh.
|
||||||
|
- [ ] **Step 2 (operator):** from a laptop (anywhere), `ssh sjat@<ubongo-netbird-ip>` (or the mesh hostname) — connection succeeds. **← the mobile-access goal lands here.**
|
||||||
|
- [ ] **Step 3:** confirm with the operator that remote access works end-to-end.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: Docs
|
||||||
|
|
||||||
|
- [ ] **Step 1:** `STATUS.md` — move "NetBird agent enrollment in `base`" to **built + applied** (ubongo + askari enrolled; reachability achieved). Note the `mesh` concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer.
|
||||||
|
- [ ] **Step 2:** `docs/ROADMAP.md` — **M5 ✅ DONE**; Phase 1 (remote access) complete. Next: the **Procurement gate** (`/capacity-review` → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→`wt0`).
|
||||||
|
- [ ] **Step 3:** `docs/FRICTION.md` — add a signal: a **docs-only commit still tripped the `rbw`-locked pre-commit guard** (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look.
|
||||||
|
- [ ] **Step 4:** `make lint`; commit `docs: M5 done — Phase 1 remote access complete`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review (completed)
|
||||||
|
|
||||||
|
- **Spec coverage:** `mesh` concern (spec §1) → Tasks 1–3; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered.
|
||||||
|
- **Placeholder scan:** the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan.
|
||||||
|
- **Consistency:** `base__mesh_enabled` (opt-in) vs `base__mesh_manage` (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; `vault.netbird.setup_key` matches between defaults, vault stub, and Task 5; `mesh` tag added (Task 1) before it is used (Task 2).
|
||||||
|
- **Risk:** the only live risk is Task 6 on the control node — mitigated because the `mesh` concern makes **no firewall change** (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.
|
||||||
|
|
@ -0,0 +1,466 @@
|
||||||
|
# Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Make askari's SSH reachable only over the NetBird mesh (`wt0`) and close the WAN `:22` surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services.
|
||||||
|
|
||||||
|
**Architecture:** Three enforcement layers — (1) sshd `ListenAddress` bound to the live `wt0` IP (fail-closed, `ip_nonlocal_bind` to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to `wt0` via `base__firewall_mgmt_interface`; add a `public` zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN `:22` rule. Tasks 1–4 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Molecule on Debian 13, `ansible.posix.sysctl`, pytest (filter unit tests), Terraform (`hcloud` provider).
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md`
|
||||||
|
|
||||||
|
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; `make tf-plan` before `make tf-apply`; never hand-edit the generated `offsite.yml`; rbw unlocked for commits touching ansible content.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: base role — sshd `ListenAddress` on wt0 + `ip_nonlocal_bind` (fail-closed)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/defaults/main.yml`
|
||||||
|
- Modify: `roles/base/tasks/ssh.yml`
|
||||||
|
- Modify: `roles/base/templates/sshd_hardening.conf.j2`
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml` (fixture)
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml` (assertions = the test)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, add these tasks after the existing "Sshd drop-in present and config valid" block:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: ListenAddress bound to the fixture mesh IP (mesh-only mode)
|
||||||
|
ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf
|
||||||
|
changed_when: false
|
||||||
|
- name: ip_nonlocal_bind sysctl drop-in is present
|
||||||
|
ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf
|
||||||
|
changed_when: false
|
||||||
|
- name: ip_nonlocal_bind is live in this netns
|
||||||
|
ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind
|
||||||
|
register: _nonlocal
|
||||||
|
changed_when: false
|
||||||
|
failed_when: _nonlocal.stdout | trim != '1'
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (alongside the existing `base__mesh_*`):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__ssh_listen_mesh_only: true
|
||||||
|
base__ssh_listen_addr: "100.99.0.1" # fixture mesh IP (no wt0 in the container)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run the test to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: FAIL — converge errors or verify fails (`ListenAddress` not rendered; sysctl drop-in absent), because the feature isn't implemented yet.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the defaults**
|
||||||
|
|
||||||
|
In `roles/base/defaults/main.yml`, after the `base__ssh_authorised_keys: []` line (end of the hardening block), add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds
|
||||||
|
# ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0
|
||||||
|
# fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind
|
||||||
|
# lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a
|
||||||
|
# non-empty address rather than silently listening on all interfaces.
|
||||||
|
base__ssh_listen_mesh_only: false
|
||||||
|
base__ssh_listen_addr: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Resolve + assert + sysctl in `ssh.yml`**
|
||||||
|
|
||||||
|
In `roles/base/tasks/ssh.yml`, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Resolve the sshd mesh listen address (override, else live wt0 fact)
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
base__ssh_listen_addr_resolved: >-
|
||||||
|
{{ base__ssh_listen_addr
|
||||||
|
or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }}
|
||||||
|
when: base__ssh_listen_mesh_only | bool
|
||||||
|
|
||||||
|
- name: Fail closed — refuse to render sshd without a known mesh address
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- base__ssh_listen_addr_resolved | length > 0
|
||||||
|
fail_msg: >-
|
||||||
|
base__ssh_listen_mesh_only is true but no mesh address resolved (set
|
||||||
|
base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to
|
||||||
|
render sshd ListenAddress empty (which would listen on ALL interfaces).
|
||||||
|
when: base__ssh_listen_mesh_only | bool
|
||||||
|
|
||||||
|
- name: Allow sshd to bind the mesh IP before wt0 exists at boot
|
||||||
|
ansible.posix.sysctl:
|
||||||
|
name: net.ipv4.ip_nonlocal_bind
|
||||||
|
value: "1"
|
||||||
|
sysctl_set: true
|
||||||
|
state: present
|
||||||
|
reload: true
|
||||||
|
sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf
|
||||||
|
when: base__ssh_listen_mesh_only | bool
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Render the conditional `ListenAddress`**
|
||||||
|
|
||||||
|
In `roles/base/templates/sshd_hardening.conf.j2`, append after the existing `KbdInteractiveAuthentication no` line:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
{% if base__ssh_listen_mesh_only | bool %}
|
||||||
|
ListenAddress {{ base__ssh_listen_addr_resolved }}
|
||||||
|
{% endif %}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: PASS — converge succeeds; verify confirms `ListenAddress 100.99.0.1`, the sysctl drop-in, and the live value `1`.
|
||||||
|
|
||||||
|
> **Checkpoint (environmental):** if `make test` fails on the sysctl task because the Molecule container can't write `net.ipv4.ip_nonlocal_bind`, add `sysctls: {net.ipv4.ip_nonlocal_bind: "0"}` to the platform in `roles/base/molecule/default/molecule.yml` (pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit.
|
||||||
|
|
||||||
|
- [ ] **Step 8: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
|
||||||
|
|
||||||
|
- [ ] **Step 9: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \
|
||||||
|
roles/base/templates/sshd_hardening.conf.j2 \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
||||||
|
git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
|
||||||
|
|
||||||
|
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
|
||||||
|
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
|
||||||
|
unresolved address never silently listens on all interfaces. Molecule covers
|
||||||
|
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: firewall catalog — `public` zone + askari's public services
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `inventories/production/group_vars/all/firewall.yml`
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml` (fixture: public-zone rule)
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml` (assert the 0.0.0.0/0 rule)
|
||||||
|
- Test: `tests/test_firewall_rules.py` (unit: a `public` zone resolves to `0.0.0.0/0`)
|
||||||
|
|
||||||
|
Rationale: `base__firewall_mgmt_interface` already accepts `:22` on `wt0`. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a `public` zone (`0.0.0.0/0`) and askari's service ingress.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing unit test**
|
||||||
|
|
||||||
|
In `tests/test_firewall_rules.py`, add:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_public_zone_resolves_to_anywhere():
|
||||||
|
catalog = {"web": {"host": "askari",
|
||||||
|
"ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}}
|
||||||
|
zones = {"public": "0.0.0.0/0"}
|
||||||
|
rules = rs.resolve_firewall_rules(catalog, zones, "askari",
|
||||||
|
{"askari": {"ansible_host": "100.99.226.39"}}, {})
|
||||||
|
assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}]
|
||||||
|
```
|
||||||
|
|
||||||
|
(Module is loaded by the existing importlib shim at the top of the test file as `rs`. If the filter is imported under a different alias there, match it.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run it to verify it fails (or passes trivially)**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
|
||||||
|
Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — `_resolve_source` treats any `zones` key generically). That is fine: the unit test documents/locks the `public`-zone contract. If it fails, fix the filter. Either way it must end green.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the Molecule fixture (public-zone rule)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, under `firewall_zones:` add `public: 0.0.0.0/0`, and under `firewall_catalog:` add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
netbird_stun:
|
||||||
|
host: instance
|
||||||
|
ingress:
|
||||||
|
- { from: public, port: 3478, proto: udp }
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the Molecule assertion (the test)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, after the photoprism assertion block, add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'0.0.0.0/0' in nft"
|
||||||
|
- "'udp dport 3478 accept' in nft"
|
||||||
|
fail_msg: "missing public->3478/udp rule for netbird_stun"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run the tests**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base` then `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
|
||||||
|
Expected: both PASS (the rendered ruleset now contains the `0.0.0.0/0 ... udp dport 3478 accept` rule).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Populate the real catalog**
|
||||||
|
|
||||||
|
In `inventories/production/group_vars/all/firewall.yml`, replace the `firewall_zones`/`firewall_catalog` blocks with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public
|
||||||
|
# off-site services (askari); home/cluster services use the internal zones only.
|
||||||
|
firewall_zones:
|
||||||
|
mgmt: 10.10.0.0/24
|
||||||
|
srv: 10.20.0.0/24
|
||||||
|
lan: 10.30.0.0/24
|
||||||
|
iot: 10.40.0.0/24
|
||||||
|
guest: 10.50.0.0/24
|
||||||
|
public: 0.0.0.0/0
|
||||||
|
|
||||||
|
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
|
||||||
|
# askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host
|
||||||
|
# nftables template renders IPv4 source rules only; askari is reached via its A record
|
||||||
|
# (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note).
|
||||||
|
firewall_catalog:
|
||||||
|
reverse_proxy:
|
||||||
|
host: askari
|
||||||
|
ingress:
|
||||||
|
- { from: public, port: 80, proto: tcp }
|
||||||
|
- { from: public, port: 443, proto: tcp }
|
||||||
|
netbird_stun:
|
||||||
|
host: askari
|
||||||
|
ingress:
|
||||||
|
- { from: public, port: 3478, proto: udp }
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: clean pass (`check-tags: OK`).
|
||||||
|
|
||||||
|
- [ ] **Step 8: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/production/group_vars/all/firewall.yml \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
|
||||||
|
tests/test_firewall_rules.py
|
||||||
|
git commit -m "feat(firewall): public zone + askari's public services in the catalog
|
||||||
|
|
||||||
|
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
|
||||||
|
(3478/udp) ingress so the base nftables default-deny does not drop the live
|
||||||
|
public services when applied to askari. Molecule + filter unit test cover the
|
||||||
|
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `inventories/production/host_vars/askari.yml`
|
||||||
|
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the host_var override**
|
||||||
|
|
||||||
|
Create `inventories/production/host_vars/askari.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the
|
||||||
|
# TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars
|
||||||
|
# outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`.
|
||||||
|
# Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH.
|
||||||
|
ansible_host: 100.99.226.39 # askari's wt0 address (NetBird, M5)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Enable mesh-only SSH for offsite hosts**
|
||||||
|
|
||||||
|
In `inventories/production/group_vars/offsite_hosts/vars.yml`, replace the file body with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
|
||||||
|
# (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the
|
||||||
|
# mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies
|
||||||
|
# (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface,
|
||||||
|
# public services via the catalog). base__mesh_enabled stays true (precondition from M5).
|
||||||
|
base__mesh_enabled: true
|
||||||
|
base__ssh_listen_mesh_only: true
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the override resolves**
|
||||||
|
|
||||||
|
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host`
|
||||||
|
Expected: `"ansible_host": "100.99.226.39"` (the host_var wins over the generated `offsite.yml`).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: clean pass.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/production/host_vars/askari.yml \
|
||||||
|
inventories/production/group_vars/offsite_hosts/vars.yml
|
||||||
|
git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH
|
||||||
|
|
||||||
|
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
|
||||||
|
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Terraform — retire the Hetzner WAN `:22` rule
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `terraform/modules/hetzner_vm/main.tf`
|
||||||
|
- Modify: `terraform/modules/hetzner_vm/variables.tf`
|
||||||
|
- Modify: `terraform/environments/offsite/main.tf`
|
||||||
|
|
||||||
|
This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live `tf-plan`/`tf-apply` happens in Task 5 — here we only change + format/validate the code.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Gate the SSH rule on a non-empty CIDR list**
|
||||||
|
|
||||||
|
In `terraform/modules/hetzner_vm/main.tf`, replace the static SSH `rule { ... }` block (the one with `port = "22"`) with a dynamic block:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
# SSH from the control node only — and only when admin CIDRs are set. An empty
|
||||||
|
# ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over
|
||||||
|
# wt0, break-glass = Hetzner console). Mesh-hardening 1/3.
|
||||||
|
dynamic "rule" {
|
||||||
|
for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : []
|
||||||
|
content {
|
||||||
|
direction = "in"
|
||||||
|
protocol = "tcp"
|
||||||
|
port = "22"
|
||||||
|
source_ips = var.ssh_admin_cidrs
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Default the variable to empty**
|
||||||
|
|
||||||
|
In `terraform/modules/hetzner_vm/variables.tf`, change the `ssh_admin_cidrs` variable to default to an empty list:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
variable "ssh_admin_cidrs" {
|
||||||
|
description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)."
|
||||||
|
type = list(string)
|
||||||
|
default = []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Set askari to mesh-only SSH**
|
||||||
|
|
||||||
|
In `terraform/environments/offsite/main.tf`, change the `ssh_admin_cidrs` argument in the `module "askari"` block to:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Format + validate**
|
||||||
|
|
||||||
|
Run: `cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd -`
|
||||||
|
Expected: `fmt` lists any reformatted files (re-add them); `validate` prints `Success! The configuration is valid.` (offsite is already `init`ed — it has live state.)
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \
|
||||||
|
terraform/environments/offsite/main.tf
|
||||||
|
git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
|
||||||
|
|
||||||
|
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
|
||||||
|
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
|
||||||
|
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
|
||||||
|
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
|
||||||
|
|
||||||
|
> This task touches the real askari over the network and is lockout-risky. Run it
|
||||||
|
> interactively with the operator, in order, verifying each step before the next. The
|
||||||
|
> firewall's auto-rollback timer + `wait_for_connection` over wt0 is the safety net; the
|
||||||
|
> Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Pre-check the mesh SSH path (before any change)**
|
||||||
|
|
||||||
|
Run: `.venv/bin/ansible askari -i inventories/production/ -m ping`
|
||||||
|
Expected: `SUCCESS` — confirms Ansible reaches askari over `wt0` (Tasks 1–3 are merged, so `ansible_host` is now `100.99.226.39`). If this fails, STOP — the mesh path must work before closing the WAN.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Dry-run the base apply (firewall + sshd)**
|
||||||
|
|
||||||
|
Run: `make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
|
||||||
|
Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (`ListenAddress 100.99.226.39`); no errors. Review that the public service rules are present (so they won't be dropped).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Apply the host firewall + sshd (auto-rollback armed)**
|
||||||
|
|
||||||
|
Run: `make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
|
||||||
|
Expected: the firewall concern arms the rollback timer, applies, resets the connection, and `wait_for_connection` succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within `base__firewall_rollback_timeout` (45 s).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify services + WAN SSH still open at the cloud edge**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me # expect 200
|
||||||
|
curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200
|
||||||
|
```
|
||||||
|
Expected: both `200` (valid certs); the host firewall did not drop the public services. (WAN `:22` is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.)
|
||||||
|
|
||||||
|
- [ ] **Step 5: Retire the Hetzner WAN `:22` — plan, review, apply**
|
||||||
|
|
||||||
|
Run: `make tf-plan TF_ENV=offsite`
|
||||||
|
Expected: the plan shows the SSH firewall rule being **destroyed** (and nothing else of substance). Review it.
|
||||||
|
|
||||||
|
Then: `make tf-apply TF_ENV=offsite`
|
||||||
|
Expected: apply succeeds; the WAN `:22` rule is gone.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Verify the end-state (out-of-band)**
|
||||||
|
|
||||||
|
From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nc -vz -w5 77.42.120.136 22 # expect: refused / timeout (WAN SSH closed)
|
||||||
|
nc -vz -w5 77.42.120.136 443 # expect: open (public service intact)
|
||||||
|
```
|
||||||
|
And from ubongo over the mesh: `.venv/bin/ansible askari -i inventories/production/ -m ping` → `SUCCESS`.
|
||||||
|
|
||||||
|
- [ ] **Step 7: Reboot resilience check (optional but recommended)**
|
||||||
|
|
||||||
|
Reboot askari from the Hetzner console; after it comes back, confirm `ansible askari -m ping` succeeds over wt0 without intervention (proves `ip_nonlocal_bind` beat the post-boot bind race).
|
||||||
|
|
||||||
|
- [ ] **Step 8: Update STATUS + ROADMAP**
|
||||||
|
|
||||||
|
- In `STATUS.md`, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN `:22` is retired. Move "host firewall + moving askari's SSH onto wt0" out of *Pending*.
|
||||||
|
- In `docs/ROADMAP.md`, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done)
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 9: Push**
|
||||||
|
|
||||||
|
Run: `git push origin main`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review (against the spec)
|
||||||
|
|
||||||
|
- **§ three layers** → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via `base__firewall_mgmt_interface`), Task 4 (Hetzner WAN :22). ✓
|
||||||
|
- **§ boot-race fix** (`ip_nonlocal_bind` + fail-closed assert + live wt0 fact) → Task 1 Steps 4–6. ✓
|
||||||
|
- **§ new code/vars** (`base__ssh_listen_mesh_only`, `base__ssh_listen_addr`, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 1–4. ✓
|
||||||
|
- **§ staged cutover** → Task 5 Steps 1–6, with the firewall auto-rollback as the gate. ✓
|
||||||
|
- **§ testing** → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank `base__ssh_listen_addr` in the converge fixture and confirm `make test ROLE=base` fails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓
|
||||||
|
- **§ risks/rollback** → auto-rollback timer (Task 5 Step 3), `ip_nonlocal_bind` (Task 1), Hetzner console break-glass, re-addable TF rule. ✓
|
||||||
|
- **IPv6 note** → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.
|
||||||
1179
docs/superpowers/plans/2026-06-18-local-vm-integration-testing.md
Normal file
1179
docs/superpowers/plans/2026-06-18-local-vm-integration-testing.md
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,409 @@
|
||||||
|
# Mesh-hardening redesign (askari) — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by `iifname "wt0"` + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout.
|
||||||
|
|
||||||
|
**Architecture:** Mirror mesh-hardening 2/3 (ubongo): `base` firewall INPUT-only (`base__firewall_input_only: true`, forward stays `policy accept` so Docker forwarding/NAT survive), **no** sshd `ListenAddress` change (the firewall, not sshd, scopes `:22`). The coordinator-host exception: WAN `:22` stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A `netbird_coordinator` change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`base`, `netbird_coordinator` roles), nftables, Docker Compose, Molecule (Debian 13), the `scripts/integration-vm.py` ADR-025 harness, NetBird self-hosted `netbird-server:0.72.4`.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md`
|
||||||
|
|
||||||
|
## Global Constraints
|
||||||
|
|
||||||
|
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
|
||||||
|
- **No sshd `ListenAddress` change** — `base__ssh_listen_mesh_only` stays `false` everywhere here (this is what sidesteps the 2026-06-17 boot-race).
|
||||||
|
- **WAN `:22` is never closed** — no Terraform / Hetzner-Cloud-Firewall change in this plan.
|
||||||
|
- **`base__firewall_input_only: true` on askari** — the forward chain must stay `policy accept` (Docker host). Never apply a forward-`drop` firewall to askari.
|
||||||
|
- **ubongo's WAN IP is `91.226.145.80`** (operator-confirmed static 2026-06-19) — the break-glass anchor.
|
||||||
|
- **askari `wt0` IP is `100.99.226.39`**; askari domain `netbird.askari.wingu.me`.
|
||||||
|
- **Before any commit:** `rbw unlocked` must succeed (the pre-commit hook decrypts `vault.yml`); run `make lint` and it must be clean.
|
||||||
|
- **Tags:** import each role at play level with its role-name tag; only use concern tags from `tests/tags.yml`.
|
||||||
|
- **Harness GREEN before live** (Task 3 before Task 4). The live cutover (Task 4) is **operator-gated** — never run autonomously.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Disable geolocation in `netbird_coordinator` (FRICTION 2026-06-17 #4)
|
||||||
|
|
||||||
|
Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario.
|
||||||
|
|
||||||
|
> verified: NetBird self-hosted geolocation knobs (`NB_DISABLE_GEOLOCATION`, `disableGeoliteUpdate`, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — *from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.*
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/netbird_coordinator/defaults/main.yml` (add the knob)
|
||||||
|
- Modify: `roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27` (add `environment:` to `netbird-server`)
|
||||||
|
- Test: `roles/netbird_coordinator/molecule/default/verify.yml:21-32` (assert the rendered compose)
|
||||||
|
- Modify: `roles/netbird_coordinator/README.md` (one line documenting the knob)
|
||||||
|
|
||||||
|
**Interfaces:**
|
||||||
|
- Produces: role default `netbird_coordinator__disable_geolocation` (bool, default `true`); rendered compose env `NB_DISABLE_GEOLOCATION: "true"` on the `netbird-server` service.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing Molecule assertion**
|
||||||
|
|
||||||
|
Append to `roles/netbird_coordinator/molecule/default/verify.yml` (after the existing compose-tags assert, inside the same `tasks:` list):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)"
|
||||||
|
fail_msg: >-
|
||||||
|
compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL
|
||||||
|
the coordinator on the GeoLite2 download
|
||||||
|
success_msg: "geolocation disabled in compose"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run Molecule to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=netbird_coordinator`
|
||||||
|
Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no `NB_DISABLE_GEOLOCATION`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the default knob**
|
||||||
|
|
||||||
|
Add to `roles/netbird_coordinator/defaults/main.yml` (after line 7, the `__domain` line):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture
|
||||||
|
# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL —
|
||||||
|
# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker
|
||||||
|
# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling
|
||||||
|
# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture.
|
||||||
|
netbird_coordinator__disable_geolocation: true
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Render the env in the compose template**
|
||||||
|
|
||||||
|
In `roles/netbird_coordinator/templates/docker-compose.yml.j2`, add an `environment:` block to the `netbird-server` service, immediately after its `command:` line (line 18):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
environment:
|
||||||
|
# Disable geolocation so a no-egress startup can't FATAL the control plane
|
||||||
|
# (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All).
|
||||||
|
NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run Molecule to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=netbird_coordinator`
|
||||||
|
Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Document the knob**
|
||||||
|
|
||||||
|
Add one line to `roles/netbird_coordinator/README.md` under its variables/defaults section:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint and commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked && make lint
|
||||||
|
git add roles/netbird_coordinator/defaults/main.yml \
|
||||||
|
roles/netbird_coordinator/templates/docker-compose.yml.j2 \
|
||||||
|
roles/netbird_coordinator/molecule/default/verify.yml \
|
||||||
|
roles/netbird_coordinator/README.md
|
||||||
|
git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over `wt0`
|
||||||
|
|
||||||
|
Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall *behavior* is proven in Task 3).
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` (replace the whole file body)
|
||||||
|
- Create: `inventories/production/host_vars/askari.yml`
|
||||||
|
|
||||||
|
**Interfaces:**
|
||||||
|
- Consumes: `base` knobs `base__firewall_apply`, `base__firewall_input_only`, `base__firewall_admin_addrs`, `base__ssh_listen_mesh_only`, `base__mesh_enabled` (all defined in `roles/base/defaults/main.yml`).
|
||||||
|
- Produces: askari resolves `ansible_host: 100.99.226.39`, `base__firewall_apply: true`, `base__firewall_input_only: true`, `base__firewall_admin_addrs: ["91.226.145.80"]`.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Rewrite the offsite group_vars**
|
||||||
|
|
||||||
|
Replace the body of `inventories/production/group_vars/offsite_hosts/vars.yml` with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
|
||||||
|
# (ADR-016, M5).
|
||||||
|
#
|
||||||
|
# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward
|
||||||
|
# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind
|
||||||
|
# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern:
|
||||||
|
# - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept`
|
||||||
|
# so Docker container forwarding/NAT survive a reboot;
|
||||||
|
# - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress
|
||||||
|
# change — base__ssh_listen_mesh_only stays false, so there is no boot-race;
|
||||||
|
# - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs)
|
||||||
|
# as the permanent non-mesh break-glass — the coordinator-host exception (a host's only
|
||||||
|
# management path must never depend on a service that host itself hosts).
|
||||||
|
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
|
||||||
|
base__mesh_enabled: true
|
||||||
|
base__firewall_apply: true
|
||||||
|
base__firewall_input_only: true # forward stays `policy accept` → Docker-safe
|
||||||
|
base__ssh_listen_mesh_only: false # no sshd ListenAddress change → no boot-race
|
||||||
|
base__firewall_admin_addrs:
|
||||||
|
- 91.226.145.80 # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the askari host_vars to manage over the mesh**
|
||||||
|
|
||||||
|
Create `inventories/production/host_vars/askari.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host`
|
||||||
|
# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path
|
||||||
|
# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the
|
||||||
|
# break-glass; the Hetzner web console is the IP-independent ultimate fallback.
|
||||||
|
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
|
||||||
|
ansible_host: 100.99.226.39
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the inventory resolves**
|
||||||
|
|
||||||
|
Run: `ansible-inventory -i inventories/production --host askari`
|
||||||
|
Expected: JSON shows `"ansible_host": "100.99.226.39"`, `"base__firewall_apply": true`, `"base__firewall_input_only": true`, and `"base__firewall_admin_addrs": ["91.226.145.80"]`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Lint**
|
||||||
|
|
||||||
|
Run: `rbw unlocked && make lint`
|
||||||
|
Expected: clean (no yamllint/ansible-lint errors).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/production/group_vars/offsite_hosts/vars.yml \
|
||||||
|
inventories/production/host_vars/askari.yml
|
||||||
|
git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate
|
||||||
|
|
||||||
|
Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing `askari` profile, which validates the `docker_host` container-forward drop-in path, intact).
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `tests/integration/profiles/askari_inputonly.json`
|
||||||
|
- Create: `tests/integration/overrides/askari_inputonly.yml`
|
||||||
|
- Modify: `tests/integration/verify.yml` (allow-list + a new profile branch)
|
||||||
|
|
||||||
|
**Interfaces:**
|
||||||
|
- Consumes: the `scripts/integration-vm.py` harness; `make test-integration HOST=<profile>` maps `HOST` to `profiles/<HOST>.json` (a profile name, not a production inventory host).
|
||||||
|
- Produces: profile `askari_inputonly` with `integration_profile: askari_inputonly`.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the new profile to the verify allow-list and a failing branch**
|
||||||
|
|
||||||
|
In `tests/integration/verify.yml`, change the allow-list assert (line 14) from:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- integration_profile in ['askari', 'ubongo']
|
||||||
|
```
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- integration_profile in ['askari', 'askari_inputonly', 'ubongo']
|
||||||
|
```
|
||||||
|
|
||||||
|
and update its `fail_msg` (line 15) to `"integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)"`. Then append this block to the `tasks:` list (after the ubongo block):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ──
|
||||||
|
# INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT
|
||||||
|
# (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot.
|
||||||
|
- name: (askari_inputonly) Read the live nftables ruleset
|
||||||
|
when: integration_profile == 'askari_inputonly'
|
||||||
|
ansible.builtin.command: nft list ruleset
|
||||||
|
register: _nft_io
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass
|
||||||
|
when: integration_profile == 'askari_inputonly'
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'hook input priority filter; policy drop;' in _nft_io.stdout"
|
||||||
|
- "'hook forward priority filter; policy accept;' in _nft_io.stdout"
|
||||||
|
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout"
|
||||||
|
fail_msg: >-
|
||||||
|
askari_inputonly: expected input policy drop, forward policy accept (input-only),
|
||||||
|
and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset.
|
||||||
|
|
||||||
|
- name: (askari_inputonly) Gather service facts
|
||||||
|
when: integration_profile == 'askari_inputonly'
|
||||||
|
ansible.builtin.service_facts:
|
||||||
|
|
||||||
|
- name: (askari_inputonly) Docker daemon is active
|
||||||
|
when: integration_profile == 'askari_inputonly'
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that: "ansible_facts.services['docker.service'].state == 'running'"
|
||||||
|
fail_msg: "docker.service is not running"
|
||||||
|
|
||||||
|
- name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive)
|
||||||
|
when: integration_profile == 'askari_inputonly'
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
ansible.builtin.uri:
|
||||||
|
url: "http://{{ ansible_host }}/"
|
||||||
|
follow_redirects: none
|
||||||
|
status_code: [200, 301, 308, 404, 502, 503]
|
||||||
|
timeout: 10
|
||||||
|
register: _probe_io
|
||||||
|
retries: 5
|
||||||
|
delay: 6
|
||||||
|
until: _probe_io is succeeded
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the profile descriptor**
|
||||||
|
|
||||||
|
Create `tests/integration/profiles/askari_inputonly.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"groups": ["offsite_hosts"],
|
||||||
|
"applies": [
|
||||||
|
{"playbook": "site.yml", "tags": ["base"]},
|
||||||
|
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
|
||||||
|
],
|
||||||
|
"extra_vars_files": ["overrides/askari_inputonly.yml"],
|
||||||
|
"mem_mib": 3072,
|
||||||
|
"vcpus": 2
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Create the overlay**
|
||||||
|
|
||||||
|
Create `tests/integration/overrides/askari_inputonly.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19).
|
||||||
|
# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy
|
||||||
|
# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable.
|
||||||
|
integration_profile: askari_inputonly
|
||||||
|
base__firewall_apply: true
|
||||||
|
base__firewall_input_only: true
|
||||||
|
# No sshd ListenAddress change — never wt0-only in a throwaway VM.
|
||||||
|
base__ssh_listen_mesh_only: false
|
||||||
|
# Isolated VM: never touch the real mesh.
|
||||||
|
base__mesh_enabled: false
|
||||||
|
# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the
|
||||||
|
# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is
|
||||||
|
# interface-independent and the default-deny + reboot don't lock out the driver. This
|
||||||
|
# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology.
|
||||||
|
base__firewall_admin_addrs:
|
||||||
|
- 192.168.150.1
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run the harness — the GREEN gate**
|
||||||
|
|
||||||
|
Run: `make test-integration HOST=askari_inputonly`
|
||||||
|
Expected: GREEN. The harness boots a VM, applies `base` (INPUT-only) + `docker_host` + `reverse_proxy`, **reboots**, re-SSHes (proving the admin-addr break-glass survives), then `verify.yml` asserts input `policy drop`, forward `policy accept`, the `192.168.150.1` SSH allow, Docker active, and the published `:80` answering. Clean up: `make test-integration-clean`.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked && make lint
|
||||||
|
git add tests/integration/profiles/askari_inputonly.json \
|
||||||
|
tests/integration/overrides/askari_inputonly.yml \
|
||||||
|
tests/integration/verify.yml
|
||||||
|
git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED
|
||||||
|
|
||||||
|
> **⚠️ DO NOT run this task autonomously.** It changes the live off-site host (lockout risk) and runs `make deploy`. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 1–3 committed and GREEN; `rbw unlocked`; the **Hetzner web console** open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN `:22` break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6).
|
||||||
|
|
||||||
|
**Files (Step 7 only):**
|
||||||
|
- Modify: `STATUS.md` (askari row), `docs/ROADMAP.md` (Next step)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Pre-check both paths are healthy**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
|
||||||
|
ansible askari -i inventories/production -m ping
|
||||||
|
curl -sI https://test.askari.wingu.me | head -1
|
||||||
|
curl -sI https://netbird.askari.wingu.me | head -1
|
||||||
|
```
|
||||||
|
Expected: wt0 SSH OK; ping `pong`; both curls `HTTP/2 200`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Dry-run the converge (mandatory `check` before `deploy`)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make check PLAYBOOK=site LIMIT=askari
|
||||||
|
```
|
||||||
|
Expected: changes limited to the `base` firewall (input-only ruleset, admin-addr) + the `netbird_coordinator` compose env (`NB_DISABLE_GEOLOCATION`). Review and show the output before proceeding.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Apply (operator present, console open, auto-rollback armed)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make deploy PLAYBOOK=site LIMIT=askari
|
||||||
|
```
|
||||||
|
The `base` firewall concern arms the auto-rollback timer (`base__firewall_rollback_timeout: 45`) and reconnects over `wt0` — a bad ruleset reverts itself. Expected: converge OK; SSH-over-`wt0` stays up.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled**
|
||||||
|
|
||||||
|
`base`'s `flush ruleset` wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh sjat@100.99.226.39 'sudo systemctl restart docker'
|
||||||
|
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
|
||||||
|
ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"'
|
||||||
|
```
|
||||||
|
Expected: `netbird-server` + `netbird-dashboard` Up; no geo-DB FATAL.
|
||||||
|
|
||||||
|
> **Contingency (only if `netbird-server` still FATALs on geolocation):** `NB_DISABLE_GEOLOCATION` was not honored by the pinned image. Pre-seed the DB into the volume instead — `ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server'` — and add `disableGeoliteUpdate: true` under `server:` in `config.yaml.j2` so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify the new steady state (both SSH paths + services)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
|
||||||
|
# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the
|
||||||
|
# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress:
|
||||||
|
# ubongo does not hold 91.226.145.80; OPNsense does.)
|
||||||
|
ssh sjat@77.42.120.136 true && echo "WAN break-glass OK"
|
||||||
|
curl -sI https://test.askari.wingu.me | head -1
|
||||||
|
nc -vz -u 77.42.120.136 3478 # STUN answers
|
||||||
|
```
|
||||||
|
Expected: both SSH paths succeed; cert valid; STUN reachable.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Reboot-resilience — the real test (console available)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh sjat@100.99.226.39 'sudo systemctl reboot'
|
||||||
|
# wait ~60s, then from ubongo — no manual intervention:
|
||||||
|
sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"'
|
||||||
|
curl -sI https://netbird.askari.wingu.me | head -1
|
||||||
|
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
|
||||||
|
```
|
||||||
|
Expected, unattended: input `policy drop` with the `wt0` + `91.226.145.80` allows; public cert valid; both containers Up; `wt0` SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.)
|
||||||
|
|
||||||
|
- [ ] **Step 7: Record reality in the ground-truth docs and commit**
|
||||||
|
|
||||||
|
Update `STATUS.md` (the askari row): firewall now **applied** — INPUT-only default-deny, SSH `wt0`-primary + permanent WAN break-glass (ubongo's WAN), managed over `wt0`, geolocation disabled, **reboot-validated**. Update `docs/ROADMAP.md` "Next step": mark the askari SSH→`wt0` redesign **DONE**; the next mesh-hardening sub-project is the **SPOF reduction** (askari relay single-point-of-failure) — confirmed by the `ubongo → askari` `Relayed` finding (2026-06-19).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked && make lint
|
||||||
|
git add STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes / out of scope (carry to the SPOF sub-project)
|
||||||
|
|
||||||
|
- **SPOF reduction is the next sub-project** (operator decision 2026-06-19): `ubongo → askari` is currently `Relayed` through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec.
|
||||||
|
- **NetBird ACL stays Allow-All** — any enrolled peer can reach askari `wt0:22` until a later sub-project.
|
||||||
|
- **Full forward-chain hardening** (`docker_host` container-forward drop-in over the `input_only` baseline) — a later tightening; the existing `askari` integration profile already covers that path.
|
||||||
|
- **Coordinator off-site backup** (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.
|
||||||
|
|
@ -0,0 +1,470 @@
|
||||||
|
# Mesh-hardening 2/3 — ubongo INPUT-only default-deny — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Apply base's nftables firewall to the control node (ubongo) as an INPUT-only default-deny — hardening its inbound surface — while leaving the forward chain permissive so Docker egress and the libvirt-NAT integration harness keep working, and without any sshd `ListenAddress` change.
|
||||||
|
|
||||||
|
**Architecture:** Two new `base` knobs make the existing firewall concern fit a control node: `base__firewall_input_only` flips the forward chain to `policy accept` (host-local input filtering only), and `base__firewall_admin_addrs` adds operator-workstation LAN sources to the SSH allow-list (alongside `wt0` and `ssh-from-control`). sshd is untouched (nftables does the scoping → no `ip_nonlocal_bind` boot-race). The change is validated on a throwaway VM via the ADR-025 integration harness (a new "be ubongo" profile) before an operator-supervised live cutover whose safety net is the firewall auto-rollback timer plus the permanent on-prem physical console.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Jinja2, Molecule on Debian 13, pytest (none new), the ADR-025 integration harness (`scripts/integration-vm.py`, JSON profiles, `-e @` overlays).
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`
|
||||||
|
|
||||||
|
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; never hand-edit the generated `offsite.yml`; `rbw unlocked` for any commit touching Ansible content and for the integration/live applies (the production `group_vars/all/vault.yml` is in inventory scope and gets decrypted at playbook load). Tasks 1–3 are code (subagent-driven, each lint/Molecule-verified). Task 4 is a real-VM validation gate on ubongo. Task 5 is the live, operator-supervised cutover.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
| File | Create/Modify | Responsibility |
|
||||||
|
|---|---|---|
|
||||||
|
| `roles/base/defaults/main.yml` | Modify | Declare `base__firewall_input_only` + `base__firewall_admin_addrs` (defaults: off / empty). |
|
||||||
|
| `roles/base/templates/nftables.conf.j2` | Modify | Conditional forward policy; render an SSH-allow rule per admin address. |
|
||||||
|
| `roles/base/molecule/default/converge.yml` | Modify | Fixture: an admin-addr source (input-only stays at its default → forward drop). |
|
||||||
|
| `roles/base/molecule/default/verify.yml` | Modify | Assert forward-drop default + the admin-addr rule render. |
|
||||||
|
| `inventories/production/group_vars/control/vars.yml` | Modify | Turn the knobs on for ubongo (input-only; mamba's LAN IP). |
|
||||||
|
| `tests/integration/overrides/ubongo.yml` | Create | The "be ubongo" overlay (input-only firewall; harness SSH lifeline). |
|
||||||
|
| `tests/integration/profiles/ubongo.json` | Create | The "be ubongo" VM profile (group `control`, applies `site.yml:base`). |
|
||||||
|
| `tests/integration/overrides/askari.yml` | Modify | Add the `integration_profile` marker (verify is now profile-aware). |
|
||||||
|
| `tests/integration/verify.yml` | Modify | Gate the askari (Docker/DNAT) block; add the ubongo (input-only) block + a guard. |
|
||||||
|
| `STATUS.md`, `docs/ROADMAP.md` | Modify (Task 5) | Record mesh-hardening 2/3 done. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: base role — `base__firewall_input_only` (forward policy) + `base__firewall_admin_addrs` (LAN SSH allow)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/defaults/main.yml`
|
||||||
|
- Modify: `roles/base/templates/nftables.conf.j2`
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml`
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml`
|
||||||
|
|
||||||
|
> **Test strategy (note):** Molecule renders one fixture, so it locks the *secure default* —
|
||||||
|
> `input_only` **off** → forward `policy drop` — plus the new admin-addr rule (red→green). The
|
||||||
|
> `input_only` **on** → forward `policy accept` path is exercised on a real VM by the
|
||||||
|
> integration "be ubongo" profile (Tasks 3–4), whose verify fails red until this template
|
||||||
|
> conditional exists. Both branches are covered, across the two test layers.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, after the `Assert the docker_host extension hook is present` block, add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Assert the forward chain defaults to policy drop (input_only off)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'hook forward priority 0; policy drop;' in nft"
|
||||||
|
fail_msg: >-
|
||||||
|
forward chain must default to policy drop when base__firewall_input_only is
|
||||||
|
false (container isolation stays the norm on real service hosts)
|
||||||
|
|
||||||
|
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
|
||||||
|
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (after the `base__firewall_control_addr` line):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__firewall_admin_addrs:
|
||||||
|
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run the test to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: FAIL on `Assert the admin-addr SSH allow rule` (the template does not consume `base__firewall_admin_addrs` yet, so the `ip saddr 10.30.0.77 …` rule is absent). The forward-drop assertion passes already (the template currently hardcodes `policy drop`).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the defaults**
|
||||||
|
|
||||||
|
In `roles/base/defaults/main.yml`, after the `base__firewall_apply: true` line (end of the firewall behaviour block, currently line 13), add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
|
||||||
|
# INPUT filtering only). For hosts that forward/route
|
||||||
|
# container or NAT traffic (the control node's Docker +
|
||||||
|
# libvirt-NAT) where a forward default-deny would break
|
||||||
|
# them. Real service hosts keep this false (forward drop).
|
||||||
|
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
|
||||||
|
# ssh-from-control. For an operator workstation reaching
|
||||||
|
# the host over the LAN (no mesh). Key-gated. (ADR-021)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Make the forward policy conditional + render the admin-addr rules**
|
||||||
|
|
||||||
|
In `roles/base/templates/nftables.conf.j2`:
|
||||||
|
|
||||||
|
(a) Replace the forward-chain line (currently line 21):
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
|
||||||
|
```
|
||||||
|
|
||||||
|
(b) After the `ssh-from-control` `{% endif %}` (currently line 14) and before the `ip protocol icmp accept` line, add the admin-addr loop:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
{% for addr in base__firewall_admin_addrs %}
|
||||||
|
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
||||||
|
{% endfor %}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: PASS — converge renders the ruleset; verify confirms the forward chain is `policy drop` (input_only defaults false) and the `ip saddr 10.30.0.77 tcp dport 22 accept` rule is present; all pre-existing assertions stay green.
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
|
||||||
|
|
||||||
|
- [ ] **Step 8: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
||||||
|
git commit -m "feat(base): input-only forward policy + admin-addr SSH allow
|
||||||
|
|
||||||
|
base__firewall_input_only renders the forward chain policy accept (host-local
|
||||||
|
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
|
||||||
|
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
|
||||||
|
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
|
||||||
|
ssh-from-control. Molecule locks the secure default + the admin rule.
|
||||||
|
Mesh-hardening 2/3 (ADR-020/021).
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: inventory — enable input-only default-deny + mamba on ubongo (control group)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `inventories/production/group_vars/control/vars.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Turn the knobs on for the control group**
|
||||||
|
|
||||||
|
Append to `inventories/production/group_vars/control/vars.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
|
||||||
|
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
|
||||||
|
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
|
||||||
|
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
|
||||||
|
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
|
||||||
|
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
|
||||||
|
base__firewall_input_only: true
|
||||||
|
base__firewall_admin_addrs:
|
||||||
|
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
|
||||||
|
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
|
||||||
|
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify the vars resolve for ubongo**
|
||||||
|
|
||||||
|
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host ubongo 2>/dev/null | grep -E 'firewall_input_only|firewall_admin_addrs|10.20.10.(50|17)'`
|
||||||
|
Expected: shows `"base__firewall_input_only": true` and `"base__firewall_admin_addrs": ["10.20.10.50", "10.20.10.17"]`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Lint**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: clean pass (`check-tags: OK`).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add inventories/production/group_vars/control/vars.yml
|
||||||
|
git commit -m "feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
|
||||||
|
|
||||||
|
Enables base__firewall_input_only on the control group (forward chain stays
|
||||||
|
permissive so Docker egress + the integration-test libvirt NAT survive) and
|
||||||
|
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
|
||||||
|
raw leases, backstopped by wt0). Mesh-hardening 2/3.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: integration harness — "be ubongo" profile (overlay + profile + profile-aware verify)
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `tests/integration/overrides/ubongo.yml`
|
||||||
|
- Create: `tests/integration/profiles/ubongo.json`
|
||||||
|
- Modify: `tests/integration/overrides/askari.yml`
|
||||||
|
- Modify: `tests/integration/verify.yml`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the "be ubongo" overlay**
|
||||||
|
|
||||||
|
Create `tests/integration/overrides/ubongo.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
|
||||||
|
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
|
||||||
|
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
|
||||||
|
# ListenAddress change (so no boot-race).
|
||||||
|
integration_profile: ubongo
|
||||||
|
base__firewall_apply: true
|
||||||
|
base__firewall_input_only: true # forward chain renders `policy accept`
|
||||||
|
base__firewall_admin_addrs:
|
||||||
|
- "192.168.150.98" # two representative LAN sources — exercises the
|
||||||
|
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
|
||||||
|
# Never wt0-only; never touch the real mesh from a throwaway VM.
|
||||||
|
base__ssh_listen_mesh_only: false
|
||||||
|
base__mesh_enabled: false
|
||||||
|
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
|
||||||
|
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
|
||||||
|
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
|
||||||
|
base__firewall_control_addr: "192.168.150.1"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the "be ubongo" VM profile**
|
||||||
|
|
||||||
|
Create `tests/integration/profiles/ubongo.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"groups": ["control"],
|
||||||
|
"applies": [
|
||||||
|
{"playbook": "site.yml", "tags": ["base"]}
|
||||||
|
],
|
||||||
|
"extra_vars_files": ["overrides/ubongo.yml"],
|
||||||
|
"mem_mib": 2048,
|
||||||
|
"vcpus": 2
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Mark the askari overlay with its profile name**
|
||||||
|
|
||||||
|
In `tests/integration/overrides/askari.yml`, after the two header comment lines (before `base__firewall_apply: true`), add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
integration_profile: askari
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Make `verify.yml` profile-aware (the test)**
|
||||||
|
|
||||||
|
Replace the entire contents of `tests/integration/verify.yml` with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
|
||||||
|
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
|
||||||
|
# criteria; an unknown/unset profile fails loudly (never a silent pass).
|
||||||
|
- name: Verify the rebooted host
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
tasks:
|
||||||
|
- name: A known integration_profile must be set (no silent pass)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- integration_profile is defined
|
||||||
|
- integration_profile in ['askari', 'ubongo']
|
||||||
|
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
|
||||||
|
|
||||||
|
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
|
||||||
|
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
|
||||||
|
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
|
||||||
|
- name: (askari) Gather service facts
|
||||||
|
when: integration_profile == 'askari'
|
||||||
|
ansible.builtin.service_facts:
|
||||||
|
|
||||||
|
- name: (askari) Docker daemon is active
|
||||||
|
when: integration_profile == 'askari'
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that: "ansible_facts.services['docker.service'].state == 'running'"
|
||||||
|
fail_msg: "docker.service is not running"
|
||||||
|
|
||||||
|
- name: (askari) Forward chain permits container traffic (drop-in loaded)
|
||||||
|
when: integration_profile == 'askari'
|
||||||
|
ansible.builtin.command: nft list chain inet filter forward
|
||||||
|
register: _fwd
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: (askari) Assert container forwarding is allowed (not pure drop)
|
||||||
|
when: integration_profile == 'askari'
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that: "'accept' in _fwd.stdout"
|
||||||
|
fail_msg: >-
|
||||||
|
forward chain is pure drop — container forwarding will die on reboot
|
||||||
|
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
|
||||||
|
|
||||||
|
- name: (askari) Published port answers from the controller (DNAT + forward alive)
|
||||||
|
when: integration_profile == 'askari'
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
ansible.builtin.uri:
|
||||||
|
url: "http://{{ ansible_host }}/"
|
||||||
|
follow_redirects: none
|
||||||
|
status_code: [200, 301, 308, 404, 502, 503]
|
||||||
|
timeout: 10
|
||||||
|
register: _probe
|
||||||
|
retries: 5
|
||||||
|
delay: 6
|
||||||
|
until: _probe is succeeded
|
||||||
|
|
||||||
|
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
|
||||||
|
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
|
||||||
|
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
|
||||||
|
- name: (ubongo) Read the live nftables ruleset
|
||||||
|
when: integration_profile == 'ubongo'
|
||||||
|
ansible.builtin.command: nft list ruleset
|
||||||
|
register: _nft
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: (ubongo) INPUT default-deny, forward permissive, admin-addr allow
|
||||||
|
when: integration_profile == 'ubongo'
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'hook input priority 0; policy drop;' in _nft.stdout"
|
||||||
|
- "'hook forward priority 0; policy accept;' in _nft.stdout"
|
||||||
|
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
|
||||||
|
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
|
||||||
|
fail_msg: >-
|
||||||
|
ubongo profile: expected input policy drop, forward policy accept (input-only),
|
||||||
|
and both admin-addr (192.168.150.98/99) SSH allows in the live ruleset.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Validate the JSON + lint**
|
||||||
|
|
||||||
|
Run: `.venv/bin/python -m json.tool tests/integration/profiles/ubongo.json >/dev/null && echo OK` then `make lint`
|
||||||
|
Expected: `OK`, then a clean lint pass (`check-tags: OK`).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tests/integration/overrides/ubongo.yml tests/integration/profiles/ubongo.json \
|
||||||
|
tests/integration/overrides/askari.yml tests/integration/verify.yml
|
||||||
|
git commit -m "test(integration): add the 'be ubongo' profile (input-only default-deny)
|
||||||
|
|
||||||
|
A control-group VM that applies base with INPUT-only default-deny (forward
|
||||||
|
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
|
||||||
|
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
|
||||||
|
block asserts input drop + forward accept + the admin-addr rule. Enables
|
||||||
|
\`make test-integration HOST=ubongo\`. Mesh-hardening 2/3 (ADR-025).
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Validate on the integration harness (`make test-integration HOST=ubongo`) — the GREEN gate
|
||||||
|
|
||||||
|
> Runs a throwaway UEFI VM on ubongo: boots it, applies the base role with the ubongo
|
||||||
|
> overlay (INPUT-only default-deny), **reboots it**, and asserts the ruleset + SSH-returns.
|
||||||
|
> This proves the change survives a reboot before the real control node is ever touched
|
||||||
|
> (spec §cutover step 1; FRICTION signal-6). No code change / no commit — a validation gate.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Ensure the vault is unlocked**
|
||||||
|
|
||||||
|
The run loads `inventories/production/group_vars/all/vault.yml` (symlinked into the run dir), which is decrypted at playbook load.
|
||||||
|
|
||||||
|
Run: `rbw unlocked || rbw unlock`
|
||||||
|
Expected: exits 0 (unlocked). If it prompts, the operator unlocks.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the integration cycle**
|
||||||
|
|
||||||
|
Run: `make test-integration HOST=ubongo`
|
||||||
|
Expected (the `cycle`: up → apply → reboot → assert): the VM gets a `192.168.150.x` lease; `site.yml --tags base` applies cleanly; `… rebooted (boot_id changed), SSH back at 192.168.150.x`; then `VERIFY PASSED for boma-it-ubongo-…`. The VM is destroyed on success.
|
||||||
|
|
||||||
|
- [ ] **Step 3: On failure, read the diagnostics**
|
||||||
|
|
||||||
|
If it prints `VERIFY FAILED`, diagnostics are in `~/integration-runs/boma-it-ubongo-<id>/` (`nft.txt`, `console.log`, `journal.txt`). The likely suspects: the admin-addr/forward assertion (Task 1/3 wiring) or SSH not returning post-reboot (the `base__firewall_control_addr: 192.168.150.1` lifeline in the overlay). Fix the implicated task, re-commit, and re-run Step 2. Re-run `make test-integration-clean` first if a VM was left defined.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Record the result**
|
||||||
|
|
||||||
|
Capture the `VERIFY PASSED` line in the task notes (this is the gate Task 5 step 1 depends on). No commit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
|
||||||
|
|
||||||
|
> Touches the **real ubongo** (the control node Ansible runs from) and reboots it — lockout-
|
||||||
|
> risky. Run it interactively with the operator, in order, verifying each step before the
|
||||||
|
> next. The firewall auto-rollback timer (`base__firewall_rollback_timeout`, 45 s) +
|
||||||
|
> `wait_for_connection` over the live path is the safety net; the **on-prem physical console**
|
||||||
|
> is the permanent break-glass. Do NOT hand this to an unattended agent.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Pre-checks (gate: Task 4 GREEN)**
|
||||||
|
|
||||||
|
- `rbw unlocked || rbw unlock`.
|
||||||
|
- SSH to ubongo over `wt0` from a road-warrior succeeds.
|
||||||
|
- SSH to ubongo from mamba on the LAN (`10.20.10.50`) succeeds.
|
||||||
|
- `.venv/bin/ansible ubongo -i inventories/production/ -m ping` → `SUCCESS` (over `10.20.10.151`).
|
||||||
|
- The physical console is reachable. If any path fails, STOP.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Dry-run the firewall apply**
|
||||||
|
|
||||||
|
Run: `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
|
||||||
|
Expected: the nftables diff shows `policy drop` on input, `iifname "wt0" … accept`, `ip saddr 10.20.10.151 … accept`, `ip saddr 10.20.10.50 … accept`, and the forward chain as `policy accept`. No errors.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Apply the host firewall (auto-rollback armed)**
|
||||||
|
|
||||||
|
Run: `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
|
||||||
|
Expected: the firewall concern snapshots `/etc/nftables.rollback`, arms the 45 s `systemd-run` revert, applies the ruleset, `reset_connection` → `wait_for_connection` over `10.20.10.151` succeeds, then cancels the timer. If connectivity is lost, the timer reverts the ruleset within 45 s and the console is the fallback.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify every path + forwarding still works**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# from a road-warrior over wt0, and from mamba on the LAN:
|
||||||
|
ssh sjat@100.99.146.14 true && echo "wt0 OK"
|
||||||
|
ssh sjat@10.20.10.151 true && echo "mamba-LAN OK" # run from mamba (10.20.10.50)
|
||||||
|
# Ansible self-path:
|
||||||
|
.venv/bin/ansible ubongo -i inventories/production/ -m ping
|
||||||
|
# a disallowed LAN host (e.g. 10.20.10.17) must now be refused/timeout on :22
|
||||||
|
# Docker egress (forward chain still permissive):
|
||||||
|
docker run --rm busybox wget -qO- https://cloudflare.com/cdn-cgi/trace | head -1
|
||||||
|
# libvirt-NAT forwarding intact — a fresh integration VM still reaches apt:
|
||||||
|
make test-integration HOST=ubongo # expect VERIFY PASSED (proves the NAT path survived)
|
||||||
|
```
|
||||||
|
Expected: `wt0 OK`, `mamba-LAN OK`, Ansible `SUCCESS`, the disallowed host refused, the Docker egress line returns, and the integration cycle passes.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Reboot resilience — while the console is present (FRICTION signal-6)**
|
||||||
|
|
||||||
|
With the operator at the physical console, reboot ubongo (`sudo systemctl reboot`). After it returns, confirm SSH comes back on all paths **unaided**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh sjat@100.99.146.14 true && echo "wt0 OK after reboot"
|
||||||
|
.venv/bin/ansible ubongo -i inventories/production/ -m ping
|
||||||
|
```
|
||||||
|
Expected: SSH returns with no manual intervention (no `ListenAddress`, so nothing to race). Only now is the cutover complete.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Update STATUS + ROADMAP**
|
||||||
|
|
||||||
|
- In `STATUS.md`: in the `roles/base/` row of "Scaffolded but empty", change the firewall note — the `firewall` concern is now **applied to ubongo** as INPUT-only default-deny (it is no longer "not yet applied to any host"); note the `base__firewall_input_only` knob and that the forward default-deny still awaits the `docker_host` drop-in for real service hosts. Add the ubongo control-node row's "Pending" item for default-deny → done.
|
||||||
|
- In `docs/ROADMAP.md`: mark **mesh-hardening sub-project 2 (ubongo default-deny) done**; the remaining follow-on is sub-project 1 (askari SSH→`wt0` *redesign*) and sub-project 3 (NetBird ACL). Update the "Next step" section accordingly.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs: ubongo INPUT-only default-deny applied (mesh-hardening 2/3 done)
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Push**
|
||||||
|
|
||||||
|
Run: `git push origin main`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review (against the spec)
|
||||||
|
|
||||||
|
- **§ Design — INPUT-only default-deny** → Task 1 (forward-policy knob) + Task 2 (enabled on ubongo). ✓
|
||||||
|
- **§ Design — admin-addrs (operator workstations on LAN)** → Task 1 (`base__firewall_admin_addrs` + template loop) + Task 2 (`10.20.10.50` mamba, `10.20.10.17`). ✓
|
||||||
|
- **§ Design — no sshd ListenAddress change** → nothing touches `ssh.yml`/`sshd_hardening.conf.j2`; only nftables. ✓ (verified: Tasks 1–3 file lists exclude them).
|
||||||
|
- **§ allow-list** (lo, established, wt0, ssh-from-control, admin-addr, icmp; forward accept) → template already renders lo/established/wt0/control/icmp; Task 1 adds admin-addr + forward-accept. ✓
|
||||||
|
- **§ Why-safe (incident signals 1/2/3/6)** → signal 1 (forward accept, Task 1); signal 2 (no ListenAddress); signal 3 (ubongo keeps LAN + console); signal 6 (Task 4 harness reboot + Task 5 step 5 reboot-while-console). ✓
|
||||||
|
- **§ New & changed code** (defaults, template, molecule, group_vars/control, integration profile) → Tasks 1–3. ✓
|
||||||
|
- **§ admin raw-leases + revisit** → Task 2 comments record both leases + the OPNsense-reservation revisit trigger; backstop (wt0) noted; flagged in `FRICTION.md`. ✓
|
||||||
|
- **§ Testing** (Molecule render asserts; `make test-integration HOST=ubongo`; live checks) → Task 1 (Molecule), Task 4 (harness), Task 5 step 4 (live). ✓ Coverage split (default in Molecule, input_only on the VM) noted in Task 1.
|
||||||
|
- **§ Staged cutover (signal-6 order)** → Task 5 steps 1–7; reboot-recovery (step 5) precedes nothing that retires a break-glass (the console is permanent). ✓
|
||||||
|
- **§ Risks/rollback** → auto-rollback (Task 5 step 3), redundant paths + physical console, raw-lease backstop. ✓
|
||||||
|
- **Type/name consistency:** `base__firewall_input_only` (bool) and `base__firewall_admin_addrs` (list) are spelled identically in defaults, template, converge, group_vars, and the overlay. `integration_profile` is spelled identically in both overlays and the three gates in `verify.yml`. ✓
|
||||||
|
- **Placeholder scan:** no TBD/TODO; every code/command step shows the actual content. ✓
|
||||||
237
docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
Normal file
237
docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
Normal file
|
|
@ -0,0 +1,237 @@
|
||||||
|
# Mesh SPOF — accept + targeted resilience — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
|
||||||
|
|
||||||
|
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
|
||||||
|
|
||||||
|
## Global Constraints
|
||||||
|
|
||||||
|
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
|
||||||
|
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
|
||||||
|
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
|
||||||
|
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
|
||||||
|
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
|
||||||
|
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
|
||||||
|
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
|
||||||
|
|
||||||
|
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
|
||||||
|
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
|
||||||
|
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
|
||||||
|
|
||||||
|
**Interfaces:**
|
||||||
|
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
|
||||||
|
```
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Read /etc/hosts (coordinator pin)
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/hosts
|
||||||
|
register: _etchosts
|
||||||
|
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
|
||||||
|
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
|
||||||
|
success_msg: "coordinator FQDN pinned in /etc/hosts"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run Molecule to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the default knob**
|
||||||
|
|
||||||
|
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
|
||||||
|
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
|
||||||
|
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
|
||||||
|
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
|
||||||
|
base__mesh_coordinator_pin: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the pin task**
|
||||||
|
|
||||||
|
Append to `roles/base/tasks/mesh.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/hosts
|
||||||
|
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
|
||||||
|
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
|
||||||
|
state: present
|
||||||
|
vars:
|
||||||
|
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
|
||||||
|
when:
|
||||||
|
- base__mesh_enabled | bool
|
||||||
|
- base__mesh_coordinator_pin | length > 0
|
||||||
|
tags: [mesh]
|
||||||
|
```
|
||||||
|
|
||||||
|
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/` → `netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run Molecule to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
|
||||||
|
|
||||||
|
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Wire the production pin for ubongo**
|
||||||
|
|
||||||
|
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
|
||||||
|
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
|
||||||
|
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
|
||||||
|
base__mesh_coordinator_pin: "77.42.120.136"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint and commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked && make lint
|
||||||
|
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
|
||||||
|
inventories/production/group_vars/control/vars.yml
|
||||||
|
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
|
||||||
|
|
||||||
|
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
|
||||||
|
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
|
||||||
|
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
|
||||||
|
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add accepted-risk R8**
|
||||||
|
|
||||||
|
In `docs/security/accepted-risks.md`, add this row to the table after R7:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
|
||||||
|
```
|
||||||
|
|
||||||
|
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the ADR-016 availability amendment**
|
||||||
|
|
||||||
|
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Availability — an `askari` outage (amendment 2026-06-20)
|
||||||
|
|
||||||
|
The coordinator is deliberately **single** (one off-site host). Recorded here so its
|
||||||
|
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
|
||||||
|
|
||||||
|
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
|
||||||
|
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
|
||||||
|
radius**:
|
||||||
|
|
||||||
|
| Traffic | `askari` down |
|
||||||
|
|---|---|
|
||||||
|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
|
||||||
|
| node ↔ node over LAN IPs (cluster) | unaffected |
|
||||||
|
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
|
||||||
|
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
|
||||||
|
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
|
||||||
|
|
||||||
|
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
|
||||||
|
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
|
||||||
|
operations, above).
|
||||||
|
|
||||||
|
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
|
||||||
|
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
|
||||||
|
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
|
||||||
|
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
|
||||||
|
hosts get the same pin via `base__mesh_coordinator_pin`.
|
||||||
|
|
||||||
|
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
|
||||||
|
default-deny posture; only helps established sessions), a second relay (needs another public
|
||||||
|
host / reintroduces the home public surface), a second coordinator (unsupported by
|
||||||
|
self-hosted NetBird; against this ADR).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update STATUS.md**
|
||||||
|
|
||||||
|
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update ROADMAP.md**
|
||||||
|
|
||||||
|
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 3–4 block with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a
|
||||||
|
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
|
||||||
|
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
|
||||||
|
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
|
||||||
|
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
|
||||||
|
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
|
||||||
|
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
|
||||||
|
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Consistency check + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -q "^| R8 " docs/security/accepted-risks.md && \
|
||||||
|
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
|
||||||
|
echo "docs OK"
|
||||||
|
```
|
||||||
|
Expected: `docs OK`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked
|
||||||
|
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes / out of scope
|
||||||
|
|
||||||
|
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
|
||||||
|
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
|
||||||
|
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.
|
||||||
|
|
@ -0,0 +1,212 @@
|
||||||
|
# Design — Logging and log integrity (ship all logs to Loki)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-05
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
|
||||||
|
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
|
||||||
|
- **Becomes:** ADR-018 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma wants **all logs in one queryable store** for three things: day-to-day
|
||||||
|
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
|
||||||
|
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
|
||||||
|
a central location if a log aggregation service is available"; "Active alerting wires
|
||||||
|
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
|
||||||
|
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
|
||||||
|
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
|
||||||
|
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
|
||||||
|
|
||||||
|
The key insight that frames the integrity question: **the biggest anti-tampering win is
|
||||||
|
that logs leave the host in near-real-time.** Once a line is in a store the attacker
|
||||||
|
doesn't control, wiping the local copy is futile. The remaining question is only *how
|
||||||
|
far* to harden the central store — set by the threat model.
|
||||||
|
|
||||||
|
## Decisions (the settled forks)
|
||||||
|
|
||||||
|
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
|
||||||
|
Not forensic-grade. This sizes everything below.
|
||||||
|
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
|
||||||
|
troubleshooting + trends. Near-real-time shipping already defeats per-host
|
||||||
|
track-covering.
|
||||||
|
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
|
||||||
|
Tamper-resistant against full-cluster compromise, at bounded volume.
|
||||||
|
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
|
||||||
|
+ off-site is the proportionate control.
|
||||||
|
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
|
||||||
|
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
|
||||||
|
|
||||||
|
## Architecture & components
|
||||||
|
|
||||||
|
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
|
||||||
|
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
|
||||||
|
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
|
||||||
|
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
|
||||||
|
|
||||||
|
**Two Loki instances, one Grafana:**
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────── per host (base role) ─────────────────────┐
|
||||||
|
│ Grafana Alloy: collect journald + container + auditd/auth/... │
|
||||||
|
└──────────┬───────────────────────────────────┬────────────────┘
|
||||||
|
ALL logs │ security subset │ (over the NetBird mesh)
|
||||||
|
▼ ▼
|
||||||
|
┌────────────────────────┐ ┌──────────────────────────────┐
|
||||||
|
│ Loki (cluster) all logs│ │ Loki (askari) security only │
|
||||||
|
│ docker_host, NVMe, │ │ off-site, write-only push, │
|
||||||
|
│ bounded hot retention │ │ long retention, append-only │
|
||||||
|
└───────────┬────────────┘ └──────────────┬───────────────┘
|
||||||
|
└───────────────┬────────────────────┘
|
||||||
|
▼
|
||||||
|
┌────────────────────────────────────┐
|
||||||
|
│ Grafana (cluster): both datasources │
|
||||||
|
│ dashboards + alerts (AIDE/auditd/ │
|
||||||
|
│ fail2ban/Suricata + log-silence) │
|
||||||
|
└────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Loki (cluster)** — `loki` service role on a docker_host; **all** logs; monolithic
|
||||||
|
single-binary mode (ample at this scale); NVMe; bounded retention.
|
||||||
|
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
|
||||||
|
group; **security subset only**, **write-only**, long retention, tiny volume.
|
||||||
|
- **Grafana** — `grafana` service role on the cluster; both Lokis as datasources (one
|
||||||
|
pane queries both); where ADR-002's "active alerting" lands.
|
||||||
|
|
||||||
|
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
|
||||||
|
`base`/service-role machinery.
|
||||||
|
|
||||||
|
## Data flow & the security subset
|
||||||
|
|
||||||
|
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
|
||||||
|
copy** of security events to the `askari` Loki — a relabel/match stage tags security
|
||||||
|
sources (`security="true"`) and routes only those to the second `loki.write` target.
|
||||||
|
One agent, two destinations.
|
||||||
|
|
||||||
|
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
|
||||||
|
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
|
||||||
|
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
|
||||||
|
ingest point), and key container security events (reverse-proxy 401/403, Authentik
|
||||||
|
login events, Docker daemon events).
|
||||||
|
|
||||||
|
**Write-only / append-only** (the tamper-resistance mechanism):
|
||||||
|
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
|
||||||
|
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
|
||||||
|
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
|
||||||
|
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
|
||||||
|
compromised host can **append but not read/edit/delete**. Deletion needs the
|
||||||
|
admin/compactor API or filesystem — unreachable from a host.
|
||||||
|
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
|
||||||
|
via API there too.
|
||||||
|
|
||||||
|
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
|
||||||
|
doesn't lose logs — they flush on reconnect with only a small local buffer.
|
||||||
|
|
||||||
|
## Security, integrity & residual risks
|
||||||
|
|
||||||
|
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
|
||||||
|
host; **host pivot to the store** — an attacker rooting any cluster host can append but
|
||||||
|
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
|
||||||
|
cluster compromise.**
|
||||||
|
|
||||||
|
**Honest residual risks (conscious, recorded):**
|
||||||
|
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
|
||||||
|
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
|
||||||
|
being minimal/hardened/operator-only/mesh-only.
|
||||||
|
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
|
||||||
|
near-real-time minimises it. Accept.
|
||||||
|
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
|
||||||
|
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
|
||||||
|
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
|
||||||
|
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
|
||||||
|
Grafana alerting.
|
||||||
|
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
|
||||||
|
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
|
||||||
|
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
|
||||||
|
|
||||||
|
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
|
||||||
|
append-only model is a clean blast-radius-containment enhancement for the opportunistic
|
||||||
|
threat model.
|
||||||
|
|
||||||
|
## Retention, sizing & disk-wear
|
||||||
|
|
||||||
|
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 2–5
|
||||||
|
host homelab generates ~1–3 GB/day raw "typical" (≪1 GB/day quiet; 5–15 GB/day very
|
||||||
|
chatty); Loki compresses ~7–10× → ~0.1–0.4 GB/day stored; the security subset is
|
||||||
|
~10–20% of that.
|
||||||
|
|
||||||
|
**Retention (tunable in `group_vars`):**
|
||||||
|
- **Cluster Loki (all logs):** bounded hot retention, start **30–90 days** (~10–35 GB
|
||||||
|
at 90d on NVMe).
|
||||||
|
- **`askari` Loki (security subset):** **1 year+** (~5–25 GB/yr) — small enough to keep
|
||||||
|
the security trail long for over-time detection.
|
||||||
|
- Defaults now; **re-measure real volume after a few weeks live** and tune.
|
||||||
|
|
||||||
|
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
|
||||||
|
rules):** at boma's volume even ~10–40 GB/day of amplified writes is decades of life on
|
||||||
|
a ~600-TBW/TB NVMe. Rules:
|
||||||
|
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
|
||||||
|
endurance-unlimited); **never SD/USB flash**.
|
||||||
|
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
|
||||||
|
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
|
||||||
|
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
|
||||||
|
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
|
||||||
|
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
|
||||||
|
stack — see Dependencies.)
|
||||||
|
|
||||||
|
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
|
||||||
|
`askari`) and SSD-wearout as a tracked metric.
|
||||||
|
|
||||||
|
## Documentation & implementation changes
|
||||||
|
|
||||||
|
This is a substantial capability → its own ADR-018, with reconciliations:
|
||||||
|
|
||||||
|
| Doc / artifact | Change |
|
||||||
|
|---|---|
|
||||||
|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
|
||||||
|
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
|
||||||
|
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
|
||||||
|
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
|
||||||
|
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
|
||||||
|
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
|
||||||
|
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
|
||||||
|
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
|
||||||
|
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
|
||||||
|
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
|
||||||
|
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
|
||||||
|
|
||||||
|
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
|
||||||
|
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
|
||||||
|
service roles, OPNsense syslog config, and the live pipeline.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `base` role + service-role machinery (unbuilt) — STATUS.md.
|
||||||
|
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
|
||||||
|
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
|
||||||
|
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
|
||||||
|
alerting — sibling effort, TODO 3.6.
|
||||||
|
|
||||||
|
## Deferred / out of scope
|
||||||
|
|
||||||
|
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
|
||||||
|
shifts to targeted/forensic.
|
||||||
|
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
|
||||||
|
**logs**. SSD-wearout + silence alerting depend on it.
|
||||||
|
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
|
||||||
|
per-service log standards** — future refinements.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tens–hundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
|
||||||
|
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||||
|
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
|
||||||
|
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
|
||||||
|
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
|
||||||
|
|
||||||
|
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||||
|
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||||
|
standard), ADR-011 (health checks — distinct from this).
|
||||||
206
docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
Normal file
206
docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
Normal file
|
|
@ -0,0 +1,206 @@
|
||||||
|
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-05
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
|
||||||
|
R3 "pending VPN choice" placeholder
|
||||||
|
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
|
||||||
|
- **Becomes:** ADR-016 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
|
||||||
|
without exposing anything to the public internet. ADR-015 left the access mechanism —
|
||||||
|
the "mesh VPN" — deferred to this discussion.
|
||||||
|
|
||||||
|
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
|
||||||
|
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
|
||||||
|
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
|
||||||
|
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
|
||||||
|
alternative to weigh."*
|
||||||
|
|
||||||
|
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
|
||||||
|
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
|
||||||
|
with the ADR-007 WireGuard.
|
||||||
|
|
||||||
|
## Decisions (as settled)
|
||||||
|
|
||||||
|
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
|
||||||
|
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
|
||||||
|
VLAN-99 OPNsense WireGuard design is retired.
|
||||||
|
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
|
||||||
|
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
|
||||||
|
coordinator that survives a homelab outage and stays out of the cluster it
|
||||||
|
administers.
|
||||||
|
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
|
||||||
|
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
|
||||||
|
means Headscale, a separate third-party reimplementation with partial parity — ruled
|
||||||
|
out below.)
|
||||||
|
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
|
||||||
|
scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is
|
||||||
|
moot, and the `base` role already runs on every host, so enrollment is one uniform
|
||||||
|
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
|
||||||
|
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
|
||||||
|
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
|
||||||
|
single advertised route or by administering OPNsense from an on-LAN meshed peer.
|
||||||
|
5. **Identity — embedded local users** (Dex, built into the management container), not
|
||||||
|
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
|
||||||
|
documented future option.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||||
|
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
|
||||||
|
> the core services are **merged into a single container**; deploy via Docker Compose.
|
||||||
|
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
|
||||||
|
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
|
||||||
|
> required.
|
||||||
|
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
|
||||||
|
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
|
||||||
|
>
|
||||||
|
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
|
||||||
|
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
|
||||||
|
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture & topology
|
||||||
|
|
||||||
|
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
|
||||||
|
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
|
||||||
|
on `askari`.
|
||||||
|
|
||||||
|
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
|
||||||
|
cluster per ADR-007) runs the **NetBird management stack** (single container:
|
||||||
|
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
|
||||||
|
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
|
||||||
|
full homelab outage and keeps the coordinator out of the cluster it administers (no
|
||||||
|
chicken-and-egg).
|
||||||
|
|
||||||
|
**Peers:**
|
||||||
|
- `askari` — coordinator + peer.
|
||||||
|
- `ubongo` (control/AI-worker host) — agent.
|
||||||
|
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
|
||||||
|
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
|
||||||
|
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
|
||||||
|
admin from a meshed peer).
|
||||||
|
|
||||||
|
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
|
||||||
|
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
|
||||||
|
ACLs instead of OPNsense routing `10.99.0.0/24`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security model, ACLs, and attack surface
|
||||||
|
|
||||||
|
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
|
||||||
|
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
|
||||||
|
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
|
||||||
|
- road-warrior clients → only what each needs; nothing by default.
|
||||||
|
|
||||||
|
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
|
||||||
|
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
|
||||||
|
role. Prefer ephemeral/scoped keys (ADR-002).
|
||||||
|
|
||||||
|
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
|
||||||
|
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
|
||||||
|
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
|
||||||
|
+ nftables are defence-in-depth.
|
||||||
|
|
||||||
|
**The new attack surface — a public control plane on `askari`.** Today `askari`
|
||||||
|
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
|
||||||
|
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
|
||||||
|
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
|
||||||
|
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
|
||||||
|
practical.
|
||||||
|
- `askari` runs `base` hardening (already a public managed host) and NetBird is
|
||||||
|
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
|
||||||
|
owning the CVE cadence (AGPLv3 server).
|
||||||
|
|
||||||
|
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
|
||||||
|
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
|
||||||
|
"NetBird control plane."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recovery, bootstrap ordering, and operations
|
||||||
|
|
||||||
|
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
|
||||||
|
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
|
||||||
|
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
|
||||||
|
there is no chicken-and-egg in the critical path.
|
||||||
|
|
||||||
|
**Bootstrap order** (askari-first):
|
||||||
|
1. Stand up the NetBird coordinator on `askari`.
|
||||||
|
2. Enroll `ubongo`.
|
||||||
|
3. `base` role enrolls the rest of the fleet via setup keys from vault.
|
||||||
|
|
||||||
|
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
|
||||||
|
outage. Two must-haves:
|
||||||
|
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
|
||||||
|
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
|
||||||
|
- Existing peer tunnels keep running on last-known config through a brief coordinator
|
||||||
|
outage; only changes/new enrollments need it live — so `askari` is important but not
|
||||||
|
instantly fatal.
|
||||||
|
|
||||||
|
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
|
||||||
|
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
|
||||||
|
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
|
||||||
|
standard). Agent install/enrollment lives in `base`.
|
||||||
|
|
||||||
|
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
|
||||||
|
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
|
||||||
|
and agents (via `base`) are version-pinned (ADR-011).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Documentation & implementation changes
|
||||||
|
|
||||||
|
This is a substantial decision → its own ADR, with amendments linking to it.
|
||||||
|
|
||||||
|
| Doc | Change |
|
||||||
|
|---|---|
|
||||||
|
| ADR-016 (new) | Home of record for this design. |
|
||||||
|
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
|
||||||
|
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
|
||||||
|
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
|
||||||
|
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
|
||||||
|
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
|
||||||
|
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
|
||||||
|
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
|
||||||
|
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
|
||||||
|
|
||||||
|
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
|
||||||
|
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
|
||||||
|
decision and the doc reconciliation; the role tasks land when `base` is built.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deferred / out of scope
|
||||||
|
|
||||||
|
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
|
||||||
|
second operator or service-SSO need appears.
|
||||||
|
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
|
||||||
|
(single advertised route vs LAN-side admin) is settled during implementation when
|
||||||
|
OPNsense automation is built.
|
||||||
|
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
|
||||||
|
unbuilt `base` role and service-role standard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
|
||||||
|
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
|
||||||
|
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
|
||||||
|
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
|
||||||
|
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
|
||||||
|
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
|
||||||
|
|
||||||
|
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
|
||||||
|
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
|
||||||
|
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).
|
||||||
|
|
@ -0,0 +1,203 @@
|
||||||
|
# Design — Service-UI acceptance verification (ADR-008 Level 4)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-05
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2
|
||||||
|
(browser portion) + TODO 2.3 (test users + manual-test instruction)
|
||||||
|
- **Expands:** ADR-008 Level 4 (currently a stub)
|
||||||
|
- **Becomes:** ADR-017 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
||||||
|
**Level 4 stub**: "Claude drives a headless browser from `ubongo` against a deployed
|
||||||
|
service: loads the rendered UI, creates test users, exercises features, and hands the
|
||||||
|
operator a manual test script." Nothing below Level 4 actually exercises a service's
|
||||||
|
**application UI** — Molecule tests the role in a container, Level 2 confirms the stack
|
||||||
|
converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism
|
||||||
|
actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).
|
||||||
|
|
||||||
|
The operator's original ask: *"Claude could spin up a browser and actually see the
|
||||||
|
generated service web-UIs to verify various things. Perhaps even generate test users
|
||||||
|
and test features and instruct me on tests as well."* That is TODO 2.2 (headless
|
||||||
|
browsing) + TODO 2.3 (test-user generation + manual-test instruction).
|
||||||
|
|
||||||
|
Today Claude "sees" a browser only **passively** — the `/screenshot` skill fetches
|
||||||
|
screenshots the operator took on `mamba`. This harness is the **active** counterpart:
|
||||||
|
Claude drives the browser itself.
|
||||||
|
|
||||||
|
## Decisions (the settled forks)
|
||||||
|
|
||||||
|
1. **Nature — Claude-driven exploratory.** Claude navigates the live UI with judgment
|
||||||
|
(look, click, reason about whether it works, notice anything off), not deterministic
|
||||||
|
scripts. This is the distinctive value; a scripted Playwright regression suite is
|
||||||
|
explicitly *not* built here.
|
||||||
|
2. **Mode — interactive, Claude-in-the-loop.** Follows from #1: exploratory judgment
|
||||||
|
can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a
|
||||||
|
determinism job for health checks / Uptime Kuma later).
|
||||||
|
3. **Environment — staging, full exercise.** Claude creates test users and exercises
|
||||||
|
features (including destructive flows) against a *staging* deploy. Staging is a
|
||||||
|
rebuildable sandbox, so this resolves safety: no production-data risk, no prod
|
||||||
|
pollution.
|
||||||
|
4. **Auth — test users in Authentik (central IdP), real SSO flow.** Claude's browser
|
||||||
|
authenticates through Traefik + Authentik exactly as a real user would, faithfully
|
||||||
|
testing the real access path.
|
||||||
|
5. **Structure — per-service `VERIFY.md` backbone + free exploration.** Each service
|
||||||
|
role ships an acceptance spec of critical user journeys; Claude executes it *and*
|
||||||
|
explores beyond it. Repeatable + intent-capturing, without losing exploratory value.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
In scope: the **browser/UI** verification harness (TODO 2.2 browser portion) + the
|
||||||
|
**test-user** and **manual-test-instruction** standards (TODO 2.3) = ADR-008 **Level 4**.
|
||||||
|
|
||||||
|
Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods —
|
||||||
|
API calls, `curl` pulls, log review. They share the spirit but are not browser work.
|
||||||
|
Also out: a scripted/CI regression suite; scheduled headless smoke checks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture, mechanism, and workflow placement
|
||||||
|
|
||||||
|
**Mechanism.** Claude drives a real Chromium on `ubongo` via the **`playwright` Claude
|
||||||
|
Code plugin** (already earmarked in `claude-code-setup.md`, enabled when this lands).
|
||||||
|
No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type,
|
||||||
|
screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive
|
||||||
|
`/screenshot`-from-`mamba` pattern.
|
||||||
|
|
||||||
|
**Orchestration.** A boma skill/command — **`/verify-service <name>`** — run
|
||||||
|
interactively on `ubongo`. It:
|
||||||
|
1. Reads the service's `roles/<name>/VERIFY.md` acceptance spec.
|
||||||
|
2. Provisions/uses a test user in the **staging** Authentik.
|
||||||
|
3. Drives the browser through the real SSO flow into the staging service.
|
||||||
|
4. Executes the listed journeys exploratorily (judging pass/fail, screenshotting key
|
||||||
|
states) and free-explores.
|
||||||
|
5. Writes a dated verification report with linked screenshots.
|
||||||
|
6. Emits a manual-test checklist for anything it couldn't do.
|
||||||
|
|
||||||
|
**Pipeline placement.** Level 4 runs after Level 2 (staging deploy) and before
|
||||||
|
production promotion:
|
||||||
|
`build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote`.
|
||||||
|
It reaches the staging service over the LAN from `ubongo` (services on `srv`; resolved
|
||||||
|
via boma DNS), through Traefik + Authentik as a real user would.
|
||||||
|
|
||||||
|
**Boundaries (one unit, clear interface):** the skill *orchestrates*; `VERIFY.md`
|
||||||
|
*declares intent* (per service); Authentik *provides identity*; the report *captures
|
||||||
|
results*. Each is independently understandable and swappable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The `VERIFY.md` standard
|
||||||
|
|
||||||
|
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from a new
|
||||||
|
template `docs/testing/service-verify-template.md` — parallel to how each role ships
|
||||||
|
`SECURITY.md` from `service-security-template.md`. It becomes a **role convention**
|
||||||
|
(every *service* role must have a populated `VERIFY.md`).
|
||||||
|
|
||||||
|
Contents:
|
||||||
|
- **Critical user journeys** — the acceptance criteria that define "working" for this
|
||||||
|
service (e.g. PhotoPrism: *SSO login → library loads → upload a test image →
|
||||||
|
thumbnail generates → search finds it*).
|
||||||
|
- **What good looks like** — states/screenshots to confirm.
|
||||||
|
- **Not browser-verifiable** — items to route to the manual-test handoff (hardware,
|
||||||
|
paid/external flows, subjective quality).
|
||||||
|
|
||||||
|
`/verify-service` reads `roles/<name>/VERIFY.md`, executes those journeys, and explores
|
||||||
|
beyond them.
|
||||||
|
|
||||||
|
## Test-user generation standard (TODO 2.3)
|
||||||
|
|
||||||
|
Test identities are provisioned in the **staging** Authentik (never the production IdP
|
||||||
|
— test accounts must not exist in prod):
|
||||||
|
- **Convention:** a dedicated `test` group / naming prefix (e.g. `test-<service>@…`) so
|
||||||
|
accounts are identifiable and bulk-removable.
|
||||||
|
- **Credentials:** ephemeral, generated per run (staging is rebuildable); held only for
|
||||||
|
the run. No test creds in `vault.yml`.
|
||||||
|
- **Idempotent:** reuse-or-create.
|
||||||
|
- **Teardown:** primary teardown is the staging rebuild (sandbox); the skill also
|
||||||
|
offers explicit cleanup of the `test` group.
|
||||||
|
|
||||||
|
## Reporting & manual-test handoff
|
||||||
|
|
||||||
|
- **Report:** `/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md`
|
||||||
|
(plus `latest.md`), mirroring `/review-repo`→`docs/reviews/` and
|
||||||
|
`/capacity-review`→`docs/hardware/reviews/`. It contains pass/fail per `VERIFY.md`
|
||||||
|
journey, observations, the test-user/env used, a verdict, and the manual-test
|
||||||
|
checklist. The committed markdown is the durable artifact.
|
||||||
|
- **Screenshots:** saved to a **git-ignored** dir on `ubongo` (PNGs would bloat the
|
||||||
|
repo); the report links them and inlines only a few key evidence shots.
|
||||||
|
- **Manual-test handoff (TODO 2.3):** anything Claude can't do — physical device,
|
||||||
|
paid/external flow, subjective judgment — becomes a **structured checklist** in the
|
||||||
|
report (numbered steps, expected result, why handed off). The operator runs them and
|
||||||
|
reports back. This is the "instruct me on tests" half of the vision, as a first-class
|
||||||
|
output.
|
||||||
|
|
||||||
|
## Safety
|
||||||
|
|
||||||
|
Even though staging is a sandbox:
|
||||||
|
- **Staging-only guard.** The skill refuses to run against production (verifies it is
|
||||||
|
pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard
|
||||||
|
stop, since exploratory clicking is destructive by nature.
|
||||||
|
- **Confined blast radius.** Test users live only in the staging `test` group; the run
|
||||||
|
sticks to the target service.
|
||||||
|
- **No secrets leaked.** Screenshots can capture on-screen tokens/credentials, so the
|
||||||
|
git-ignored screenshot dir is also the safety boundary (evidence isn't committed by
|
||||||
|
default), and the skill avoids capturing credential screens.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Documentation & implementation changes
|
||||||
|
|
||||||
|
This is a substantial capability → its own ADR-017, with reconciliations:
|
||||||
|
|
||||||
|
| Doc / artifact | Change |
|
||||||
|
|---|---|
|
||||||
|
| ADR-017 (new) | Home of record: harness, the five settled forks, `VERIFY.md` standard, test-user + manual-handoff standards, safety. |
|
||||||
|
| ADR-008 (testing) | Expand the Level 4 stub into the full definition; link ADR-017. |
|
||||||
|
| `docs/testing/service-verify-template.md` (new) | The `VERIFY.md` template (parallels `service-security-template.md`). |
|
||||||
|
| `.claude/commands/verify-service.md` (new) | The `/verify-service <name>` orchestrating skill. |
|
||||||
|
| `CLAUDE.md` | Role conventions: every *service* role must ship a populated `VERIFY.md`. Further reading: ADR-017. |
|
||||||
|
| `docs/security/service-checklist.md` | Add "passed Level 4 (`/verify-service`)" to the pre-production service-clearance gate. |
|
||||||
|
| `.gitignore` + `docs/testing/reviews/` | Ignore the screenshot dir; create the reviews dir (README/`.gitkeep`). |
|
||||||
|
| `STATUS.md` | Row: Level 4 verification — skill + template authorable; *running* deferred. |
|
||||||
|
| `docs/TODO.md` | Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/`curl`/log siblings remain. |
|
||||||
|
| `make new-role` scaffold | Scaffold `VERIFY.md` into new service roles (when that scaffold is next touched). |
|
||||||
|
|
||||||
|
**Buildable now** (no `ubongo`/Authentik/staging needed): ADR-017, the ADR-008
|
||||||
|
expansion, the `VERIFY.md` template, the `/verify-service` skill logic, the convention +
|
||||||
|
checklist + Further-reading edits, `.gitignore`/dir, STATUS/TODO. This spec yields real
|
||||||
|
working artifacts immediately — the skill and standards exist and are reviewable; only
|
||||||
|
the *live run* waits on the stack.
|
||||||
|
|
||||||
|
**Deferred** (needs the stack): actually running it (`ubongo` + `playwright` plugin +
|
||||||
|
Authentik + a staging deploy); the Authentik test-user provisioning automation;
|
||||||
|
per-service `VERIFY.md` files (need the service roles, which don't exist yet).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `ubongo` (ADR-015) — the host that runs the browser. Designed, not built.
|
||||||
|
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
|
||||||
|
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
|
||||||
|
- A staging environment with the service deployed (ADR-008 Level 2) — staging is
|
||||||
|
currently empty stubs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Scripted Playwright regression suite | The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this. |
|
||||||
|
| Scheduled headless smoke gate (cron) | Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma. |
|
||||||
|
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here. |
|
||||||
|
| Free-form exploration with no per-service spec | Flexible but non-repeatable and can miss a service's critical flow; `VERIFY.md` gives a backbone while keeping free exploration. |
|
||||||
|
| Staging bypasses SSO / per-app local users | Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful. |
|
||||||
|
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`, markdown report committed. |
|
||||||
|
|
||||||
|
See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser),
|
||||||
|
ADR-002 (security), ADR-004 (one service = one role — `VERIFY.md` parallels
|
||||||
|
`SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|
||||||
164
docs/superpowers/specs/2026-06-06-firewall-strategy-design.md
Normal file
164
docs/superpowers/specs/2026-06-06-firewall-strategy-design.md
Normal file
|
|
@ -0,0 +1,164 @@
|
||||||
|
# Design — Firewall strategy (two-layer model + shared catalog)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-06
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** TODO 3.5 ("Decide the firewall strategy — which firewall, ruleset,
|
||||||
|
per-host vs central")
|
||||||
|
- **Becomes:** ADR-020 (this design is the basis for that ADR)
|
||||||
|
- **Scope note:** This is the **strategy** ADR. It pins the architecture and
|
||||||
|
responsibilities; the detailed builds (host nftables in `base`, OPNsense-as-code) are
|
||||||
|
separate follow-up specs (see *Scope*).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma needs a firewall strategy that is **predictable, declarative, and defends the
|
||||||
|
stated threat model** (opportunistic external, lateral movement / blast radius,
|
||||||
|
operator/agent error — ADR-002). The ADRs already commit to pieces of this — `nftables`
|
||||||
|
default-deny on hosts (ADR-002), OPNsense at the perimeter (ADR-007), Docker with
|
||||||
|
`iptables: false` (ADR-004) — but no document ties them together: *which layer owns
|
||||||
|
what, where firewall intent is declared, and how the two layers stay consistent.*
|
||||||
|
Without that, ports drift open ad-hoc and "per-host vs central" stays unanswered.
|
||||||
|
|
||||||
|
The roles that would hold the host firewall (`base`, `docker_host`) are empty, and there
|
||||||
|
is no OPNsense automation yet — so this is greenfield strategy work.
|
||||||
|
|
||||||
|
## The two-layer model
|
||||||
|
|
||||||
|
Two firewall layers, each with a distinct job; the host layer adds deliberate
|
||||||
|
defense-in-depth for the one thing the perimeter structurally cannot see.
|
||||||
|
|
||||||
|
### OPNsense — perimeter + inter-VLAN
|
||||||
|
|
||||||
|
Owns everything *between zones* and at the edge:
|
||||||
|
|
||||||
|
- WAN edge (the internet boundary).
|
||||||
|
- Inter-VLAN policy: `lan`/`iot`/`guest` → `srv`, `mgmt` access, the documented
|
||||||
|
per-VLAN egress rules (ADR-007).
|
||||||
|
- **Structurally blind to intra-`srv` traffic**: services share the `srv` subnet
|
||||||
|
(VLAN 20), which is switched and never reaches the OPNsense gateway.
|
||||||
|
|
||||||
|
### Host nftables — host-local + east-west within `srv` (in `base`)
|
||||||
|
|
||||||
|
Runs on every Debian VM:
|
||||||
|
|
||||||
|
- **Default-deny inbound**; allow loopback + established/related.
|
||||||
|
- **East-west allowlist**: a service host accepts a connection only from declared
|
||||||
|
sources (e.g. the reverse proxy, a named peer). This is the lateral-movement control
|
||||||
|
OPNsense cannot provide — the blast-radius goal in ADR-002.
|
||||||
|
- **Permissive egress**: allow outbound + established/related. Per-VLAN egress
|
||||||
|
restriction stays at OPNsense (where it already lives, ADR-007). Rationale: host-level
|
||||||
|
egress allowlisting is high-friction (every DNS/NTP/update/registry/webhook call must
|
||||||
|
be enumerated) for limited additional benefit given OPNsense already bounds where each
|
||||||
|
VLAN can go.
|
||||||
|
- **Docker integration**: Docker daemon runs with `"iptables": false`; nftables owns all
|
||||||
|
filtering, including container traffic (ADR-004).
|
||||||
|
- **Guaranteed management plane**: loopback, established/related, and `wt0` (the NetBird
|
||||||
|
overlay, ADR-016) for SSH + Ansible are *always* allowed, independent of the catalog,
|
||||||
|
and the ruleset is applied atomically — so a malformed or empty catalog can never lock
|
||||||
|
out management. (ADR-016: SSH is allowed only on `wt0`, not the LAN.)
|
||||||
|
|
||||||
|
## The shared service catalog (single source of truth)
|
||||||
|
|
||||||
|
A central, declarative **service catalog** in `group_vars/` is the one source of truth
|
||||||
|
for firewall intent. This aligns with ADR-002's existing rule that "port definitions
|
||||||
|
live in `group_vars/` so rules stay in sync with deployed services," and keeps
|
||||||
|
connectivity *topology* (inherently cross-cutting) in inventory rather than in any one
|
||||||
|
self-contained service role (ADR-004).
|
||||||
|
|
||||||
|
Each entry describes a service's **ingress** as a list of allow rules:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
photoprism:
|
||||||
|
ingress:
|
||||||
|
- { from: reverse_proxy, port: 2342, proto: tcp }
|
||||||
|
reverse_proxy:
|
||||||
|
ingress:
|
||||||
|
- { from: lan, port: 443, proto: tcp }
|
||||||
|
```
|
||||||
|
|
||||||
|
`from` is **symbolic**, resolved at render time:
|
||||||
|
|
||||||
|
- a **host or group** → IP(s) from inventory;
|
||||||
|
- a **role** (e.g. `reverse_proxy`) → the host(s) filling it;
|
||||||
|
- a **VLAN/zone** (e.g. `lan`) → the subnet from the ADR-007 table.
|
||||||
|
|
||||||
|
Symbolic sources keep the catalog readable and resilient to IP changes.
|
||||||
|
|
||||||
|
### Each layer renders only its own slice
|
||||||
|
|
||||||
|
The same catalog feeds both layers; each filters for the rules it owns:
|
||||||
|
|
||||||
|
| Ingress rule | Host nftables | OPNsense |
|
||||||
|
|---|---|---|
|
||||||
|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
|
||||||
|
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
|
||||||
|
|
||||||
|
The dominant pattern falls out naturally: most services are **proxied** — their only
|
||||||
|
ingress is `from: reverse_proxy`; users reach them *through* the reverse proxy, which
|
||||||
|
alone carries `from: lan, port: 443`. This matches "services sit behind the reverse
|
||||||
|
proxy with authentication" (ADR-002).
|
||||||
|
|
||||||
|
"Shared catalog, each layer renders its own" was chosen over a single
|
||||||
|
connectivity-model-generates-both (too much machinery, tight coupling of two very
|
||||||
|
different rule domains) and over fully independent per-layer declarations (real drift
|
||||||
|
risk: a port opened on the host but not at OPNsense, or vice versa).
|
||||||
|
|
||||||
|
## OPNsense automation — owned here, mechanism deferred
|
||||||
|
|
||||||
|
OPNsense is **Ansible-managed** (CLAUDE.md: "OPNsense is entirely Ansible; do not reach
|
||||||
|
for a Terraform OPNsense provider"). It renders the **cross-VLAN slice** of the catalog
|
||||||
|
(every `from: <other-zone>` rule) plus the static ADR-007 facts (WAN edge, per-VLAN
|
||||||
|
egress, mgmt access, inter-VLAN defaults).
|
||||||
|
|
||||||
|
This ADR pins **what** OPNsense owns and that it renders from the shared catalog. The
|
||||||
|
**how** — config-XML templating vs the OPNsense API vs a plugin — is a substantial,
|
||||||
|
separate tooling decision, **deferred to the OPNsense-as-code follow-up spec**. Recorded
|
||||||
|
here as an explicit open sub-decision so it is not lost.
|
||||||
|
|
||||||
|
## Guardrails & enforcement
|
||||||
|
|
||||||
|
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist.
|
||||||
|
This hardens the existing CLAUDE.md guardrail ("never open a firewall port ad-hoc on a
|
||||||
|
host") into a positive contract.
|
||||||
|
- **The `firewall` tag** (ADR-019) marks firewall tasks, so `--tags firewall` re-renders
|
||||||
|
rules on `base` and any service role that contributes them.
|
||||||
|
- **Drift detection (aspiration).** A deterministic check — in the spirit of
|
||||||
|
`scripts/check-tags.py` — compares each host's actual listening ports / live `nft`
|
||||||
|
ruleset against the catalog and flags anything undeclared. Ties to TODO 8.5
|
||||||
|
(`/security-review`) and the "undeclared open ports" pre-scan idea. Listed as a
|
||||||
|
consequence and future guardrail; not necessarily built in the first implementation.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- "Per-host vs central" is answered: **both**, with clear ownership — central perimeter
|
||||||
|
(OPNsense) + per-host default-deny with east-west allowlisting, fed by one catalog.
|
||||||
|
- Lateral movement within `srv` is constrained (the gap OPNsense can't close).
|
||||||
|
- One declarative catalog means no ad-hoc ports and no cross-layer drift on the shared
|
||||||
|
facts (ports, IPs, sources).
|
||||||
|
- Cost: the catalog and the render-per-layer machinery must be built and maintained;
|
||||||
|
east-west allowlisting adds per-service ingress declarations (mitigated by the
|
||||||
|
proxied-by-default pattern, which keeps most entries to a single line).
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
**This ADR decides:** the two-layer model and each layer's responsibilities; host
|
||||||
|
nftables = default-deny inbound + east-west allowlist + permissive egress + guaranteed
|
||||||
|
management plane + Docker `iptables:false`; the shared `group_vars` service catalog as
|
||||||
|
single source of truth with symbolic sources; each layer renders its own slice; the
|
||||||
|
no-ad-hoc-ports guardrail.
|
||||||
|
|
||||||
|
**Deferred to follow-up specs (each its own brainstorm → plan):**
|
||||||
|
|
||||||
|
1. **Host nftables implementation** in `base` — exact catalog schema, nftables template
|
||||||
|
structure, Docker `iptables:false` integration, fail-safe ordering, Molecule tests.
|
||||||
|
The natural next spec.
|
||||||
|
2. **OPNsense-as-code** — the tooling mechanism + cross-VLAN rule rendering.
|
||||||
|
3. **Drift-detection check** — if/when we build it.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
|
||||||
|
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
|
||||||
|
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).
|
||||||
|
|
@ -0,0 +1,219 @@
|
||||||
|
# Design — Host nftables firewall (the `firewall` concern of `base`)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-06
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Implements:** ADR-020 deferred build #1 (host nftables in `base`)
|
||||||
|
- **Scope:** The **`firewall`-tagged concern of the `base` role only**. Other `base`
|
||||||
|
concerns (SSH hardening, fail2ban, auditd, packages, users) are separate future efforts.
|
||||||
|
Docker netfilter is deferred to the `docker_host` role.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
ADR-020 settled the firewall *strategy*: a per-host nftables layer doing default-deny
|
||||||
|
inbound + east-west allowlisting + permissive egress, rendered from a shared
|
||||||
|
`group_vars` service catalog. Nothing is built yet — `roles/base/` is empty. This spec
|
||||||
|
designs the concrete host firewall: the catalog schema, how rules are resolved and
|
||||||
|
rendered, how they are applied without locking out the host, and how it is tested.
|
||||||
|
|
||||||
|
Two hard constraints shape the design:
|
||||||
|
|
||||||
|
1. **Molecule runs in a privileged Docker container sharing the dev host (`ubongo`)
|
||||||
|
kernel netfilter** — applying real nftables rules there could mutate the live host.
|
||||||
|
So Level-1 testing renders and syntax-checks but does **not** apply.
|
||||||
|
2. **Lockout risk** — a bad ruleset can brick SSH/Ansible. On-cluster hosts have the
|
||||||
|
Proxmox console as break-glass; offsite `askari` (Hetzner) does not, cheaply.
|
||||||
|
|
||||||
|
## Scope decisions (settled in brainstorming)
|
||||||
|
|
||||||
|
- **Host firewall only**, coherent on any host (even one with no services). Docker
|
||||||
|
`iptables:false` + container forward/NAT/masquerade are **deferred to `docker_host`**,
|
||||||
|
which contributes rules via an extension hook (below).
|
||||||
|
- **Placement lives in the catalog** (`host:` | `group:` | `hosts:`), giving one source
|
||||||
|
of truth that also resolves symbolic sources. Proxmox HA/migration moves a *VM*
|
||||||
|
between physical nodes but the VM keeps its static `srv` IP and inventory identity, so
|
||||||
|
node-level failover is invisible to the firewall. A planned service relocation is a
|
||||||
|
one-line catalog edit + `--tags firewall` re-deploy (which re-renders opened ports
|
||||||
|
*and* every source resolution consistently). Within-group HA is handled by placing a
|
||||||
|
service on a `group`/`hosts` list — the allowlist then already covers every member.
|
||||||
|
- **Level-1 testing = render + `nft -c` syntax check, no apply.** Enforcement is
|
||||||
|
verified at Level 2 on staging VMs.
|
||||||
|
- **Auto-rollback safety net** on apply (critical for offsite `askari`).
|
||||||
|
|
||||||
|
## Role layout
|
||||||
|
|
||||||
|
Scaffold with `make new-role base`, then implement the firewall concern:
|
||||||
|
|
||||||
|
```
|
||||||
|
roles/base/
|
||||||
|
tasks/main.yml # include_tasks firewall.yml (tags: [firewall]); grows later
|
||||||
|
tasks/firewall.yml # install nftables, render, validate, safe-apply
|
||||||
|
filter_plugins/firewall_rules.py # pure catalog→resolved-rules resolver (pytest-unit-tested)
|
||||||
|
templates/nftables.conf.j2
|
||||||
|
defaults/main.yml # base__firewall_* behaviour knobs
|
||||||
|
handlers/main.yml
|
||||||
|
molecule/default/ # fixture catalog + inventory; converge + verify
|
||||||
|
README.md, meta/main.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
`base` is infrastructure, not a *service* role, so the service-role `SECURITY.md` /
|
||||||
|
`VERIFY.md` conventions (ADR-004) do not apply. The firewall role import in a playbook
|
||||||
|
carries the `base` role-name tag (enforced by `check-tags.py`, ADR-019); the firewall
|
||||||
|
tasks within carry the `firewall` concern tag.
|
||||||
|
|
||||||
|
## Data model — shared catalog + zones
|
||||||
|
|
||||||
|
Two new **global inventory facts** (read by `base` now and OPNsense later, so plain
|
||||||
|
names, not role-namespaced) in `inventories/<env>/group_vars/all/firewall.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Zone → subnet (from ADR-007)
|
||||||
|
firewall_zones:
|
||||||
|
lan: 10.30.0.0/24
|
||||||
|
srv: 10.20.0.0/24
|
||||||
|
mgmt: 10.10.0.0/24
|
||||||
|
iot: 10.40.0.0/24
|
||||||
|
guest: 10.50.0.0/24
|
||||||
|
|
||||||
|
# Service catalog: name → placement + ingress
|
||||||
|
firewall_catalog:
|
||||||
|
reverse_proxy:
|
||||||
|
host: docker01 # placement: host | group | hosts:[...]
|
||||||
|
ingress:
|
||||||
|
- { from: lan, port: 443, proto: tcp }
|
||||||
|
photoprism:
|
||||||
|
host: docker01
|
||||||
|
ingress:
|
||||||
|
- { from: reverse_proxy, port: 2342, proto: tcp }
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Placement** is exactly one of `host: <name>`, `group: <group>`, or `hosts: [<name>, …]`.
|
||||||
|
- **`from`** resolves three ways, checked in this order: (1) a key in `firewall_zones`
|
||||||
|
→ that subnet; (2) a key in `firewall_catalog` → that service's placement → host
|
||||||
|
IP(s) as `/32`; (3) an inventory group or host name → its IP(s) as `/32`. An
|
||||||
|
unresolvable `from` is a hard error (fail fast, never silently open/skip).
|
||||||
|
|
||||||
|
Role **behaviour knobs** stay role-namespaced in `roles/base/defaults/main.yml`:
|
||||||
|
|
||||||
|
| Default | Value | Purpose |
|
||||||
|
|---|---|---|
|
||||||
|
| `base__firewall_mgmt_interface` | `wt0` | interface SSH is accepted on (NetBird overlay, ADR-016) |
|
||||||
|
| `base__firewall_ssh_port` | `22` | SSH port allowed on the mgmt interface |
|
||||||
|
| `base__firewall_rollback_timeout` | `45` | seconds before auto-revert fires |
|
||||||
|
| `base__firewall_dropin_dir` | `/etc/nftables.d` | extension dir included by the ruleset |
|
||||||
|
|
||||||
|
## Resolution & rendering
|
||||||
|
|
||||||
|
The resolver is a **pure Python filter plugin**, `roles/base/filter_plugins/firewall_rules.py`,
|
||||||
|
exposing `resolve_firewall_rules(catalog, zones, inventory_hostname, hostvars)`. It:
|
||||||
|
|
||||||
|
1. selects catalog entries placed on `inventory_hostname` (matching `host`, membership
|
||||||
|
in `group`, or presence in `hosts`);
|
||||||
|
2. for each entry's `ingress` rules, resolves `from` to a list of source CIDRs (zone /
|
||||||
|
service-placement / group-or-host, per the order above);
|
||||||
|
3. returns a **deterministic, de-duplicated, sorted** list of
|
||||||
|
`{proto, port, sources: [cidr, …]}`.
|
||||||
|
|
||||||
|
Chosen over inline Jinja (unreadable, untestable) and a `set_fact` loop (awkward to
|
||||||
|
unit-test) — a filter plugin matches the house style of `check-tags.py` /
|
||||||
|
`capacity-scan.py` and is pytest-unit-testable in isolation. Host→IP resolution reads
|
||||||
|
`hostvars[<host>].ansible_host` (the static `srv` IP the Terraform-generated inventory
|
||||||
|
provides).
|
||||||
|
|
||||||
|
`tasks/firewall.yml` builds `base__firewall_resolved` from the filter; the template
|
||||||
|
renders that flat list:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
#!/usr/sbin/nft -f
|
||||||
|
flush ruleset
|
||||||
|
table inet filter {
|
||||||
|
chain input {
|
||||||
|
type filter hook input priority 0; policy drop;
|
||||||
|
iif "lo" accept
|
||||||
|
ct state established,related accept
|
||||||
|
ct state invalid drop
|
||||||
|
iif "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
|
||||||
|
ip protocol icmp accept
|
||||||
|
ip6 nexthdr ipv6-icmp accept
|
||||||
|
{% for r in base__firewall_resolved %}
|
||||||
|
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
|
||||||
|
{% endfor %}
|
||||||
|
}
|
||||||
|
chain forward { type filter hook forward priority 0; policy drop; }
|
||||||
|
chain output { type filter hook output priority 0; policy accept; }
|
||||||
|
}
|
||||||
|
include "{{ base__firewall_dropin_dir }}/*.nft"
|
||||||
|
```
|
||||||
|
|
||||||
|
A host with no catalog entries still gets a valid default-deny + management-plane
|
||||||
|
ruleset. The `include` is the `docker_host` extension hook (forward/NAT drop-ins).
|
||||||
|
Sorted resolved rules → stable diffs and deterministic tests.
|
||||||
|
|
||||||
|
## Safe apply (lockout protection)
|
||||||
|
|
||||||
|
`tasks/firewall.yml` renders `/etc/nftables.conf`; when it changes, a **linear**
|
||||||
|
safe-apply sequence runs (deliberately in tasks, not a handler, so the confirm/cancel
|
||||||
|
step is controllable — a small, justified deviation from the handler idiom, noted in the
|
||||||
|
role README):
|
||||||
|
|
||||||
|
1. **Validate** — `nft -c -f /etc/nftables.conf`; fail the play if invalid, before
|
||||||
|
touching the live ruleset.
|
||||||
|
2. **Snapshot** — `nft list ruleset > /etc/nftables.rollback` (empty/flush on first run).
|
||||||
|
3. **Arm revert** — `systemd-run --on-active={{ base__firewall_rollback_timeout }}
|
||||||
|
--unit=nft-rollback nft -f /etc/nftables.rollback` (transient timer, no `at`
|
||||||
|
dependency).
|
||||||
|
4. **Apply** — `nft -f /etc/nftables.conf`.
|
||||||
|
5. **Confirm + disarm** — the next Ansible task running proves the connection survived →
|
||||||
|
`systemctl stop nft-rollback`. If the apply bricked connectivity, the play cannot
|
||||||
|
continue, the timer fires, and the host self-heals (the offsite-`askari` safeguard).
|
||||||
|
6. **Persist** — enable `nftables.service` so `/etc/nftables.conf` loads on boot.
|
||||||
|
|
||||||
|
`established/related` (rendered in the ruleset) means the in-flight Ansible session
|
||||||
|
survives the swap; atomic `nft -f` avoids partial states.
|
||||||
|
|
||||||
|
**NetBird dependency:** locking SSH to `wt0`-only assumes NetBird (ADR-016) is built.
|
||||||
|
Until then, `base__firewall_mgmt_interface` (and, if needed, an additional management
|
||||||
|
source) is set to a reachable path so the role is deployable independently. This is a
|
||||||
|
config knob, not a code dependency.
|
||||||
|
|
||||||
|
## Testing (ADR-008)
|
||||||
|
|
||||||
|
- **Level 1 / pytest** — unit-test `firewall_rules.py` against fixture catalogs: zone
|
||||||
|
resolution, service→host-IP resolution, `group`/`hosts` multi-host placement, a host
|
||||||
|
with no services, source de-dup/sort, and an unresolvable `from` raising. Mirrors
|
||||||
|
`tests/test_check_tags.py` (import the module, assert on return values).
|
||||||
|
- **Level 1 / Molecule** — fixture `firewall_catalog` + fixture inventory (host_vars/
|
||||||
|
group_vars) in the scenario; `converge` renders `/etc/nftables.conf`; `verify` asserts
|
||||||
|
(a) expected accept lines are present for the fixture and (b) `nft -c -f
|
||||||
|
/etc/nftables.conf` validates syntax. **No apply** (kernel safety).
|
||||||
|
- **Level 2 / staging** — real apply on staging VMs verifies enforcement *and* the
|
||||||
|
safe-apply + auto-rollback path (steps 2–5), which Level 1 cannot safely cover.
|
||||||
|
|
||||||
|
The Molecule base image is not guaranteed to ship `nft`. The role installs the
|
||||||
|
`nftables` package as its first firewall task, so by the time `verify` runs the `nft -c`
|
||||||
|
syntax check, `nft` is present (installed during `converge`).
|
||||||
|
|
||||||
|
## Open dependencies / notes
|
||||||
|
|
||||||
|
- **NetBird/ADR-016 unbuilt** — see the mgmt-interface knob above; full `wt0`-only
|
||||||
|
lockdown lands when NetBird does.
|
||||||
|
- The safe-apply orchestration (steps 2–5) has **no Level-1 coverage** by design; it is
|
||||||
|
integration-tested at Level 2. Called out so the gap is explicit.
|
||||||
|
|
||||||
|
## Scope summary
|
||||||
|
|
||||||
|
**Built here:** `firewall_catalog`/`firewall_zones` schema; `firewall_rules.py` resolver
|
||||||
|
+ pytest; `nftables.conf.j2` (default-deny input, mgmt plane, permissive egress, drop-in
|
||||||
|
`include` hook); safe-apply-with-rollback tasks; Molecule render/syntax scenario;
|
||||||
|
`base` role scaffolding (README, meta, defaults, handlers).
|
||||||
|
|
||||||
|
**Deferred:** Docker `iptables:false` + container forward/NAT (→ `docker_host` spec, via
|
||||||
|
the drop-in hook); OPNsense rendering from the same catalog (→ OPNsense-as-code spec);
|
||||||
|
drift-detection check (ADR-020); all other `base` concerns.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
ADR-020 (firewall strategy), ADR-002 (security baseline), ADR-004 (Docker model —
|
||||||
|
`iptables:false`, one service = one role), ADR-007 (VLANs/subnets), ADR-008 (testing
|
||||||
|
levels), ADR-016 (NetBird mesh — SSH on `wt0`), ADR-019 (`firewall` tag).
|
||||||
188
docs/superpowers/specs/2026-06-06-tagging-strategy-design.md
Normal file
188
docs/superpowers/specs/2026-06-06-tagging-strategy-design.md
Normal file
|
|
@ -0,0 +1,188 @@
|
||||||
|
# Design — Ansible tagging standard (targeted, predictable runs)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-06
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** TODO 3.7 ("Define a tagging standard that lets us target runs without
|
||||||
|
over-tagging") and TODO 3.11 ("Deliberate tagging strategy") — the same thread
|
||||||
|
- **Becomes:** ADR-019 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma wants to run playbooks **targeted** — a single service, a single layer, or a
|
||||||
|
single cross-cutting concern — and to do so **transparently and predictably**: you
|
||||||
|
should be able to look at a `--tags` invocation and know exactly what it will and won't
|
||||||
|
touch. CLAUDE.md already mandates that every task be tag-filterable, but no *vocabulary*
|
||||||
|
or *naming convention* exists. Without one, tags proliferate ad-hoc per role and the
|
||||||
|
"predictable" property is lost — and the TODO explicitly warns against the opposite
|
||||||
|
failure mode, **over-tagging**.
|
||||||
|
|
||||||
|
The repo is effectively greenfield for this: `base` and `docker_host` are empty, and the
|
||||||
|
only tags in existence are `[base]`/`[docker]` in `site.yml` and `[bootstrap]` in
|
||||||
|
`bootstrap.yml`. So we can bake the standard into role-authoring conventions *before*
|
||||||
|
there are a dozen service roles to retrofit.
|
||||||
|
|
||||||
|
## Targeting axes (what we want to slice by)
|
||||||
|
|
||||||
|
1. **Layer / role** — `--tags base`, `--tags docker`
|
||||||
|
2. **Single service** — `--tags photoprism`, `--tags traefik`
|
||||||
|
3. **Concern / function** — `--tags firewall`, `--tags logging`, …
|
||||||
|
|
||||||
|
Lifecycle phases (bootstrap/config/deploy) are **not** a tag axis — `bootstrap.yml` vs
|
||||||
|
`site.yml` already separate those as whole playbooks.
|
||||||
|
|
||||||
|
Key simplification: because of ADR-004 (*one service = one role*, role name = service
|
||||||
|
name), axes 1 and 2 are the **same mechanism** — a tag equal to the role name. Only the
|
||||||
|
concern axis needs a curated vocabulary.
|
||||||
|
|
||||||
|
## Approach (chosen): two-tier tagging
|
||||||
|
|
||||||
|
**Tier 1 — role/service tag (mechanical).** The tag *equals the role name*, applied
|
||||||
|
**once** at the role-import level in the playbook:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
roles:
|
||||||
|
- role: photoprism
|
||||||
|
tags: [photoprism]
|
||||||
|
```
|
||||||
|
|
||||||
|
Ansible propagates the tag to every task in the role. This covers both the layer/role
|
||||||
|
and single-service axes with one rule and **zero per-task burden**.
|
||||||
|
|
||||||
|
**Tier 2 — concern tag (curated).** A small **closed, documented list** of cross-cutting
|
||||||
|
concern tags, applied per-task/block **only where a task genuinely belongs to that
|
||||||
|
concern**. `--tags firewall` then hits firewall tasks in `base` and in every service
|
||||||
|
role.
|
||||||
|
|
||||||
|
Rejected alternatives: *concern-only/flat* (loses natural `--tags <service>` ergonomics);
|
||||||
|
*rich multi-dimensional* (role+service+concern+lifecycle+ad-hoc per task) — that is
|
||||||
|
precisely the over-tagging the TODO warns against.
|
||||||
|
|
||||||
|
## The closed concern list
|
||||||
|
|
||||||
|
Litmus test for earning a spot: a concern must (a) appear in **2+ roles**, (b) be
|
||||||
|
something you'd realistically want to run as a slice on its own, and (c) not overlap
|
||||||
|
confusingly with another.
|
||||||
|
|
||||||
|
**Baseline concerns** (mostly in `base`, some echoed in service roles):
|
||||||
|
|
||||||
|
| Tag | Covers |
|
||||||
|
|-----|--------|
|
||||||
|
| `packages` | apt package install/management |
|
||||||
|
| `users` | accounts, groups, sudo |
|
||||||
|
| `firewall` | nftables rulesets & port definitions (ADR-002) |
|
||||||
|
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
|
||||||
|
| `logging` | Alloy / log-shipping config (ADR-018) |
|
||||||
|
| `monitoring` | metric exporters / health checks |
|
||||||
|
|
||||||
|
**Service concerns** (in every service role, ADR-004):
|
||||||
|
|
||||||
|
| Tag | Covers |
|
||||||
|
|-----|--------|
|
||||||
|
| `config` | render templated config/compose files to disk — **no restart** |
|
||||||
|
| `deploy` | bring services up / restart (`compose up -d`) |
|
||||||
|
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
|
||||||
|
|
||||||
|
Nine tags total. The `config`/`deploy` split is deliberate and high-value: `--tags
|
||||||
|
config` re-renders and lets you diff configuration without bouncing services; `--tags
|
||||||
|
deploy` does the restart.
|
||||||
|
|
||||||
|
`backup` and `secrets` are **intentionally omitted** until the roles that need them
|
||||||
|
exist — they enter via the extend process, not speculative reservation.
|
||||||
|
|
||||||
|
## `always` / `never` policy
|
||||||
|
|
||||||
|
boma uses Ansible's two built-in special tags, narrowly:
|
||||||
|
|
||||||
|
- **`always`** — reserved strictly for **cheap preflight assertions** (vault unlocked,
|
||||||
|
OS is Debian 13, required vars present). Ensures even `--tags config` runs its safety
|
||||||
|
guards.
|
||||||
|
- **`never`** — reserved for **destructive/expensive opt-in tasks**, each paired with a
|
||||||
|
descriptive tag (e.g. `never, force_pull` or `never, restore`). They never run unless
|
||||||
|
explicitly named, keeping dangerous actions out of normal runs. The descriptive
|
||||||
|
partner tag is a documented `never`-paired opt-in (allowed by the linter).
|
||||||
|
|
||||||
|
## Predictability principle: tags are union-only
|
||||||
|
|
||||||
|
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. Rather than fight
|
||||||
|
this, we make it an explicit principle: **boma targets one axis at a time** — *either* a
|
||||||
|
role/service (`--tags photoprism`) *or* a concern (`--tags firewall`), never an
|
||||||
|
intersection like "photoprism's firewall only." If that is ever genuinely needed, the
|
||||||
|
answer is "just run `--tags photoprism`" (idempotent and fast). Designing for
|
||||||
|
intersection is the over-tagging trap; we decline it on purpose.
|
||||||
|
|
||||||
|
## Reconciling the existing CLAUDE.md rule
|
||||||
|
|
||||||
|
CLAUDE.md currently says *"every task must have at least one tag."* Under the two-tier
|
||||||
|
model the role tag is applied **once at the play/import level** and **inherited** by
|
||||||
|
every task, so tasks are always reachable without hand-tagging each one. The rule is
|
||||||
|
**reworded** to:
|
||||||
|
|
||||||
|
> Import each role with its role-name tag (once, at the play level). Within a role, tag a
|
||||||
|
> task/block with a concern tag from the approved list **only where it genuinely belongs
|
||||||
|
> to that concern** — don't invent tags or tag for tagging's sake.
|
||||||
|
|
||||||
|
This directly resolves the "without over-tagging" tension.
|
||||||
|
|
||||||
|
## Terraform / Proxmox VM tags (metadata only)
|
||||||
|
|
||||||
|
Formalize the convention that already half-exists in `staging/main.tf`
|
||||||
|
(`tags = ["staging", each.value.group]`). Every TF-managed VM gets exactly three tags:
|
||||||
|
|
||||||
|
| Tag | Value | Purpose |
|
||||||
|
|-----|-------|---------|
|
||||||
|
| env | `staging` \| `production` | which environment |
|
||||||
|
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
|
||||||
|
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
|
||||||
|
|
||||||
|
Set as `tags = ["${env}", each.value.group, "managed-by=terraform"]` in the env
|
||||||
|
`main.tf` (env is constant per directory).
|
||||||
|
|
||||||
|
**Explicit non-goals** (stated so nobody wires them up later): these tags are **pure
|
||||||
|
metadata for transparency** — glanceable in the Proxmox UI. They do **not** drive
|
||||||
|
run-targeting and do **not** feed inventory. `scripts/tf_to_inventory.py` keeps building
|
||||||
|
groups from the `group` output field, which stays the single source of truth.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
A small **lint check wired into `make lint`**: a script collects every `tags:` value
|
||||||
|
across `roles/` and `playbooks/` and fails if any tag is not in the allowed set:
|
||||||
|
|
||||||
|
```
|
||||||
|
{role names} ∪ {9 concern tags} ∪ {always, never} ∪ {documented never-paired opt-ins}
|
||||||
|
```
|
||||||
|
|
||||||
|
The allowed concern list (and the `never`-paired opt-ins) live in **one
|
||||||
|
machine-readable file, `tests/tags.yml`**, which both the linter reads and the ADR
|
||||||
|
documents — so doc and enforcement cannot drift. This is more honest than ansible-lint's
|
||||||
|
limited built-in tags rule. A unit test (mirroring `tests/test_capacity_scan.py`) covers
|
||||||
|
the checker.
|
||||||
|
|
||||||
|
## The "propose to extend" process
|
||||||
|
|
||||||
|
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the ADR-019 table
|
||||||
|
with a one-line justification showing it passes the litmus test (cross-cutting, 2+
|
||||||
|
roles, distinct). That is the whole gate — lightweight, but it leaves a paper trail.
|
||||||
|
|
||||||
|
## Deliverables
|
||||||
|
|
||||||
|
- **New `docs/decisions/019-tagging.md`** — the standard: rationale, two-tier model,
|
||||||
|
concern table, union-only principle, `always`/`never` policy, Proxmox tag convention,
|
||||||
|
extend process.
|
||||||
|
- **`tests/tags.yml`** — machine-readable allowed concern list + `never`-paired opt-ins.
|
||||||
|
- **Lint checker script** (e.g. `scripts/check-tags.py`) + **`make lint`** wiring +
|
||||||
|
**`tests/test_check_tags.py`**.
|
||||||
|
- **CLAUDE.md** — reword the tag bullet under *Ansible conventions*; add the Proxmox tag
|
||||||
|
convention under *Terraform conventions*; add ADR-019 to *Further reading*.
|
||||||
|
- **`terraform/environments/{staging,production}/main.tf`** — apply the three-tag
|
||||||
|
convention.
|
||||||
|
- **`docs/TODO.md`** — mark 3.7 and 3.11 DECIDED (ADR-019).
|
||||||
|
- **`docs/CAPABILITIES.md`** — note targeted runs as a capability, if it fits.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Intersection targeting (role ∩ concern) — declined on purpose (see principle).
|
||||||
|
- Lifecycle-phase tags — handled by separate playbooks.
|
||||||
|
- Proxmox tags feeding inventory or run-targeting — metadata only.
|
||||||
|
- `backup`/`secrets` concern tags — added later via the extend process.
|
||||||
214
docs/superpowers/specs/2026-06-09-operational-access-design.md
Normal file
214
docs/superpowers/specs/2026-06-09-operational-access-design.md
Normal file
|
|
@ -0,0 +1,214 @@
|
||||||
|
# Design — Operational access (ADR-021)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-09
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Implements:** New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2
|
||||||
|
(what to set up on hosts, given direct access will be rare).
|
||||||
|
- **Amends:** ADR-016 (SSH was mesh-only; now also from `ubongo`'s LAN address) and
|
||||||
|
ADR-020 (adds an `ssh-from-control` symbolic catalog source).
|
||||||
|
- **Scope:** The operational-access *doctrine* + the declarative `access__*` data model,
|
||||||
|
the rendered `ACCESS.md` record, and the `/check-access` verifier design. It does **not**
|
||||||
|
build any of it — `base`/service roles and live hosts don't exist yet. Designed now,
|
||||||
|
built when there is something to access.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
||||||
|
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
||||||
|
ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational
|
||||||
|
question: **when a service or host breaks, how does the operator (and the AI working on
|
||||||
|
boma's behalf from `ubongo`) actually get in to troubleshoot it?**
|
||||||
|
|
||||||
|
Experience on similar projects shows troubleshooting is far more effective with *several*
|
||||||
|
documented ways in — SSH, container exec, logs, an admin API — so a single broken path
|
||||||
|
doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are
|
||||||
|
documented, or still work. The risk is the classic one: the access you assumed you had is
|
||||||
|
stale exactly when you need it (key rotated, API disabled, token expired).
|
||||||
|
|
||||||
|
boma already has the right *shape* for the fix. Service roles carry record docs —
|
||||||
|
`SECURITY.md` (security answers) and `VERIFY.md` (acceptance spec) — gated by the service
|
||||||
|
checklist and the `new-role` runbook. What's missing is the third sibling: an
|
||||||
|
**operational access record**, plus the doctrine behind it.
|
||||||
|
|
||||||
|
Two constraints shape the design:
|
||||||
|
|
||||||
|
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
||||||
|
paths over the *trusted* interface, never new exposed ports. Resolution: all routine
|
||||||
|
access runs over the mesh from `ubongo`.
|
||||||
|
2. **A documented path that is never tested drifts.** It fails exactly when needed. So
|
||||||
|
the structured access facts must be *data* that both renders the doc and drives an
|
||||||
|
active verifier — the two can then never disagree.
|
||||||
|
|
||||||
|
## Decisions settled in brainstorming
|
||||||
|
|
||||||
|
- **Access is a deployment deliverable.** The deploy that creates a host/service also
|
||||||
|
records and (by design) proves its access paths. Not rediscovered under pressure.
|
||||||
|
- **All routine access over the mesh** (`wt0`, from `ubongo`). No new LAN/WAN exposure.
|
||||||
|
- **Two layers:** a host-level access baseline (resolves TODO 7.2) and a per-service
|
||||||
|
access record (resolves TODO 3.2).
|
||||||
|
- **Baseline paths, every service:** host SSH, container exec + compose, logs
|
||||||
|
(Loki/Grafana, ADR-018), and the service admin API where one exists (`n/a` otherwise).
|
||||||
|
- **A new first-class sibling record** `ACCESS.md` (next to `SECURITY.md`/`VERIFY.md`),
|
||||||
|
**rendered from declarative data** — not hand-written prose (the firewall-catalog
|
||||||
|
philosophy of ADR-020 applied to access).
|
||||||
|
- **Active verification designed in:** a `/check-access` skill probes the declared paths
|
||||||
|
and reports which are live — the access analogue of `/verify-service` (ADR-017).
|
||||||
|
- **Direct LAN SSH from `ubongo` only** is added as a second, mesh-independent path
|
||||||
|
(amends ADR-016); all other LAN hosts stay blocked by default-deny.
|
||||||
|
|
||||||
|
## The doctrine
|
||||||
|
|
||||||
|
> **Every host and every service guarantees at least one documented, verifiable way in
|
||||||
|
> for operational troubleshooting — and the deploy that creates it also records and
|
||||||
|
> proves it.**
|
||||||
|
|
||||||
|
### Two layers
|
||||||
|
|
||||||
|
- **Host layer** (TODO 7.2). Every host, via the `base` role, guarantees a fixed access
|
||||||
|
baseline: SSH over `wt0` and from `ubongo` (below), Docker/Compose tooling present, and
|
||||||
|
log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a known, uniform set of
|
||||||
|
paths exists over the mesh. This is boma's answer to "what every host runs for access."
|
||||||
|
- **Service layer** (TODO 3.2). Every service role guarantees and records its paths:
|
||||||
|
container exec + compose management, its Loki log labels, and its admin API where one
|
||||||
|
exists (enabled, token in vault, endpoint + health probe documented) or explicit `n/a`.
|
||||||
|
|
||||||
|
### The three-tier access ladder
|
||||||
|
|
||||||
|
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
||||||
|
before SSH sees it. The preferred path (ADR-016's original rationale).
|
||||||
|
2. **LAN SSH from `ubongo` — secondary, mesh-independent.** Most hardware (all but
|
||||||
|
`askari`) shares a LAN. SSH from `ubongo`'s LAN address is allowed via a new catalog
|
||||||
|
source, giving a fallback that survives a NetBird/`wt0` outage. It is gated by *source
|
||||||
|
IP* (spoofable on a LAN) **plus** the standing keys-only + fail2ban SSH hardening, so
|
||||||
|
the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one
|
||||||
|
trusted host" — modest and deliberate. All *other* LAN hosts remain default-denied.
|
||||||
|
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, not
|
||||||
|
used for routine work:
|
||||||
|
- **Cluster VMs** → Proxmox serial/VNC console (`qm terminal` / console via the
|
||||||
|
Proxmox host) — independent of the guest network, `wt0`, and even a broken guest
|
||||||
|
nftables ruleset.
|
||||||
|
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
||||||
|
- **`ubongo`** (physical) → local console.
|
||||||
|
|
||||||
|
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
||||||
|
|
||||||
|
## The declarative access data model (Approach B)
|
||||||
|
|
||||||
|
Structured access facts live as **data** — the single source of truth that both renders
|
||||||
|
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge.
|
||||||
|
|
||||||
|
### Service-layer — `access__*` in each service role's defaults
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
access__service: photoprism
|
||||||
|
access__compose_project: photoprism # docker compose -p <this>
|
||||||
|
access__compose_path: /opt/photoprism/compose.yml
|
||||||
|
access__containers: [photoprism, photoprism-db] # exec targets
|
||||||
|
access__log:
|
||||||
|
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
||||||
|
access__api:
|
||||||
|
enabled: true
|
||||||
|
base_url: "https://photoprism.host:2342" # reachable over the mesh
|
||||||
|
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
||||||
|
auth: { type: token, vault_ref: "vault.photoprism.api_token" }
|
||||||
|
health_path: "/api/v1/status" # what /check-access pings
|
||||||
|
# where the service has no API:
|
||||||
|
# access__api: { enabled: false, reason: "<none upstream>" }
|
||||||
|
```
|
||||||
|
|
||||||
|
**Single-source-of-truth rule:** `access__api` **never opens a port**. It `firewall_ref`s
|
||||||
|
the entry in the `group_vars` firewall catalog — ADR-020 stays the sole owner of
|
||||||
|
*exposure*. The access data adds only *how to use* the path (endpoint, token ref, health
|
||||||
|
probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
||||||
|
|
||||||
|
### Host-layer — a fixed baseline, stated once
|
||||||
|
|
||||||
|
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
||||||
|
uniform, so it is asserted by `base` and recorded once at the host/group level — not
|
||||||
|
re-stated per service. The break-glass console per host class is recorded with it.
|
||||||
|
|
||||||
|
## The rendered record — `ACCESS.md`
|
||||||
|
|
||||||
|
`ACCESS.md` is **rendered** from the `access__*` data, with a prose tail for the genuinely
|
||||||
|
narrative parts:
|
||||||
|
|
||||||
|
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
||||||
|
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
||||||
|
invocation (`ssh host`, `docker compose -p <project> …`, the Loki query, the `curl`
|
||||||
|
against the API health path).
|
||||||
|
- **Break-glass (generated from host class)** — the Proxmox/provider console line.
|
||||||
|
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
||||||
|
part a template cannot know.
|
||||||
|
|
||||||
|
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
||||||
|
security/verify templates.
|
||||||
|
|
||||||
|
## The verifier — `/check-access` (designed now, build-pending on infra)
|
||||||
|
|
||||||
|
Runs from `ubongo`; turns the `access__*` data into live probes. Invoked
|
||||||
|
`/check-access <service>` (or `<host>` for the host baseline). The access analogue of
|
||||||
|
`/verify-service` (ADR-017).
|
||||||
|
|
||||||
|
| Path | Probe | Green = |
|
||||||
|
|---|---|---|
|
||||||
|
| `wt0` mesh SSH | connect over mesh, run `true` | reachable + key works |
|
||||||
|
| LAN SSH from `ubongo` | connect via LAN addr, run `true` | reachable + key works |
|
||||||
|
| exec + compose | `docker compose -p <project> ps`; exec `true` in each container | stack up, exec works |
|
||||||
|
| logs | query Loki for `loki_labels`, expect recent lines | logs flowing |
|
||||||
|
| admin API | `curl` the `health_path` with the vault token | 2xx |
|
||||||
|
| break-glass | reachability of the Proxmox/provider console endpoint only | console host reachable |
|
||||||
|
|
||||||
|
- **Break-glass is checked for reachability, not exercised** — firing a serial console is
|
||||||
|
invasive; the verifier confirms the fallback *exists* without disrupting anything.
|
||||||
|
- **Output:** a pass/fail table; on any red, it names the path and the likely cause
|
||||||
|
("API token in vault stale", "Alloy not shipping", "`ssh-from-control` catalog source
|
||||||
|
missing"). The payoff: not "the doc *says* you can get in" but "verified — three of four
|
||||||
|
paths green right now, here's the broken one."
|
||||||
|
- **Status:** designed now, build-pending on infra (needs live hosts + staging + vault),
|
||||||
|
exactly like `/verify-service` under ADR-017.
|
||||||
|
|
||||||
|
## Governance — so it can't be forgotten
|
||||||
|
|
||||||
|
Three light touches mirror how `SECURITY.md`/`VERIFY.md` are enforced:
|
||||||
|
|
||||||
|
1. **Service checklist** (`docs/security/service-checklist.md`) gains one item: *"Access
|
||||||
|
paths declared (`access__*`), `ACCESS.md` rendered, `/check-access` green — or
|
||||||
|
deviation recorded in `accepted-risks.md`."*
|
||||||
|
2. **`new-role` runbook** (`docs/runbooks/new-role.md`) gains a step: fill `access__*`,
|
||||||
|
render `ACCESS.md`, run `/check-access`.
|
||||||
|
3. **`make new-role` scaffold** drops a stub `access__*` block + the `ACCESS.md` template
|
||||||
|
into the role — the same way roles already get `SECURITY.md`/`VERIFY.md` stubs, so it
|
||||||
|
is structurally impossible to ship a service role with no access record.
|
||||||
|
|
||||||
|
## Repo wiring
|
||||||
|
|
||||||
|
- **`docs/decisions/021-operational-access.md`** — the new ADR (doctrine, both layers,
|
||||||
|
the three-tier ladder, break-glass, the `access__*` model, `/check-access`).
|
||||||
|
- **`docs/decisions/016-mesh-vpn.md`** — amend: SSH on `wt0` **and** from `ubongo`'s LAN
|
||||||
|
address (was mesh-only). Cross-link ADR-021.
|
||||||
|
- **`docs/decisions/020-firewall.md`** — note the new `ssh-from-control` symbolic source.
|
||||||
|
- **`docs/access/service-access-template.md`** — the rendered `ACCESS.md` shape.
|
||||||
|
- **`docs/security/service-checklist.md`** — the one new gate item.
|
||||||
|
- **`docs/runbooks/new-role.md`** — the fill/render/`check-access` step.
|
||||||
|
- **`CLAUDE.md`** — `ACCESS.md` under "Role conventions"; ADR-021 in Further reading.
|
||||||
|
- **`STATUS.md`** — rows: ADR-021 doctrine *(designed)*; `ssh-from-control` catalog source
|
||||||
|
*(designed, builds with `base` firewall)*; `/check-access` *(designed, build-pending)*.
|
||||||
|
- **`docs/TODO.md`** — mark 3.2 and 7.2 DECIDED → ADR-021.
|
||||||
|
|
||||||
|
## What is buildable now vs later
|
||||||
|
|
||||||
|
- **Now:** the doctrine, ADR-021, the `ACCESS.md` template, the checklist/runbook/scaffold
|
||||||
|
wiring, and the `ssh-from-control` catalog source (the `firewall` concern of `base`
|
||||||
|
already exists, so the source can land with it).
|
||||||
|
- **Later (build-pending on infra):** `/check-access` *running*, and per-service
|
||||||
|
`ACCESS.md` *files* — both wait on service roles + live hosts. Designed now, built when
|
||||||
|
there is something to verify.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Building `base`'s non-firewall concerns, any service role, or live hosts.
|
||||||
|
- Broader LAN SSH (a management VLAN) — explicitly rejected; `ubongo`-only.
|
||||||
|
- Exercising (vs reachability-probing) the break-glass console.
|
||||||
|
- Any access path that is not over the mesh or the one `ubongo` LAN source.
|
||||||
164
docs/superpowers/specs/2026-06-10-adr-structure-design.md
Normal file
164
docs/superpowers/specs/2026-06-10-adr-structure-design.md
Normal file
|
|
@ -0,0 +1,164 @@
|
||||||
|
# Design — ADR structure & lifecycle
|
||||||
|
|
||||||
|
- **Date:** 2026-06-10
|
||||||
|
- **Status:** Approved design — implementation plan to follow
|
||||||
|
- **Resolves:** the absence of a written standard for how ADRs in
|
||||||
|
`docs/decisions/` are structured. The newest ADRs (019–022) have converged on a
|
||||||
|
clean pattern (`Status` → `Context` → `Decision` → `Consequences` → `Related`),
|
||||||
|
but it lives only as imitation; ADRs 001–018 predate it and most lack a `Status`
|
||||||
|
section.
|
||||||
|
- **Becomes:** ADR-023 (this design is the basis for that ADR).
|
||||||
|
- **Reuses:** boma's existing `*-template.md` convention (`service-security-template.md`,
|
||||||
|
`service-verify-template.md`, `service-access-template.md`, `service-backup-template.md`);
|
||||||
|
ADR-014 (knowledge-sourcing → the optional `Verified facts` section); ADR-019/020/021/022
|
||||||
|
(the emergent structure being codified); the `/review-repo` command (enforcement home).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma documents architectural decisions as numbered ADRs in `docs/decisions/`, and
|
||||||
|
CLAUDE.md treats them as load-bearing ("Before assuming a role, provider, or pipeline
|
||||||
|
exists, check STATUS.md"; the entire "Further reading" table points into them). Yet
|
||||||
|
there is no ADR that says how an ADR is written. The result:
|
||||||
|
|
||||||
|
- **Structural drift.** ADRs 001–018 are freeform; 019–022 converged on a consistent
|
||||||
|
shape but only by imitation. A new ADR's structure depends on which existing one the
|
||||||
|
author happened to copy.
|
||||||
|
- **No status discipline.** Most early ADRs have no `## Status` section, so there is no
|
||||||
|
uniform way to tell an active decision from a superseded or deprecated one — and no
|
||||||
|
written rule for how a decision gets reversed without silently rewriting history.
|
||||||
|
- **No scaffold.** Every other recurring document type in boma has a template
|
||||||
|
(`service-security-template.md`, etc.). ADRs do not.
|
||||||
|
|
||||||
|
This design codifies the structure 019–022 already demonstrate, pins a status
|
||||||
|
lifecycle, ships a template, and reconciles the back-catalogue.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
- **In:** the canonical section set (mandatory + optional); title and filename
|
||||||
|
convention; the `Accepted / Superseded / Deprecated` status lifecycle and the
|
||||||
|
no-silent-rewrite rule; cross-reference convention; an ADR template file; a
|
||||||
|
lightweight `/review-repo` structure check; a **one-time retroactive restructure of
|
||||||
|
ADRs 001–018** to full conformance (all four mandatory sections + a parseable Status
|
||||||
|
line), reorganizing existing content under canonical headings.
|
||||||
|
- **Out (for now):** *changing the substance of* any existing decision (the restructure
|
||||||
|
is presentational — relabel/regroup/demote existing content, add a dated Status, never
|
||||||
|
alter what was decided); a `make lint` / CI gate for ADR structure (explicitly
|
||||||
|
rejected in favour of the `/review-repo` check — consistent with boma's other doctrine
|
||||||
|
ADRs, which add no CI gate); grandfathering pre-convention ADRs from the check
|
||||||
|
(rejected — the whole corpus is brought to conformance instead).
|
||||||
|
|
||||||
|
The lifecycle uses four states — `Proposed / Accepted / Superseded / Deprecated`. An
|
||||||
|
earlier draft of this design omitted `Proposed`, but ADR-011 (a real draft with open
|
||||||
|
questions) is evidence boma occasionally needs it, so it was kept.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Title & filename
|
||||||
|
- Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash `—`,
|
||||||
|
matching every existing ADR).
|
||||||
|
- Filename: `NNN-kebab-title.md`, zero-padded 3-digit, monotonic, **never reused**
|
||||||
|
(a superseded ADR keeps its number and file).
|
||||||
|
- A new ADR is registered as a row in the CLAUDE.md "Further reading" table.
|
||||||
|
|
||||||
|
### 2. Canonical sections
|
||||||
|
|
||||||
|
**Mandatory — every ADR, in this order:**
|
||||||
|
|
||||||
|
| Section | Holds |
|
||||||
|
|---|---|
|
||||||
|
| `## Status` | `Accepted (YYYY-MM-DD)`, plus an optional one-line note (what it resolves/supersedes, or a doctrine-not-yet-built caveat as ADR-022 uses) |
|
||||||
|
| `## Context` | the forces, the problem, what exists today, why now |
|
||||||
|
| `## Decision` | what we are doing — numbered sub-decisions for multi-part ADRs, as 020/021/022 do |
|
||||||
|
| `## Consequences` | results, trade-offs *explicitly accepted*, follow-on work |
|
||||||
|
|
||||||
|
**Optional — use only where genuinely applicable, never as padding:**
|
||||||
|
|
||||||
|
- `## Related` — links to other ADRs by number.
|
||||||
|
- `## Scope` — explicit in/out-of-scope boundaries.
|
||||||
|
- `## Guardrails` / `## Enforcement` — how the decision is mechanically enforced
|
||||||
|
(lint, CI, hooks).
|
||||||
|
- `## What was ruled out` — rejected alternatives, each with its reason.
|
||||||
|
- `## Verified facts (ADR-014)` — version-stamped facts per the knowledge-sourcing rule.
|
||||||
|
|
||||||
|
### 3. Status lifecycle
|
||||||
|
|
||||||
|
Four states. Most ADRs are **born `Accepted (YYYY-MM-DD)`** — the sole author commits
|
||||||
|
to it on writing (boma is single-contributor and trunk-based with no review gate).
|
||||||
|
|
||||||
|
- **`Proposed (YYYY-MM-DD)`** — a genuine draft whose core direction is recorded but
|
||||||
|
whose specifics are still open (e.g. ADR-011, which carries open questions). Promoted
|
||||||
|
to `Accepted (YYYY-MM-DD)` once settled.
|
||||||
|
- **`Accepted (YYYY-MM-DD)`** — committed-to; the common starting state.
|
||||||
|
- Replaced by a later decision → the old ADR's Status becomes
|
||||||
|
**`Superseded by ADR-NNN (YYYY-MM-DD)`**; the superseding ADR records
|
||||||
|
`Supersedes ADR-MMM` in its own `## Status` and `## Related`. The link is
|
||||||
|
**bidirectional** — both files must point at each other.
|
||||||
|
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** plus a one-line reason.
|
||||||
|
|
||||||
|
**Load-bearing rule — no silent rewrites.** An `Accepted` ADR is not edited to reverse
|
||||||
|
its decision. Typo and clarity fixes are fine; a *material reversal* requires a new ADR
|
||||||
|
and a `Superseded by` marker on the old one. The history of decisions stays legible.
|
||||||
|
|
||||||
|
### 4. Cross-references
|
||||||
|
Reference other ADRs by number inline (`ADR-019`), and collect the relationships in a
|
||||||
|
`## Related` section.
|
||||||
|
|
||||||
|
### 5. Template file
|
||||||
|
Ship `docs/decisions/adr-template.md` — consistent with boma's existing
|
||||||
|
`*-template.md` convention. It contains the mandatory section headers pre-filled with
|
||||||
|
short HTML-comment hints, and the optional sections listed as commented stubs to
|
||||||
|
uncomment when relevant. It is a skeleton, not a numbered decision, so it does not take
|
||||||
|
an ADR number.
|
||||||
|
|
||||||
|
### 6. Retroactive restructure (001–018)
|
||||||
|
A **separate step** after the ADR and template land: bring every pre-convention ADR to
|
||||||
|
full conformance — all four mandatory sections present and a parseable Status line. This
|
||||||
|
is a **presentational** restructure, governed by a strict faithfulness rule:
|
||||||
|
|
||||||
|
- **Add** a `## Status` section valued `Accepted (YYYY-MM-DD)`, the date reconstructed
|
||||||
|
from the file's **first git-commit date**. For 016–018, whose existing trailing
|
||||||
|
build-state note is unparseable, prepend the dated `Accepted (...)` clause so the note
|
||||||
|
becomes a parseable Status line's tail.
|
||||||
|
- **Reorganize** existing content under the canonical headings: relabel a synonym
|
||||||
|
(`## Decisions` → `## Decision`), or introduce a `## Decision` umbrella and **demote**
|
||||||
|
the existing topical `##` headings to `###` beneath it. No sentence of existing prose
|
||||||
|
is altered.
|
||||||
|
- **Add** a `## Consequences` section built **only** from implications the ADR already
|
||||||
|
states (trade-offs, "what was ruled out", "open questions", follow-on work already
|
||||||
|
named). If an ADR genuinely states nothing that can be faithfully cast as a
|
||||||
|
consequence, that file is escalated for a human decision rather than inventing one.
|
||||||
|
- **Never** change the substance of a decision. A `git diff` of the restructure should
|
||||||
|
show heading-level changes, a new Status section, and a Consequences section assembled
|
||||||
|
from existing material — not edits to existing argument.
|
||||||
|
|
||||||
|
ADRs already conformant (019–022) are left alone. End state: the `adr-structure` check
|
||||||
|
reports zero findings across the whole corpus, with no grandfathering.
|
||||||
|
|
||||||
|
### 7. Enforcement
|
||||||
|
Lightweight, no CI gate. The `/review-repo` command gains an ADR-structure check:
|
||||||
|
every file in `docs/decisions/` matching `NNN-*.md` has the four mandatory sections and
|
||||||
|
a parseable `## Status` line. The template carries the convention forward for new ADRs.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- New ADRs have one obvious shape and a scaffold to start from; structural drift stops.
|
||||||
|
- Every ADR declares its lifecycle state uniformly, and reversals are traceable rather
|
||||||
|
than silent — the back-catalogue becomes a legible decision history.
|
||||||
|
- One-time churn: a restructure touching ~18 files (heading reorganization + a Status
|
||||||
|
section + a Consequences section per file). Larger and more judgment-heavy than a
|
||||||
|
Status-only backfill, hence the faithfulness rule and per-file review.
|
||||||
|
- The whole corpus conforms — the check needs no grandfathering or number threshold, and
|
||||||
|
stays simple (presence + parseable Status, applied uniformly).
|
||||||
|
- `/review-repo` grows a new check; no new CI machinery, matching boma's habit of not
|
||||||
|
gating doctrine in CI.
|
||||||
|
- This ADR is itself the first conformant example — it must follow its own structure.
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
None outstanding — title/filename, the **4-state lifecycle** (`Proposed / Accepted /
|
||||||
|
Superseded / Deprecated`; `Proposed` adopted on the evidence of ADR-011), template name
|
||||||
|
(`adr-template.md`), enforcement (`/review-repo`, no CI gate), and the **full
|
||||||
|
retroactive restructure** of 001–018 (no grandfathering) were all confirmed during
|
||||||
|
brainstorming and execution.
|
||||||
315
docs/superpowers/specs/2026-06-10-backup-strategy-design.md
Normal file
315
docs/superpowers/specs/2026-06-10-backup-strategy-design.md
Normal file
|
|
@ -0,0 +1,315 @@
|
||||||
|
# Design — Backup & disaster recovery strategy
|
||||||
|
|
||||||
|
- **Date:** 2026-06-10
|
||||||
|
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
|
||||||
|
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
|
||||||
|
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
|
||||||
|
all "planned")
|
||||||
|
- **Grounds:** the backup substrate that ADR-011 (update management) already leans on
|
||||||
|
("snapshot-before + backups remain the rollback mechanism", "always dumps the DB /
|
||||||
|
takes a backup first") but never defined
|
||||||
|
- **Reuses:** ADR-004 (one service = one role; per-service doc conventions),
|
||||||
|
ADR-008/017 (`VERIFY.md` per-service checks), ADR-021 (`ACCESS.md` rendered from
|
||||||
|
role `access__*` data — the same render-from-data pattern), ADR-015 (`ubongo`
|
||||||
|
recovery model; `mamba` break-glass clone)
|
||||||
|
- **Becomes:** ADR-022 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
|
||||||
|
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
|
||||||
|
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
|
||||||
|
copies live, *how* they're encrypted, or *whether restores actually work*.
|
||||||
|
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
|
||||||
|
but commits to nothing.
|
||||||
|
|
||||||
|
This design defines the policy end-to-end: recovery model, what is captured and how,
|
||||||
|
the 3-2-1 topology, encryption and key escrow with a break-glass path, restore
|
||||||
|
testing, retention, failure alerting, and the air-gap mechanism.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
- **In:** application *state* backup for boma's hosts and services; off-site and
|
||||||
|
air-gapped copies; encryption + key escrow; restore testing; failure alerting;
|
||||||
|
retention; the backup node.
|
||||||
|
- **Out (for now):** whole-VM image backup (Proxmox Backup Server) — explicitly
|
||||||
|
deferred, see Decision 1; a central-vs-per-app database decision (TODO 3.9 — this
|
||||||
|
design is agnostic to it); Prometheus backup metrics (noted as a later add).
|
||||||
|
|
||||||
|
## Decisions (as settled)
|
||||||
|
|
||||||
|
### 1. Recovery model — data-only backups, rebuild from code (Model A)
|
||||||
|
|
||||||
|
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
|
||||||
|
Ansible re-renders the Docker Compose stack. So backups protect **state only** — DB
|
||||||
|
contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
|
||||||
|
|
||||||
|
To recover a host: Terraform re-provisions the VM → Ansible redeploys → restic
|
||||||
|
restores the data. **No Proxmox Backup Server.** This keeps 3-2-1 cheap, fits
|
||||||
|
pCloud's 1 TB comfortably, and turns every restore into a continuous proof that the
|
||||||
|
IaC *and* the backups both work.
|
||||||
|
|
||||||
|
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +
|
||||||
|
data restore, potentially hours), and it bets the repo is complete enough to rebuild
|
||||||
|
from nothing — which Tier-2 restore testing (Decision 8) exists to verify. **PBS
|
||||||
|
(Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO proves
|
||||||
|
too slow; nothing here precludes it.
|
||||||
|
|
||||||
|
### 2. One backup tier, ~24 h RPO
|
||||||
|
|
||||||
|
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
|
||||||
|
the board. No per-data-type tiering yet — revisit once there is real-world data and
|
||||||
|
experience to justify the added machinery.
|
||||||
|
|
||||||
|
### 3. Engine — restic (data) + rclone (off-site); no PBS
|
||||||
|
|
||||||
|
- **restic** captures state into an encrypted, deduplicated repository.
|
||||||
|
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
|
||||||
|
rclone has a first-class pCloud backend).
|
||||||
|
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
|
||||||
|
encryption layer, no pCloud "crypto folder."
|
||||||
|
|
||||||
|
### 4. Topology — central pull node (`fisi`), off the cluster
|
||||||
|
|
||||||
|
A single backup node owns the canonical restic repo. It is **off the Proxmox
|
||||||
|
cluster** — an independent failure domain, so copy 2 survives a PVE node (or the whole
|
||||||
|
cluster) dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
|
||||||
|
(off-site): a manually-provisioned physical node in its own inventory group, still
|
||||||
|
Ansible-managed (base hardening + a `backup` role).
|
||||||
|
|
||||||
|
**Pull model.** The backup node holds SSH keys to each host; per service it runs the
|
||||||
|
declared dump command remotely, pulls the declared paths read-only, then `restic`
|
||||||
|
snapshots the staged data into its *local* repo. **Hosts hold no backup credentials
|
||||||
|
and cannot reach the repo** — so a compromised or ransomwared service host cannot
|
||||||
|
delete backup history.
|
||||||
|
|
||||||
|
**Backup node assignment:** `fisi` (an HP Elite 600 G9 tower), penciled in / provisional
|
||||||
|
— the *role* ("the backup node") is load-bearing; the physical assignment may be
|
||||||
|
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
|
||||||
|
(ZFS or mdraid → 8 TB usable, survives one disk failure; not a stripe). It owns the
|
||||||
|
repo, runs the pull orchestration, runs `rclone → pCloud`, and **docks the USB
|
||||||
|
air-gap drives** (Decision 11). Pending one hardware item: the SATA power cable from
|
||||||
|
the board/PSU to the drives. A data-only restic node is a featherweight workload, so
|
||||||
|
the G9 is comfortably over-specced.
|
||||||
|
|
||||||
|
### 5. 3-2-1 mapping
|
||||||
|
|
||||||
|
| Copy | Location | Medium | Off-site | Notes |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
|
||||||
|
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
|
||||||
|
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Decision 9 / threat model) |
|
||||||
|
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
|
||||||
|
|
||||||
|
≥3 copies, ≥2 media, ≥1 off-site — satisfied, with the air-gap drive as a fourth,
|
||||||
|
offline copy that no online compromise can reach.
|
||||||
|
|
||||||
|
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md` (hard convention)
|
||||||
|
|
||||||
|
Almost every boma service is the same shape: a Docker bind-mount data dir + maybe a
|
||||||
|
database. Each **service role declares its backup needs** in role vars — the same
|
||||||
|
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
backup__service: nextcloud # identifier; matches the role / compose project
|
||||||
|
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
|
||||||
|
backup__paths: # bind-mount dirs / files holding state ([] = none)
|
||||||
|
- /srv/nextcloud/data
|
||||||
|
backup__dumps: # logical app-consistent dumps (list; [] = none)
|
||||||
|
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
|
||||||
|
dest: nextcloud-db.sql
|
||||||
|
backup__quiesce: false # true = stop→back up→restart escape hatch
|
||||||
|
```
|
||||||
|
|
||||||
|
(ADR-022 is authoritative for the contract.)
|
||||||
|
|
||||||
|
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
|
||||||
|
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
|
||||||
|
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
|
||||||
|
silent).
|
||||||
|
|
||||||
|
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md` /
|
||||||
|
`VERIFY.md` / `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
|
||||||
|
what state exists, what is backed up, the dump command, and the per-service **restore**
|
||||||
|
procedure. A template lives at `docs/backup/service-backup-template.md`. `make lint`
|
||||||
|
gates its presence for service roles.
|
||||||
|
|
||||||
|
### 7. Consistency — logical dumps first, quiesce as an escape hatch
|
||||||
|
|
||||||
|
- **Default (A):** databases are captured with logical dumps (`pg_dump` /
|
||||||
|
`mysqldump`) — portable, version-independent, restorable to a fresh DB. Plain data
|
||||||
|
dirs are backed up as files. No downtime. Cost: every stateful service must declare
|
||||||
|
a working dump command, *tested by restore drills*.
|
||||||
|
- **Escape hatch (B):** a service whose data cannot be dumped live declares a
|
||||||
|
quiesce step (stop container → back up volume → restart) in the same contract.
|
||||||
|
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
|
||||||
|
crash-consistent for a live database).
|
||||||
|
|
||||||
|
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
|
||||||
|
way, each service declares how to dump its own data.
|
||||||
|
|
||||||
|
### 8. Restore testing — two tiers
|
||||||
|
|
||||||
|
- **Tier 1 — frequent, automated, rolling restore-verify (weekly).** Pick the next
|
||||||
|
service in rotation, restore its latest snapshot into a throwaway **container on
|
||||||
|
`ubongo`** (reusing boma's existing Molecule harness, ADR-015), start the app
|
||||||
|
against the restored data, and **run that service's `VERIFY.md` checks**
|
||||||
|
(ADR-008/017) against it, then tear down. This catches the failure that actually
|
||||||
|
kills people — *silently corrupt or unrestorable backups*. Failures alert via ntfy.
|
||||||
|
- **Tier 2 — rare, full DR rehearsal (semi-annual), driven from `ubongo` onto PVE
|
||||||
|
staging.** Rebuild a host from zero via Terraform + Ansible + restic restore on the
|
||||||
|
staging cluster (only a real PVE node can host the VM; `ubongo` orchestrates). This
|
||||||
|
validates the whole Model-A recovery chain, not just "can I read a snapshot."
|
||||||
|
**At least once a year the rehearsal exercises the paper-secret break-glass path**
|
||||||
|
(Decision 10) end-to-end.
|
||||||
|
|
||||||
|
`ubongo` stays **bare Debian, not a hypervisor** (ADR-015 unchanged): its job is to be
|
||||||
|
the independent recovery anchor — "the tool used to rebuild the cluster must not live
|
||||||
|
inside the thing it rebuilds." Higher-fidelity real-VM testing is *better* served by
|
||||||
|
the PVE staging env (same hardware class, same cluster, same provisioning path) than
|
||||||
|
by converting `ubongo`. `ubongo`'s real spec is a ThinkCentre M70q (i3-10100T / 16 GB
|
||||||
|
/ **1 TB NVMe**) — the 1 TB gives ample room for Tier-1 dataset restores; disk
|
||||||
|
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
|
||||||
|
|
||||||
|
### 9. Retention — GFS via restic
|
||||||
|
|
||||||
|
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
|
||||||
|
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
|
||||||
|
Tune once real repo growth is observed.
|
||||||
|
|
||||||
|
### 10. Encryption + key escrow + break-glass
|
||||||
|
|
||||||
|
restic already encrypts the repo, so **one secret — the restic repo password —
|
||||||
|
protects all copies uniformly** (fisi, pCloud, USB). One thing to escrow, not three.
|
||||||
|
|
||||||
|
**Escrow locations:**
|
||||||
|
- **`fisi`, root-only** (+ in the Ansible vault) — so backups run non-interactively
|
||||||
|
and `fisi` is redeployable.
|
||||||
|
- **Vaultwarden** — the day-to-day human-accessible copy.
|
||||||
|
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
|
||||||
|
copy that survives "everything is down."
|
||||||
|
|
||||||
|
**Model-A twist — the paper holds *two* secrets, not one:**
|
||||||
|
1. the **restic repo password** (to read any backup at all), and
|
||||||
|
2. the **Ansible vault master password** (to rebuild hosts from the repo — normally
|
||||||
|
from Vaultwarden via `rbw`, which is itself down in a from-zero recovery).
|
||||||
|
|
||||||
|
With both on paper, the break-glass chain has **no circular dependency**: paper →
|
||||||
|
restic restores Vaultwarden + repo data → the vault password (from paper) drives
|
||||||
|
Terraform/Ansible re-provisioning → services return, `rbw` works again. `ubongo`'s
|
||||||
|
ADR-015 recovery model already establishes **`mamba` (laptop) as a break-glass clone**
|
||||||
|
(repo + toolchain + mesh + `rbw`, with Terraform state synced to it) — the rebuild can
|
||||||
|
be driven from `mamba` if `ubongo` is also gone. The printed sheet is a short
|
||||||
|
**break-glass runbook** assuming zero running boma infrastructure: install restic on
|
||||||
|
any machine, point it at pCloud *or* a USB drive with the password, restore Vaultwarden
|
||||||
|
first, then rebuild with the vault password.
|
||||||
|
|
||||||
|
### 11. USB air-gap trigger (plug-and-go cold copy)
|
||||||
|
|
||||||
|
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
|
||||||
|
systemd unit → script that: mounts the drive, confirms it is an expected drive, runs
|
||||||
|
**`restic copy` from the local repo → a restic repo on the USB drive** (dedup-aware,
|
||||||
|
same password → ciphertext if lost/stolen), runs `restic check` on the USB copy,
|
||||||
|
unmounts, and **notifies via ntfy** with the result. Only allowlisted serials trigger
|
||||||
|
anything (a rogue USB does nothing).
|
||||||
|
|
||||||
|
`restic copy` (not rsync) so the USB is itself a valid restic repo — restorable
|
||||||
|
**directly** in a break-glass with nothing else alive. Rotate among a few drives,
|
||||||
|
**stored off-site** → also a second *geographic* off-site copy independent of pCloud.
|
||||||
|
|
||||||
|
### 12. Failure alerting — guard against silent death
|
||||||
|
|
||||||
|
Success/failure pings alone miss the worst case (*the job silently stopped running*):
|
||||||
|
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
|
||||||
|
monitor** (already in the planned stack); no ping in ~25 h → alert.
|
||||||
|
- **Immediate failure → ntfy** on any job or dump-step error.
|
||||||
|
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
|
||||||
|
- **Tier-1 restore-verify failures → ntfy.**
|
||||||
|
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
|
||||||
|
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
|
||||||
|
|
||||||
|
### 13. Schedule
|
||||||
|
|
||||||
|
- **Nightly backup run (~02:00–04:00),** driven by `fisi` (pull): per host →
|
||||||
|
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
|
||||||
|
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
|
||||||
|
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
|
||||||
|
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
|
||||||
|
- **USB air-gap:** manual, ~monthly, whenever a drive is docked.
|
||||||
|
|
||||||
|
## Architecture & data flow (nightly run)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
docker_hosts / etc. │ fisi (backup node) │
|
||||||
|
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
|
||||||
|
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │
|
||||||
|
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
|
||||||
|
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
|
||||||
|
┌───────────┐ │ 4. restic forget --prune (GFS) │
|
||||||
|
│ service B │ │ 5. rclone sync repo → pCloud (offsite) │
|
||||||
|
└───────────┘ │ 6. heartbeat → Uptime Kuma; errors→ntfy│
|
||||||
|
└───────────────┬──────────────────────────┘
|
||||||
|
│ (manual, ~monthly)
|
||||||
|
udev: known drive plugged
|
||||||
|
▼
|
||||||
|
restic copy → USB repo (air-gap, offline)
|
||||||
|
```
|
||||||
|
|
||||||
|
Restore (Model A): Terraform re-provisions the VM → Ansible redeploys the role →
|
||||||
|
restic restores `backup__paths` + replays the dump → `VERIFY.md` confirms.
|
||||||
|
|
||||||
|
## Components & boundaries
|
||||||
|
|
||||||
|
- **`backup` role (on `fisi`):** pull orchestrator, restic repo management, retention
|
||||||
|
prune, rclone→pCloud sync, udev/air-gap unit, alerting hooks. New inventory group
|
||||||
|
(e.g. `backup_hosts`) with the `base` role applied, like `control`/`offsite_hosts`.
|
||||||
|
- **Per-service backup contract:** `backup__*` role vars + rendered `BACKUP.md`
|
||||||
|
(Decision 6); a hard convention enforced by `make lint`.
|
||||||
|
- **`ubongo`:** schedules/drives Tier-1 (local container) and Tier-2 (onto staging);
|
||||||
|
unchanged role per ADR-015.
|
||||||
|
- **Secrets:** restic password + rclone token in `fisi` (root-only) and the Ansible
|
||||||
|
vault; escrowed per Decision 10.
|
||||||
|
|
||||||
|
## Threat model / 3-2-1 honesty
|
||||||
|
|
||||||
|
- **`rclone sync` propagates deletions** — a prune, or a *malicious* wipe of `fisi`'s
|
||||||
|
repo, replicates to pCloud. pCloud is therefore the **off-site** copy but **not
|
||||||
|
immutable**. Mitigations: the **USB air-gap drive is the immutable backstop**
|
||||||
|
(offline = unreachable by any online compromise) and **pCloud's own file-version
|
||||||
|
history** is enabled as a recovery cushion.
|
||||||
|
- **Pull model** stops a compromised *service host* from touching the repo.
|
||||||
|
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
|
||||||
|
gets full base hardening and tight access. restic encryption means a stolen `fisi`
|
||||||
|
(or USB, or pCloud blob) yields ciphertext only.
|
||||||
|
- **pCloud's 1 TB is the smallest copy → the off-site capacity ceiling.** Data-only
|
||||||
|
backups fit for years at homelab scale; flag for `/capacity-review` if the repo
|
||||||
|
trends toward ~1 TB.
|
||||||
|
|
||||||
|
## What this changes in the repo (for the plan)
|
||||||
|
|
||||||
|
- New `backup` role + `backup_hosts` inventory group; `fisi` hardware-reference entry.
|
||||||
|
- New per-service convention: `backup__*` vars + `BACKUP.md` (template at
|
||||||
|
`docs/backup/service-backup-template.md`); `make lint` gate; update role-conventions
|
||||||
|
in `CLAUDE.md` and the new-role scaffolding/runbook.
|
||||||
|
- Update `docs/hardware/reference.md`: `ubongo` = M70q (i3-10100T/16 GB/**1 TB**);
|
||||||
|
add `fisi`.
|
||||||
|
- Update `CAPABILITIES.md` §9 (PBS → deferred; restic+rclone+USB the committed engine).
|
||||||
|
- Close `docs/TODO.md` 3.8; cross-reference from ADR-011.
|
||||||
|
- The break-glass runbook (printed sheet + `docs/runbooks/`), referencing ADR-015's
|
||||||
|
`mamba` clone and Terraform-state survival.
|
||||||
|
|
||||||
|
## Non-goals / YAGNI
|
||||||
|
|
||||||
|
- No PBS / whole-VM images in v1 (Decision 1).
|
||||||
|
- No per-data-type RPO tiering in v1 (Decision 2).
|
||||||
|
- No second encryption layer over restic (Decision 3).
|
||||||
|
- No central NAS/file-share scope creep on `fisi` — it stays single-purpose.
|
||||||
|
|
||||||
|
## Open / deferred
|
||||||
|
|
||||||
|
- Central vs per-app database (TODO 3.9) — orthogonal; this design works either way.
|
||||||
|
- Prometheus backup metrics — later add (Decision 12).
|
||||||
|
- PBS (Model B) or hybrid (Model C) — revisit if real-world RTO is too slow.
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue